FastVLM: Efficient Vision Encoding for Vision Language Models is an implementation for enhancing vision language models, achieving high performance in encoding tasks.

Features

Hybrid vision encoder

FastViTHD, our novel hybrid vision encoder outputs fewer tokens and significantly reduces encoding time for high-resolution images.

Faster performance

Outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and a smaller vision encoder size.

High versatility

Larger variants using Qwen2-7B LLM outperform recent works while using a single image encoder with a 7.9x faster TTFT.

Demo application

Includes a demo iOS app to showcase model performance on mobile devices.