FastVLM: Efficient Vision Encoding for Vision Language Models is an implementation for enhancing vision language models, achieving high performance in encoding tasks.
FastViTHD, our novel hybrid vision encoder outputs fewer tokens and significantly reduces encoding time for high-resolution images.
Outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and a smaller vision encoder size.
Larger variants using Qwen2-7B LLM outperform recent works while using a single image encoder with a 7.9x faster TTFT.
Includes a demo iOS app to showcase model performance on mobile devices.