
Introduction
Vision-Language Models (VLMs) are rapidly becoming the core of many generative AI applications, from multimodal chatbots to automated content analysis systems. As open-source models mature, they offer promising alternatives to proprietary systems, enabling developers and enterprises to build cost-effective, scalable, and customizable AI solutions.
However, the growing number of VLMs presents a common dilemma: how do you choose the right model for your use case? It’s often a balancing act between output quality, latency, throughput, context length, and infrastructure cost.
This blog aims to simplify the decision-making process by providing detailed benchmarks and model descriptions for three leading open-source VLMs: Gemma-3-4B, MiniCPM-o 2.6, and Qwen2.5-VL-7B-Instruct. Our research team ran these benchmarks on our Compute Orchestration to ensure consistent and reliable results.
Now, let’s dive into the details of each model, starting with Gemma-3-4B.
Gemma3-4b
Gemma-3-4B, part of Google’s latest Gemma 3 family of open multimodal models, is designed to handle both text and image inputs, producing coherent and contextually rich text responses. With support for up to 128K context tokens, 140+ languages, and tasks like text generation, image understanding, reasoning, and summarization, it’s built for production-grade applications across diverse use cases.
Benchmark Summary: Performance on L40S GPU
Before diving into the results, let’s quickly break down the key metrics used in the benchmarks:
- Latency per Token: The time it takes to generate each output token. Lower latency means faster responses, especially important for chat-like experiences.
- Time to First Token (TTFT): Measures how quickly the model generates the first token after receiving the input. It impacts perceived responsiveness in streaming generation tasks.
- End-to-End Throughput: The number of tokens the model can generate per second for a single request, considering the full request processing time. Higher end-to-end throughput means the model can efficiently generate output while keeping latency low.
Gemma-3-4B shows strong performance across text and image tasks, scaling well under concurrency (handling multiple requests simultaneously). Below are the key observations:
-
Text-only tasks deliver the best performance:
-
Latency per token: 0.014 sec
-
Time to First Token (TTFT): 0.092 sec
-
End-to-end Throughput: 323.49 tokens/sec
-
-
Image input increases latency and reduces throughput as size grows:
-
256px images maintain high end-to-end throughput (307.19 tokens/sec)
-
Performance drops gradually with 512px and 1024px images
-
2048px images see higher latency (3.69s/request), reduced end-to-end throughput (182.10 tokens/sec)
-
-
Concurrency scaling is efficient up to 16 concurrent requests:
-
At 2 concurrent requests: End-to-end throughput ranges from 300.48 tokens/sec (text-only) to 158.49 tokens/sec (2048px image), handling up to 48.75 Requests Per Minute (RPM) for text-only tasks.
-
At 8 concurrent requests: End-to-end throughput ranges from 270.81 tokens/sec (text-only) to 118.18 tokens/sec (2048px image), handling up to 158.97 RPM for text-only tasks.
-
At 16 concurrent requests: 196.02 (text) to 93.70 tokens/sec (2048px), handling 165.77 RPM (text)
-
-
At 32 concurrent requests, performance dips with large images:
-
Text end-to-end throughput: 127.29 tokens/sec, 164.88 RPM
-
2048px image end-to-end throughput: 60.28 tokens/sec, 13.32 RPM
-
TTFT peaks at 8.59 sec for large images
-
Overall, Gemma-3-4B handles text-heavy tasks and medium image inputs (up to 1024px) efficiently even under high concurrency. Large image processing at max concurrency requires resource scaling to avoid minor performance dips. If you’re deciding which GPU to use for running this model, we have a blog comparing the A10 vs. L40S, helping you choose the best hardware for your workload.
MiniCPM-o 2.6
MiniCPM-o 2.6 represents a major leap in end-side multimodal LLMs. It expands input modalities to images, video, audio, and text, offering real-time speech conversation and multimodal streaming support.
With an architecture integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the model boasts a total of 8 billion parameters. MiniCPM-o-2.6 demonstrates significant improvements over its predecessor, MiniCPM-V 2.6, and introduces real-time speech conversation, multimodal live streaming, and superior efficiency in token processing.
Benchmark Summary: Performance on L40S GPU
MiniCPM-o-2.6 demonstrates solid performance across various workloads and concurrency levels. Key observations:
-
Text-only tasks (native transformers) perform consistently well:
-
Single concurrency end-to-end throughput ranges from 8.97 to 37.69 RPM
-
Latency per token stays between 0.083s to 0.086s.
At 32 concurrency, end-to-end throughput ranges from 78.03 to 35.41 RPM based on scale-down settings.
Shared vLLM (text) boosts throughput significantly:
Shared vLLM refers to serving the model using the vLLM inference engine within a shared node pool.
-
Single concurrency end-to-end throughput reaches 52.65 RPM with 209.97 tokens/sec.
-
Scales up to 707.61 RPM at 32 concurrent requests.
-
Maintains low latency between 0.021s to 0.025s per token.
Image inputs introduce slight overhead but scale well up to 1024px:
-
256px image end-to-end throughput: 244.38 req/min at 32 concurrency, 181.26 tokens/sec
-
512px image end-to-end throughput: 174.70 req/min, 112.17 tokens/sec
-
1024px image end-to-end throughput: 139.89 req/min, 82.45 tokens/sec
-
Latency increases with image size
Latency trends:
MiniCPM-o-2.6 handles both text and image-heavy tasks efficiently on L40S, scaling smoothly up to 32 concurrent requests. Shared vLLM significantly improves throughput while maintaining stable latency. Performance remains robust even with large image inputs.
Qwen2.5-VL-7B-Instruct
Qwen2.5-VL is a vision-language model designed for visual recognition, reasoning, long video analysis, object localization, and structured data extraction.
Its architecture integrates window attention into the Vision Transformer (ViT), significantly improving both training and inference efficiency. Additional optimizations like SwiGLU activation and RMSNorm further align the ViT with the Qwen2.5 LLM, enhancing overall performance and consistency.
Benchmark Summary: Performance on L40S GPU
Qwen2.5-VL-7B-Instruct performs efficiently across text-only and image+text tasks with varying concurrency.
Key highlights:
Which VLM is Right for You?
Choosing the right Vision-Language Model (VLM) depends on your workload type and concurrency needs. Here’s how each model performs under different scenarios:
MiniCPM-o-2.6 delivers strong performance across both single and high-concurrency scenarios, particularly when served with vLLM. It scales efficiently up to 32 concurrent requests while maintaining high end-to-end throughput and low latency. This makes it ideal for general-purpose language tasks that require consistent performance under heavy loads. If your priority is high efficiency at scale, MiniCPM-o-2.6 is a top choice.
Gemma-3-4B is optimized for text-heavy workloads with low to moderate concurrency. It handles text and small images (256px–512px) efficiently but struggles with large images (1024px+) and high concurrency. If your use case involves mostly text processing with occasional small image inputs and you don’t need extreme scalability, Gemma-3-4B offers solid performance.
Qwen2.5-VL-7B-Instruct excels in vision-language applications requiring structured visual understanding, such as object localization, document processing, and reasoning tasks. It scales well with image inputs up to 512px, but larger images (1024px+) increase latency and reduce throughput. If your priority is accurate vision-language reasoning rather than raw throughput, Qwen2.5-VL is a strong choice.
Conclusion
We have seen the benchmarks across MiniCPM-2.6, Gemma-3-4B, and Qwen2.5-VL-7B-Instruct, covering their performance on latency, throughput, and scalability under different concurrency levels and image sizes. Each model performs differently based on the task and workload requirements.
If you want to try out these models, we have launched a new AI Playground where you can explore them directly. We will continue adding the latest models to the platform, so keep an eye on our updates and join our Discord community for the latest announcements.
If you are also looking to deploy these Open Source VLMs on your own dedicated compute, our platform supports production-grade inference, and scalable deployments. You can quickly get started with setting up your own node pool and running inference efficiently. Check out the tutorial below to get started.
#Gemma #MiniCPM #Qwen
-