Skip to Content
Plaiground Substack Launched. Sign up! →
Blog202603vLLM beats Ollama in throughput by 4x at scale

vLLM beats Ollama in throughput by 4x at scale

vLLM is running the bigger model and still outperforms Ollama at scale

TLDR: vLLM  running Qwen3-VL-4B in BF16 outperforms Ollama running the same model in Q4_K_M once you introduce concurrent requests. At a single request Ollama wins, as expected. The lighter quantization is faster when there is nothing to batch. Push more than 2-3 concurrent requests and vLLM pulls ahead. At 16 concurrent requests it is 4x faster in total throughput with a larger, higher precision model.

Code and test harness: michaelcizmar/vllm-vs-ollama 

Why vLLM beats Ollama

Ollama is built for convenience. It handles model loading, unloading, and switching automatically. It is great for a single user running one model at a time locally.

What it is not built for is concurrent throughput. While Ollama does support parallel requests via OLLAMA_NUM_PARALLEL and this benchmark ran it with 6 parallel slots. That did not effect its performance and even so, Ollama flatlines. The reason is architectural. Each parallel slot in Ollama pre-allocates its own fixed KV cache block. You get 6x the memory usage for 6x the parallelism with no dynamic sharing between requests. Add more slots and you run out of VRAM before you get meaningful throughput gains.

vLLM handles this differently. Three things notably make the difference:

Continuous batching. vLLM dynamically groups incoming requests into batches and processes them together. New requests join mid-flight. Finished requests leave mid-flight. The GPU stays saturated regardless of when requests arrive.

PagedAttention. KV cache is managed in fixed-size pages shared across all concurrent sequences. Memory is used only for tokens that actually exist, not pre-allocated per slot. At 0.90 GPU memory utilization on an RTX 5090, vLLM pre-allocates 22GB of KV cache and holds up to 224,000 tokens in flight simultaneously across all active requests.

CUDA graphs. vLLM compiles and captures the decode loop as a CUDA graph on startup. Each decode step replays a pre-compiled graph instead of relaunching kernels. Overhead per token drops significantly, especially at small batch sizes where kernel launch latency would otherwise dominate.

Ollama uses llama.cpp under the hood. It has none of these.

Docker makes vLLM usable

vLLM has a large setup surface area. CUDA versions, PyTorch compatibility, driver quirks on WSL2. The docker compose approach collapses all of that into a single file. Pull the image, set your HuggingFace token, run docker compose up.

One thing that is not obvious and is my preference is to mount your cache directories under the project folder, not your home directory. This way, if you delete the project you also delete the volumes. So having everythign about the project in one location is helpful for me.

volumes: - ./cache/huggingface:/root/.cache/huggingface - ./cache/vllm:/root/.cache/vllm

The ./cache/vllm volume stores the torch compile artifacts and CUDA graph captures from first startup. Without it every cold start pays a 2-3 minute compilation penalty.

The test

I wrote a Python benchmark script that sends concurrent requests to both engines using asyncio.gather. All requests in a concurrency level fire simultaneously. The script measures total wall time, P50/P95 latency, and aggregate tokens per second.

The task is multimodal knowledge graph extraction. Given a scanned document image, extract entities, relationships, and factual assertions as structured JSON. This is a real workload from a document intelligence pipeline, not a synthetic benchmark.

The script runs vLLM first, then pauses and prompts you to switch engines before running the identical test against Ollama. Results are compared side by side at the end.

Hardware: RTX 5090, WSL2 2.7.1, vLLM 0.17.1, document image 794KB JPEG. Ollama configured with OLLAMA_NUM_PARALLEL=6.

Performance

Note: This is not a pure apples-to-apples quantization comparison. vLLM serves Qwen/Qwen3-VL-4B-Instruct in BF16 (native HuggingFace weights). Ollama serves qwen3-vl:4b-instruct in Q4_K_M (~4.5-bit GGUF). Ollama’s lighter quantization gives it a single-request speed advantage. vLLM’s architecture gives it the throughput advantage under load.

ConcurrencyvLLM tok/sOllama tok/sSpeedupvLLM P50Ollama P50
1123.3221.60.56x11.41s5.19s
4391.3232.41.68x13.46s14.85s
8439.2230.01.91x25.22s26.92s
16933.6231.34.04x22.25s61.35s
27816.6229.23.56x42.96s76.41s

What the numbers say:

  • Ollama wins at concurrency=1. Q4_K_M is faster per token than BF16 with no batching benefit to offset it.
  • vLLM overtakes at concurrency=4. Continuous batching starts paying off.
  • Ollama flatlines at ~231 tok/s from concurrency=4 through concurrency=27. This is not because it is serial. It is because 6 fixed parallel slots with static KV allocation hit a memory and throughput ceiling that does not scale further.
  • vLLM peaks at concurrency=16 with 933 tok/s before hitting KV cache pressure from the large document images.
  • For any pipeline sending more than 2-3 concurrent requests, vLLM wins.

Bottom line

If you are running a single-user local assistant, use Ollama. It is simpler and fast enough.

If you are building a pipeline that processes documents, images, or any workload with concurrent requests, use vLLM. The setup cost is a one-time investment. The throughput difference is not marginal.

The repo has the full docker compose, benchmark script, and instructions for running both engines: michaelcizmar/vllm-vs-ollama 

Last updated on