Updated May 2026
Most articles on private AI inference repeat the same line. vLLM is for production. Ollama is for hobbyists. We did not believe that without seeing the numbers, so we ran the same Mistral Small 3.2 24B Dense and Gemma 4 31B Dense models across three different machines, both backends, and a real workload. The results overturn the conventional wisdom on small-team deployments. They also explain why two identical NVIDIA RTX PRO 6000 Blackwell GPUs can give you a 3.1x throughput gap on the same software, depending on the CPU you paired them with.
This guide walks through the data, what it means for a small or mid-sized business standing up private AI, and the decision matrix Petronella Technology Group, Inc. uses when architecting in-house AI systems for clients. If you need to talk through a specific deployment, the direct line is 919-348-4912.
What we tested
Four model and backend combinations across three hosts, 12 single-user inference runs in total. The prompt was a real production task: write a 2,000-token cybersecurity and compliance briefing in HTML. Same prompt for every run, same generation cap. We logged tokens per second from each backend's own metrics, then cross-checked against client-side wall-clock.
The three hosts:
- ai5, AMD Ryzen 9 9950X3D, 192 GB DDR5, NVIDIA RTX PRO 6000 Blackwell with 96 GB GDDR7 and roughly 1,792 GB/s of memory bandwidth.
- c1, Intel Core i9-9900K from 2018, 125 GB DDR4, the same RTX PRO 6000 Blackwell card on a PCIe Gen 3 platform.
- msi2, NVIDIA GB10 Grace Superchip (the DGX Spark generation), 128 GB unified LPDDR5X with roughly 273 GB/s of memory bandwidth, ARM-based CPU, integrated SoC.
The two backends:
- vLLM 0.21 with NVFP4 quantization. The reference engine for production inference, with PagedAttention, continuous batching, CUDA graph capture, and the most aggressive scheduler in open source.
- Ollama (llama.cpp) with Q4_K_M GGUF quantization. The popular single-binary stack that wraps llama.cpp for desktop use, well known for cold starts measured in seconds rather than minutes.
The headline table
All numbers are tokens per second, single user, 2,000 output tokens, deterministic prompt. Higher is better.
| Model | ai5 vLLM (NVFP4) | ai5 Ollama (Q4_K_M) | c1 vLLM (NVFP4) | c1 Ollama (Q4_K_M) | msi2 vLLM (NVFP4) | msi2 Ollama (Q4_K_M) |
|---|---|---|---|---|---|---|
| Mistral Small 3.2 24B Dense | 71.4 | 98.4 (+38%) | 69.0 | 31.6 (-54%) | 12.5 | 13.8 (+10%) |
| Gemma 4 31B Dense | 37.7 | 63.5 (+68%) | 36.4 | 21.9 (-40%) | 6.6 | 10.0 (+52%) |
Two patterns jump off the page. On ai5, Ollama is 38 to 68 percent faster than vLLM. On c1, with the same GPU, Ollama is 40 to 54 percent slower than vLLM. Same backend, same model, same GPU, opposite winners. The difference is the CPU.
Surprise one: Ollama beats vLLM for single users on Blackwell
This is the headline result. For a single-user workload on a 2025 desktop CPU paired with an NVIDIA RTX PRO 6000 Blackwell, Ollama running Q4_K_M GGUF is 38 to 68 percent faster than vLLM running NVFP4. That is not a tiny win. On Gemma 4 31B that is the difference between 38 tokens per second and 64 tokens per second, the difference between a slow chatbot and one that feels live.
Three reasons explain it:
- vLLM 0.21 falls back to MARLIN dequantization on Blackwell consumer SKUs. Native FP4 tensor cores are present in the silicon, but the upstream vLLM kernels do not engage them on the PRO 6000 yet. The startup log says it plainly: "Your GPU does not have native support for FP4 computation." MARLIN dequantizes the weights to bf16 on the fly, then runs the matmul. That is extra work per token compared to a native FP4 path.
- Ollama's path through llama.cpp is highly tuned for single-stream decode. No scheduler queue, no PagedAttention indirection, no Python overhead, no request batching machinery. For one user generating one stream of tokens, the lean path wins.
- vLLM is engineered for the opposite regime. Its design wins when you have 8 to 100 concurrent requests, because continuous batching lets one GPU pass amortize across many users. At concurrency 1, you pay for that machinery and get nothing back.
This does not mean vLLM is bad. It means vLLM is correctly optimized for production multi-tenant serving, and you should not benchmark a multi-tenant engine with a single user and conclude anything about your real workload until you check whether you have one user or twenty.
Surprise two: the CPU matters more than you think
The most counterintuitive number in the table is the c1 Ollama column. Same GPU as ai5, same model, same Ollama version, same drivers. On ai5, Mistral Small 3.2 hits 98.4 tokens per second. On c1, the same configuration hits 31.6. A 3.1x gap from nothing but the CPU.
The Intel i9-9900K is a 2018 part. Coffee Lake, 8 cores at 14nm, no 3D V-Cache, no AVX-512. The Ryzen 9 9950X3D is a 2025 part with 16 cores, 96 MB of stacked L3 cache, and 5nm process. Seven years of IPC improvements plus the V-Cache. While Ollama is decoding, the CPU has to launch a stream of small CUDA kernels every token. With Q4_K_M GGUF, there is no CUDA graph capture by default, so each kernel launch is a real round trip through the driver. On a fast CPU that is cheap. On a 2018 CPU it caps GPU utilization.
We watched this happen live with nvidia-smi dmon during the c1 Ollama run. Memory clock and GPU clock were both at full boost, but SM utilization sat at 52 percent and memory utilization at 20 percent. The GPU was idle waiting on the CPU about half the time. The Blackwell was simply not being fed fast enough.
vLLM hides this with CUDA graph capture. Many kernel launches collapse into one cudagraphLaunch, so the GPU gets a steady stream of work regardless of CPU latency. That is why c1 vLLM at 69.0 tokens per second matches ai5 vLLM at 71.4 within noise, even though c1 Ollama is 3x slower than ai5 Ollama.
Practical takeaway for owners and IT directors: If you are buying a new private AI workstation and pairing a current Blackwell card with a five-year-old desktop, you are leaving more than half your inference budget on the floor. Pair the GPU with a 2024-or-newer Ryzen 9 or modern Xeon W, or accept that you have to run vLLM (not Ollama) to recover the GPU's potential.
Surprise three: file size beats parameter count on memory-bound hardware
On the NVIDIA GB10 Grace Superchip, Ollama still wins, but the gap depends on the model. Mistral Small 3.2 24B sees only a 10 percent advantage for Ollama. Gemma 4 31B sees a 52 percent advantage. Why the difference?
The GB10 has roughly 273 GB/s of LPDDR5X bandwidth, about one-seventh of what the RTX PRO 6000 has. On bandwidth-bound hardware, decode speed is approximately model file size divided by memory bandwidth, because every token requires streaming the active weights through the math units. Gemma 4 31B in NVFP4 is 30 GB on disk. The same model in Q4_K_M GGUF is 19 GB. The Q4_K_M file is roughly 37 percent smaller, and the throughput advantage on the GB10 is 52 percent. Those two numbers are not a coincidence. Smaller weights, less traffic, faster decode.
For Mistral Small 3.2 24B the file sizes are closer (15 GB NVFP4 versus 13 GB Q4_K_M), so the gap collapses to 10 percent.
If you are running inference on memory-bound hardware (DGX Spark, GB10, M-series Macs, anything with under 300 GB/s of effective bandwidth), prefer Q4_K_M GGUF and accept the smaller quality hit over the larger throughput hit you would take with NVFP4 plus a stack that does not yet exploit native FP4 tensor cores.
The decision matrix we use
When Petronella Technology Group, Inc. designs a private AI deployment for a small or mid-sized business, this is the matrix we walk through. It is opinionated, and that is the point. There is no general best engine. There is only the best engine for your hardware and your concurrency.
| Use case | Recommended engine | Why |
|---|---|---|
| One user, modern (2024+) CPU, dGPU | Ollama (Q4_K_M GGUF) | Cleanest single-stream path, 38 to 68 percent faster than vLLM at concurrency 1, fast cold start. |
| One user, older CPU (pre-2024), dGPU | vLLM (cudagraph on) | CUDA graph capture hides kernel dispatch latency, recovers GPU utilization the CPU cannot keep up with. |
| 4 to 100 concurrent users, dGPU | vLLM | PagedAttention plus continuous batching are the reason vLLM exists. The overhead earns its keep above concurrency 4. |
| SoC with under 300 GB/s bandwidth (GB10, Mac M-series) | Ollama (Q4_K_M) | Smaller weight files reduce memory traffic per token. Bandwidth is the bottleneck, file size is destiny. |
| Cold start in seconds matters (occasional use, dev laptop) | Ollama | Loads weights once and stays warm. vLLM warmup and graph capture take 30 to 90 seconds. |
| Bleeding-edge model not yet in Ollama's library | vLLM | Mistral, Cohere, and others ship to the vLLM ecosystem first. |
| Production deployment with SLOs and metrics | vLLM | Prometheus endpoints, OpenAI-compatible API, mature batch policies, predictable tail latency under load. |
A measurement gotcha that cost us 60 percent on the first run
Worth sharing because every team running benchmarks at home will hit this. Our first ai5 Mistral 3.2 vLLM number came in at 44.4 tokens per second, which was suspiciously close to the c1 number (69.0) and made no sense for a Ryzen 9 9950X3D. We almost wrote that result up as a Blackwell quirk.
The cause was background containers. Eight Sam voice-agent and document-retrieval containers were running on ai5 from earlier work. Each one showed under 0.1 percent CPU in docker stats, well below any threshold that would normally raise an alarm. We stopped the stack to see what would happen, and Mistral 3.2 immediately jumped from 44.4 to 71.4 tokens per second. A 60 percent throughput recovery from stopping containers that did not appear to be doing anything.
The mechanism is some combination of CPU cache contention, scheduler slot competition, and PCIe bus attention. None of it shows up in normal monitoring because the containers are not consuming CPU time in any meaningful aggregate sense. They are just present, and presence is enough to disturb the very tight kernel-launch loop that single-stream inference depends on.
The lesson for anyone benchmarking or operating production inference: stop every non-essential service on the host. If you cannot stop them, plan for a 20 to 60 percent throughput penalty depending on how many there are and what they do. For a production inference node, do not host anything else. The hardware cost of a dedicated box is less than the throughput tax of sharing.
The concurrency crossover, with numbers
vLLM's claim to fame is concurrency. To put a number on it, we ran a 32-concurrent-request stress test against ai5 vLLM on the same Mistral Small 3.2 build. Aggregate throughput at concurrency 32 was 2,173 tokens per second. That is roughly 30 times higher than the single-user number, because PagedAttention amortizes a single forward pass across many active sequences.
Ollama and llama.cpp do support multiple parallel requests, but throughput saturates around concurrency 4 to 8 on the same hardware because there is no continuous batching at the kernel level. Two users running side by side on Ollama do not get the 2x throughput that vLLM delivers at the same load.
The practical rule from real client deployments. Under three active users at any moment, default to Ollama. Plan for four or more concurrent sessions, default to vLLM. The crossover point shifts a little based on prompt length and output length, but it is in that band.
The hardware story: NVIDIA Blackwell versus Grace
One more comparison the numbers expose. A single NVIDIA RTX PRO 6000 Blackwell, paired with the right CPU, is between 5 and 6 times faster than an NVIDIA GB10 Grace Superchip on the same model. Mistral 24B at 71 versus 12. Gemma 31B at 38 versus 7. That ratio tracks exactly with the bandwidth math: 1,792 GB/s divided by 273 GB/s is 6.6. For decode-heavy inference, memory bandwidth is destiny, and the Blackwell wins by a wide margin.
The GB10 still has its niche. It holds 128 GB of unified memory, so larger or multiple smaller models can sit resident at once. It draws under 250 watts, fanless and quiet, which matters for edge or office deployments. And ARM plus unified memory has a cost-per-watt story for some workloads. But if you have the power, the cooling, and the budget for a Blackwell, the throughput delta is real. For more on hardware choices in private AI, see our private AI services overview.
How this fits a Petronella deployment
Our deployments lead with AI, then layer cybersecurity and compliance underneath because the two are inseparable. A private inference cluster on your own hardware solves data leakage out of the gate, but the cluster itself still has to be hardened, monitored, patched, and access-controlled. Otherwise the private AI you built to keep client files off public cloud becomes the single richest target on your network.
The standard architecture pattern we ship looks like this:
- vLLM on a private network for the multi-user chat, document-assistant, and RAG endpoints.
- Ollama on individual workstations for offline drafting, code assistance, and overnight batch generation.
- A retrieval layer that pulls only from your own document store, sandboxed and logged.
- The whole thing wrapped in a hardened identity layer, network segmentation per the relevant framework (CMMC, HIPAA, NIST AI RMF), and the Petronella encrypted data and email system for anything CUI or PHI adjacent.
- Full observability. Every prompt, every retrieval, every response routed through the audit pipeline that maps cleanly to NIST AI RMF and your compliance posture.
The combination is the differentiator. AI built right is fast. AI built right and secure is rare. Our 23 years in cybersecurity, four CMMC Registered Practitioners on staff (RPO #1449), and Craig Petronella's MIT AI Certificate plus North Carolina Licensed Digital Forensic Examiner credential (#604180) are the foundation we build private AI on.
FAQ
Is vLLM always faster than Ollama?
No. For single-user inference on a modern desktop CPU paired with an NVIDIA RTX PRO 6000 Blackwell, our benchmarks showed Ollama running Q4_K_M GGUF was 38 to 68 percent faster than vLLM running NVFP4. vLLM only pulls ahead under concurrent load or on hardware where its CUDA graph capture is needed to hide CPU dispatch overhead.
Why does the CPU matter so much for Ollama?
Ollama uses llama.cpp under the hood and does not enable CUDA graph capture by default. That means the CPU has to launch a stream of small CUDA kernels for every decoded token. On a 2018 Intel i9-9900K paired with the same Blackwell GPU, our test showed only 52 percent SM utilization, because the CPU could not dispatch kernels fast enough. On a 2025 Ryzen 9 9950X3D the same setup hit 98 tokens per second. The gap is 3.1x from the CPU alone.
What is NVFP4 and why does it not always win on Blackwell?
NVFP4 is NVIDIA's 4-bit floating point format introduced with Blackwell. It is designed to be matmul'd natively by Blackwell tensor cores. The catch is that stock vLLM 0.21 does not engage the native FP4 path on consumer-class Blackwell cards yet, so it falls back to MARLIN dequantization, which dequantizes weights to bf16 on the fly. That extra dequant step is what costs the throughput. The hardware will get faster when the software catches up, but as of mid-2026 the stack maturity gap is real.
Do I need a 70B+ model for my small business?
Most small and mid-sized businesses do not. A 24B to 31B dense model in Q4_K_M, hosted on a single RTX PRO 6000 Blackwell, beats most cloud APIs on response latency for SMB use cases (document drafting, internal Q and A, code assistance, RAG over a few hundred thousand documents). The case for 70B+ is narrower. Long-context reasoning, multilingual quality, and certain agentic workflows. We size the model to the use case, not the other way around.
Can Ollama be used in production?
For single-workstation deployments and small teams (under three concurrent users), yes. For multi-user production with SLOs, monitoring, and predictable tail latency under load, no, and that is fine because that is exactly what vLLM is for. The right answer for most small businesses is both. vLLM for the shared chat service, Ollama for individual workstations.
What about llama.cpp directly?
Llama.cpp is what Ollama wraps. Running llama.cpp directly gives you slightly more control over server flags and kernel choice. For most teams the Ollama wrapper is worth the ten seconds of overhead because of model management, OpenAI-compatible API, and the ecosystem. For a single advanced operator who wants every flag exposed, llama.cpp's official server binary is the lower-level option.
Where can I read more on these engines?
The vLLM team publishes their architecture and tuning guides in the official vLLM documentation. NVIDIA's Blackwell architecture page covers the FP4 tensor core story. For risk and governance posture around private AI, the NIST AI Risk Management Framework is the current reference. Our blog has additional coverage on compliance and managed IT for businesses building private AI capacity.
What to do next
If you are evaluating private AI for your business and want to skip the trial-and-error stage, our team has already run the benchmarks and built the deployment patterns. We architect private inference clusters for small and mid-sized businesses across North Carolina and the Southeast, sized to actual concurrency needs, paired with the right CPU and GPU, and hardened to the relevant compliance framework from day one.
For a 15-minute private-AI architecture call, call 919-348-4912. Penny will route you to a CMMC Registered Practitioner on our team. No sales script, no boilerplate, just a working conversation about what your team actually needs and what it would cost to build. Or browse our AI services overview for context first.
Petronella Technology Group, Inc. has been securing North Carolina businesses since 2002 and architecting AI for them since the technology was usable in production. The benchmark numbers in this article came off our own bench. The deployment patterns came off real client engagements. Both are available to you.
Petronella Technology Group, Inc.
5540 Centerview Dr Suite 200
Raleigh, NC 27606
919-348-4912