All Posts Next

Updated May 2026

Most articles on private AI inference repeat the same line. vLLM is for production. Ollama is for hobbyists. We did not believe that without seeing the numbers, so we ran the same Mistral Small 3.2 24B Dense and Gemma 4 31B Dense models across three different machines, both backends, and a real workload. The results overturn the conventional wisdom on small-team deployments. They also explain why two identical NVIDIA RTX PRO 6000 Blackwell GPUs can give you a 3.1x throughput gap on the same software, depending on the CPU you paired them with.

This guide walks through the data, what it means for a small or mid-sized business standing up private AI, and the decision matrix Petronella Technology Group, Inc. uses when architecting in-house AI systems for clients. If you need to talk through a specific deployment, the direct line is 919-348-4912.

What we tested

Four model and backend combinations across three hosts, 12 single-user inference runs in total. The prompt was a real production task: write a 2,000-token cybersecurity and compliance briefing in HTML. Same prompt for every run, same generation cap. We logged tokens per second from each backend's own metrics, then cross-checked against client-side wall-clock.

The three hosts:

  • ai5, AMD Ryzen 9 9950X3D, 192 GB DDR5, NVIDIA RTX PRO 6000 Blackwell with 96 GB GDDR7 and roughly 1,792 GB/s of memory bandwidth.
  • c1, Intel Core i9-9900K from 2018, 125 GB DDR4, the same RTX PRO 6000 Blackwell card on a PCIe Gen 3 platform.
  • msi2, NVIDIA GB10 Grace Superchip (the DGX Spark generation), 128 GB unified LPDDR5X with roughly 273 GB/s of memory bandwidth, ARM-based CPU, integrated SoC.

The two backends:

  • vLLM 0.21 with NVFP4 quantization. The reference engine for production inference, with PagedAttention, continuous batching, CUDA graph capture, and the most aggressive scheduler in open source.
  • Ollama (llama.cpp) with Q4_K_M GGUF quantization. The popular single-binary stack that wraps llama.cpp for desktop use, well known for cold starts measured in seconds rather than minutes.

The headline table

All numbers are tokens per second, single user, 2,000 output tokens, deterministic prompt. Higher is better.

Modelai5 vLLM (NVFP4)ai5 Ollama (Q4_K_M)c1 vLLM (NVFP4)c1 Ollama (Q4_K_M)msi2 vLLM (NVFP4)msi2 Ollama (Q4_K_M)
Mistral Small 3.2 24B Dense71.498.4 (+38%)69.031.6 (-54%)12.513.8 (+10%)
Gemma 4 31B Dense37.763.5 (+68%)36.421.9 (-40%)6.610.0 (+52%)

Two patterns jump off the page. On ai5, Ollama is 38 to 68 percent faster than vLLM. On c1, with the same GPU, Ollama is 40 to 54 percent slower than vLLM. Same backend, same model, same GPU, opposite winners. The difference is the CPU.

Surprise one: Ollama beats vLLM for single users on Blackwell

This is the headline result. For a single-user workload on a 2025 desktop CPU paired with an NVIDIA RTX PRO 6000 Blackwell, Ollama running Q4_K_M GGUF is 38 to 68 percent faster than vLLM running NVFP4. That is not a tiny win. On Gemma 4 31B that is the difference between 38 tokens per second and 64 tokens per second, the difference between a slow chatbot and one that feels live.

Three reasons explain it:

  1. vLLM 0.21 falls back to MARLIN dequantization on Blackwell consumer SKUs. Native FP4 tensor cores are present in the silicon, but the upstream vLLM kernels do not engage them on the PRO 6000 yet. The startup log says it plainly: "Your GPU does not have native support for FP4 computation." MARLIN dequantizes the weights to bf16 on the fly, then runs the matmul. That is extra work per token compared to a native FP4 path.
  2. Ollama's path through llama.cpp is highly tuned for single-stream decode. No scheduler queue, no PagedAttention indirection, no Python overhead, no request batching machinery. For one user generating one stream of tokens, the lean path wins.
  3. vLLM is engineered for the opposite regime. Its design wins when you have 8 to 100 concurrent requests, because continuous batching lets one GPU pass amortize across many users. At concurrency 1, you pay for that machinery and get nothing back.

This does not mean vLLM is bad. It means vLLM is correctly optimized for production multi-tenant serving, and you should not benchmark a multi-tenant engine with a single user and conclude anything about your real workload until you check whether you have one user or twenty.

Surprise two: the CPU matters more than you think

The most counterintuitive number in the table is the c1 Ollama column. Same GPU as ai5, same model, same Ollama version, same drivers. On ai5, Mistral Small 3.2 hits 98.4 tokens per second. On c1, the same configuration hits 31.6. A 3.1x gap from nothing but the CPU.

The Intel i9-9900K is a 2018 part. Coffee Lake, 8 cores at 14nm, no 3D V-Cache, no AVX-512. The Ryzen 9 9950X3D is a 2025 part with 16 cores, 96 MB of stacked L3 cache, and 5nm process. Seven years of IPC improvements plus the V-Cache. While Ollama is decoding, the CPU has to launch a stream of small CUDA kernels every token. With Q4_K_M GGUF, there is no CUDA graph capture by default, so each kernel launch is a real round trip through the driver. On a fast CPU that is cheap. On a 2018 CPU it caps GPU utilization.

We watched this happen live with nvidia-smi dmon during the c1 Ollama run. Memory clock and GPU clock were both at full boost, but SM utilization sat at 52 percent and memory utilization at 20 percent. The GPU was idle waiting on the CPU about half the time. The Blackwell was simply not being fed fast enough.

vLLM hides this with CUDA graph capture. Many kernel launches collapse into one cudagraphLaunch, so the GPU gets a steady stream of work regardless of CPU latency. That is why c1 vLLM at 69.0 tokens per second matches ai5 vLLM at 71.4 within noise, even though c1 Ollama is 3x slower than ai5 Ollama.

Practical takeaway for owners and IT directors: If you are buying a new private AI workstation and pairing a current Blackwell card with a five-year-old desktop, you are leaving more than half your inference budget on the floor. Pair the GPU with a 2024-or-newer Ryzen 9 or modern Xeon W, or accept that you have to run vLLM (not Ollama) to recover the GPU's potential.

Surprise three: file size beats parameter count on memory-bound hardware

On the NVIDIA GB10 Grace Superchip, Ollama still wins, but the gap depends on the model. Mistral Small 3.2 24B sees only a 10 percent advantage for Ollama. Gemma 4 31B sees a 52 percent advantage. Why the difference?

The GB10 has roughly 273 GB/s of LPDDR5X bandwidth, about one-seventh of what the RTX PRO 6000 has. On bandwidth-bound hardware, decode speed is approximately model file size divided by memory bandwidth, because every token requires streaming the active weights through the math units. Gemma 4 31B in NVFP4 is 30 GB on disk. The same model in Q4_K_M GGUF is 19 GB. The Q4_K_M file is roughly 37 percent smaller, and the throughput advantage on the GB10 is 52 percent. Those two numbers are not a coincidence. Smaller weights, less traffic, faster decode.

For Mistral Small 3.2 24B the file sizes are closer (15 GB NVFP4 versus 13 GB Q4_K_M), so the gap collapses to 10 percent.

If you are running inference on memory-bound hardware (DGX Spark, GB10, M-series Macs, anything with under 300 GB/s of effective bandwidth), prefer Q4_K_M GGUF and accept the smaller quality hit over the larger throughput hit you would take with NVFP4 plus a stack that does not yet exploit native FP4 tensor cores.

The decision matrix we use

When Petronella Technology Group, Inc. designs a private AI deployment for a small or mid-sized business, this is the matrix we walk through. It is opinionated, and that is the point. There is no general best engine. There is only the best engine for your hardware and your concurrency.

Use caseRecommended engineWhy
One user, modern (2024+) CPU, dGPUOllama (Q4_K_M GGUF)Cleanest single-stream path, 38 to 68 percent faster than vLLM at concurrency 1, fast cold start.
One user, older CPU (pre-2024), dGPUvLLM (cudagraph on)CUDA graph capture hides kernel dispatch latency, recovers GPU utilization the CPU cannot keep up with.
4 to 100 concurrent users, dGPUvLLMPagedAttention plus continuous batching are the reason vLLM exists. The overhead earns its keep above concurrency 4.
SoC with under 300 GB/s bandwidth (GB10, Mac M-series)Ollama (Q4_K_M)Smaller weight files reduce memory traffic per token. Bandwidth is the bottleneck, file size is destiny.
Cold start in seconds matters (occasional use, dev laptop)OllamaLoads weights once and stays warm. vLLM warmup and graph capture take 30 to 90 seconds.
Bleeding-edge model not yet in Ollama's libraryvLLMMistral, Cohere, and others ship to the vLLM ecosystem first.
Production deployment with SLOs and metricsvLLMPrometheus endpoints, OpenAI-compatible API, mature batch policies, predictable tail latency under load.

A measurement gotcha that cost us 60 percent on the first run

Worth sharing because every team running benchmarks at home will hit this. Our first ai5 Mistral 3.2 vLLM number came in at 44.4 tokens per second, which was suspiciously close to the c1 number (69.0) and made no sense for a Ryzen 9 9950X3D. We almost wrote that result up as a Blackwell quirk.

The cause was background containers. Eight Sam voice-agent and document-retrieval containers were running on ai5 from earlier work. Each one showed under 0.1 percent CPU in docker stats, well below any threshold that would normally raise an alarm. We stopped the stack to see what would happen, and Mistral 3.2 immediately jumped from 44.4 to 71.4 tokens per second. A 60 percent throughput recovery from stopping containers that did not appear to be doing anything.

The mechanism is some combination of CPU cache contention, scheduler slot competition, and PCIe bus attention. None of it shows up in normal monitoring because the containers are not consuming CPU time in any meaningful aggregate sense. They are just present, and presence is enough to disturb the very tight kernel-launch loop that single-stream inference depends on.

The lesson for anyone benchmarking or operating production inference: stop every non-essential service on the host. If you cannot stop them, plan for a 20 to 60 percent throughput penalty depending on how many there are and what they do. For a production inference node, do not host anything else. The hardware cost of a dedicated box is less than the throughput tax of sharing.

The concurrency crossover, with numbers

vLLM's claim to fame is concurrency. To put a number on it, we ran a 32-concurrent-request stress test against ai5 vLLM on the same Mistral Small 3.2 build. Aggregate throughput at concurrency 32 was 2,173 tokens per second. That is roughly 30 times higher than the single-user number, because PagedAttention amortizes a single forward pass across many active sequences.

Ollama and llama.cpp do support multiple parallel requests, but throughput saturates around concurrency 4 to 8 on the same hardware because there is no continuous batching at the kernel level. Two users running side by side on Ollama do not get the 2x throughput that vLLM delivers at the same load.

The practical rule from real client deployments. Under three active users at any moment, default to Ollama. Plan for four or more concurrent sessions, default to vLLM. The crossover point shifts a little based on prompt length and output length, but it is in that band.

The hardware story: NVIDIA Blackwell versus Grace

One more comparison the numbers expose. A single NVIDIA RTX PRO 6000 Blackwell, paired with the right CPU, is between 5 and 6 times faster than an NVIDIA GB10 Grace Superchip on the same model. Mistral 24B at 71 versus 12. Gemma 31B at 38 versus 7. That ratio tracks exactly with the bandwidth math: 1,792 GB/s divided by 273 GB/s is 6.6. For decode-heavy inference, memory bandwidth is destiny, and the Blackwell wins by a wide margin.

The GB10 still has its niche. It holds 128 GB of unified memory, so larger or multiple smaller models can sit resident at once. It draws under 250 watts, fanless and quiet, which matters for edge or office deployments. And ARM plus unified memory has a cost-per-watt story for some workloads. But if you have the power, the cooling, and the budget for a Blackwell, the throughput delta is real. For more on hardware choices in private AI, see our private AI services overview.

How this fits a Petronella deployment

Our deployments lead with AI, then layer cybersecurity and compliance underneath because the two are inseparable. A private inference cluster on your own hardware solves data leakage out of the gate, but the cluster itself still has to be hardened, monitored, patched, and access-controlled. Otherwise the private AI you built to keep client files off public cloud becomes the single richest target on your network.

The standard architecture pattern we ship looks like this:

  • vLLM on a private network for the multi-user chat, document-assistant, and RAG endpoints.
  • Ollama on individual workstations for offline drafting, code assistance, and overnight batch generation.
  • A retrieval layer that pulls only from your own document store, sandboxed and logged.
  • The whole thing wrapped in a hardened identity layer, network segmentation per the relevant framework (CMMC, HIPAA, NIST AI RMF), and the Petronella encrypted data and email system for anything CUI or PHI adjacent.
  • Full observability. Every prompt, every retrieval, every response routed through the audit pipeline that maps cleanly to NIST AI RMF and your compliance posture.

The combination is the differentiator. AI built right is fast. AI built right and secure is rare. Our 23 years in cybersecurity, four CMMC Registered Practitioners on staff (RPO #1449), and Craig Petronella's MIT AI Certificate plus North Carolina Licensed Digital Forensic Examiner credential (#604180) are the foundation we build private AI on.

FAQ

Is vLLM always faster than Ollama?

No. For single-user inference on a modern desktop CPU paired with an NVIDIA RTX PRO 6000 Blackwell, our benchmarks showed Ollama running Q4_K_M GGUF was 38 to 68 percent faster than vLLM running NVFP4. vLLM only pulls ahead under concurrent load or on hardware where its CUDA graph capture is needed to hide CPU dispatch overhead.

Why does the CPU matter so much for Ollama?

Ollama uses llama.cpp under the hood and does not enable CUDA graph capture by default. That means the CPU has to launch a stream of small CUDA kernels for every decoded token. On a 2018 Intel i9-9900K paired with the same Blackwell GPU, our test showed only 52 percent SM utilization, because the CPU could not dispatch kernels fast enough. On a 2025 Ryzen 9 9950X3D the same setup hit 98 tokens per second. The gap is 3.1x from the CPU alone.

What is NVFP4 and why does it not always win on Blackwell?

NVFP4 is NVIDIA's 4-bit floating point format introduced with Blackwell. It is designed to be matmul'd natively by Blackwell tensor cores. The catch is that stock vLLM 0.21 does not engage the native FP4 path on consumer-class Blackwell cards yet, so it falls back to MARLIN dequantization, which dequantizes weights to bf16 on the fly. That extra dequant step is what costs the throughput. The hardware will get faster when the software catches up, but as of mid-2026 the stack maturity gap is real.

Do I need a 70B+ model for my small business?

Most small and mid-sized businesses do not. A 24B to 31B dense model in Q4_K_M, hosted on a single RTX PRO 6000 Blackwell, beats most cloud APIs on response latency for SMB use cases (document drafting, internal Q and A, code assistance, RAG over a few hundred thousand documents). The case for 70B+ is narrower. Long-context reasoning, multilingual quality, and certain agentic workflows. We size the model to the use case, not the other way around.

Can Ollama be used in production?

For single-workstation deployments and small teams (under three concurrent users), yes. For multi-user production with SLOs, monitoring, and predictable tail latency under load, no, and that is fine because that is exactly what vLLM is for. The right answer for most small businesses is both. vLLM for the shared chat service, Ollama for individual workstations.

What about llama.cpp directly?

Llama.cpp is what Ollama wraps. Running llama.cpp directly gives you slightly more control over server flags and kernel choice. For most teams the Ollama wrapper is worth the ten seconds of overhead because of model management, OpenAI-compatible API, and the ecosystem. For a single advanced operator who wants every flag exposed, llama.cpp's official server binary is the lower-level option.

Where can I read more on these engines?

The vLLM team publishes their architecture and tuning guides in the official vLLM documentation. NVIDIA's Blackwell architecture page covers the FP4 tensor core story. For risk and governance posture around private AI, the NIST AI Risk Management Framework is the current reference. Our blog has additional coverage on compliance and managed IT for businesses building private AI capacity.

What to do next

If you are evaluating private AI for your business and want to skip the trial-and-error stage, our team has already run the benchmarks and built the deployment patterns. We architect private inference clusters for small and mid-sized businesses across North Carolina and the Southeast, sized to actual concurrency needs, paired with the right CPU and GPU, and hardened to the relevant compliance framework from day one.

For a 15-minute private-AI architecture call, call 919-348-4912. Penny will route you to a CMMC Registered Practitioner on our team. No sales script, no boilerplate, just a working conversation about what your team actually needs and what it would cost to build. Or browse our AI services overview for context first.

Petronella Technology Group, Inc. has been securing North Carolina businesses since 2002 and architecting AI for them since the technology was usable in production. The benchmark numbers in this article came off our own bench. The deployment patterns came off real client engagements. Both are available to you.

Petronella Technology Group, Inc.
5540 Centerview Dr Suite 200
Raleigh, NC 27606
919-348-4912

Need help implementing these strategies? Our cybersecurity experts can assess your environment and build a tailored plan.
Get Free Assessment

About the Author

Craig Petronella, CEO and Founder of Petronella Technology Group
CEO, Founder & AI Architect, Petronella Technology Group

Craig Petronella founded Petronella Technology Group in 2002 and has spent 20+ years professionally at the intersection of cybersecurity, AI, compliance, and digital forensics. He holds the CMMC Registered Practitioner credential issued by the Cyber AB and leads Petronella as a CMMC-AB Registered Provider Organization (RPO #1449). Craig is an NC Licensed Digital Forensics Examiner (License #604180-DFE) and completed MIT Professional Education programs in AI, Blockchain, and Cybersecurity. He also holds CompTIA Security+, CCNA, and Hyperledger certifications.

He is an Amazon #1 Best-Selling Author of 15+ books on cybersecurity and compliance, host of the Encrypted Ambition podcast (95+ episodes on Apple Podcasts, Spotify, and Amazon), and a cybersecurity keynote speaker with 200+ engagements at conferences, law firms, and corporate boardrooms. Craig serves as Contributing Editor for Cybersecurity at NC Triangle Attorney at Law Magazine and is a guest lecturer at NCCU School of Law. He has served as a digital forensics expert witness in federal and state court cases involving cybercrime, cryptocurrency fraud, SIM-swap attacks, and data breaches.

Under his leadership, Petronella Technology Group has served hundreds of regulated SMB clients across NC and the southeast since 2002, earned a BBB A+ rating every year since 2003, and been featured as a cybersecurity authority on CBS, ABC, NBC, FOX, and WRAL. The company leverages SOC 2 Type II certified platforms and specializes in AI implementation, managed cybersecurity, CMMC/HIPAA/SOC 2 compliance, and digital forensics for businesses across the United States.

CMMC-RP NC Licensed DFE MIT Certified CompTIA Security+ Expert Witness 15+ Books
Related Service
Protect Your Business with Our Cybersecurity Services

Our proprietary 39-layer ZeroHack cybersecurity stack defends your organization 24/7.

Explore Cybersecurity Services
All Posts Next
Free cybersecurity consultation available Schedule Now