All Posts Next

Updated May 2026. We took two identical NVIDIA RTX PRO 6000 Blackwell graphics cards, dropped one into a 2025 AMD Ryzen 9 9950X3D workstation and the other into a 2018 Intel i9-9900K box. Same model. Same Ollama version. Same prompt. The new workstation ran 3.1 times faster on dense models. That is the kind of gap that breaks budgets and confuses procurement. The shocking part is that the gap completely disappears under the right software stack, which means most buyers are about to pay for the wrong half of their next private AI workstation.

This post documents the benchmark, explains the root cause, and gives a four quadrant decision matrix you can hand to a vendor. We are Petronella Technology Group, Inc., a Raleigh based cybersecurity and private AI firm that has been delivering technology services since 2002, and we run these benchmarks because clients are buying real hardware based on guidance that has not caught up to the 2026 reality.

The conventional wisdom is half right and that is the dangerous part

Every blog post about local LLM inference says the same thing. GPU first. Get the most VRAM and the fastest memory bandwidth your budget allows. CPU is fine, any modern desktop chip will do. PCIe generation does not move the needle once weights are resident on the card. That advice is correct for the workloads most people benchmark and for production inference servers that handle dozens of concurrent users. It is correct for almost every dimension of LLM serving except for one specific quadrant that happens to be the most popular use case for solo developers and small businesses: a single user running a dense model through Ollama or another llama.cpp wrapper.

In that exact quadrant the CPU choice can cost you a 3.1 times throughput penalty on the same GPU. Most buyer guidance has not caught up to the measurement, and we keep seeing budgets that spec a brand new 96 GB Blackwell card next to a recycled seven year old desktop CPU. Below is the proof and the cleanup procedure.

The two workstations we tested

Both machines have the same NVIDIA RTX PRO 6000 Blackwell. Both run the same Ubuntu 24.04. Both use the same NVIDIA driver branch. Both use the same Ollama 0.24 with the same Q4_K_M GGUF weights. The only variable that changed between the two boxes was the CPU and the chipset around it.

  • ai5 (new): AMD Ryzen 9 9950X3D, 16 cores and 32 threads, 3D V-Cache, 5 nanometer process, released 2025. 192 GB DDR5. PCIe Gen 5 to the GPU.
  • c1 (old): Intel Core i9-9900K, 8 cores and 16 threads, Coffee Lake refresh, 14 nanometer process, released October 2018. 125 GB DDR4. PCIe Gen 3 to the GPU.

The two CPUs are separated by roughly seven years of process node and architecture progress. The AMD chip uses 3D stacked SRAM (V-Cache) that AMD documents as a 96 MB pool sitting directly on top of the compute die. The Intel chip predates that technique by a full generation. We picked this matchup because it represents the worst case a buyer might encounter: someone reusing an old gamer or workstation chassis with a brand new GPU rather than buying a complete new machine.

The numbers

All numbers are tokens per second on a single user request streaming 2,000 output tokens with our standard CMMC blog generation prompt. Higher is better. Same hardware (PRO 6000 Blackwell), same model weights, same software stack within each column. The only variable is the CPU and the chipset.

Dense models on Ollama: the painful gap

  • Mistral Small 3.2 24B Dense (Q4_K_M): ai5 = 98.4 tokens per second. c1 = 31.6 tokens per second. That is a 3.1 times penalty for the old CPU.
  • Gemma 4 31B Dense (Q4_K_M): ai5 = 63.5 tokens per second. c1 = 21.9 tokens per second. A 2.9 times penalty.

That is not measurement noise. We re-ran the bench three times on each box, with the GPU at default boost and with idle background containers stopped. The gap is real and reproducible.

Dense models on vLLM: the gap mostly vanishes

  • Mistral Small 3.2 24B Dense (NVFP4): ai5 = 71.4 tokens per second. c1 = 69.0 tokens per second. That is within 3 percent.
  • Gemma 4 31B Dense (NVFP4): ai5 = 37.7 tokens per second. c1 = 36.4 tokens per second. Also within 3 percent.

Same GPU. Same prompt. Different inference engine. The seven year old CPU now keeps up with the brand new one. We will explain why in the next section, but the headline is that the same physical hardware can either be 3.1 times slower or basically tied depending on a software choice.

Mixture of experts models on Ollama and llama.cpp: gap also vanishes

  • Mistral 4 119B Mixture of Experts (llama.cpp): ai5 = 184.3 tokens per second. c1 = 189.5 tokens per second. The old machine was actually faster, by a hair.
  • gpt-oss 20B Mixture of Experts (Ollama): ai5 = 261.3 tokens per second. c1 = 244.6 tokens per second. The old box ran at 94 percent of the new one.
  • gpt-oss 120B Mixture of Experts (Ollama): ai5 = 179.1 tokens per second. c1 = 176.6 tokens per second. The old box ran at 98 percent of the new one.

Three different mixture of experts models, three different sizes, two very different software stacks (llama.cpp directly for Mistral 4, Ollama for gpt-oss). All three landed at parity. The seven year old CPU is fine for these workloads. The new CPU is wasted budget if mixture of experts is all you plan to run.

What the GPU profiler actually showed

The first thing we did when c1 came in at 31.6 tokens per second was run nvidia-smi dmon during decode and watch the clocks and utilization counters live. We expected to see one of two patterns. Either the card had thermal throttled, in which case clocks would be capped well below boost, or the card was running at full clocks but starved on something. The actual readout was the second pattern but more extreme than we expected.

  • Memory clock: 13,365 megahertz. That is the PRO 6000 Blackwell at full boost.
  • SM clock: 2,790 megahertz. Also full boost.
  • SM utilization: 52 percent. The streaming multiprocessors were idle nearly half the time.
  • Memory bandwidth utilization: 20 percent. The high bandwidth GDDR7 path to model weights was almost completely idle.

The card had nothing to do. The CPU could not issue kernel launches fast enough to keep the work queue full. Ollama and the underlying llama.cpp runtime serialize a small CUDA kernel per layer per token on a single user request. Every kernel launch carries a fixed CPU overhead: parameter marshalling, driver call, command buffer write. On a 2018 CPU at decode time, that fixed overhead is large enough relative to the actual matmul time that the GPU spends most of its life waiting for the next instruction. On a 2025 CPU with 3D V-Cache, the same overhead is small enough that the GPU stays fed.

This is not a PCIe bandwidth issue. We tested it. Once weights are GPU resident, PCIe carries only the request and response bytes, which are kilobytes per token. A Gen 3 x16 link has thousands of times the headroom needed for that traffic. The bottleneck is exactly what the profiler said it was: CPU dispatch latency.

Why vLLM escapes the trap

vLLM has a feature called cudagraph capture. The idea is simple but the impact is large. Instead of issuing N separate kernel launches per layer per token, vLLM records the entire forward pass once, captures it as a single CUDA Graph object, and then replays the whole thing with one driver call per token. The CPU does the heavy work of building the graph at startup. After that, the per-token CPU cost collapses to a single cudaGraphLaunch call. The documentation lives at docs.vllm.ai and the cudagraph behavior is documented in the engine architecture and performance tuning sections.

When we ran the same Mistral 3.2 24B model on vLLM on c1, the workstation that had been stuck at 31.6 tokens per second on Ollama now did 69.0 tokens per second. The GPU did not get faster. The CPU did not get faster. The software stopped asking the CPU to do per-kernel work, and the GPU ran at the rate the memory bandwidth allowed. That is the entire story.

This also explains the second mystery in the data. ai5 with vLLM ran at 71.4 tokens per second on the same prompt while ai5 with Ollama hit 98.4 tokens per second. Why was the new workstation faster on Ollama than on vLLM? Because Ollama has lower per-request overhead than vLLM in single user mode. vLLM was built for batched concurrent serving. Its Python scheduler, PagedAttention indirection, and continuous batching state machine add overhead that pays for itself when you have eight or more simultaneous requests, but it costs you on a single stream. The cudagraph dispatch helps the CPU but the scheduler overhead is still there.

Said differently: vLLM is the great equalizer. It hides bad CPUs and it caps good CPUs. Ollama exposes the CPU choice. The faster the CPU, the more headroom Ollama can extract from the GPU on a single user workload.

Why mixture of experts models escape the trap

The other escape hatch in the data is model architecture. Mixture of experts (MoE) models like gpt-oss 20B, gpt-oss 120B, and Mistral 4 119B have most of their parameters sitting on disk and only a small fraction active per token. The architecture is documented in detail at the llama.cpp repository. Practical active parameter counts on the models we tested:

  • gpt-oss 20B: 3.6 billion active parameters per token out of roughly 20 billion total.
  • gpt-oss 120B: 5.1 billion active parameters per token out of roughly 120 billion total.
  • Mistral 4 119B: 6.5 billion active parameters per token out of roughly 119 billion total.

The fewer active parameters per token, the less compute and memory traffic per kernel. Less work per kernel means the GPU can absorb a kernel launch and stay busy for a few more microseconds, which means the CPU has more breathing room to prepare the next launch. The CPU dispatch overhead does not disappear, but the GPU has enough actual work between launches that the overhead stops being the bottleneck. Net effect: an old CPU keeps up because the per-kernel CPU work is now small relative to the per-kernel GPU work.

This is why c1 with its 2018 Intel processor ran gpt-oss 120B at 98 percent of ai5's speed. A 120 billion parameter model sounds intimidating but in MoE form it only activates 5.1 billion parameters per token. The GPU spends enough time on those activations that the CPU has time to prepare the next layer. Old CPU is fine.

The four quadrant decision matrix

Put model architecture on one axis and inference engine on the other. You get four combinations. Three of them tolerate an old CPU. Only one demands a new one.

Quadrant 1: Dense model + Ollama (the painful one)

Modern CPU required. You will pay a 2.9 to 3.1 times throughput penalty for using a 2018 era CPU. Spec a 2024 or newer desktop chip with high single thread performance and large L3 cache. Examples that work well: AMD Ryzen 9 9950X3D, Ryzen 9 7950X3D, Intel Core i9-14900K, Intel Xeon W-3500 series. If your only workload is single user dense model serving (chat, coding agent, writing assistant), the CPU choice is the most consequential single decision you will make after the GPU. The AMD Ryzen 9 9950X3D is documented at amd.com, and the older Intel i9-9900K archive lives at intel.com for spec comparison.

Quadrant 2: Dense model + vLLM (CPU does not matter)

Any reasonable CPU is fine. cudagraph capture neutralizes the per token CPU dispatch cost. A 2018 Intel chip ran within 3 percent of a brand new Ryzen on the same dense models. If you intend to standardize on vLLM, redirect the CPU budget to extra GPU memory, extra storage, or a second GPU. The cudagraph mechanism is documented in the vLLM documentation.

Quadrant 3: MoE model + Ollama (CPU does not matter)

Any reasonable CPU is fine. The low active parameter count gives the GPU enough work per kernel that CPU dispatch overhead stops being the bottleneck. gpt-oss 120B on a 2018 Intel chip ran at 98 percent of the speed of the same model on a 2025 AMD chip. If you can commit to MoE workloads only (gpt-oss, Mistral 4, Qwen MoE, DeepSeek MoE), the CPU budget can go straight into more VRAM.

Quadrant 4: MoE model + vLLM (CPU really does not matter)

Any reasonable CPU is fine. Both escape hatches are engaged at the same time. cudagraph collapses the kernel launches and the low active parameter count keeps per kernel GPU work substantial. This is the most CPU tolerant quadrant of the four.

What this means if you are budgeting a private AI workstation

The buyer recipe falls out directly from the matrix. The order of operations matters.

  1. Decide which models you will actually run. Not "I might run anything." That is how budgets get wasted. List the specific models you plan to deploy in the first 90 days. If the list is all dense (Mistral Small, Gemma, Llama dense variants, Qwen dense variants), you are in quadrant 1 or 2. If the list is all MoE (gpt-oss, Mistral 4, Qwen MoE), you are in quadrant 3 or 4.
  2. Decide which inference engine you will run them through. Ollama is faster for single user (38 to 68 percent faster on Blackwell than vLLM in our test), simpler to operate, and ships with the model library built in. vLLM is faster for concurrent serving (4+ users), supports more quant formats, and integrates better with monitoring and orchestration. Pick one.
  3. Map the combination to the matrix. Quadrant 1 (Dense + Ollama) is the only combination that demands a modern CPU. Every other combination tolerates a 2018 era CPU.
  4. Spend the savings on GPU memory. If you land in quadrants 2, 3, or 4, the CPU budget you would otherwise spend on a 2025 desktop chip can go into a card with more VRAM, an additional GPU for tensor parallelism, or upgraded NVMe for fast model swapping. That is usually a better return than a faster CPU.

The only scenario where the new CPU is non negotiable

To be precise about the headline: the modern CPU is only mandatory when you are running a single user workload, on a dense model, through Ollama or another single stream llama.cpp wrapper. That is one quadrant out of four, and it happens to be the most common quadrant for solo developers and small business owners. Even within it there is a graceful escape hatch: you can buy the old CPU, observe the penalty in production, and migrate to vLLM later if the workload outgrows a single user. The migration recovers the GPU performance. The reverse migration does not.

The background container trap

One result deserves its own warning. Before we stopped our voice agent stack on ai5, Mistral 3.2 24B ran at 44.4 tokens per second on the workstation that should have been doing 98.4. The agent containers (sam-orchestrator, hermes, kokoro, whisper, signal-cli, plus an Nginx and a site container) all showed less than 0.1 percent CPU usage each. They looked idle. They were not.

The Linux scheduler still moved threads off optimal cores when the agents woke up for periodic health checks. The L3 cache and the 3D V-Cache layer on top of it got disturbed. Once we stopped the eight background containers, ai5 went from 44.4 to 98.4 tokens per second. A 60 percent improvement from doing nothing except shutting down genuinely idle services. The lesson for a production AI box: do not co host the inference workload with anything else, even if the other workloads look like they are using zero resources.

How we run this in production at Petronella

Our internal AI box (ai5) runs the AMD Ryzen 9 9950X3D plus the RTX PRO 6000 Blackwell pairing we recommend in quadrant 1, because we run a mix of single user dense (for blog drafting, code review, and one off summarization) and multi user concurrent (for Sam, our voice agent). The Ryzen 9950X3D gives us headroom on the dense single user side and is more than fast enough for the concurrent vLLM side. The voice agent runs on vLLM with cudagraph capture, the writing assistant runs on Ollama. We let the workload pick the engine.

For client deployments, the recipe depends on what they actually plan to run. A law firm that wants a private chat assistant for a single attorney at a time gets quadrant 1 hardware. A regional CPA firm that wants overnight document analysis for the whole tax team gets quadrant 2 or 4 hardware (vLLM, fewer CPU dollars, more VRAM). A manufacturing client that wants a CMMC compliant production chatbot serving the shop floor gets quadrant 4 (MoE plus vLLM). We size the CPU after we know the workload. We do not lead with a CPU recommendation.

If you are sizing a private AI deployment for your business, our private AI deployment service walks through the workload sizing, hardware selection, and rollout plan. Petronella Technology Group, Inc. has been delivering technology services in the Raleigh North Carolina region since 2002 and we are a registered CMMC RPO (#1449) for clients with DoD obligations. Our principal Craig Petronella is CMMC-RP credentialed, a Licensed Digital Forensic Examiner (License #604180), holds an MIT AI Certificate, and has 23 plus years of experience advising small and mid market businesses on infrastructure decisions. For the IT operations side of the build (Windows imaging, group policy, backup, monitoring), our managed IT services team can take the project from spec through deployment.

FAQ

Does my CPU choice matter for cloud LLM use (OpenAI API, Anthropic API)?

No. Cloud LLM APIs run inference on the provider's hardware. Your CPU only handles request and response handling, which is trivial. If you are exclusively using cloud APIs, spend the CPU budget on faster local storage or a better keyboard.

Will a faster CPU help if I add multiple GPUs?

For tensor parallel inference across two or more GPUs in the same host, yes. The PCIe traffic between GPUs goes through the CPU complex (the PRO 6000 Blackwell does not support NVLink). A modern CPU with a fast memory subsystem helps that traffic move efficiently. The effect is small for single user single stream workloads but compounds at higher concurrency.

I already bought a 2018 era CPU. Can I rescue the throughput?

Yes. Migrate to vLLM. Our c1 workstation went from 31.6 tokens per second on Ollama to 69.0 tokens per second on vLLM for the same Mistral Small 3.2 24B model. That is roughly the same throughput an ai5 with vLLM produces. You do not have to replace the CPU. You have to replace the inference engine.

What about cheaper consumer Blackwell cards like the RTX 5090?

The same logic applies. The dispatch trap is about CPU dispatch overhead and the inference engine, not about which Blackwell card you bought. A 5090 with a 2018 CPU on Ollama will exhibit the same penalty pattern. Absolute throughput numbers will differ (less VRAM means smaller models), but the quadrants stay the same.

How do I know if my workload is single user or concurrent?

If you are the only person typing into the model and you read the answer before sending the next prompt, you are single user. If your model is behind an API that multiple humans or agents call at once, you are concurrent. Most desktop chat usage is single user. Most production deployments end up concurrent within six months.

Bottom line

The headline result holds. Same GPU, two different CPUs separated by seven years of progress, and the throughput gap on dense Ollama workloads is 3.1 times. The headline result also has a precise scope. It only applies to dense model plus Ollama plus single user. Three other large quadrants of usage are CPU tolerant. The mistake to avoid is generalizing the painful quadrant into a universal rule. The opposite mistake (assuming the CPU never matters) is the one most buyer guides have been making, and the one that costs people real throughput on the workloads they most commonly run at home.

Decide the workload first. Pick the engine second. Spec the CPU third. The GPU choice is the same in every quadrant (more VRAM is always better) but the rest of the box should be a consequence of the workload, not a default.

Need help sizing a private AI workstation for your business? Call Petronella Technology Group, Inc. at 919-348-4912 for a no obligation conversation about workload, budget, and the right hardware tier. We have 23 plus years of cybersecurity and IT services experience, we are a CMMC RPO (#1449), and our principal Craig Petronella holds CMMC-RP, Licensed Digital Forensic Examiner (License #604180), and MIT AI Certificate credentials. Learn more about our private AI deployment service or browse our broader private AI offering.

Petronella Technology Group, Inc.
2010 Cameron St, Raleigh, NC 27605
Phone: 919-348-4912
CMMC RPO #1449 | Serving the Research Triangle and Eastern North Carolina since 2002

Need help implementing these strategies? Our cybersecurity experts can assess your environment and build a tailored plan.
Get Free Assessment

About the Author

Craig Petronella, CEO and Founder of Petronella Technology Group
CEO, Founder & AI Architect, Petronella Technology Group

Craig Petronella founded Petronella Technology Group in 2002 and has spent 20+ years professionally at the intersection of cybersecurity, AI, compliance, and digital forensics. He holds the CMMC Registered Practitioner credential issued by the Cyber AB and leads Petronella as a CMMC-AB Registered Provider Organization (RPO #1449). Craig is an NC Licensed Digital Forensics Examiner (License #604180-DFE) and completed MIT Professional Education programs in AI, Blockchain, and Cybersecurity. He also holds CompTIA Security+, CCNA, and Hyperledger certifications.

He is an Amazon #1 Best-Selling Author of 15+ books on cybersecurity and compliance, host of the Encrypted Ambition podcast (95+ episodes on Apple Podcasts, Spotify, and Amazon), and a cybersecurity keynote speaker with 200+ engagements at conferences, law firms, and corporate boardrooms. Craig serves as Contributing Editor for Cybersecurity at NC Triangle Attorney at Law Magazine and is a guest lecturer at NCCU School of Law. He has served as a digital forensics expert witness in federal and state court cases involving cybercrime, cryptocurrency fraud, SIM-swap attacks, and data breaches.

Under his leadership, Petronella Technology Group has served hundreds of regulated SMB clients across NC and the southeast since 2002, earned a BBB A+ rating every year since 2003, and been featured as a cybersecurity authority on CBS, ABC, NBC, FOX, and WRAL. The company leverages SOC 2 Type II certified platforms and specializes in AI implementation, managed cybersecurity, CMMC/HIPAA/SOC 2 compliance, and digital forensics for businesses across the United States.

CMMC-RP NC Licensed DFE MIT Certified CompTIA Security+ Expert Witness 15+ Books
Related Service
Enterprise IT Solutions & AI Integration

From AI implementation to cloud infrastructure, Petronella Technology Group helps businesses deploy technology securely and at scale.

Explore AI & IT Services
All Posts Next
Free cybersecurity consultation available Schedule Now