Previous All Posts Next

Updated May 2026. NVIDIA shipped the GB10 Grace Superchip in DGX Spark form earlier this spring. The marketing pitch is bold, a desktop-class ARM SoC with 128 GB of unified memory and Blackwell GPU cores for roughly four thousand dollars. The obvious question for anyone building private AI on a budget is whether that little cube can credibly replace a ten thousand dollar RTX PRO 6000 Blackwell workstation card. We bought one of each, dropped them on the bench at Petronella Technology Group, Inc., and ran the same workloads through both for two weeks. The headline finding will not surprise anyone who has stared at memory bandwidth numbers long enough. The detail behind it might.

This post walks through the spec sheets, the actual measured throughput on three current open-source models, the surprising places where the GB10 wins despite its bandwidth disadvantage, and a buyer-by-use-case matrix for small and mid-sized businesses planning a private AI deployment. If you want to skip to the recommendation, the section near the end called Buyer Decision Matrix distills it. If you want to know why, the rest of the article is the receipts.

Why we ran this test

Three things drove the benchmark. First, customers ask us almost weekly whether the GB10 platform is good enough to host a private AI server for a small team, since the unit price competes with a midrange laptop rather than a server. Second, the bandwidth gap between professional discrete GPUs and unified-memory SoCs is now so large that a quick spec-sheet comparison underestimates how it actually plays out, in both directions. Third, software maturity for the Blackwell consumer architecture is still settling. Stock vLLM 0.21 does not yet light up native NVFP4 tensor cores on the PRO 6000, while a community patched build of vLLM does on the GB10. Spec sheets cannot show you that.

Our bench at Petronella Technology Group is set up the way a security-conscious SMB would build a private AI lab. We do not benchmark on cloud GPUs. We test on bare metal that we own, in a Tailscale-isolated subnet, with the same telemetry and hardening posture we put on every customer deployment. Hardware decisions made here influence what we recommend on every private AI consulting engagement we run.

The spec sheet, as published

Before any throughput numbers, here is the apples-to-apples on what NVIDIA documents for each platform.

SpecificationRTX PRO 6000 BlackwellGB10 Grace Superchip (DGX Spark)
GPU architectureBlackwell, dGPU on PCIe Gen5 x16Blackwell GPU + ARM Grace CPU, single package SoC
Memory capacity96 GB GDDR7128 GB LPDDR5X unified (CPU+GPU shared)
Memory bandwidth~1,792 GB/s~273 GB/s
TDP, typical operating~600 W~250 W (whole SoC)
Form factorDual-slot PCIe add-in cardCompact desktop unit, fanless aesthetic
Host CPU requiredYes, separate workstation x86 hostIntegrated ARM Cortex-X cluster on package
Native NVFP4 tensor coresYes, but stack support variesYes
Approximate street price$9,500 to $11,000 USD$3,999 to $4,499 USD

The crude prediction from the spec table is that the PRO 6000 should win decode-heavy LLM workloads by the bandwidth ratio, which is 1,792 divided by 273, or roughly 6.6 times. Large language models in single-stream decode are almost purely memory bandwidth bound, because each token requires the model to stream every weight from memory through the tensor cores once. If the rule held perfectly, a token per second number on the PRO 6000 should be about 6.6 times what we see on the GB10. The reality is closer, and there are good reasons.

For background on the broader architecture story see the NVIDIA Blackwell architecture overview and the NVIDIA Grace family announcement. For the inference engine internals, the vLLM documentation and the llama.cpp repository describe the kernel paths we exercised.

The benchmark setup

We tested three open-source models that we routinely ship to private AI deployments. Each one represents a different shape of workload.

  • Mistral Small 3.2 24B Dense, NVFP4 quant, single-user, 2000 output tokens, served via vLLM. Dense smaller model, fast cold path, common SMB choice for drafting and summarization.
  • Gemma 4 31B Dense, NVFP4 quant, single-user, 2000 output tokens, served via vLLM. Larger dense model, the upper end of what fits comfortably on a single 96 GB card with headroom for KV cache.
  • Gemma 4 26B with 4B active mixture-of-experts, NVFP4 quant, served via vLLM with CUDA graph capture. MoE workload, lighter per-token compute but the full weight working set is still resident.

The prompt was a fixed compliance-themed brief that produces roughly two thousand output tokens. The stack on the PRO 6000 host was stock vllm/vllm-openai:latest built against vLLM 0.21. The stack on the GB10 was a community patched vllm-v4-sm120 ARM container that engages the native NVFP4 cutlass kernel path. We will return to why that matters.

For concurrent serving, we ran the same NVFP4 MoE model with batched requests at varying concurrency. Single-user numbers tell you how fast one chat session feels. Concurrent numbers tell you how many simultaneous users a unit can serve at production quality of service.

The headline results

Here are the measured tokens per second on identical workloads. All numbers are single-user, 2000 output tokens, freshly warmed model.

ModelRTX PRO 6000 BlackwellGB10 Grace SuperchipMeasured RatioBandwidth Predicts
Mistral Small 3.2 24B Dense, NVFP4 + vLLM71.4 t/s12.5 t/s5.7x6.6x
Gemma 4 31B Dense, NVFP4 + vLLM37.7 t/s6.6 t/s5.7x6.6x
Gemma 4 26B-A4B MoE, NVFP4 + cudagraph161.5 t/s29.2 t/s5.5x6.6x

Three things to note. First, the measured ratio lands consistently in a 5.5 to 5.7 times band, slightly under the pure bandwidth prediction of 6.6. That gap, roughly fourteen percent, is the overhead of everything that is not weight streaming: kernel launch, scheduler, KV cache index math, attention compute on hot tokens. Second, the MoE model is dramatically faster per token on both platforms, which is what mixture-of-experts is for. Third, the dense 31B model is the most punishing case for the GB10. Below seven tokens per second is right at the edge of what feels usable in an interactive chat, and it is the point where the GB10 stops being a serious answer for single-user dense inference at the 30B class.

Concurrent serving turns the comparison on its head

Single-user latency is one workload. Production private AI for a team usually looks different. If you have eight to thirty people firing requests, batching kicks in and the math changes.

We ran a sustained concurrent load test on the same NVFP4 MoE model. The PRO 6000 on a clean host hit roughly 2,173 tokens per second aggregate at concurrency 32. The GB10 unit hit roughly 217 tokens per second aggregate at concurrency 16, with the per-token latency staying inside an acceptable band thanks to vLLM continuous batching plus the unified memory pool absorbing KV cache pressure. The aggregate ratio in that scenario is ten to one. On a per-dollar basis, however, the picture flips. Take a roughly ten thousand dollar PRO 6000 against a roughly four thousand dollar GB10. The PRO 6000 delivers about 217 tokens per second per thousand dollars. The GB10 delivers about 54. The PRO 6000 still wins on pure raw aggregate throughput per dollar at this batch size, but the GB10 closes the gap dramatically compared to single-user benchmarks where the price gap and the throughput gap both compounded against it.

Now consider the configuration where the GB10 actually wins. A small office that wants a private AI host for eight to fifteen people doing knowledge-worker drafting and code assistance, where peak concurrency is eight to twelve, where the model is a 14B to 24B dense in 4-bit and easily fits in twenty to thirty gigabytes, and where the power envelope must stay under three hundred watts because the unit lives on a credenza, not in a server rack. In that scenario the GB10 is a credible single-box solution. The PRO 6000 needs a 750 W workstation power supply, professional cooling, and almost always a separate eight hundred to fifteen hundred dollar host machine to drop into. Total system cost for a turnkey PRO 6000 workstation lands closer to twelve thousand dollars, not ten.

Where the GB10 surprises on the upside

Three workload patterns where the GB10 is the better engineering answer regardless of the bandwidth gap.

Models larger than 96 GB

The PRO 6000 has 96 GB of GDDR7. The GB10 has 128 GB of unified LPDDR5X. If you want to run a 100B+ class model on a single unit without sharding across nodes, the GB10 has the address space and the PRO 6000 does not. We were not able to fit Mistral Small 4 119B in NVFP4 quant on a single PRO 6000 without aggressive KV cache trimming. On the GB10 it fits with room. Throughput is lower because of the bandwidth, but the workload completes on a single box.

Low power envelope

An RTX PRO 6000 draws 600 watts under load and demands serious airflow. A GB10 draws roughly 250 watts and runs quietly. If the deployment is a regional clinic, a law firm with twelve attorneys, a manufacturing site office, or any other small business location without a server room, the GB10 is the realistic answer. Adding a six hundred watt heat source to a small office requires HVAC math the PRO 6000 vendor brochure does not mention.

Memory-flexible multi-model serving

If your use case is to keep three or four smaller models warm in memory simultaneously, a sub-30B coder, a sub-15B translator, a 7B embed model, and so on, the GB10 holds them all at once in unified memory without paging. Throughput per model is lower than the PRO 6000 could deliver, but the GB10 can serve all of them concurrently with no model swap penalty. The PRO 6000 has more peak speed but a smaller bag.

Where the PRO 6000 stays untouchable

Single-user, low-latency, premium experience

If you are building a customer-facing assistant where ninety-fifth percentile latency on a long answer must stay under three seconds, the PRO 6000 will deliver that on a 24B class model and the GB10 will not. The 5.7 times throughput gap on single-stream decode is real and translates directly to perceived speed in any single-user chat surface.

Native NVFP4 once the stack catches up

Right now, stock vLLM 0.21 on consumer Blackwell falls back to a MARLIN weight-only dequant path because the native FP4 tensor-core kernel for some activation functions is not upstream yet. We logged the warning message "Your GPU does not have native support for FP4 computation" during this benchmark. The GB10, running rob's patched vllm-v4-sm120 ARM build, did engage the native cutlass NVFP4 path. In effect we are comparing a PRO 6000 running at maybe seventy to eighty percent of its real ceiling against a GB10 running at the architecture's full design point. When vLLM upstream lands the native Blackwell NVFP4 kernel for the MoE activation set we tested, the PRO 6000 should claw back another twenty to thirty percent of throughput. For background on why the inference stack matters as much as the silicon, our previous post on the private AI stack goes deeper.

Heavy serving with sticky cache locality

For a large team of twenty to fifty users with sustained traffic, the PRO 6000 plus a modern host plus serious networking remains the better unit economics if you can power and cool it. The aggregate throughput at concurrency 32 was 2,173 tokens per second, which is enough to keep dozens of concurrent chat sessions feeling instant.

The hidden gotcha: stack maturity

This is the single most important thing for an SMB to understand before purchasing either platform. The silicon is only half the story. The inference stack determines whether you actually get the silicon's performance, or whether you get a fallback path that leaves twenty to forty percent on the table. Three concrete examples from our bench.

  1. Stock vLLM 0.21 on PRO 6000 + NVFP4 + Gemma 4 MoE takes the MARLIN backend because cutlass and flashinfer reject GELU_TANH activation, and flashinfer_trtllm refuses the device class. MARLIN does weight-only dequant to bf16 and matmuls in bf16, which is extra work. Until vLLM upstream lands the native FP4 path for that activation, the PRO 6000 is running with one hand tied.
  2. Stock vLLM on GB10 has its own intermediate-size padding bugs for some MoE shapes in tensor-parallel configurations. We hit "Intermediate size padding for w1 and w3 for VLLM_CUTLASS NvFp4 backend, but this is not currently supported" on Gemma 4 MoE tensor-parallel on a two-node Spark cluster. Rob's vllm-v4-sm120 patched image works around it for single-node, but cross-node TP-2 for this specific model is still blocked at this writing.
  3. llama.cpp via Ollama actually beats vLLM by 38 to 68 percent for single-user inference on PRO 6000. Our cross-engine bench measured Mistral Small 3.2 24B at 98.4 tokens per second on Ollama Q4_K_M versus 71.4 on vLLM NVFP4. The reasons are PagedAttention overhead, scheduler dispatch cost, and the MARLIN fallback compounding. Ollama is single-stream optimized and lighter. We covered this in detail in our companion post on choosing an inference engine.

The lesson for buyers is to budget time and engineering for the stack, not just the hardware. A six month software gap can erase a five times silicon advantage in the wrong workload.

What this means for your private AI deployment

For SMBs evaluating either platform, the framing we use during a private AI services consultation is to start with the workload and work backwards.

Buyer Decision Matrix

If your teamYou wantWhy
Has 1 to 4 users, wants the fastest single chat experienceRTX PRO 6000 Blackwell on a workstation5 to 6 times higher single-user decode throughput
Has 8 to 15 users at small office, no server room, under $300/month power budgetGB10 Grace Superchip (DGX Spark)Power envelope and form factor win, throughput is adequate at small batch
Wants to run 100B+ class models on a single unitGB10 with 128 GB unified memoryThe PRO 6000 cannot hold the working set without sharding
Has 20 to 50 concurrent users, sustained load, can power and cool 600 WRTX PRO 6000, possibly multiple cardsAggregate throughput at high batch is multiples ahead
Wants to keep three or four smaller models warm togetherGB10 unified memory poolHolds all models at once without paging
Needs sub-3 second p95 latency on long responsesRTX PRO 6000Bandwidth determines tail latency on long generations
Lives in a fanless or near-silent environmentGB10250 W and quiet thermals, no workstation acoustics
Has a regulatory mandate to keep all inference on-premiseEither, with the right wrapperBoth can be deployed entirely behind your firewall

The regulatory point is worth emphasizing. Cloud AI services route prompts and outputs across hyperscaler infrastructure that small businesses cannot directly audit. For any workload touching protected health information, controlled unclassified information, attorney-client privileged matter, financial advisory data subject to SEC custody rules, or pre-public deal documents, an on-premise private AI deployment is the cleanest answer. Our team at Petronella Technology Group has designed private AI deployments for regulated SMBs across healthcare, defense subcontracting, legal, and accounting verticals. Where applicable, we map the deployment back to the relevant control frameworks during the design phase. See our work on compliance program build-out and cybersecurity engineering for context.

Security framing: AI first, security built in

One question we get on every hardware decision is whether locating a private AI server inside the office network actually improves the security posture compared to a cloud LLM API. The honest answer is that it does, but only if the deployment is engineered. A box on the network is not a security strategy. The components we layer on top of either platform during a real deployment include identity-aware reverse proxies, mutual TLS between application and inference endpoint, RBAC at the prompt-routing layer, redaction filters on outbound responses, full prompt and completion logging into your SIEM, signed model artifacts with hash verification on load, and a documented data flow diagram that maps every request to a control mapping in your compliance framework. We also encrypt at-rest model artifacts and any retrieval corpora through our Petronella Secure Data Suite. None of that is hardware. All of it is what turns the hardware into a defensible architecture.

For private AI workloads touching DoD CUI, additional controls apply. Our team holds CMMC Registered Practitioner credentials across all four engineers, and the firm is a registered CMMC RPO #1449. When private AI is part of a CMMC scoping conversation, the inference endpoint, the model artifacts, and the prompt and completion logs are typically in-scope assets that need to be enrolled in the SSP. We have shipped this architecture twice this year for active DoD subcontractors. The managed IT services wrapper handles the operational side: patching, telemetry, backup, and 24x7 monitoring of the inference host. The AI build-out and the security wrapper are sold together because they only work together.

A note on price-per-token economics

It is tempting to compare either platform to a commercial frontier API like GPT-4-class or Claude Opus on a price-per-million-token basis. Two cautions. First, the frontier APIs are much larger and higher-quality on hard reasoning tasks, which an open 24B to 31B does not match. Compare like with like, which usually means an open 24B class model in a private deployment against the cheapest competitive cloud tier of equivalent open model. Second, the amortization period for a private AI box is two to three years. At Mistral 24B's 71 t/s on the PRO 6000, a single unit running eight hours a day generates roughly two million tokens per day. Over a year that is roughly five hundred million tokens of capacity, well over an order of magnitude more than most SMBs consume in a year. The deciding factor is rarely raw throughput. It is whether you want the prompts to stay on your premise.

Common failure modes we have seen in the field

From customer deployments and bench experiments, three things sink a private AI build more often than hardware.

  1. Pairing a modern GPU with an old CPU. We measured a 3.1 times performance gap on Ollama single-user inference between two identical PRO 6000 cards. The faster host had a 2025 Ryzen 9 9950X3D. The slower had a 2018 Intel i9-9900K. Same GPU, same model, same Ollama version. The old CPU could not dispatch GPU kernels fast enough to keep the GPU fed, and Ollama did not use CUDA graph capture to mask the dispatch cost. vLLM with cudagraphs masked it and brought the two hosts within three percent. Buyer lesson, pair the PRO 6000 with a 2024 or newer desktop CPU.
  2. Idle background containers stealing performance. Stopping idle voice and orchestration containers on the test host gave the PRO 6000 a sixty percent throughput boost on Mistral 3.2 24B Dense, even though each container reported less than one tenth of one percent CPU usage. Cache locality and scheduler contention are invisible until you measure them. Buyer lesson, isolate inference hosts.
  3. Routing benchmark traffic over the wrong network path. Benchmarking the same host over Tailscale userspace WireGuard versus LAN dropped throughput from 69 to 30 tokens per second. The HTTP token stream chunking hits Tailscale's per-packet processing harder than a steady file transfer. Buyer lesson, the production network path matters as much as the model.

FAQ

Is the GB10 worth buying for a private AI server in a small office?

Yes if your team is fewer than fifteen people, you want a quiet desktop unit on a credenza or shelf, and your workload is drafting, summarization, and code assistance with a 7B to 24B model. No if you need premium single-user latency on a 30B class model or if you expect twenty plus concurrent power users at peak.

Does the PRO 6000 require its own server?

It requires a workstation-class host with at least a 1000 watt power supply, PCIe Gen5 x16, modern CPU, and serious cooling. Plan on $1,500 to $3,000 for the host, on top of the card cost. The GB10 includes its compute host.

Why is single-user decode bandwidth-bound rather than compute-bound?

During decode of a long completion, every output token requires streaming the full set of model weights through the tensor cores once. The arithmetic intensity of decode is low, so the throughput ceiling is set by how fast the GPU can read its weight memory, not by how many FLOPS it can do. Encoding the prompt is different and can become compute-bound. For most chat workloads, decode dominates.

Will vLLM eventually fix the native NVFP4 fallback on Blackwell PRO 6000?

It is upstream priority work and changes are landing regularly. We expect the gap to close within one to two minor releases. Our team tracks the vLLM release cadence as part of every managed private AI deployment and updates the stack on a controlled schedule.

Can I cluster two GB10 units to double the throughput?

NVIDIA documents two-node Spark cluster configurations and we have begun bench work on cross-node tensor-parallel. As of this writing, some MoE shapes still fail intermediate-size padding checks in cutlass. The single-unit answer is more predictable today. Multi-unit is a 2026 second half question.

What happens if my private AI model needs to be updated?

On a Petronella-managed deployment, model updates run through a tested staging procedure: signed artifact verification, regression evals on a fixed prompt suite, performance regression check, and a controlled cutover with rollback. We do not push model updates blindly.

Is open-source NVFP4 inference production-ready for a regulated SMB?

Yes, with appropriate engineering. The model formats are stable, the inference engines have predictable behavior in well-defined configurations, and the failure modes are observable. The point of partnering with a security-led team is that we have already done the failure-mode mapping during prior deployments. We do not learn it on your dollar.

The bottom line

If you remember nothing else, remember this. For a 1 to 4 user team that wants the fastest single chat response, buy the PRO 6000. For an 8 to 15 user office that needs a quiet box on a credenza, buy the GB10. For 20+ users with serious sustained load, buy multiple PRO 6000s in a real workstation. For models that do not fit in 96 GB, buy the GB10. For premium latency on a 30B dense model, buy the PRO 6000. The hardware decision is straightforward once the workload is honest. The deployment around it is where the real engineering lives.

Craig Petronella, CMMC Registered Practitioner, Licensed Digital Forensic Examiner #604180, MIT AI Certificate holder, and founder of Petronella Technology Group, has spent twenty three years building security-led infrastructure for regulated small businesses. If you are weighing a private AI build for your team and want a candid, technically specific second opinion, call Petronella Technology Group at 919-348-4912 for a fifteen minute consultation. The engineers who run this bench answer the phone.

Petronella Technology Group, Inc.
5540 Centerview Dr, Suite 200
Raleigh, NC 27606
919-348-4912
petronellatech.com

Need help implementing these strategies? Our cybersecurity experts can assess your environment and build a tailored plan.
Get Free Assessment

About the Author

Craig Petronella, CEO and Founder of Petronella Technology Group
CEO, Founder & AI Architect, Petronella Technology Group

Craig Petronella founded Petronella Technology Group in 2002 and has spent 20+ years professionally at the intersection of cybersecurity, AI, compliance, and digital forensics. He holds the CMMC Registered Practitioner credential issued by the Cyber AB and leads Petronella as a CMMC-AB Registered Provider Organization (RPO #1449). Craig is an NC Licensed Digital Forensics Examiner (License #604180-DFE) and completed MIT Professional Education programs in AI, Blockchain, and Cybersecurity. He also holds CompTIA Security+, CCNA, and Hyperledger certifications.

He is an Amazon #1 Best-Selling Author of 15+ books on cybersecurity and compliance, host of the Encrypted Ambition podcast (95+ episodes on Apple Podcasts, Spotify, and Amazon), and a cybersecurity keynote speaker with 200+ engagements at conferences, law firms, and corporate boardrooms. Craig serves as Contributing Editor for Cybersecurity at NC Triangle Attorney at Law Magazine and is a guest lecturer at NCCU School of Law. He has served as a digital forensics expert witness in federal and state court cases involving cybercrime, cryptocurrency fraud, SIM-swap attacks, and data breaches.

Under his leadership, Petronella Technology Group has served hundreds of regulated SMB clients across NC and the southeast since 2002, earned a BBB A+ rating every year since 2003, and been featured as a cybersecurity authority on CBS, ABC, NBC, FOX, and WRAL. The company leverages SOC 2 Type II certified platforms and specializes in AI implementation, managed cybersecurity, CMMC/HIPAA/SOC 2 compliance, and digital forensics for businesses across the United States.

CMMC-RP NC Licensed DFE MIT Certified CompTIA Security+ Expert Witness 15+ Books
Related Service
Enterprise IT Solutions & AI Integration

From AI implementation to cloud infrastructure, Petronella Technology Group helps businesses deploy technology securely and at scale.

Explore AI & IT Services
Previous All Posts Next
Free cybersecurity consultation available Schedule Now