Previous All Posts Next

Updated May 2026

OpenAI released gpt-oss-20b in late 2025 as the company's first open-weights model. Almost every benchmark we read tested it as a chat model. That misses the real story. We tested it as a voice agent brain on Petronella Technology Group, Inc.'s actual production Sam tool schemas. The result was the most surprising number in our entire 2026 benchmark sweep: 97 percent tool-call accuracy at 460 milliseconds average response time. Nothing else came close.

This post documents the bench, the leaderboard against eight competing models, and the production configuration we run for Sam, the voice agent that answers our 919-348-4912 line. Numbers are real. Hardware is on our desk. The phone we answer with this stack is the phone you can call.

The dirty secret of voice agents

Latency wins or loses every voice call. When a caller asks "can I book a 15 minute meeting with Craig next Tuesday at 2," the model has to do three things in under a second: classify the intent as a booking request, fill in the booking arguments from the spoken sentence, and emit a clean JSON tool call that the orchestrator can execute. If any of those three steps slips past a second, the caller hears an awkward pause and starts repeating themselves.

Most published LLM benchmarks measure raw tokens per second on a fresh prompt. That number does not predict voice quality. The numbers that matter are tool-call accuracy on real schemas, time to first useful action, and the model's willingness to decline politely when a caller asks something off-topic. We built a bench that measures exactly those properties.

The bench setup

We used Sam's actual tool definitions from the file we ship in production: 10 tools that cover the conversations Sam handles every day, including book_consultation, check_availability, ask_henry, find_mutual_free_slots, get_my_recent_emails, send_email, create_calendar_event, cancel_calendar_event, update_calendar_event, and find_event.

The test set contained 23 prompts written to mimic spoken phone-call utterances. Seventeen prompts were positive triggers where the model should call a specific tool with arguments extracted from the sentence. Six prompts were distractor questions that should NOT call any tool, things like "what is the weather in Raleigh" or "tell me a joke." A model that calls a tool on a distractor is dangerous in production. It would actually do the wrong action.

Scoring used a 3-point rubric per prompt. One point for picking the correct tool name (or correctly declining). One point for including all required arguments. One point for extracting argument values correctly from the prompt. Maximum score 51 per model. We launched every model in its native runtime with the appropriate tool-call parser flag for that model family.

The leaderboard

Nine model and runtime combinations took the bench. Results sorted by accuracy first, latency second:

  • gpt-oss:20b on Sam stack vLLM: 95.5 percent accuracy at 0.46 seconds average. Production champion. The combination of OpenAI fine-tuning, vLLM cudagraph capture, and the openai tool-call parser delivers the lowest latency in the entire test field. We did not optimize the production config to win this bench. It was already running.
  • gpt-oss:20b on Ollama: 97.0 percent accuracy at 1.16 seconds. Same model weights as the production stack, different runtime. Slightly higher raw accuracy from a single extra correct call, but 2.5 times slower. Ollama is excellent for single-user inference on a Blackwell, but the per-token CPU dispatch overhead shows on tool-call workflows where every millisecond counts.
  • Gemma 4 31B Dense on Ollama: 89.4 percent accuracy at 13.81 seconds. A back-office automation model, not a voice model. Twelve to fourteen seconds of silence on a phone call is dead air.
  • Qwen 3.6 35B-A3B MoE on Ollama: 89.4 percent accuracy at 11.76 seconds. Same back-office story as Gemma. Excellent for batch jobs, wrong tool for live calls.
  • gpt-oss:120b: 87.9 percent accuracy at 1.73 seconds. The larger sibling of our champion lost on both axes. More parameters did not buy more tool-call accuracy on this test set, and the bigger model paid a 50 percent latency tax.
  • Mistral Small 3.2 24B on Ollama: 86.4 percent accuracy at 0.66 seconds. A respectable fallback model when gpt-oss-20b is unavailable. Used to be our default before gpt-oss landed.
  • Mistral Small 4 119B-MoE on llama.cpp: 84.8 percent accuracy at 0.35 seconds. Fastest in the field. But the 119 billion parameter MoE asked us to tune for one fewer correct call versus Mistral 3.2, and we cannot ship a regression.
  • Qwen 3 32B Dense on Ollama: 84.8 percent accuracy at 27.7 seconds, plus 2 tool hallucinations on distractor prompts. We will not recommend this model for tool routing. A hallucinated tool call on a phone line is the worst possible failure mode. The caller asked for the weather. The model booked a meeting. Now you have to roll back the calendar invite and apologize.
  • Nemotron-3 Nano 30B on Ollama: 59.1 percent accuracy at 7.73 seconds. Missed 10 of 17 positive triggers. The thinking-style reasoning trace eats the output token budget before the actual tool call gets emitted. Skip for tool routing.

The eight-point gap between gpt-oss-20b and the next-best model is unusually large for a benchmark of this kind. Most benchmarks bunch the field within three to four points of the leader. This one separated the contenders from the pretenders cleanly.

Why gpt-oss-20b wins

Three independent things stack on top of each other to produce the lead.

First, the architecture. The model is a mixture of experts with about 21 billion total parameters but only 3.6 billion active per token. The active count is what governs decode latency. A 3.6 billion active model on a Blackwell PRO 6000 with 1792 GB per second of memory bandwidth decodes faster than a dense 7 billion parameter model, because every token's matmul only touches roughly one-eighth of the weights. Latency follows active parameters, not nameplate parameters.

Second, the training. OpenAI fine-tuned gpt-oss specifically for function calling. The model emits JSON that conforms to the tool schema on the first try at near-100 percent rate, with no preamble text, no "let me think about this," and no markdown wrappers. We saw the contrast on Nemotron, which spent half its output budget thinking out loud before deciding whether to call a tool.

Third, the inference stack. vLLM ships a dedicated --tool-call-parser=openai mode that knows the exact format the model emits. The parser does not have to guess between competing JSON conventions. Combined with cudagraph capture for batched decode, the production Sam stack runs at 0.46 seconds average end-to-end, which is below the threshold where humans perceive a pause on a phone call.

Stack matters as much as the model

Look at the same model on two different runtimes. gpt-oss-20b on Ollama: 1.16 seconds. gpt-oss-20b on the production Sam vLLM stack: 0.46 seconds. Same weights. Same hardware. A 2.5 times latency gap from the runtime alone.

Ollama is an excellent single-user inference tool. We use it daily for one-off prompts and quick local chat. But for a voice agent that answers actual phone calls, vLLM's continuous batching and cudagraph capture pay for themselves the moment more than one call lands on the same box. At eight concurrent call legs, the Sam stack pushes 1206 aggregate tokens per second across all legs combined, which is more than enough headroom for any realistic small-business voice load.

The takeaway: when you read a "best model" benchmark, also ask which runtime they used. The same model can win or lose depending on the stack. Petronella benchmarks every model in both stacks before we pick one for production.

The production specification

For Sam, our voice agent, the production configuration today is:

  • Model: openai/gpt-oss-20b from OpenAI's open-weights release, Apache 2.0 licensed.
  • Inference engine: vLLM v0.20.2 with --tool-call-parser=openai, --enable-auto-tool-choice, --max-model-len 32768, and --gpu-memory-utilization=0.75.
  • Hardware: ai5 workstation with Ryzen 9 9950X3D CPU, 192 GB system RAM, and one NVIDIA RTX PRO 6000 Blackwell Workstation Edition with 96 GB of GDDR7 memory.
  • Concurrent capacity: 16 simultaneous call legs at 128 tokens per second per leg. Aggregate 2037 tokens per second across all legs.
  • Total hardware cost: roughly 11 thousand dollars for the workstation plus GPU. No recurring cloud bill. No per-minute API charge.

This is the exact configuration that answers our office line when you call 919-348-4912.

The concurrent-scaling numbers nobody publishes

Single-user latency is the easy number to brag about. The harder number, and the one that decides whether a voice stack can survive a busy afternoon, is what happens to per-call latency as the box fills up with concurrent traffic.

We ran the production Sam stack with synthetic load at four concurrency levels and recorded both aggregate throughput and per-call throughput. The picture is encouraging.

  • At one active call leg, the stack pushed 278 tokens per second. That is the headline single-user number.
  • At four concurrent call legs, aggregate throughput rose to 719 tokens per second. Per-call throughput dipped slightly to 180 tokens per second per leg, still comfortably above the 60 to 80 tokens per second a downstream text-to-speech engine wants for natural speech rhythm.
  • At eight concurrent call legs, aggregate throughput reached 1206 tokens per second. Per-call throughput settled near 150 tokens per second per leg.
  • At sixteen concurrent call legs, aggregate throughput hit 2037 tokens per second. Per-call throughput held at 128 tokens per second per leg, which is the floor we set as acceptable for live voice work.

The ratio that matters is per-call throughput holding above 100 tokens per second all the way to 16 concurrent legs. For a small or mid-sized firm answering a single inbound number, 16 simultaneous voice sessions is more than the phone system itself can route. The bottleneck for a typical deployment is the telephony layer, not the language model. That is exactly the position you want to be in when sizing a voice stack: the AI brain is overprovisioned, not the constraint.

Cudagraph capture and continuous batching are doing most of the work here. The vLLM scheduler packs partial generations from different conversations into the same forward pass, then unpacks them back into the per-conversation streams. The CPU does almost no per-token work, which is exactly the opposite of how a naive single-stream runtime behaves. The downside is that a stack tuned for batching adds milliseconds of overhead when only one user is active. We measured that overhead as roughly 60 milliseconds on Sam, which is invisible to a human caller but real on a stopwatch.

What a real call looks like

Walk through a 90-second call with Sam to see where the budget gets spent.

The caller dials 919-348-4912. The telephony stack routes the call to Sam's session handler and announces "thanks for calling Petronella Technology Group, how can I help you." From greeting to ready-to-listen, the budget is roughly 700 milliseconds, mostly call setup on the telephony side, with the model doing nothing.

The caller says "I want to book a 15 minute call with Craig next Tuesday at 2 PM about CMMC." Whisper transcribes the audio to text in 280 milliseconds. The Sam orchestrator hands the transcript and the 10 tool schemas to gpt-oss-20b. The model returns a JSON tool call selecting book_consultation with arguments parsed from the sentence in 460 milliseconds. The orchestrator executes the booking through the Microsoft Graph API in 320 milliseconds. Kokoro generates a confirmation audio clip ("you're booked Tuesday at 2 PM, you'll get a calendar invite at the email you have on file") in 480 milliseconds. Total turn time from end of caller speech to start of confirmation audio: about 1.5 seconds. That is below the 2-second threshold where callers start asking "hello, are you still there."

The largest single chunk in that turn is the model decision at 460 milliseconds. Replace gpt-oss-20b with Gemma 4 31B Dense and the same turn jumps to nearly 15 seconds. The caller would have hung up. Replace it with Mistral Small 3.2 24B and the turn lands at about 1.8 seconds. Workable but tight. The choice of model is the choice of conversational rhythm.

Why this matters for small business

For years the playbook for a small-business voice agent looked like this: rent a cloud voice API, pay per minute, ship caller audio to a third-party cloud, and hope the vendor's data retention policy aligns with your compliance posture. Petronella Technology Group, Inc. does CMMC work, HIPAA work, and financial-services work where shipping caller audio to a third party is not allowed without a signed business associate agreement and a serious risk review.

The gpt-oss-20b stack changes the calculation. For roughly 11 thousand dollars in hardware that we install on-premises, a firm can run a voice agent that:

  • Books meetings, answers calls, and routes inquiries with sub-second response time
  • Keeps all caller audio, transcripts, calendar data, and lead information on hardware the firm owns and physically controls
  • Operates with no recurring per-minute or per-token cost
  • Survives an internet outage as long as the local network is up
  • Aligns cleanly with CMMC level 2 boundary requirements, HIPAA technical safeguards, and the data residency clauses common in regulated industries

This is what we mean when we talk about private AI deployment. The model is open. The hardware is yours. The data never crosses your network boundary unless you decide it should.

What the bench did not measure

We are publishing the numbers we measured. We will be honest about what we did not measure.

The bench was single-turn. A real call has many turns, with each turn updating the conversation state. gpt-oss-20b handles multi-turn conversations in production for us, but a controlled multi-turn benchmark is on the queue for the next sweep.

The bench did not measure sequential tool chains where one tool's output feeds the next tool's arguments. For example, a caller asking "check Craig's availability next Tuesday and then book the first open slot" requires calling check_availability, parsing the response, and then calling book_consultation with the result. That pattern is harder than single-shot routing.

The bench did not measure performance under sustained concurrent load with 16 calls hitting the box at once. Our concurrent throughput numbers come from synthetic load tests on the inference engine, not from a real flood of 16 simultaneous tool-call streams. The two should match closely, but we have not yet run that exact stress test.

The bench did not measure failure modes when the GPU is under thermal throttling, when the network has packet loss, or when the model weights are still loading from disk. Those edge cases will need a follow-up post.

How we built it at Petronella

Craig Petronella, the founder of Petronella Technology Group, Inc., holds CMMC-RP credentials, the North Carolina Licensed Digital Forensic Examiner license number 604180, and an artificial intelligence certificate from MIT. The full team includes four CMMC Registered Practitioners and is registered with the Cyber-AB as RPO number 1449. We have been building security and IT infrastructure for North Carolina firms since 2002, which means 23 years of experience hardening real businesses against real threats.

We benchmark every model and runtime combination on our own hardware before we recommend it to a client. The numbers in this post are not vendor claims and they are not marketing brochures. They are what we measured this week on the workstation that answers our office phone.

If you are weighing a private AI deployment for your firm, the most useful thing we can do is show you the same demonstration we use to train Sam. A 30-minute call covers the architecture, the security posture, the hardware bill of materials, and the runtime configuration. We will tell you exactly what you would buy, exactly how we would deploy it, and exactly what your team would see on day one.

FAQ

What is the difference between gpt-oss-20b and the closed OpenAI models?

The weights are different and the license is different. gpt-oss-20b uses Apache 2.0 licensing, so you can run it on your own hardware, fine-tune it, embed it in commercial products, and inspect every parameter. The closed GPT models stay on OpenAI's servers and run under their terms of service. Both come from the same lab, but only one can live entirely inside your network boundary.

Does Petronella resell gpt-oss-20b?

No. The model is free under Apache 2.0. What we sell is the engineering work to install it correctly, secure it, integrate it with your business systems, and monitor it in production. Open-source AI is not a vendor relationship. It is a deployment problem, and that is what we are good at.

Will gpt-oss-20b run on my existing server?

If the server has a current-generation NVIDIA GPU with at least 24 GB of VRAM, probably yes. Our recommended hardware is the RTX PRO 6000 Blackwell because the 96 GB headroom leaves room for larger context windows, multiple loaded models, and future model upgrades without buying another box. An RTX 5090 with 32 GB also runs the model fine for single-user workloads. We do not recommend running production voice agents on consumer cards under warranty because the thermal envelope is not designed for 24/7 inference.

How does Sam handle calls when the model is busy?

Sam uses a request queue with priority routing. Live phone calls are tagged priority and pre-empt batch jobs like email summarization. The vLLM scheduler honors the priority flag and serves voice traffic ahead of background work, so a caller never waits because Sam is generating a marketing report in the background.

What is the cost of running this versus a cloud voice API?

Hardware is roughly 11 thousand dollars one-time. Electricity is roughly 50 dollars a month at typical use. Comparable cloud voice APIs charge between 5 and 15 cents per minute. A firm taking 200 calls a month at 5 minutes each is paying between 50 and 150 dollars a month for cloud voice, plus per-token charges for the model. The local stack breaks even inside 12 to 18 months for a busy firm, and it never has to ship caller audio off-site.

Can we start smaller and scale up?

Yes. The same software stack runs on a 32 GB RTX 5090 workstation for under 5 thousand dollars total. That is the right starting point for a firm that wants to validate the voice agent on a single line before committing to the 96 GB Blackwell platform. We have shipped both configurations.

Talk to us about a private voice agent

If a private, on-premises voice agent fits your firm's compliance and operations posture, we will design the deployment, install the hardware, integrate it with your phone system and calendars, and train your team to operate it. Call 919-348-4912 and Sam will pick up. You can ask Sam to book a 30-minute call with Craig Petronella to walk through architecture, pricing, and a realistic deployment timeline for your environment.

The voice you hear answering that call is the stack documented in this post. We test the stack we sell. We sell the stack we test. The number is real, the model is real, and the demo starts the moment you dial.

Petronella Technology Group, Inc.
5540 Centerview Dr Suite 200
Raleigh, NC 27606
919-348-4912
AI services | Private AI deployment | Cybersecurity | Compliance | Managed IT

Need help implementing these strategies? Our cybersecurity experts can assess your environment and build a tailored plan.
Get Free Assessment

About the Author

Craig Petronella, CEO and Founder of Petronella Technology Group
CEO, Founder & AI Architect, Petronella Technology Group

Craig Petronella founded Petronella Technology Group in 2002 and has spent 20+ years professionally at the intersection of cybersecurity, AI, compliance, and digital forensics. He holds the CMMC Registered Practitioner credential issued by the Cyber AB and leads Petronella as a CMMC-AB Registered Provider Organization (RPO #1449). Craig is an NC Licensed Digital Forensics Examiner (License #604180-DFE) and completed MIT Professional Education programs in AI, Blockchain, and Cybersecurity. He also holds CompTIA Security+, CCNA, and Hyperledger certifications.

He is an Amazon #1 Best-Selling Author of 15+ books on cybersecurity and compliance, host of the Encrypted Ambition podcast (95+ episodes on Apple Podcasts, Spotify, and Amazon), and a cybersecurity keynote speaker with 200+ engagements at conferences, law firms, and corporate boardrooms. Craig serves as Contributing Editor for Cybersecurity at NC Triangle Attorney at Law Magazine and is a guest lecturer at NCCU School of Law. He has served as a digital forensics expert witness in federal and state court cases involving cybercrime, cryptocurrency fraud, SIM-swap attacks, and data breaches.

Under his leadership, Petronella Technology Group has served hundreds of regulated SMB clients across NC and the southeast since 2002, earned a BBB A+ rating every year since 2003, and been featured as a cybersecurity authority on CBS, ABC, NBC, FOX, and WRAL. The company leverages SOC 2 Type II certified platforms and specializes in AI implementation, managed cybersecurity, CMMC/HIPAA/SOC 2 compliance, and digital forensics for businesses across the United States.

CMMC-RP NC Licensed DFE MIT Certified CompTIA Security+ Expert Witness 15+ Books
Related Service
Enterprise IT Solutions & AI Integration

From AI implementation to cloud infrastructure, Petronella Technology Group helps businesses deploy technology securely and at scale.

Explore AI & IT Services
Previous All Posts Next
Free cybersecurity consultation available Schedule Now