Every week, another company tells us the same thing. Their developers want to use AI coding assistants, their leadership sees the productivity numbers, and their compliance team has put a hard stop on the whole idea. The reason is simple. Most popular AI coding tools send your source code, your configuration files, and sometimes your secrets to a vendor's cloud. For a defense contractor handling Controlled Unclassified Information, a healthcare practice under HIPAA, or a law firm with privileged client data, that is a non-starter.
The good news is that the gap between cloud-only AI and what you can run on your own hardware is closing fast. Open-weight models, the kind you download and run entirely inside your own network, are now competitive with the closed commercial systems on the tasks that matter most to working engineers. To prove it rather than assume it, we benchmark these models ourselves on a private test harness built around real Petronella Technology Group, Inc. engineering work.
This month we put Cohere's brand-new North Mini Code through that harness. The result was striking enough that we want to walk through exactly what we found, how we measured it, and what it means for any organization that needs strong AI assistance without surrendering control of its data.
Why On-Premise AI Matters More Than the Model Itself
Before we get to scores, it helps to understand why we care about open-weight models at all. The marketing around AI rarely mentions the part that keeps compliance officers awake. When you paste code into a hosted assistant, that code leaves your boundary. Where it goes, how long it is retained, whether it is used to train future models, and who can subpoena it are questions you often cannot answer with certainty.
For regulated work, that uncertainty is the whole problem. The Cybersecurity Maturity Model Certification program and the underlying NIST SP 800-171 control set both require you to know exactly where Controlled Unclassified Information lives and to limit the systems that can touch it. An AI tool that quietly transmits CUI to a third party expands your assessment scope and can break the very controls you are paying to maintain. We cover the broader picture of meeting these obligations on our CMMC compliance resources, and the same logic applies to HIPAA, ITAR, and most financial regulations.
An open-weight model changes the calculation. You download the weights once, host the model on a server you own, and every prompt and response stays inside your firewall. There is no outbound connection to a vendor, no data-retention clause to negotiate, and no fourth-party risk hiding in the supply chain. This is what people mean when they talk about sovereign AI, and it is the foundation of the on-premise and air-gapped AI deployments we design for clients who cannot use the public cloud.
The catch, historically, has been quality. Open models were good enough for demos but trailed the commercial leaders on serious coding work. That is the assumption North Mini Code challenges, and it is why we tested it.
What North Mini Code Actually Is
North Mini Code is the first developer-focused model from Cohere and its research division, Cohere Labs. The company announced it in June 2026 and released the weights under the Apache 2.0 license, which permits commercial use, modification, and private deployment without per-seat fees. For an organization that wants to run AI on its own terms, the license is as important as the benchmark scores.
The technical shape of the model is what makes it practical. North Mini Code is a sparse mixture-of-experts design with 30 billion total parameters but only about 3 billion active for any given token. In plain terms, it carries the knowledge of a large model while doing the work of a small one. That keeps it fast and lets it run on a single modern accelerator rather than a rack of them. Cohere lists a minimum requirement of a single H100-class GPU when the model is served in the FP8 numeric format, which is exactly the configuration we tested.
Cohere built the model specifically for agentic software engineering, meaning it is trained not just to autocomplete a line but to plan, call tools, run commands in a terminal, and iterate on its own output. It supports a very large context window, so it can hold an entire codebase or a long compliance document in working memory. The company reports a score of 67.6 on the widely used SWE-bench Verified test, a respectable number for a model this size. Published vendor benchmarks are a starting point, not a verdict, which is why we run our own.
How We Test, and Why We Trust the Numbers
Our benchmark harness exists because we make production decisions based on it. We route real client work to these models, so a misleading score costs us time and trust. To avoid fooling ourselves, we built the evaluation around a few firm rules.
First, the tasks are ours. The coding portion uses 22 tasks drawn from actual Petronella Technology Group, Inc. engineering work: shell scripts for backups, SEO fixes, debugging exercises, configuration changes, and multi-file reasoning problems. These are the chores our team automates every week, not abstract puzzles a model may have memorized.
Second, no model grades itself. A model asked to score its own answer inflates the result by twenty to forty percent in our testing, so we forbid it. Every answer in this campaign was graded by an independent judge from a different model family, locked to a single grader for the entire run so the rankings stay comparable. Letting a model grade its own family is the most common way benchmark numbers get quietly corrupted, and we guard against it in code.
Third, we measure variance, not single lucky runs. For the headline coding number we ran the full 22-task suite five separate times and report the mean with its spread. A single run can swing by a few points on chance alone, and a responsible comparison has to account for that.
Finally, we test on hardware we control. North Mini Code ran on a single workstation-class Blackwell GPU inside our lab, served with the open-source vLLM engine in the same FP8 format Cohere recommends for production. Nothing about the test depended on a cloud endpoint, which is the entire point. If you want to understand how we apply this discipline to client systems, our cybersecurity services team uses the same evidence-first approach to vet every tool we deploy.
The Results: A New Co-Leader on Coding
On our coding suite, North Mini Code earned a mean score of 0.980 out of a possible 1.0 across five runs, with a tight spread of plus or minus 0.004. It passed all 22 tasks in every single run. That places it in a statistical tie with the strongest open model we have ever benchmarked on this suite, and slightly ahead of Qwen3.6-35B-A3B, the model that has anchored our own fleet as the default coder.
A few points put that number in context:
- The score landed inside a narrow band every time we ran it, which tells us the result is real and repeatable, not a fluke.
- It matched a model that needs two of the most expensive data-center GPUs available, while North Mini Code did the work on one workstation card.
- It produced clean, correct answers quickly, averaging a few seconds per task, which matters when a model sits inside an interactive developer workflow.
For a model you can download for free and run on a single machine, matching the very top of our coding leaderboard is a genuine milestone. Two years ago, this level of capability was available only through a metered cloud API.
Speed, Stability, and Clean Tool Calls
A high score on a leaderboard is worth little if the model is slow, erratic, or sloppy with the mechanics of tool use. We checked all three, because they decide whether a model is pleasant to work with in practice.
On speed, North Mini Code was one of the quickest strong coders we have measured, averaging only a few seconds per task on a single GPU. The efficiency comes from its sparse design, which keeps the active computation small even though the model carries a large body of knowledge. In an interactive workflow, that responsiveness is the difference between a tool developers reach for and one they avoid.
On stability, we ran the coding suite at two different temperature settings, a parameter that controls how much randomness the model uses when it writes. Some models swing wildly between conservative and creative settings, which makes their output hard to predict. North Mini Code held steady, scoring 0.982 at a low setting and 0.977 at the vendor-recommended higher one. A model that behaves consistently regardless of that knob is far easier to deploy with confidence.
On tool calling, which is the foundation of any agentic workflow, the model emitted clean, correctly structured function calls. When we asked it to inspect a directory, it produced a precise, well-formed command and signaled completion properly rather than dumping malformed output that a downstream system would choke on. That reliability is what lets a model safely drive a terminal, edit files, and run tests on its own. Our AI engineering team treats clean tool behavior as a hard requirement before any model touches an automated pipeline.
Loop Amplification: The Test That Separates Real Agents From Pretenders
Raw coding accuracy is only part of the story for agentic work. The more important question for an autonomous assistant is what happens when it gets to try again. A real agent should use extra iterations to fix its own mistakes. A weak one either stays flat or, worse, declares victory on a job it never finished. We call this the loop amplification test, and it is where many otherwise capable models fall apart.
Our loop harness gives the model a small ticket to complete inside a real sandbox with real files. After each attempt, a programmatic verifier checks the actual work against ground truth, and a separate done-gate re-runs the produced files in a fresh sandbox to catch any model that fabricates a finished status. We run the same set of tasks twice: once with a single attempt allowed, and once with up to five attempts.
North Mini Code amplified. With one attempt it completed 0.833 of the tasks and fabricated completion twice. With up to five attempts it climbed to 0.917 and cut its false completions to one. In other words, the extra iterations did exactly what they should: the model found and fixed a failure, and it became more honest about its own status rather than less. It also worked efficiently, solving most tasks on the first try and only looping when it genuinely needed to.
That behavior stands in sharp contrast to several popular small models we have tested. Some stay completely flat between one attempt and five, gaining nothing from the chance to iterate. Others actively get worse, multiplying their fabricated completions under the pressure of a loop. A model that lies about finishing is dangerous in an automated pipeline, because the false success propagates downstream before a human ever sees it. North Mini Code passed this test cleanly, which supports Cohere's claim that it was built for genuine agentic engineering rather than one-shot autocomplete.
Where North Mini Code Is Not the Answer
Honest benchmarking means reporting the weaknesses with the same clarity as the strengths. North Mini Code is a coding specialist, and our broader tests show it.
On our research and reasoning suite, which measures long-document summarization, multi-hop question answering, citation accuracy, and reasoning-heavy generation, North Mini Code scored 0.821. That is respectable but well behind the models we rely on for compliance research and retrieval-augmented generation. If your goal is to query a corpus of regulatory documents and get carefully cited answers, this is not the model to reach for, and we would steer you toward a stronger reasoning model in that role.
On long-form writing, the picture is mixed. North Mini Code produced a clean, well-structured article that our judge scored at 0.945, on par with the best models we have tested for blog generation, and it did so without the repetition problems that plague some reasoning models on long output. The limitation is length. Left to its own defaults it wrote about 2,300 words, short of the 3,000-plus words an in-depth, search-optimized article usually needs. It is a capable writer for shorter pieces, not a drop-in replacement for a dedicated long-form content engine.
None of this is a criticism. A model that is excellent at coding and agentic tool use, decent at writing, and merely average at deep research is precisely what you would expect from a tool built and named for developers. The mistake would be deploying it outside its lane.
Building a Sovereign AI Workflow Around Open Models
The strategic lesson from this benchmark is bigger than one model. A regulated organization no longer has to choose between strong AI and data control. You can assemble a capable, fully on-premise AI stack today by matching the right open model to each job: a coding specialist like North Mini Code for engineering and automation, a stronger reasoning model for compliance research, and a dedicated writer for content. Every one of them runs inside your boundary, under your logging, with no data leaving your network.
This is the architecture we build for clients who operate under CMMC, HIPAA, ITAR, and similar mandates. The model server lives on hardware you own or that we manage for you, often in a segmented or fully air-gapped enclave. Access is authenticated and logged like any other sensitive system. The result is an AI capability that supports an audit rather than undermining it, and that maps cleanly onto the control families your assessor will examine. You can read more about how we structure these programs on our compliance program pages.
Getting the architecture right is not trivial. The model is only one component. You also need a hardened inference server, sensible access controls, monitoring that satisfies your control set, and a process for evaluating new models as they ship, because the field moves monthly. That last point is exactly why we maintain a public benchmark and update it as new models arrive. The work is led by Craig Petronella, a CMMC Registered Practitioner, Digital Forensic Examiner, and the Amazon number one best-selling author of more than a dozen cybersecurity books, and it informs every recommendation we make to clients.
What This Means for Your Organization
If you have been holding back on AI because of where the data goes, the calculus has changed. The capability you wanted is now available in a form you can own. North Mini Code is one strong example among a growing field, and its Apache 2.0 license and single-GPU footprint make it unusually practical for small and mid-sized organizations that do not have a hyperscale budget.
The right next step is rarely to download a model and hope. It is to define the jobs you want AI to do, map them to models that have been honestly tested for those jobs, and stand the whole thing up inside a boundary your compliance program already trusts. That is the work we do every day, and our independent benchmark is how we keep our advice grounded in evidence rather than vendor claims.
FAQ
Is North Mini Code free to use commercially?
Yes. Cohere released North Mini Code under the Apache 2.0 license, which allows commercial use, modification, and private deployment with no per-seat or per-token fees. You are responsible for the hardware you run it on, but the weights themselves carry no licensing cost.
Can it really run on a single GPU?
Yes. Because it activates only about 3 billion of its 30 billion parameters per token, the model is efficient to serve. Cohere lists a single H100-class GPU as the minimum in the FP8 format, and in our lab it ran comfortably on a single workstation-class Blackwell card while matching the top of our coding leaderboard.
Does using an on-premise model keep me compliant with CMMC?
Hosting a model on hardware you control removes one major source of data leakage, because prompts and responses never leave your network. That is a strong start, but compliance depends on the full system: access controls, logging, network segmentation, and documented procedures all have to be in place. We design the complete enclave, not just the model server, and validate it against your control set.
How is your benchmark different from the scores vendors publish?
Vendor benchmarks are a useful baseline, but they are produced by the party with an interest in the outcome. Our tasks come from real engineering work, every answer is graded by an independent model from a different family, we run the coding suite five times to measure variance, and we test on hardware we control. We publish the methodology so the numbers can be checked.
Should I replace my current AI coding tool with North Mini Code today?
For many regulated organizations it is worth a serious pilot, especially where a cloud assistant is currently prohibited. That said, it is a coding specialist, not a general-purpose model. We recommend testing it on your own representative tasks and pairing it with stronger models for research and long-form writing before committing it to production.
Who can help us set this up safely?
Petronella Technology Group, Inc. designs, deploys, and manages on-premise and air-gapped AI systems for organizations under CMMC, HIPAA, and similar mandates. We handle model selection, the inference server, access controls, monitoring, and the documentation your assessor will want to see.
Talk to Us About Sovereign AI
If you want strong AI assistance without sending your most sensitive data to someone else's cloud, we can help you design a system that your compliance program will welcome rather than fear. Call Petronella Technology Group, Inc. at 919-348-4912 or reach our team through our contact page to schedule a conversation about an on-premise or air-gapped AI deployment built for your regulatory requirements.