Previous All Posts Next

Karpathy's Autoresearch: What It Means for Enterprise R&D

Posted: December 31, 1969 to AI.

Modern data center aisle with tall server racks and subtle blue glow, shallow depth of field

Andrej Karpathy pushed a small repository to GitHub in early March 2026 that most of the AI world is still trying to digest. The project is called autoresearch. It is roughly 630 lines of training code, a single instruction file written in plain English, and a framing idea that sounds simple until you try to articulate why it matters. The idea is that an AI coding agent can be handed a real machine learning training setup, left alone for two days, and come back having run hundreds of experiments, kept the ones that improved the model, thrown out the ones that did not, and written a log explaining what it tried. No human in the loop. No watchlist of metrics refreshed by hand. Just the agent, a GPU, and a tight feedback signal.

That is a deceptively dry description. The actual behavior is closer to what happens when you hand a good engineer a problem and walk away for a weekend. The agent picks a change, runs the training for five minutes of wall clock time so everything stays comparable, checks a single number, and decides whether the change was worth keeping. Then it iterates. Karpathy ran the loop for two days and it conducted roughly 700 experiments, surfacing 20 optimizations that stuck. He then transferred those same 20 tweaks to a larger model and shaved about 11 percent off the training time needed to hit GPT-2 quality on the nanochat leaderboard. The time to GPT-2 went from 2.02 hours to 1.80 hours.

That is not a huge number in isolation. What is interesting is not the 11 percent. What is interesting is that a human did not write the 20 tweaks. The agent wrote them, tested them, and composed them into a coherent stack of improvements, and the only human judgment involved was in the shape of the harness and the metric it was pointed at.

Petronella Technology Group has been building private AI clusters, research agents, and autonomous workflows for clients across healthcare, defense, finance, engineering, and legal verticals since Claude and the open-source frontier got good enough to trust with serious work. The autoresearch pattern is not a toy. It is the clearest demonstration so far of how an internal research loop can be built inside a regulated business without handing control of the experiments to an external model vendor, and it is worth breaking down why that matters and how a team without a Karpathy-level founder can start running loops of their own.

What "Autoresearch" Actually Is

Most of what has been labeled autonomous AI research over the last three years has been either neural architecture search, which is narrow and expensive, or loose prompt engineering harnesses that claim to do research but really just run the same prompt over and over with small variations. Karpathy's framing is different. In his own words, neural architecture search is "such a weak version of this that it's in its own category of totally useless by comparison." That is a strong statement from someone who typically understates things, and it is worth reading carefully.

The autoresearch loop has four pieces. There is a sealed baseline of training code that the agent can modify. There is a prompt file that tells the agent what to try, what to avoid, and how to interpret results. There is a fixed time budget for each trial so that every experiment is comparable regardless of whether the agent tried a bigger model, a weirder optimizer, or a completely different attention pattern. And there is a single metric that is cheap to compute and hard to game. For autoresearch, the metric is validation bits per byte, a standard measure of how well a language model predicts the next token in held-out text. Lower is better. The agent cannot overfit it with a prompt trick because the training signal is real.

Those four pieces sound basic. They are not. Most enterprise AI projects that call themselves research lack at least two of them. They do not have a sealed baseline because the code keeps shifting under the team's feet. They do not have a fixed time budget because different experiments take different amounts of time and no one wants to sit through a six-hour run. And they do not have a single crisp metric because stakeholders want eight metrics on a dashboard. Karpathy's contribution is as much about the discipline of the harness as it is about the agent itself.

What changes when you get the four pieces right is that the agent becomes comparable to a junior researcher with infinite stamina. It does not need to take breaks. It does not get frustrated when an experiment fails. It reads its own training logs, forms a hypothesis, writes the code change, runs the trial, and either keeps or discards the result. If it discards, it tries something else. If it keeps, it moves on. Karpathy pointed out in his No Priors interview that he has not personally written much code since December 2025. He directs agents. The autoresearch loop is what that direction looks like when it is pointed at a research problem rather than a product feature.

He also said the pattern has broader reach. "Any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm." That is where enterprise R&D enters the picture. Most businesses do not need a better training method for GPT-2. They need better internal processes, better retrieval systems, better anomaly detection, better forecasting, better document extraction, and better triage of incoming requests. All of those are metric-driven if you can define a crisp enough metric. The autoresearch pattern generalizes.

Why Enterprise R&D Leaders Should Care

There has been a long running disconnect in enterprise AI between what labs like OpenAI, Anthropic, and Google DeepMind publish and what a typical corporate research team can actually do with it. The labs publish frontier results on clusters of thousands of GPUs. The corporate team has a handful of cards, a compliance officer who wants every data flow documented, and a calendar full of meetings. The gap between the two has historically meant that enterprise AI teams either consume frontier models as a service or run small projects that never compound into anything bigger.

Autoresearch collapses that gap for a specific class of problem. The setup fits on a single NVIDIA H100 and runs in overnight cycles. The metric is bits per byte, which maps cleanly onto internal language tasks. The agent is small enough to audit, which means compliance teams can actually read the 630 lines and the markdown prompt file without drowning in complexity. And the improvements transfer. That last point is underrated. When the agent finds 20 changes that help a depth-12 model, those same 20 changes helped the depth-24 model too. Research done on cheap hardware produced tweaks that paid off when applied to bigger training runs.

For a business with an internal fine-tuning pipeline, that pattern means you can run autoresearch on a small version of your workload, harvest the improvements, and apply them to your real production training. You do not need a frontier cluster to get frontier-quality experimentation. Petronella Technology Group has been running this pattern with clients since the repo went public, and the pattern has held up outside of language modeling. It works for retrieval tuning. It works for document classification. It works for anomaly detection on log streams. It works anywhere you have a tight metric and fast trials.

The more important shift is what this does to the economics of research. A two-day autoresearch run costs maybe 40 dollars of compute on a rented H100 plus whatever you pay for the agent model API calls. That is less than a single hour of senior engineering time at most firms. A mid-sized team that runs autoresearch continuously on four or five different internal problems is effectively adding a full-time researcher for the price of electricity. The constraint is no longer people. The constraint is the quality of your harness and the clarity of your metric.

Karpathy called out the other half of this shift explicitly. "The next step for autoresearch is that it has to be asynchronously massively collaborative for agents." Individual loops are already useful. What happens when twenty different loops are running at once, sharing a common pool of discovered optimizations, and coordinating on which directions to explore? That is the direction the field is heading. Enterprise teams that get good at the single-loop pattern first will be in a much better position to operate the multi-loop systems when they arrive.

The Regulated Business Problem

Here is where most coverage of autoresearch stops. The loop works. The optimizations transfer. The economics are attractive. Ship it. The problem is that most of our clients cannot ship it, at least not in the form Karpathy published. They cannot use a public agent API to modify arbitrary code that touches customer data. They cannot upload training runs to a cloud provider that has not been assessed under their compliance framework. They cannot even let an autonomous process modify production code without a change review because their auditors will have a strong opinion about that.

This is where Petronella Technology Group spends most of its time. Our clients are regulated. CMMC Level 2 for defense contractors. HIPAA for healthcare. SOC 2 for most of the financial and legal firms. FedRAMP Moderate for the teams working with federal data. GDPR for the European operations. The autoresearch pattern is valuable to them, but it has to be implemented in a way that satisfies the control frameworks. That is not a blocker. It is a design constraint, and once you treat it as a design constraint the problem becomes solvable.

The first constraint is that the agent cannot see production data without explicit authorization and logging. For a healthcare client running internal document extraction, that means the autoresearch loop runs on synthetic or de-identified data only, and production data is only introduced through a reviewed pipeline with its own audit trail. The agent can optimize the model. It cannot peek at the inputs.

The second constraint is that the code the agent modifies has to be sandboxed. We run autoresearch inside isolated containers with no network egress to anything outside the approved model providers and the internal metrics store. The agent cannot exfiltrate anything because it has nowhere to send it. When the loop finishes, the diff is preserved, tagged, and reviewed by a human before any of the changes merge into production training pipelines.

The third constraint is that the model running the loop has to be approved for the data classification level of the experiment. For most of our regulated clients, that rules out consumer API tiers. It points toward Claude on AWS Bedrock with a data processing addendum, or toward self-hosted open-source models on a private AI cluster that the client controls end to end. The latter is what most of our clients pick once they do the math on scaling autoresearch across multiple teams. A private cluster running Llama 3.1 405B or DeepSeek V3 or Qwen 2.5 72B can drive an autoresearch loop without a single byte of experiment data ever leaving the client's own network. That is the posture a serious compliance program wants.

The fourth constraint is that every experiment has to produce an audit trail that a human can read. That is actually easier than it sounds. Autoresearch logs are already readable. The agent writes a few sentences about what it tried, the metric it saw, and whether it kept the change. You wrap that output in a standard evidence format, ship it to an internal logging system, and you have an audit trail that most compliance reviewers will accept. Some will want additional structure, but the raw material is there.

What we have found in practice is that regulated businesses that implement autoresearch with these four constraints baked in end up with a research loop that is actually more trustworthy than what most labs operate internally. The labs move fast and break things. Enterprise teams cannot afford to, and the discipline of the compliance overlay tends to produce cleaner science. When every change the agent makes is logged, reviewed, and tied to a specific metric improvement, you get a research record that holds up under scrutiny.

How To Start

Research scientist at a desk with multiple monitors showing scientific plots and equations under warm lamp lighting

The temptation with a project like autoresearch is to spin up a huge initiative. Hire a team. Buy hardware. Write a plan. That is the wrong instinct. The pattern is small. It should stay small for the first six weeks while your team figures out what your actual research problems look like. The teams that get traction with this are the ones that start with one crisp metric and one small model and one overnight loop and iterate from there.

Here is the rough sequence we have found works for a team that has never run an autonomous research loop before.

Pick one problem with a clear metric. Not a dashboard. A single number. Validation accuracy on an internal eval set. Bits per byte on a held-out corpus. Recall at a fixed precision threshold for a classification task. The metric has to be cheap to compute because the agent is going to compute it several hundred times a day.

Cap the trial time. Five minutes per trial is the autoresearch default and it works well because it forces the agent to focus on changes that show up fast. Longer trials let the agent waste time on slow but not particularly interesting ideas. Shorter trials do not give enough signal to distinguish good changes from noise.

Sandbox the agent. Isolated container. No network egress to anything except the approved model API and the metrics store. Code diffs preserved per trial. Do this on day one, not day 30.

Point the agent at a model that can actually write code. For most of our clients that is Claude Opus 4.5 or Claude Sonnet 4.7 for the orchestration, with a locally hosted model doing the actual training. The orchestration model needs to be strong at code synthesis and long context reasoning. The training model does not have to be large.

Run the loop overnight. Look at the log in the morning. Most of the trials will fail. That is fine. Karpathy's run had about 680 experiments that did not stick. You are looking for the 20 that do, and you only find them by running the 700.

Commit the winners to your production pipeline. The whole point of the loop is to produce transferable improvements. If the 20 optimizations sit in a research repo and never reach the production model, you have wasted the exercise. Build the handoff path before you need it.

Review everything with a human. The agent is smart enough to find improvements. It is not smart enough to decide whether those improvements are safe for production. That is still a human judgment. Our experience is that about 80 percent of the changes the agent finds are straightforwardly good, 15 percent are good but need a small human tweak, and 5 percent look good on the metric but would cause some other problem if deployed. A human reviewer catches the last category.

The whole thing should cost under a thousand dollars in compute and agent calls for the first month. If you are spending more than that, you are probably over-engineering the harness. Come back to the 630 lines and ask what you added and whether it earned its keep.

What This Means For The Next Two Years

Karpathy has been careful not to overclaim. He has called 2025 through 2035 "the decade of agents" rather than declaring agent dominance already here. He has said autoresearch at scale is "a lot more complex of course" but that "doing it is just engineering and it's going to work." He has said frontier labs will all build systems like this and called it "the final boss battle" for model training. Those are the statements of someone who has seen this pattern work at small scale and is extrapolating cautiously to larger scales. That is the right temperature for enterprise leaders to match.

The useful framing is that autoresearch is not a product. It is a capability. Once a team has built one successful loop, they can build a second loop in a fraction of the time because the harness generalizes. The first loop might take six weeks. The second takes two. The tenth takes a day. Once you have ten loops running on ten different internal problems, you have something that actually looks like a research capability, and you have it without hiring a research team.

That is the shift. Research used to be a people problem. It is becoming a harness problem. The businesses that figure out how to build good harnesses over the next two years will compound improvements across their AI stack faster than businesses that keep treating AI as a series of one-off projects. The autoresearch pattern is the clearest demonstration yet that the compounding is real.

Petronella Technology Group helps regulated businesses build this kind of capability the right way. We design the sandboxing. We pick the model that fits the data classification. We set up the metrics store and the audit trail. We work with your compliance team to make sure the control framework is satisfied before the first loop runs, not after. And we build the handoff path from the research loop to the production pipeline so the improvements actually make it into shipping systems.

If you are an R&D leader who has been watching the autoresearch news and trying to figure out how to adapt the pattern for a regulated environment, that is exactly the conversation we have every week. Craig Petronella holds CMMC-RP, CCNA, CWNE, and DFE 604180 credentials. Our team is fully CMMC-RP certified, and Petronella Technology Group is a CMMC-AB Registered Provider Organization, RPO 1449. We have been working with enterprise AI, managed AI services, and private AI cluster deployments long enough to know where the compliance rocks are buried, and we know how to build research loops that do not trip over them.

Start with one metric. Start with one model. Start with one overnight loop. The rest follows.

Key Takeaways For Enterprise R&D Leaders

Karpathy's autoresearch is a minimal repository that lets an AI agent modify training code, run experiments, and iterate without human supervision. It is real. It works. The 700-experiment run that produced 20 transferable optimizations was reported by Karpathy himself and covered by Fortune and The New Stack in March 2026.

The pattern generalizes beyond language model training. Any problem with a crisp metric and a fast trial loop can be autoresearched.

Regulated businesses can implement the pattern with four constraints: no production data in the loop without review, sandboxed execution, approved models only, and a full audit trail of every trial.

The economics favor starting small. A single-GPU overnight loop costs tens of dollars, not tens of thousands. The limiting factor is the quality of your harness, not the size of your budget.

Petronella Technology Group builds and operates these loops for regulated clients. Call (919) 348-4912 or use the form at /contact-us/ to talk through what an internal research agent could do for your team.

Our broader work across AI, compliance, infrastructure, and managed services is laid out at /solutions/. The private AI cluster platform most of our clients use for autoresearch is documented at /solutions/private-ai-cluster/. The full AI services catalog lives at /ai/.

Sources

The autoresearch repository itself is at github.com/karpathy/autoresearch. The nanochat repository that autoresearch targets is at github.com/karpathy/nanochat. Karpathy's personal site is karpathy.ai. His education company Eureka Labs is at eurekalabs.ai. His public posts are at x.com/karpathy. Fortune's coverage of the March 2026 autoresearch announcement is at fortune.com. The New Stack's piece on the 630-line Python script is at thenewstack.io. His long-form No Priors interview walkthrough is at pjfp.com.

Need help implementing these strategies? Our cybersecurity experts can assess your environment and build a tailored plan.
Get Free Assessment

About the Author

Craig Petronella, CEO and Founder of Petronella Technology Group
CEO, Founder & AI Architect, Petronella Technology Group

Craig Petronella founded Petronella Technology Group in 2002 and has spent more than 30 years working at the intersection of cybersecurity, AI, compliance, and digital forensics. He holds the CMMC Registered Practitioner credential (RP-1372) issued by the Cyber AB, is an NC Licensed Digital Forensics Examiner (License #604180-DFE), and completed MIT Professional Education programs in AI, Blockchain, and Cybersecurity. Craig also holds CompTIA Security+, CCNA, and Hyperledger certifications.

He is an Amazon #1 Best-Selling Author of 15+ books on cybersecurity and compliance, host of the Encrypted Ambition podcast (95+ episodes on Apple Podcasts, Spotify, and Amazon), and a cybersecurity keynote speaker with 200+ engagements at conferences, law firms, and corporate boardrooms. Craig serves as Contributing Editor for Cybersecurity at NC Triangle Attorney at Law Magazine and is a guest lecturer at NCCU School of Law. He has served as a digital forensics expert witness in federal and state court cases involving cybercrime, cryptocurrency fraud, SIM-swap attacks, and data breaches.

Under his leadership, Petronella Technology Group has served 2,500+ clients, maintained a zero-breach record among compliant clients, earned a BBB A+ rating every year since 2003, and been featured as a cybersecurity authority on CBS, ABC, NBC, FOX, and WRAL. The company leverages SOC 2 Type II certified platforms and specializes in AI implementation, managed cybersecurity, CMMC/HIPAA/SOC 2 compliance, and digital forensics for businesses across the United States.

CMMC-RP NC Licensed DFE MIT Certified CompTIA Security+ Expert Witness 15+ Books
Related Service
Need Cybersecurity or Compliance Help?

Schedule a free consultation with our cybersecurity experts to discuss your security needs.

Schedule Free Consultation
Previous All Posts Next
Free cybersecurity consultation available Schedule Now