Build Your Own autoresearch — Applying Autonomous Experimentation to Any Domain

Build Your Own autoresearch -- Applying Autonomous Experimentation to Any Domain

Karpathy's autoresearch is an autonomous experimentation system built for LLM pretraining. In Part 1 we covered the overall architecture, and in Part 2 we dug into the agent's experimentation strategy and result analysis. If you've read this far, one question is probably on your mind:

"Can I use this for my own problem?"

In this post, we extract the core patterns from autoresearch and apply them to three domains: text classification, image classification, and RAG pipelines. At the end, we provide a general-purpose experiment runner and a program.md template you can adapt immediately.

Extracting the Core Pattern from autoresearch

The structure running through all of autoresearch is surprisingly simple. Three files, a five-step loop, and a handful of design principles. Extract these, and you can apply the pattern to any ML task.

The 3-File Architecture

Here's autoresearch's file structure broken down by role:

File	Role	Modified by
`prepare.py`	Fixed infrastructure (data, evaluation, utilities)	Human (once)
`train.py`	Experimentation target (model, hyperparameters, training loop)	Agent (every experiment)
`program.md`	Agent protocol (experiment rules, evaluation criteria)	Human (meta-optimization)

`prepare.py` is the stable foundation. It contains data loading, preprocessing, and evaluation functions -- the agent never touches this file. In autoresearch's actual prepare.py, you'll find constants like MAX_SEQ_LEN, TIME_BUDGET, EVAL_TOKENS and the evaluate_bpb() function, all locked down.

python

# From prepare.py -- fixed constants the agent cannot modify
MAX_SEQ_LEN = 2048       # context length
TIME_BUDGET = 300        # training time budget in seconds (5 minutes)
EVAL_TOKENS = 40 * 524288  # number of tokens for val eval

`train.py` is the search space. It's the only file the agent modifies, and everything lives here: model architecture, optimizer, hyperparameters, batch size, and so on. In autoresearch, the entire GPT model, the MuonAdamW optimizer, and the training loop are all packed into this single file.

python

# From train.py -- hyperparameters the agent freely modifies
DEPTH = 8               # number of transformer layers
ASPECT_RATIO = 64       # model_dim = depth * ASPECT_RATIO
TOTAL_BATCH_SIZE = 2**19 # ~524K tokens per optimizer step
MATRIX_LR = 0.04        # learning rate for matrix parameters (Muon)
WEIGHT_DECAY = 0.2      # cautious weight decay for Muon

`program.md` is the agent's code of conduct. It spells out what the agent can modify, which metric to optimize, and how to log experiment results. Here's a summary of the key rules from program.md:

Build Your Own autoresearch — Applying Autonomous Experimentation to Any Domain