ProgramBench is a new LLM coding benchmark where agents rebuild full programs from a compiled binary and documentation. See scores, how it differs from SWE-bench, and why all public models are 0% resolved.
Share This Report
Copy the link, post it, or save a PDF version.
ProgramBench is a hard new LLM coding benchmark for a capability most leaderboards barely touch: can a model rebuild an entire program from observed behavior, without seeing the source?
The benchmark gives an agent a compiled executable and its usage documentation. The agent has to probe the program, infer behavior, choose an implementation approach, write source code, create a build script, and produce a candidate program. The submitted program is then compared against the original with hidden behavioral tests.
That is very different from completing a function, solving a contest problem, or patching a known repository.
ProgramBench matters most if you are evaluating coding agents rather than chat models.
If your workflow is "ask the model to write a small function," ProgramBench is probably too hard and too indirect. HumanEval, LiveCodeBench, and language-specific coding evals will tell you more.
If your workflow is "ask the model to patch a known repository," SWE-bench Pro, SWE-Rebench, and SWE-bench Verified are still more directly relevant.
ProgramBench becomes interesting when the job is less structured:
For model buyers, ProgramBench is not yet a "pick the top model and ship it" leaderboard. It is a warning that current coding agents still struggle when the source code, issue description, and scaffolding disappear. For agent builders, it is a diagnosis tool: it tells you whether the agent can investigate before it implements.
ProgramBench is built around cleanroom reconstruction. Each task starts with:
The model has to decide what questions to ask the executable. It can run the program with inputs and observe outputs, but it cannot inspect the underlying implementation.
That makes ProgramBench a benchmark for architecture and specification discovery, not just code writing.
Most coding benchmarks still give models a lot of structure.
HumanEval gives a function signature and docstring. SWE-bench Verified gives an existing repository and an issue. LiveCodeBench gives competitive-programming-style problem statements.
ProgramBench removes most of that scaffolding. It asks whether the agent can reconstruct the target behavior from interaction alone.
That matters because real software engineering often starts from partial specifications. Developers reverse-engineer workflows, probe APIs, explore old systems, and infer edge cases from observed behavior. ProgramBench compresses that kind of work into an evaluation setting.
ProgramBench is not a replacement for the major coding benchmarks. It measures a different failure mode.
| Benchmark | What the model gets | What it measures best | Main limitation |
|---|---|---|---|
| ProgramBench | Compiled binary plus documentation | Cleanroom architecture, probing, and full-program reconstruction | Current scores are near zero, so it is not yet useful for fine ranking |
| SWE-bench Pro | Existing repository plus issue | Real software issue resolution | Still starts from known source code and repo structure |
| LiveCodeBench | Fresh programming problems | Contamination-resistant code generation and reasoning | More contest-like than production engineering |
| Terminal-Bench 2.0 | Terminal environment and task goal | Agentic execution, debugging, and tool use | Does not isolate cleanroom reconstruction |
| HumanEval | Function signature and docstring | Simple function synthesis baseline | Saturated and high contamination risk |
The closest comparison is SWE-bench, but the task shape is almost inverted. SWE-bench asks whether the agent can modify a real codebase correctly. ProgramBench asks whether the agent can discover what a program does and build a new codebase that behaves the same way.
The best way to understand ProgramBench is to look at what it removes.
Most coding benchmarks provide at least one of these anchors:
Those anchors are useful. They make evaluation controlled and repeatable. But they also hide a major part of real engineering: figuring out what the software is supposed to do before writing it.
ProgramBench removes almost all of those anchors. The agent does not get source files to inspect. It does not get a bug report with a target location. It does not get a type signature that defines the interface. It has to create its own understanding by interacting with the executable.
That creates a different capability profile:
Specification discovery. The agent needs to ask good questions of the binary. What inputs are valid? What errors are produced? How does output formatting work? Which edge cases matter?
Search discipline. A weak agent may run a few obvious commands and then start coding. A stronger agent should systematically explore behavior before committing to an implementation.
Architecture choice. With no skeleton, the agent has to decide file layout, language, modules, parsing strategy, build script, and test strategy.
Behavioral precision. The replacement does not need to match the original source. It needs to match observable behavior. That includes boring details like exit codes, error messages, whitespace, file output, and weird flags.
Stopping judgment. The agent needs to know when it has enough evidence. Stopping too early produces the exact failure mode the current leaderboard suggests: plausible but incomplete submissions.
This is why ProgramBench has value even while the scores are near zero. A benchmark can be useful before it ranks models finely if it exposes a real capability gap.
The initial public leaderboard is deliberately sobering:
| Rank | Model | Fully resolved | Almost resolved |
|---|---|---|---|
| 1 | Claude Opus 4.7 | 0.0% | 3.0% |
| 2 | Claude Opus 4.6 | 0.0% | 2.5% |
| 3 | Claude Sonnet 4.6 | 0.0% | 1.0% |
| 4 | GPT 5.4 | 0.0% | 0.0% |
| 5 | Gemini 3.1 Pro | 0.0% | 0.0% |
The primary metric is fully resolved tasks. Every public model is currently at 0%.
ProgramBench also reports "almost resolved" tasks, meaning runs that pass at least 95% of behavioral tests. BenchLM uses that auxiliary metric on the display page because it is the only visible separator between models right now.
See the mirrored page: ProgramBench leaderboard
The ProgramBench leaderboard needs a more careful reading than a normal coding benchmark.
On most coding leaderboards, a higher score directly means more solved tasks. On ProgramBench today, the official primary metric is flat: every evaluated public model is at 0% fully resolved. The only separation is the auxiliary "almost resolved" rate.
That means three things.
First, do not over-rank the current top three. Claude Opus 4.7 leading at 3.0% almost resolved is meaningful as a signal that it got closer more often, but it is not a production-level success rate. It does not mean it can reliably rebuild programs from binaries. It means it occasionally gets within striking distance under this benchmark's tests.
Second, do not treat 0.0% almost resolved as identical model quality. A model can fail ProgramBench in many ways: it can misunderstand the interface, submit something that does not build, implement only a trivial subset, or get close but miss the 95% threshold. The published public view does not expose enough granularity to separate all those cases.
Third, expect harness effects to matter. ProgramBench is an agent benchmark, not a pure model benchmark. The model, tool loop, prompting, retry policy, exploration budget, and stopping criteria can all affect results. The Reddit launch discussion explicitly notes that custom harnesses should be feasible, which means future results may separate "better base model" from "better agent scaffold."
BenchLM therefore keeps ProgramBench display-only. It is valuable evidence, but it is not mature enough to carry weighted ranking influence.
The most important distinction in ProgramBench is between fully resolved and almost resolved.
A fully resolved task means the submitted replacement program passes the benchmark's behavioral test suite well enough to count as solved. That is the metric people should care about long term. If models start scoring 10%, 20%, or 40% fully resolved, ProgramBench will become a much stronger ranking signal.
Almost resolved is a softer metric. It captures submissions that pass at least 95% of behavioral tests. That is useful now because the fully resolved column is all zero. It tells us that some agents are close on a small fraction of tasks.
But almost resolved has limits:
For that reason, almost resolved is best read as "near-miss rate," not "success rate." The right question is not "which model wins ProgramBench?" The right question is "which model is beginning to show signs of the capability ProgramBench is trying to measure?"
The paper describes a pipeline that starts from open-source repositories that build executables. The benchmark authors compile the original program, strip away source code and implementation details, and keep only the executable plus documentation for the agent-facing task.
Evaluation uses behavioral tests generated through agent-driven fuzzing. These tests compare observable behavior: stdout, stderr, exit codes, file outputs, and similar effects. The published benchmark covers 200 tasks and more than 248,000 behavioral tests.
The task set ranges from small command-line utilities to much larger projects such as FFmpeg, SQLite, and PHP.
ProgramBench evaluates behavior, not source similarity. That is the right choice for cleanroom reconstruction.
If evaluation rewarded source similarity, the benchmark would become confused. A correct cleanroom implementation may use a different language, different architecture, or different internal algorithms while matching the public behavior. Source similarity would punish valid replacements and reward questionable imitation.
Behavioral tests focus on what matters: does the replacement program act like the original from the outside?
The hidden-test design also raises the bar. If the tests were public, agents could overfit to known cases. Hidden tests force the agent to infer general behavior rather than memorize visible examples. The paper's use of agent-driven fuzzing is meant to broaden that test coverage beyond hand-written examples.
This makes ProgramBench closer to how real software replacement works. When you replace a legacy tool, users do not care whether your source looks like the old source. They care whether the new tool handles their workflows, edge cases, errors, and outputs.
Not in the usual sense.
The benchmark uses compiled executables, so it naturally sits near reverse-engineering language. But the goal is not to recover the original source code. The goal is to reproduce external behavior from black-box interaction.
That distinction matters. If decompilation, internet search, or direct source lookup were allowed, the benchmark would drift toward measuring binary analysis and retrieval. ProgramBench instead tries to measure whether a coding agent can:
For buyers of coding agents, that is the valuable signal. It tests whether the model can work when the spec is incomplete and the structure is missing.
ProgramBench has a different contamination profile from classic coding benchmarks.
Older static benchmarks such as HumanEval are vulnerable because the tasks and solutions have been public for years. A model may have seen those examples directly or indirectly during training. That does not make HumanEval useless, but it makes it weak for frontier model comparison.
ProgramBench tries to reduce shortcut paths in several ways:
That said, no public benchmark is magically immune forever. If ProgramBench tasks, harness details, or reconstructed solutions become widely shared, contamination risk will rise. The strongest long-term version of ProgramBench would need active submission governance, careful leakage controls, and possibly refreshed task sets.
For now, ProgramBench is a strong contamination-resistant coding signal in spirit, but BenchLM still treats it cautiously because public model coverage is small and the primary metric is not yet moving.
Low ProgramBench scores should not be read as "models cannot code." They mean the benchmark is testing a harder slice of coding than most public leaderboards:
This is closer to a stress test than a product-readiness score. A model that scores well on SWE-bench can still fail ProgramBench because it has never had to infer and recreate a whole executable's behavior without the original source.
The current scores suggest that ProgramBench exposes several failure modes at once.
1. Shallow probing. A weak agent may run the executable on obvious inputs, observe a few outputs, and infer too much from too little. That creates programs that handle the demo path but miss edge cases.
2. Premature implementation. Many coding agents are optimized to start editing quickly. That helps on tasks with clear specs, but it hurts when the main challenge is discovering the spec. ProgramBench rewards investigation before implementation.
3. Missing negative cases. Rebuilding a command-line tool is not just about valid inputs. It also means matching invalid flags, missing arguments, malformed files, exit codes, and error formatting.
4. Weak architecture. Without a skeleton, the agent may choose a brittle implementation that works for a handful of cases but cannot scale to the full behavior.
5. Poor self-testing. A strong agent should generate its own tests while probing the target binary. If it does not, it has no way to know whether the replacement is close.
6. Overconfidence. The LocalLLaMA thread notes that many runs did not simply hit the step limit. Agents often declared completion and submitted incomplete work. That is a serious practical problem: the agent does not know what it does not know.
7. Tool-loop fragility. ProgramBench is sensitive to agent execution quality. If the harness fails to preserve observations, manage files, run builds, or iterate cleanly, the base model may never get a fair chance.
These failure modes are exactly why ProgramBench is useful. They are also why a future public leaderboard should report more than one number. For serious analysis, we need to know whether failures came from bad probing, bad code, failed builds, time limits, missing edge cases, or bad stopping decisions.
The LocalLLaMA launch discussion adds useful implementation context from Kilian Lieret, one of the ProgramBench authors.
First, the benchmark is not limited to the default baseline agent. The authors say custom harnesses should be straightforward because the inference containers are published, as long as submissions respect constraints such as no internet access and no cheating.
Second, public submissions are planned but the rules are still being worked out. That matters for BenchLM because ProgramBench is currently closed-model-heavy; open-weight results are expected later, but the authors note that open-source models have been harder to run reliably on these tasks so far.
Third, the low scores are not mostly a timeout artifact. In the thread, the authors say most agents were not killed by the step limit. They tended to declare the task finished and submit incomplete executables. That reinforces the main interpretation: today's agents often stop with plausible but behaviorally shallow reconstructions.
ProgramBench is not just a static paper benchmark. The team has published the GitHub repository, Hugging Face datasets, and Docker/inference materials.
The public launch notes describe the basic local workflow as installing the package and evaluating a submission:
pip install programbench
programbench eval <your submission>
The important constraint is that the run has to preserve the benchmark setup. Agents should not use internet access, direct source lookup, or decompilation-style shortcuts. The authors said public submissions are planned, but the rules still need to guard against irrelevant or cheating submissions.
For teams building coding agents, the practical value is less about chasing the current leaderboard and more about running controlled internal experiments:
If you build coding agents, treat ProgramBench as a system benchmark, not just a model leaderboard. The most useful questions are about the whole loop: exploration, implementation, self-testing, and stopping judgment.
A practical ProgramBench harness should include:
Those are not just benchmark tricks. They are the same ingredients needed for reliable production coding agents.
ProgramBench should sit next to, not replace, other coding benchmarks:
For model selection today, ProgramBench is best treated as a frontier warning light. It shows how far current coding agents are from robust cleanroom software reconstruction, even when they look strong on more structured tasks.
ProgramBench is valuable, but it should not become a universal coding metric. If your product mainly needs autocomplete, short code generation, SQL, unit tests, or explanation of known code, this benchmark is probably too indirect.
It becomes relevant when your risk is under-specified engineering: unknown legacy behavior, missing source code, incomplete specs, high edge-case sensitivity, or autonomous agents that must decide what to inspect. In those settings, ProgramBench can reveal problems that SWE-bench-style scores may hide.
ProgramBench will become more useful if three things happen.
First, public submissions need to separate model quality from harness quality. A custom agent scaffold could matter as much as the base model, so future leaderboard rows should make the harness, tool policy, and retry strategy visible.
Second, open-weight results need broader coverage. The initial public leaderboard is mostly closed-model evidence. If open-weight coding models underperform because they are overfit to SWE-bench-style tasks, ProgramBench could become a useful out-of-distribution test for local coding agents.
Third, the primary resolved metric needs movement. Almost-resolved rates are useful right now because everything else is tied at zero, but the benchmark becomes much more actionable once models begin fully resolving tasks.
Fourth, result reports should expose failure categories. A single resolved percentage is clean, but it does not tell builders what to fix. The most useful future ProgramBench leaderboard would show build failures, near misses, early stopping, timeout rate, average cost, call count, and maybe task-family breakdowns.
Fifth, task refresh matters. If ProgramBench becomes popular, contamination pressure will rise. The benchmark will be strongest if it can add new tasks, rotate held-out tests, or separate public development tasks from private evaluation tasks.
BenchLM tracks ProgramBench in three places.
First, it has a dedicated benchmark page: ProgramBench scores. That page mirrors the public ProgramBench leaderboard and uses almost-resolved rate as the visible score because the official fully resolved metric is currently tied at zero.
Second, it is listed on the coding leaderboard as a display-only benchmark. That keeps it visible to people researching coding models while preventing a very new, near-zero benchmark from distorting weighted rankings.
Third, it is linked from this explainer so readers can move from interpretation to live data.
BenchLM will not weight ProgramBench until the signal becomes more stable. The benchmark needs broader model coverage, more public submissions, clearer harness metadata, and meaningful movement in fully resolved scores. Until then, it is better treated as qualitative evidence about frontier coding-agent limits.
ProgramBench is one of the sharpest public tests of whether coding agents can architect from scratch under uncertainty. It is too new and too low-scoring to use as a weighted production ranking signal today, but it is exactly the kind of benchmark to watch as coding agents move from patching repositories toward building and rebuilding complete systems.
See the benchmark page: ProgramBench scores
What is ProgramBench? ProgramBench is a coding-agent benchmark where the model receives a compiled executable and documentation, then must rebuild the program's source repository and build script without seeing the original source.
Why are ProgramBench scores so low? The benchmark requires specification discovery, architecture, and full-program implementation with no source code or skeleton. The initial public leaderboard reports 0% fully resolved tasks for all evaluated models.
How is ProgramBench different from SWE-bench? SWE-bench asks models to patch existing repositories from issue descriptions. ProgramBench asks models to rebuild an entire program from behavior. It tests a different and harder cleanroom reconstruction skill.
Is ProgramBench a reverse-engineering benchmark? No. ProgramBench uses binaries, but it forbids decompilation-style shortcuts because it wants to measure behavioral reconstruction and architecture from black-box interaction, not source recovery.
How can I run ProgramBench?
The team has published the GitHub repo, Docker/inference materials, and Hugging Face datasets. The basic flow is to install the package and evaluate a submission with programbench eval, while preserving the no-internet and anti-cheating constraints.
How does BenchLM score ProgramBench? BenchLM keeps ProgramBench display-only. Since all models are tied at 0% fully resolved, the benchmark page displays the published almost-resolved rate and excludes ProgramBench from weighted coding and overall rankings.
Which models lead ProgramBench? Claude Opus 4.7 leads the initial public snapshot by almost-resolved rate at 3.0%, followed by Claude Opus 4.6 at 2.5% and Claude Sonnet 4.6 at 1.0%.
Data sourced from ProgramBench, the extended results, the ProgramBench paper, and the LocalLLaMA launch discussion. Last updated May 5, 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Terminal-Bench 2.0 measures whether AI models can work through real terminal-based coding and ops workflows instead of just answering in chat.
Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch.
React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.