benchmarkscodingagenticprogrambenchexplainer

ProgramBench Benchmark Explained: Can LLMs Rebuild Programs From Binaries?

ProgramBench is a new LLM coding benchmark where agents rebuild full programs from a compiled binary and documentation. See scores, how it differs from SWE-bench, and why all public models are 0% resolved.

Glevd·Published May 5, 2026·17 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on X Share on LinkedIn

ProgramBench is a hard new LLM coding benchmark for a capability most leaderboards barely touch: can a model rebuild an entire program from observed behavior, without seeing the source?

The benchmark gives an agent a compiled executable and its usage documentation. The agent has to probe the program, infer behavior, choose an implementation approach, write source code, create a build script, and produce a candidate program. The submitted program is then compared against the original with hidden behavioral tests.

That is very different from completing a function, solving a contest problem, or patching a known repository.

Key takeaways

ProgramBench tests cleanroom program reconstruction from a compiled binary and documentation.
It is not a decompilation benchmark: internet access, source lookup, and binary analysis shortcuts are restricted.
The first public leaderboard has every evaluated model at 0% fully resolved.
Claude Opus 4.7 leads only on the auxiliary almost-resolved metric, at 3.0%.
The benchmark is best read as a frontier stress test for coding agents, not as a production model selector yet.

Who should care about ProgramBench?

ProgramBench matters most if you are evaluating coding agents rather than chat models.

If your workflow is "ask the model to write a small function," ProgramBench is probably too hard and too indirect. HumanEval, LiveCodeBench, and language-specific coding evals will tell you more.

If your workflow is "ask the model to patch a known repository," SWE-bench Pro, SWE-Rebench, and SWE-bench Verified are still more directly relevant.

ProgramBench becomes interesting when the job is less structured:

migrating legacy tools when documentation is incomplete
recreating internal command-line utilities from behavior
writing replacement services that need to preserve edge cases
probing a system before deciding how to implement it
evaluating whether an agent explores enough before coding
testing whether a model can resist the temptation to submit plausible but incomplete work

For model buyers, ProgramBench is not yet a "pick the top model and ship it" leaderboard. It is a warning that current coding agents still struggle when the source code, issue description, and scaffolding disappear. For agent builders, it is a diagnosis tool: it tells you whether the agent can investigate before it implements.

What ProgramBench tests

ProgramBench is built around cleanroom reconstruction. Each task starts with:

an execute-only binary
usage-related documentation
no source code
no internet access
no decompilation tools
no prescribed language, file layout, or skeleton

The model has to decide what questions to ask the executable. It can run the program with inputs and observe outputs, but it cannot inspect the underlying implementation.

That makes ProgramBench a benchmark for architecture and specification discovery, not just code writing.

Why it matters

Most coding benchmarks still give models a lot of structure.

HumanEval gives a function signature and docstring. SWE-bench Verified gives an existing repository and an issue. LiveCodeBench gives competitive-programming-style problem statements.

ProgramBench removes most of that scaffolding. It asks whether the agent can reconstruct the target behavior from interaction alone.

That matters because real software engineering often starts from partial specifications. Developers reverse-engineer workflows, probe APIs, explore old systems, and infer edge cases from observed behavior. ProgramBench compresses that kind of work into an evaluation setting.

ProgramBench vs SWE-bench, LiveCodeBench, and Terminal-Bench

ProgramBench is not a replacement for the major coding benchmarks. It measures a different failure mode.

Benchmark	What the model gets	What it measures best	Main limitation
ProgramBench	Compiled binary plus documentation	Cleanroom architecture, probing, and full-program reconstruction	Current scores are near zero, so it is not yet useful for fine ranking
SWE-bench Pro	Existing repository plus issue	Real software issue resolution	Still starts from known source code and repo structure
LiveCodeBench	Fresh programming problems	Contamination-resistant code generation and reasoning	More contest-like than production engineering
Terminal-Bench 2.0	Terminal environment and task goal	Agentic execution, debugging, and tool use	Does not isolate cleanroom reconstruction
HumanEval	Function signature and docstring	Simple function synthesis baseline	Saturated and high contamination risk

The closest comparison is SWE-bench, but the task shape is almost inverted. SWE-bench asks whether the agent can modify a real codebase correctly. ProgramBench asks whether the agent can discover what a program does and build a new codebase that behaves the same way.

What ProgramBench adds beyond existing coding benchmarks

The best way to understand ProgramBench is to look at what it removes.

Most coding benchmarks provide at least one of these anchors:

a function signature
a complete problem statement
a source repository
an issue description
visible tests
a known framework or language
a narrow expected output format

Those anchors are useful. They make evaluation controlled and repeatable. But they also hide a major part of real engineering: figuring out what the software is supposed to do before writing it.

ProgramBench removes almost all of those anchors. The agent does not get source files to inspect. It does not get a bug report with a target location. It does not get a type signature that defines the interface. It has to create its own understanding by interacting with the executable.

That creates a different capability profile:

Specification discovery. The agent needs to ask good questions of the binary. What inputs are valid? What errors are produced? How does output formatting work? Which edge cases matter?

Search discipline. A weak agent may run a few obvious commands and then start coding. A stronger agent should systematically explore behavior before committing to an implementation.

Architecture choice. With no skeleton, the agent has to decide file layout, language, modules, parsing strategy, build script, and test strategy.

Behavioral precision. The replacement does not need to match the original source. It needs to match observable behavior. That includes boring details like exit codes, error messages, whitespace, file output, and weird flags.

Stopping judgment. The agent needs to know when it has enough evidence. Stopping too early produces the exact failure mode the current leaderboard suggests: plausible but incomplete submissions.

This is why ProgramBench has value even while the scores are near zero. A benchmark can be useful before it ranks models finely if it exposes a real capability gap.

Current results

The initial public leaderboard is deliberately sobering:

Rank	Model	Fully resolved	Almost resolved
1	Claude Opus 4.7	0.0%	3.0%
2	Claude Opus 4.6	0.0%	2.5%
3	Claude Sonnet 4.6	0.0%	1.0%
4	GPT 5.4	0.0%	0.0%
5	Gemini 3.1 Pro	0.0%	0.0%

The primary metric is fully resolved tasks. Every public model is currently at 0%.

ProgramBench also reports "almost resolved" tasks, meaning runs that pass at least 95% of behavioral tests. BenchLM uses that auxiliary metric on the display page because it is the only visible separator between models right now.

See the mirrored page: ProgramBench leaderboard

How to read the ProgramBench leaderboard

The ProgramBench leaderboard needs a more careful reading than a normal coding benchmark.

On most coding leaderboards, a higher score directly means more solved tasks. On ProgramBench today, the official primary metric is flat: every evaluated public model is at 0% fully resolved. The only separation is the auxiliary "almost resolved" rate.

That means three things.

First, do not over-rank the current top three. Claude Opus 4.7 leading at 3.0% almost resolved is meaningful as a signal that it got closer more often, but it is not a production-level success rate. It does not mean it can reliably rebuild programs from binaries. It means it occasionally gets within striking distance under this benchmark's tests.

Second, do not treat 0.0% almost resolved as identical model quality. A model can fail ProgramBench in many ways: it can misunderstand the interface, submit something that does not build, implement only a trivial subset, or get close but miss the 95% threshold. The published public view does not expose enough granularity to separate all those cases.

Third, expect harness effects to matter. ProgramBench is an agent benchmark, not a pure model benchmark. The model, tool loop, prompting, retry policy, exploration budget, and stopping criteria can all affect results. The Reddit launch discussion explicitly notes that custom harnesses should be feasible, which means future results may separate "better base model" from "better agent scaffold."

BenchLM therefore keeps ProgramBench display-only. It is valuable evidence, but it is not mature enough to carry weighted ranking influence.

Fully resolved vs almost resolved

The most important distinction in ProgramBench is between fully resolved and almost resolved.

A fully resolved task means the submitted replacement program passes the benchmark's behavioral test suite well enough to count as solved. That is the metric people should care about long term. If models start scoring 10%, 20%, or 40% fully resolved, ProgramBench will become a much stronger ranking signal.

Almost resolved is a softer metric. It captures submissions that pass at least 95% of behavioral tests. That is useful now because the fully resolved column is all zero. It tells us that some agents are close on a small fraction of tasks.

But almost resolved has limits:

It may hide catastrophic failures on a few important edge cases.
It may reward implementations that cover common behavior but miss rare behavior.
It can make a model look directionally promising before it is actually reliable.
It may be sensitive to how broad and balanced the hidden tests are for each task.

For that reason, almost resolved is best read as "near-miss rate," not "success rate." The right question is not "which model wins ProgramBench?" The right question is "which model is beginning to show signs of the capability ProgramBench is trying to measure?"

How ProgramBench is built

The paper describes a pipeline that starts from open-source repositories that build executables. The benchmark authors compile the original program, strip away source code and implementation details, and keep only the executable plus documentation for the agent-facing task.

Evaluation uses behavioral tests generated through agent-driven fuzzing. These tests compare observable behavior: stdout, stderr, exit codes, file outputs, and similar effects. The published benchmark covers 200 tasks and more than 248,000 behavioral tests.

The task set ranges from small command-line utilities to much larger projects such as FFmpeg, SQLite, and PHP.

Why hidden behavioral tests matter

ProgramBench evaluates behavior, not source similarity. That is the right choice for cleanroom reconstruction.

If evaluation rewarded source similarity, the benchmark would become confused. A correct cleanroom implementation may use a different language, different architecture, or different internal algorithms while matching the public behavior. Source similarity would punish valid replacements and reward questionable imitation.

Behavioral tests focus on what matters: does the replacement program act like the original from the outside?

The hidden-test design also raises the bar. If the tests were public, agents could overfit to known cases. Hidden tests force the agent to infer general behavior rather than memorize visible examples. The paper's use of agent-driven fuzzing is meant to broaden that test coverage beyond hand-written examples.

This makes ProgramBench closer to how real software replacement works. When you replace a legacy tool, users do not care whether your source looks like the old source. They care whether the new tool handles their workflows, edge cases, errors, and outputs.

Is ProgramBench reverse engineering?

Not in the usual sense.

The benchmark uses compiled executables, so it naturally sits near reverse-engineering language. But the goal is not to recover the original source code. The goal is to reproduce external behavior from black-box interaction.

That distinction matters. If decompilation, internet search, or direct source lookup were allowed, the benchmark would drift toward measuring binary analysis and retrieval. ProgramBench instead tries to measure whether a coding agent can:

design experiments against a target executable
infer a behavioral specification
choose an implementation language and architecture
write a buildable replacement program
match edge cases well enough to pass hidden tests

For buyers of coding agents, that is the valuable signal. It tests whether the model can work when the spec is incomplete and the structure is missing.

Is ProgramBench contamination-resistant?

ProgramBench has a different contamination profile from classic coding benchmarks.

Older static benchmarks such as HumanEval are vulnerable because the tasks and solutions have been public for years. A model may have seen those examples directly or indirectly during training. That does not make HumanEval useless, but it makes it weak for frontier model comparison.

ProgramBench tries to reduce shortcut paths in several ways:

The agent does not receive source code.
The task is evaluated with hidden behavioral tests.
The setup restricts internet access and source lookup.
The benchmark is new enough that public writeups and solutions are not yet widely circulated.
The target is behavior reconstruction, which is harder to memorize than a single expected answer.

That said, no public benchmark is magically immune forever. If ProgramBench tasks, harness details, or reconstructed solutions become widely shared, contamination risk will rise. The strongest long-term version of ProgramBench would need active submission governance, careful leakage controls, and possibly refreshed task sets.

For now, ProgramBench is a strong contamination-resistant coding signal in spirit, but BenchLM still treats it cautiously because public model coverage is small and the primary metric is not yet moving.

What the low scores mean

Low ProgramBench scores should not be read as "models cannot code." They mean the benchmark is testing a harder slice of coding than most public leaderboards:

discovering requirements through experiments
choosing an architecture with no skeleton
implementing a full repository from scratch
matching edge-case behavior
avoiding shortcuts such as source retrieval or decompilation

This is closer to a stress test than a product-readiness score. A model that scores well on SWE-bench can still fail ProgramBench because it has never had to infer and recreate a whole executable's behavior without the original source.

Why coding agents fail ProgramBench

The current scores suggest that ProgramBench exposes several failure modes at once.

1. Shallow probing. A weak agent may run the executable on obvious inputs, observe a few outputs, and infer too much from too little. That creates programs that handle the demo path but miss edge cases.

2. Premature implementation. Many coding agents are optimized to start editing quickly. That helps on tasks with clear specs, but it hurts when the main challenge is discovering the spec. ProgramBench rewards investigation before implementation.

3. Missing negative cases. Rebuilding a command-line tool is not just about valid inputs. It also means matching invalid flags, missing arguments, malformed files, exit codes, and error formatting.

4. Weak architecture. Without a skeleton, the agent may choose a brittle implementation that works for a handful of cases but cannot scale to the full behavior.

5. Poor self-testing. A strong agent should generate its own tests while probing the target binary. If it does not, it has no way to know whether the replacement is close.

6. Overconfidence. The LocalLLaMA thread notes that many runs did not simply hit the step limit. Agents often declared completion and submitted incomplete work. That is a serious practical problem: the agent does not know what it does not know.

7. Tool-loop fragility. ProgramBench is sensitive to agent execution quality. If the harness fails to preserve observations, manage files, run builds, or iterate cleanly, the base model may never get a fair chance.

These failure modes are exactly why ProgramBench is useful. They are also why a future public leaderboard should report more than one number. For serious analysis, we need to know whether failures came from bad probing, bad code, failed builds, time limits, missing edge cases, or bad stopping decisions.

Launch context from LocalLLaMA

The LocalLLaMA launch discussion adds useful implementation context from Kilian Lieret, one of the ProgramBench authors.

First, the benchmark is not limited to the default baseline agent. The authors say custom harnesses should be straightforward because the inference containers are published, as long as submissions respect constraints such as no internet access and no cheating.

Second, public submissions are planned but the rules are still being worked out. That matters for BenchLM because ProgramBench is currently closed-model-heavy; open-weight results are expected later, but the authors note that open-source models have been harder to run reliably on these tasks so far.

Third, the low scores are not mostly a timeout artifact. In the thread, the authors say most agents were not killed by the step limit. They tended to declare the task finished and submit incomplete executables. That reinforces the main interpretation: today's agents often stop with plausible but behaviorally shallow reconstructions.

How to run or submit to ProgramBench

ProgramBench is not just a static paper benchmark. The team has published the GitHub repository, Hugging Face datasets, and Docker/inference materials.

The public launch notes describe the basic local workflow as installing the package and evaluating a submission:

pip install programbench
programbench eval <your submission>

The important constraint is that the run has to preserve the benchmark setup. Agents should not use internet access, direct source lookup, or decompilation-style shortcuts. The authors said public submissions are planned, but the rules still need to guard against irrelevant or cheating submissions.

For teams building coding agents, the practical value is less about chasing the current leaderboard and more about running controlled internal experiments:

Does a stronger planning scaffold improve almost-resolved rates?
Does a model quit early with incomplete behavior?
Does the agent probe edge cases before writing architecture?
Does cost rise because the agent explores too broadly?
Do open-weight models fail because of tool discipline, context handling, or code quality?

What agent builders should test

If you build coding agents, treat ProgramBench as a system benchmark, not just a model leaderboard. The most useful questions are about the whole loop: exploration, implementation, self-testing, and stopping judgment.

A practical ProgramBench harness should include:

a dedicated probing phase
structured notes from observed behavior
automatic generation of local regression tests
repeated comparison between original and replacement
explicit edge-case search
a final checklist before submission
cost and call-count tracking

Those are not just benchmark tricks. They are the same ingredients needed for reliable production coding agents.

How to use it with other benchmarks

ProgramBench should sit next to, not replace, other coding benchmarks:

SWE-bench Pro for real repository issue resolution
SWE-Rebench for fresh rolling-window software issues
LiveCodeBench for contamination-resistant coding problems
Terminal-Bench 2.0 for terminal-based agent execution
React Native Evals for app-specific implementation work

For model selection today, ProgramBench is best treated as a frontier warning light. It shows how far current coding agents are from robust cleanroom software reconstruction, even when they look strong on more structured tasks.

When ProgramBench matters less

ProgramBench is valuable, but it should not become a universal coding metric. If your product mainly needs autocomplete, short code generation, SQL, unit tests, or explanation of known code, this benchmark is probably too indirect.

It becomes relevant when your risk is under-specified engineering: unknown legacy behavior, missing source code, incomplete specs, high edge-case sensitivity, or autonomous agents that must decide what to inspect. In those settings, ProgramBench can reveal problems that SWE-bench-style scores may hide.

What to watch next

ProgramBench will become more useful if three things happen.

First, public submissions need to separate model quality from harness quality. A custom agent scaffold could matter as much as the base model, so future leaderboard rows should make the harness, tool policy, and retry strategy visible.

Second, open-weight results need broader coverage. The initial public leaderboard is mostly closed-model evidence. If open-weight coding models underperform because they are overfit to SWE-bench-style tasks, ProgramBench could become a useful out-of-distribution test for local coding agents.

Third, the primary resolved metric needs movement. Almost-resolved rates are useful right now because everything else is tied at zero, but the benchmark becomes much more actionable once models begin fully resolving tasks.

Fourth, result reports should expose failure categories. A single resolved percentage is clean, but it does not tell builders what to fix. The most useful future ProgramBench leaderboard would show build failures, near misses, early stopping, timeout rate, average cost, call count, and maybe task-family breakdowns.

Fifth, task refresh matters. If ProgramBench becomes popular, contamination pressure will rise. The benchmark will be strongest if it can add new tasks, rotate held-out tests, or separate public development tasks from private evaluation tasks.

How BenchLM uses ProgramBench

BenchLM tracks ProgramBench in three places.

First, it has a dedicated benchmark page: ProgramBench scores. That page mirrors the public ProgramBench leaderboard and uses almost-resolved rate as the visible score because the official fully resolved metric is currently tied at zero.

Second, it is listed on the coding leaderboard as a display-only benchmark. That keeps it visible to people researching coding models while preventing a very new, near-zero benchmark from distorting weighted rankings.

Third, it is linked from this explainer so readers can move from interpretation to live data.

BenchLM will not weight ProgramBench until the signal becomes more stable. The benchmark needs broader model coverage, more public submissions, clearer harness metadata, and meaningful movement in fully resolved scores. Until then, it is better treated as qualitative evidence about frontier coding-agent limits.

The bottom line

ProgramBench is one of the sharpest public tests of whether coding agents can architect from scratch under uncertainty. It is too new and too low-scoring to use as a weighted production ranking signal today, but it is exactly the kind of benchmark to watch as coding agents move from patching repositories toward building and rebuilding complete systems.

See the benchmark page: ProgramBench scores

Frequently asked questions

What is ProgramBench? ProgramBench is a coding-agent benchmark where the model receives a compiled executable and documentation, then must rebuild the program's source repository and build script without seeing the original source.

Why are ProgramBench scores so low? The benchmark requires specification discovery, architecture, and full-program implementation with no source code or skeleton. The initial public leaderboard reports 0% fully resolved tasks for all evaluated models.

How is ProgramBench different from SWE-bench? SWE-bench asks models to patch existing repositories from issue descriptions. ProgramBench asks models to rebuild an entire program from behavior. It tests a different and harder cleanroom reconstruction skill.

Is ProgramBench a reverse-engineering benchmark? No. ProgramBench uses binaries, but it forbids decompilation-style shortcuts because it wants to measure behavioral reconstruction and architecture from black-box interaction, not source recovery.

How can I run ProgramBench? The team has published the GitHub repo, Docker/inference materials, and Hugging Face datasets. The basic flow is to install the package and evaluate a submission with programbench eval, while preserving the no-internet and anti-cheating constraints.

How does BenchLM score ProgramBench? BenchLM keeps ProgramBench display-only. Since all models are tied at 0% fully resolved, the benchmark page displays the published almost-resolved rate and excludes ProgramBench from weighted coding and overall rankings.

Which models lead ProgramBench? Claude Opus 4.7 leads the initial public snapshot by almost-resolved rate at 3.0%, followed by Claude Opus 4.6 at 2.5% and Claude Sonnet 4.6 at 1.0%.

Data sourced from ProgramBench, the extended results, the ProgramBench paper, and the LocalLLaMA launch discussion. Last updated May 5, 2026.

New models drop every week. We send one email a week with what moved and why.

I agree to receive weekly updates. Unsubscribe anytime.

Share This Report

Copy the link, post it, or save a PDF version.

Share on X Share on LinkedIn

On this page

Don't miss the next GPT moment

Which models moved up, what’s new, and what it costs. One email a week, 3-min read.

Free. One email per week.

Related Posts

benchmarksagentic

Terminal-Bench 2.0 Explained: How We Measure Agentic Coding

Terminal-Bench 2.0 measures whether AI models can work through real terminal-based coding and ops workflows instead of just answering in chat.

Glevd·Published Mar 12, 2026

anthropicclaude

Mythos Preview is the first frontier model Anthropic decided not to ship. The benchmarks show why.

Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch.

BenchLM·Published Apr 7, 2026

benchmarkscoding

React Native Evals: The Mobile App Coding Benchmark Explained

React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.

Glevd·Published Mar 24, 2026