For two years, no AI model crossed 50% on Humanity's Last Exam. As of July 2026, fourteen models have, and the top tool-assisted score is 64.5. What broke the ceiling matters as much as the number.
Share This Report
Copy the link, post it, or save a PDF version.
Fourteen models in BenchLM's catalog score above 50 on Humanity's Last Exam as of July 2026, and the top row reads 64.5, a number that did not exist anywhere on the leaderboard a year ago. For two years, "no model has crossed 50% on HLE" was the single most quoted sentence about the limits of frontier AI. This week that sentence came out of our own explainer, and deleting it is a bigger story than any launch post.
During a routine freshness pass on July 4, our validators flagged the HLE explainer as stale. Stale posts are normal operations here; scores move and the tooling catches them. This one was different, because the fix was not a number swap. The framing itself was dead. The post's central claim, its FAQ answers, and its closing argument all rested on a ceiling that no longer exists.
We rewrote the explainer the same day. This post is the changelog entry explaining what replaced the ceiling, because a benchmark site that silently edits its own thesis is a benchmark site training its readers not to trust the archive.
One housekeeping note for anyone diffing our history: the same July 4 pass also rescored our overall leaderboard under a new missing-data methodology, which is documented separately. The HLE numbers in this post are raw benchmark values and were not affected by that change.
Humanity's Last Exam was designed as the benchmark that outlives the others. Its 3,000-plus questions were crowdsourced from domain experts at universities and research institutions with an explicit filter: submissions had to stump the frontier models of the day. Advanced mathematics, theoretical physics, niche subfields of chemistry and linguistics and law, the kind of material where being nearly right scores zero.
The design goal was headroom. MMLU had saturated into the high 90s. GPQA Diamond, briefly the hard benchmark, was heading the same direction. HLE's authors wanted a test where progress would stay legible for years, and the sub-50% frontier ceiling became the benchmark's identity. It was the number that let commentators say deep expertise remained human territory, and it held long enough that the sentence calcified into furniture.
Furniture is exactly what expired rhetoric looks like the day before someone moves it.
The current top of the HLE column, as tracked in our catalog:
| Rank | Model | HLE score | Protocol |
|---|---|---|---|
| 1 | Claude Mythos 5 | 64.5 | with tools |
| 1 | Claude Fable 5 | 64.5 | with tools |
| 3 | GPT-5.4 Pro | 58.7 | as published |
| 4 | Claude Opus 4.8 | 57.9 | as published |
| 5 | Claude Sonnet 5 | 57.4 | as published |
| 6 | GPT-5.5 Pro | 57.2 | as published |
| 7 | GLM-5.2 | 54.7 | as published |
| 7 | Claude Opus 4.7 (Adaptive) | 54.7 | as published |
| 9 | Claude Opus 4.6 | 53 | as published |
| 10 | GLM-5.1 | 52.3 | as published |
The asterisk on the top rows is doing honest work: the 64.5 was recorded with tool use, per Anthropic's system card for the Claude 5 family. Tool-assisted means the model could search, browse, and run code while answering.
Purists object that an exam of expert knowledge should be closed-book, and the objection deserves a straight answer rather than a shrug. Closed-book and tool-assisted are different measurements, both legitimate. One asks what the model knows; the other asks what the model can find out, verify, and synthesize under a real deployment loop. The second question is the one that matters for anyone paying API invoices in 2026, because nobody's production agent answers from memory with the internet disconnected. Our position is procedural rather than partisan: label the protocol on every row, refuse to blend the two silently, and let readers pick the column that matches their question. That is more than the launch decks tend to do.
Benchmarks with asterisks age better than benchmarks with obituaries.
Our own archive is the cleanest measuring stick for the pace, because we can diff it. The March 2026 edition of our HLE explainer recorded the state of the art as follows: GPT-5.4 on top at 46, GPT-5.3 Codex at 44, GPT-5.2 at 40, Claude Opus 4.6 at 38, Gemini 3.1 Pro at 35. The post's headline claim, quoted verbatim from the FAQ we shipped then: no model has broken 50%.
Four months later the top tracked score is 64.5 and the count of models above 50 is fourteen.
The comparison is not perfectly apples-to-apples, and saying so is the point of this section. Part of the jump is genuinely new models: the Claude 5 family, GPT-5.5 Pro, the GLM-5 line. Part of it is protocol: the leading rows are tool-assisted where the March snapshot was dominated by as-published closed-book-style numbers. And part of it is sourcing churn on existing rows: GPT-5.4's own entry moved from 46 to 52.1 between snapshots as later-published runs replaced earlier ones in our catalog. An 18.5-point quarterly move decomposes into model progress, protocol drift, and data freshness, and anyone quoting it as pure capability gain is overclaiming by an unknowable amount.
What is not ambiguous is the shape of the distribution. The 50-to-59 band, empty in March, now holds ten models from three different labs. Crossing 50 stopped being a singularity event and became a cohort behavior inside a single quarter.
Who owns the new band is as informative as its existence.
| Lab | Models above 50 on HLE | Best row |
|---|---|---|
| Anthropic | 6 | Claude Mythos 5, 64.5 (with tools) |
| OpenAI | 4 | GPT-5.4 Pro, 58.7 |
| Z.AI | 3 | GLM-5.2, 54.7 |
| Meta | 1 | Muse Spark, 50.4 |
Anthropic's six entries include its mid-tier: Claude Sonnet 5 at 57.4 outscores every non-Anthropic model except GPT-5.4 Pro, which says something uncomfortable about how the knowledge frontier is concentrating. OpenAI's four are all flagship or premium rows.
The Z.AI cluster is the underreported story. GLM-5.2, GLM-5.1, and GLM-5 all clear the line that no model on earth had reached eighteen months ago, and they do it at API prices a fraction of the frontier rate. Whatever the training recipe is, deep expert knowledge is no longer an exclusively American export. Meta's Muse Spark completes the set at 50.4, the narrowest possible membership in the club.
Google is the conspicuous absence. No Gemini row in our catalog currently clears 50, with Gemini 3.1 Pro's sourced figure at 40. That may reflect Google's disclosure choices as much as its models (a theme we take up in a companion post on what labs decline to publish), and the distinction matters: an absent number and a low number read identically in a ranking and mean different things.
Below the frontier, the picture is unchanged in the way that keeps HLE useful. Mid-tier and budget models still score in single digits, and the spread between the best and the median tracked model is wider on HLE than on any other knowledge benchmark we carry. A benchmark discriminates exactly as long as its scores refuse to bunch, and these refuse.
Three forces, in descending order of measured effect.
Tool use is first, and it is not close. The gap between the best tool-assisted rows and the best as-published rows spans nearly six points at the frontier (64.5 against 58.7), and the Claude 5 system card reports its figure explicitly as an agentic, tool-enabled protocol. HLE questions reward exactly what tool loops provide: the ability to decompose a monstrous question, retrieve the two obscure facts it hinges on, and check the arithmetic before committing. The benchmark was built to resist recall. It was not built to resist research, because in 2023 nobody had to plan for models that could do research.
Test-time compute is second. Every model in the fifties is a reasoning model spending orders of magnitude more inference on hard questions than its 2024 ancestors did. HLE's items are long-horizon by construction, multi-step derivations and cross-domain syntheses, which is the terrain where extended thinking buys the most accuracy per dollar. The same force that moved competition-math benchmarks through the 2025 season arrived at HLE roughly on schedule.
The third force is access tiers, and it is the strangest. The very top of the column is occupied by Claude Mythos 5, a restricted-access configuration available to vetted partners, alongside its public sibling Fable 5. Some of the measured frontier now lives behind vetting programs, which means public benchmark tables increasingly describe capability that most readers cannot rent at any price. A leaderboard has two choices: exclude gated models and misrepresent the frontier, or include them with loud labels and misrepresent availability. We chose inclusion with labels, and we are not fully comfortable with either option. That discomfort deserves its own post, and it will get one.
The milestone everyone quotes is tool-assisted. The milestone worth watching is not.
As-published scores, most of them closer to the closed-book end of the protocol spectrum, now reach 58.7. When a clearly labeled closed-book run crosses 50, the claim that broke this week breaks again in a stronger form: not "a model with a search engine and an interpreter can pass," but "the weights alone hold and can deploy that much expert knowledge." Those are different findings about where capability lives. The first says frontier systems can do expert work; the second says the expertise has been internalized, which bears on everything from distillation risk to what an air-gapped deployment can do.
No such run sits in our catalog today.
That absence is itself worth flagging, because the incentive to publish closed-book HLE numbers weakens as the tool-assisted ones grow more marketable. If the closed-book column goes quiet for two more quarters while headline numbers climb, that silence will be a disclosure choice, and we track those.
HLE remains far from done. Against our saturation threshold (a top score of 90), 64.5 leaves more headroom than any other major knowledge benchmark: 37% of the percentage-scale benchmarks we track are already saturated, and HLE is nowhere near joining them.
| Benchmark | Top tracked score | Status |
|---|---|---|
| MMLU | 99 | saturated for two years |
| GPQA Diamond | mid-90s | effectively saturated |
| MMLU-Pro | 93 | crossing the line now |
| HLE | 64.5 (with tools) | 25.5 points of headroom |
The questions most models still miss are exactly the deep, multi-step expert items the benchmark was commissioned to hold, and the single-digit mid-tier scores mean HLE will keep separating model classes long after the frontier clears 70.
Two risks could spoil that, and they are worth naming precisely rather than gesturing at.
Contamination. HLE's public question set has been circulating since early 2025, which starts the familiar clock: each new training corpus is likelier than the last to contain the questions, the answers, or forum discussions of both. A private held-out split exists for auditing, and the signature of contamination is well known by now: a sudden score jump on the public split without a matching protocol change or held-out gain. Our benchmark-confidence page carries our current contamination assessment for HLE, updated as evidence arrives.
Protocol mixing. As the marketing value of an HLE number grows, so does the temptation to report tool-assisted results next to closed-book ones in a single unlabeled column. The launch-deck version of this is not fraud, exactly; it is a footnote conveniently far from the headline. The protocol column in the table above is our standing answer, and any row we cannot attribute to a protocol gets flagged rather than blended.
Watch the 70 line. At the cadence the last two quarters imply, the first tool-assisted 70 is a near-term event rather than a decade event, and when a labeled closed-book run crosses 50, that will be the quieter, larger milestone. The archived March snapshot with its 46 will still be sitting in our git history when both happen, which is precisely where a benchmark site's receipts belong.
A practical note before the FAQ, because benchmark posts have a way of floating free of purchasing decisions.
HLE deserves weight in a model choice when the workload lives near the benchmark's terrain: research assistance in technical domains, literature synthesis, expert-level question answering where a confident near-miss is worse than a refusal. The spread is wide and real; picking a 57 over a 40 buys measurable capability on exactly those tasks.
HLE deserves almost no weight when the workload is routine coding, drafting, extraction, or support automation. A model in the single digits on HLE can be flawless at summarizing tickets, and the premium attached to frontier knowledge scores is money spent on headroom those workloads never touch. Our category pages exist because no single column, this one included, should pick a model alone.
One asterisk transfers from the leaderboard to the invoice: the 64.5 at the top of this benchmark belongs to tool-assisted runs of models at the top of the price sheet. Buyers comparing HLE numbers across providers should confirm the protocol matches their deployment before paying for the difference.
What is the highest HLE score in 2026?
Claude Mythos 5 and Claude Fable 5 hold the top HLE score tracked by BenchLM at 64.5, recorded with tool use, as of July 2026. The best scores behind them are GPT-5.4 Pro at 58.7, Claude Opus 4.8 at 57.9, and Claude Sonnet 5 at 57.4.
What does "with tools" mean on HLE?
Tool-assisted HLE runs let the model search, browse, and execute code while answering, the way frontier models are actually deployed. Closed-book runs measure stored knowledge alone and score meaningfully lower. Both are legitimate protocols measuring different things, and mixing them in one column without labels is how leaderboards mislead.
How many AI models score above 50 on HLE?
Fourteen models in BenchLM's catalog score above 50 on HLE as of July 2026, spanning Anthropic, OpenAI, Z.AI, and Meta rows. Eighteen months earlier that count was zero, and the benchmark's designers were publicly wondering whether the ceiling would survive the decade.
How fast are HLE scores improving?
BenchLM's own snapshots frame the pace: the top tracked HLE score was 46 in March 2026 and 64.5 in July 2026, an 18.5-point move in one quarter. Most of that jump came from tool-assisted protocols and heavier test-time reasoning rather than from bigger base models.
Is HLE saturated now?
No. Saturation on BenchLM means a top score of 90 or higher, and 37% of tracked percentage-scale benchmarks have crossed that line. HLE's best result remains 64.5 with tools, mid-tier models still land in single digits, and the score spread keeps widening rather than compressing.
Can HLE become contaminated?
Yes, and the clock is running. HLE's public question set has been available since early 2025, so newer training corpora may include it, while a private held-out split exists for auditing. BenchLM's benchmark-confidence page tracks contamination risk, and a sudden score jump without a protocol change would be the warning sign.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models still top out under 65%. Here's why HLE matters.
Anthropic's Claude Fable 5 brings Mythos-class capability to public users, while Claude Mythos 5 remains trusted-access. The benchmark story is strong, but the real shift is capability-gated deployment.
Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch.