# BenchLM AI > BenchLM AI compares 258 tracked AI models across 247 benchmarks in 9 categories: Agentic, Coding, Multimodal & Grounded, Reasoning, Knowledge, Instruction Following, Multilingual, Mathematics, Korean Benchmarks. Leaderboards exclude generated benchmark rows so the public rankings stay conservative and source-aware. ## Main Pages - [Homepage](https://benchlm.ai/): Overall leaderboard and benchmark explorer - [Models Directory](https://benchlm.ai/models): Canonical model families and sibling SKUs - [Compare](https://benchlm.ai/compare): Head-to-head model comparisons - [Benchmarks](https://benchlm.ai/benchmarks): Benchmark directory and explainer pages - [Pricing](https://benchlm.ai/llm-pricing): Token pricing comparison for major models - [LLM Pricing Deep Dive](https://benchlm.ai/llm-pricing): Full pricing table with benchmark score and value context - [Price vs Performance](https://benchlm.ai/llm-price-performance): Cost-adjusted model rankings and value leaders - [LLM Speed](https://benchlm.ai/llm-speed): Tokens/sec and first-answer latency comparisons across providers - [Benchmark Confidence](https://benchlm.ai/benchmark-confidence): Provenance, verification, and confidence coverage for ranked models - [AI Race](https://benchlm.ai/ai-race): Release timeline, provider movement, and benchmark freshness snapshot - [LLM Leaderboard History](https://benchlm.ai/llm-leaderboard-history): Arena Elo history from 2023 to today - [Alternatives Directory](https://benchlm.ai/alternatives): SEO landing pages for ChatGPT, Claude, Gemini, and OpenAI API alternatives - [Korean AI Hub](https://benchlm.ai/leaderboards/korean-llm): Best Korean LLM Leaderboard - [Korean Benchmarks](https://benchlm.ai/leaderboards/korean-benchmarks): Global models evaluated on Korean metrics - [KMMLU Guide](https://benchlm.ai/guides/kmmlu-explained): KMMLU Benchmark Explained - [European AI Guide](https://benchlm.ai/best/european-llm): Europe's benchmarked, sovereign, and specialist model landscape - [European Models Ranking](https://benchlm.ai/best/european-models): Ranked benchmark view for European model creators - [Blog](https://benchlm.ai/blog): Benchmark explainers and model analysis ## Top Model Profiles - [Claude Mythos 5](https://benchlm.ai/models/claude-mythos-5): #1, Anthropic, 99/100, Proprietary, 1M+ - [Claude Fable 5](https://benchlm.ai/models/claude-fable-5): #2, Anthropic, 97/100, Proprietary, 1M+ - [Claude Opus 4.8](https://benchlm.ai/models/claude-opus-4-8): #3, Anthropic, 93/100, Proprietary, 1M - [Gemini 3.1 Pro](https://benchlm.ai/models/gemini-3-1-pro): #4, Google, 91/100, Proprietary, 1M - [Qwen3.7 Max](https://benchlm.ai/models/qwen3-7-max): #5, Alibaba, 91/100, Proprietary, 1M - [GPT-5.4 Pro](https://benchlm.ai/models/gpt-5-4-pro): #6, OpenAI, 90/100, Proprietary, 1.05M - [GPT-5.5](https://benchlm.ai/models/gpt-5-5): #7, OpenAI, 89/100, Proprietary, 1M - [Gemini 3 Pro Deep Think](https://benchlm.ai/models/gemini-3-pro-deep-think): #8, Google, 89/100, Proprietary, 2M - [Grok 4.1](https://benchlm.ai/models/grok-4-1): #9, xAI, 89/100, Proprietary, 1M - [GPT-5.4](https://benchlm.ai/models/gpt-5-4): #10, OpenAI, 88/100, Proprietary, 1.05M ## Best-Of Rankings - [Best LLMs for Coding](https://benchlm.ai/coding) - [Best LLMs for Math](https://benchlm.ai/math) - [Best LLMs for Knowledge](https://benchlm.ai/knowledge) - [Best LLMs for Reasoning](https://benchlm.ai/reasoning) - [Best Agentic AI Models](https://benchlm.ai/agentic) - [Best Multimodal & Grounded AI Models](https://benchlm.ai/multimodal-grounded) - [Best LLMs for Instruction Following](https://benchlm.ai/instruction-following) - [Best Multilingual LLMs](https://benchlm.ai/multilingual) - [Best Long Context AI Models](https://benchlm.ai/best/long-context) - [Best Tool Use & Function Calling Models](https://benchlm.ai/best/tool-use) - [Best AI Models for Web Research](https://benchlm.ai/best/web-research) - [Best Computer Use AI Models](https://benchlm.ai/best/computer-use) - [Best Document AI Models](https://benchlm.ai/best/document-ai) - [Best Image Understanding Models](https://benchlm.ai/best/image-understanding) - [Best Frontend & App Dev Models](https://benchlm.ai/best/frontend-app-dev) - [Best Factuality AI Models](https://benchlm.ai/best/factuality) - [Best Open Source LLMs](https://benchlm.ai/best/open-source) - [Best Proprietary LLMs](https://benchlm.ai/best/proprietary) - [Best Reasoning AI Models](https://benchlm.ai/best/reasoning-models) - [Best OpenAI Models](https://benchlm.ai/best/openai-models) - [Best Anthropic Models](https://benchlm.ai/best/anthropic-models) - [Best Google AI Models](https://benchlm.ai/best/google-models) - [Best Meta AI Models](https://benchlm.ai/best/meta-models) - [Best DeepSeek Models](https://benchlm.ai/best/deepseek-models) - [Best AI Models Overall](https://benchlm.ai/best/overall) - [Best Large Context Window LLMs](https://benchlm.ai/best/large-context-window) - [Best Chinese AI Models](https://benchlm.ai/best/chinese-models) - [European AI Models](https://benchlm.ai/best/european-models) - [Best Non-Reasoning LLMs](https://benchlm.ai/best/non-reasoning-models) - [Best Mistral Models](https://benchlm.ai/best/mistral-models) - [Best xAI Grok Models](https://benchlm.ai/best/xai-models) - [Best Alibaba Qwen Models](https://benchlm.ai/best/alibaba-models) - [Best Value LLM for Coding](https://benchlm.ai/best/best-value-coding) - [Best Value Agentic AI Model](https://benchlm.ai/best/best-value-agentic) - [Best Value LLM for Reasoning](https://benchlm.ai/best/best-value-reasoning) - [Best Value LLM for Knowledge](https://benchlm.ai/best/best-value-knowledge) - [Best Value LLM for Math](https://benchlm.ai/best/best-value-math) - [Best Value Multimodal AI Model](https://benchlm.ai/best/best-value-multimodal) - [Best Value LLM Overall](https://benchlm.ai/best/best-value-overall) ## Tools & Resources - [Pricing](https://benchlm.ai/llm-pricing): Canonical token pricing table for major models - [LLM Pricing Deep Dive](https://benchlm.ai/llm-pricing): Extended pricing page with score and Score/$ context - [Price vs Performance](https://benchlm.ai/llm-price-performance): Benchmark score per dollar across major categories - [LLM Speed](https://benchlm.ai/llm-speed): Runtime throughput and first-token latency - [Benchmark Confidence](https://benchlm.ai/benchmark-confidence): Verified vs generated benchmark coverage - [Alternative Finder](https://benchlm.ai/tools/alternative-finder): Replace ChatGPT, Claude, Google Gemini, or the OpenAI API using benchmark fit, pricing, context, and open-weight filters - [LLM Selector Quiz](https://benchlm.ai/tools/llm-selector): Personalized model recommendations - [Cost Calculator](https://benchlm.ai/tools/cost-calculator): Estimate AI cost per blog post, web page, doc, PRD, or feature ## Alternative Landing Pages - [Best ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt): chatgpt alternatives, best chatgpt alternatives, best alternative to chatgpt - [Best Claude Alternatives in 2026](https://benchlm.ai/alternatives/claude): claude alternative, best claude alternative, cheaper alternative to claude - [Best Google Gemini Alternatives in 2026](https://benchlm.ai/alternatives/google-gemini): google gemini alternative, best google gemini alternative, gemini alternative - [Best OpenAI API Alternatives in 2026](https://benchlm.ai/alternatives/openai-api): openai api alternative, best openai api alternatives, cheaper openai api alternative - [Best GLM Alternatives in 2026](https://benchlm.ai/alternatives/glm): glm alternative, best glm alternative, z.ai alternative - [Best Kimi Alternatives in 2026](https://benchlm.ai/alternatives/kimi): kimi alternative, best kimi alternative, moonshot kimi alternative - [Best Free ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/free): free chatgpt alternative, best free chatgpt alternative, chatgpt alternative free - [Best Open Source ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/open-source): open source chatgpt alternative, best open source chatgpt alternative, open-weight chatgpt alternative - [Best Claude Alternatives for Coding in 2026](https://benchlm.ai/alternatives/claude/coding): claude alternative for coding, best claude alternative for coding, claude code alternative ## Blog Posts - [Claude Fable 5 and Mythos 5: The Future of AI Is Gated Intelligence](https://benchlm.ai/blog/posts/claude-fable-5-mythos-5-future-of-ai): Anthropic's Claude Fable 5 brings Mythos-class capability to public users, while Claude Mythos 5 remains trusted-access. The benchmark story is strong, but the real shift is capability-gated deployment. - [Best LLM for Math 2026: AIME, HMMT & MATH-500 Rankings](https://benchlm.ai/blog/posts/best-llm-math): Best LLM for math 2026: GPT-5.4 leads AIME 2025, MATH-500, and BRUMO. Compare Claude, Gemini, DeepSeek-R1, GPT-5.5, and value picks by use case. - [Perceptron Mk1 and Frontier Video Models: The Complete Guide to Video Understanding AI](https://benchlm.ai/blog/posts/perceptron-mk1-frontier-video-models): A complete guide to Perceptron Mk1, frontier video understanding models, video AI benchmarks, and where video-language models are headed next. - [ProgramBench Benchmark Explained: Can LLMs Rebuild Programs From Binaries?](https://benchlm.ai/blog/posts/programbench-cleanroom-coding-benchmark): ProgramBench is a new LLM coding benchmark where agents rebuild full programs from a compiled binary and documentation. See scores, how it differs from SWE-bench, and why all public models are 0% resolved. - [ARC-AGI-2 Explained: The Hardest Public Reasoning Benchmark](https://benchlm.ai/blog/posts/arc-agi-2-explained): ARC-AGI-2 measures fluid intelligence through visual grid puzzles that can't be solved by memorization. Here's how it works, what scores mean, and where current frontier models stand. - [LLM Context Window Comparison 2026: Advertised vs Effective, Input vs Output](https://benchlm.ai/blog/posts/context-window-comparison): Four frontier LLMs now advertise 1M+ tokens. DeepSeek V4 Pro's 384K output changes generation workflows. Gemini leads effective-context evals. Here's the real comparison. - [DeepSeek V4 Pro vs Claude Opus 4.7 vs GPT-5.5: The Frontier in April 2026](https://benchlm.ai/blog/posts/deepseek-v4-vs-claude-opus-4-7-vs-gpt-5-5): Three frontier flagships launched in eight days. DeepSeek V4 Pro undercuts GPT-5.5 by ~9x on output price under MIT license. Here's how they compare on benchmarks, cost, and real use. - [Claude API Pricing: Haiku 4.5, Sonnet 4.6, and Opus 4.7 (April 2026)](https://benchlm.ai/blog/posts/claude-api-pricing): Current Anthropic Claude API pricing from official model pages and the Claude Opus 4.7 launch announcement, including prompt caching, batch discounts, and current long-context notes. - [DeepSeek API Pricing: deepseek-chat vs deepseek-reasoner (April 2026)](https://benchlm.ai/blog/posts/deepseek-api-pricing): Current DeepSeek API pricing from the official docs: deepseek-chat and deepseek-reasoner, cache-hit vs cache-miss pricing, output pricing, and the current V3.2 endpoint mapping. - [Gemini API Pricing: Current Flash, Flash-Lite, and Pro Rates (April 2026)](https://benchlm.ai/blog/posts/gemini-api-pricing): Current Gemini API pricing from Google's official docs: 3.1 Pro Preview, 3.1 Flash-Lite Preview, 3 Flash Preview, 2.5 Flash, 2.5 Pro, plus Batch and Flex pricing. - [OpenAI API Pricing: GPT-5.4, GPT-5.2, and GPT-5.1 (April 2026)](https://benchlm.ai/blog/posts/openai-api-pricing): Current OpenAI API pricing from official docs: GPT-5.4, GPT-5.2, GPT-5.1, cached input rates, Batch API discounts, and the pricing details that actually matter. - [GPT-5 vs Gemini in 2026: Full Benchmark Breakdown](https://benchlm.ai/blog/posts/gpt5-vs-gemini-2026): GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard. - [Mythos Preview is the first frontier model Anthropic decided not to ship. The benchmarks show why.](https://benchlm.ai/blog/posts/mythos-preview-anthropic-not-shipping): Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch. - [Best LLM for RAG in 2026: Top Models Ranked for Retrieval-Augmented Generation](https://benchlm.ai/blog/posts/best-llm-rag): We ranked LLMs for RAG by IFEval, knowledge benchmarks, context window, long-context retrieval accuracy, and pricing. Here are the top models for retrieval-augmented generation in 2026. - [Best LLM for Writing in 2026: AI Models Ranked for Content Creation](https://benchlm.ai/blog/posts/best-llm-writing): Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget. - [How to Choose an LLM in 2026: Which AI Model Is Best for Your Use Case](https://benchlm.ai/blog/posts/which-llm-to-use): A step-by-step framework for choosing the right LLM in 2026. We compare Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4, DeepSeek, and open source models by use case, budget, and deployment needs — backed by benchmark data. - [Best Open Source LLM in 2026: Rankings, Benchmarks, and the Models Worth Running](https://benchlm.ai/blog/posts/best-open-source-llm): Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — DeepSeek V4, Kimi K2.6, GLM-5, Qwen3.5, Gemma 4, Llama — and compare them to proprietary leaders. - [Best Chinese LLMs in 2026: DeepSeek V4, Kimi K2.6, GLM-5, Qwen, and Every Model Ranked](https://benchlm.ai/blog/posts/best-chinese-llm): Which Chinese LLM is best in 2026? We rank DeepSeek V4, Kimi K2.6, GLM-5, GLM-5.1, Qwen3.5, MiMo, and more using current BenchLM data across coding, math, reasoning, and agentic work. - [ChatGPT vs Claude vs Gemini in 2026: The Definitive Comparison](https://benchlm.ai/blog/posts/chatgpt-vs-claude-vs-gemini-2026): The best AI model depends on your use case. We compare Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.4 across coding, writing, reasoning, multimodal, price, and speed using current benchmark data. - [How LLM Token Pricing Works: A Complete Guide to API Costs in 2026](https://benchlm.ai/blog/posts/llm-token-pricing): Learn how LLM API pricing works — from tokens, input/output costs, and reasoning tokens to vision, embedding, and fine-tuning pricing. Includes real cost examples, free tiers, and 6 strategies to cut your AI spend. ## Markdown Mirrors - [Homepage (md)](https://benchlm.ai/md/index.md) - [Models directory (md)](https://benchlm.ai/md/models/index.md) - [Benchmarks directory (md)](https://benchlm.ai/md/benchmarks/index.md) - [Compare index (md)](https://benchlm.ai/md/compare/index.md) - [Pricing (md)](https://benchlm.ai/md/pricing.md) - [LLM Pricing Deep Dive (md)](https://benchlm.ai/md/llm-pricing.md) - [Price vs Performance (md)](https://benchlm.ai/md/llm-price-performance.md) - [LLM Speed (md)](https://benchlm.ai/md/llm-speed.md) - [Benchmark Confidence (md)](https://benchlm.ai/md/benchmark-confidence.md) - [AI Race (md)](https://benchlm.ai/md/ai-race.md) - [LLM Leaderboard History (md)](https://benchlm.ai/md/llm-leaderboard-history.md) - [Alternatives directory (md)](https://benchlm.ai/md/alternatives/index.md) - [Alternative Finder (md)](https://benchlm.ai/md/tools/alternative-finder.md) - [Cost Calculator (md)](https://benchlm.ai/md/tools/cost-calculator.md) - [LLM Selector (md)](https://benchlm.ai/md/tools/llm-selector.md) - [Korean LLM Leaderboard (md)](https://benchlm.ai/md/leaderboards/korean-llm.md) - [Korean Benchmarks (md)](https://benchlm.ai/md/leaderboards/korean-benchmarks.md) - [KMMLU Guide (md)](https://benchlm.ai/md/guides/kmmlu-explained.md) - [European AI Guide (md)](https://benchlm.ai/md/best/european-llm.md) - Individual alternative pages available at: `https://benchlm.ai/md/alternatives/[slug].md` - Individual model pages available at: `https://benchlm.ai/md/models/[slug].md` - Benchmark pages available at: `https://benchlm.ai/md/benchmarks/[slug].md` - Best-of ranking pages available at: `https://benchlm.ai/md/best/[slug].md` - Comparison pages available at: `https://benchlm.ai/md/compare/[slug].md` - Blog posts available at: `https://benchlm.ai/md/blog/[slug].md` ## Machine-Readable Data - [Models JSON](https://benchlm.ai/data/models.json): Stable model metadata, rankings, benchmark scores, coverage fields, and model URLs - [Leaderboard JSON](https://benchlm.ai/data/leaderboard.json): Overall and category leaderboards derived from the same ranking logic as the site - [Benchmarks JSON](https://benchlm.ai/data/benchmarks.json): Benchmark definitions, weights, source links, and displayable score coverage counts - [Pricing JSON](https://benchlm.ai/data/pricing.json): API pricing joined to BenchLM model scores and value fields where available - [Speed JSON](https://benchlm.ai/data/speed.json): Runtime throughput and first-token latency metrics joined to BenchLM model records where possible - [Comparisons JSON](https://benchlm.ai/data/comparisons.json): Curated model comparisons with score deltas and category summaries ## Data & Technical Notes - Data last updated: June 12, 2026 - Canonical model families tracked: 98 - Total pairwise comparisons available: 33153 - Built with Next.js and deployed on Cloudflare Workers via OpenNext - Sitemap: https://benchlm.ai/sitemap.xml - Full crawler bundle: https://benchlm.ai/llms-full.txt - Author: [@glevd](https://x.com/glevd) --- # FULL CONTENT The following sections consolidate the main BenchLM datasets, rankings, and route indexes in one file. ## All Benchmark Descriptions ### Knowledge Benchmarks #### MMLU (Massive Multitask Language Understanding) A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level. - Year: 2020 - Tasks: 57 subjects - Format: Multiple choice questions - Difficulty: Elementary to professional level - Paper: Measuring Massive Multitask Language Understanding (https://arxiv.org/abs/2009.03300) - Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt MMLU evaluates models on 57 subjects spanning humanities, social sciences, STEM, and other areas. Questions range from elementary to advanced professional level, making it a comprehensive test of world knowledge and reasoning ability. #### GPQA (Graduate-Level Google-Proof Q&A) A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google. - Year: 2023 - Tasks: 448 questions - Format: Multiple choice questions - Difficulty: Graduate level - Paper: GPQA: A Graduate-Level Google-Proof Q&A Benchmark (https://arxiv.org/abs/2311.12022) - Authors: David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman GPQA questions are crafted by PhD-level domain experts and validated to be answerable by experts but challenging for non-experts even with internet access. This makes it an excellent test of deep scientific knowledge and reasoning. #### GPQA-D (GPQA Diamond) A display-only GPQA Diamond reference from provider comparison charts. - Year: 2026 - Tasks: Graduate-level science questions - Format: Multiple choice questions - Difficulty: Graduate level - Paper: Trinity-Large-Thinking: Scaling an Open Source Frontier Agent (https://www.arcee.ai/blog/trinity-large-thinking) - Authors: Arcee AI BenchLM stores GPQA-D separately from the standardized GPQA row when providers publish exact chart values that should not overwrite the core weighted benchmark. #### SuperGPQA (SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines) An expanded version of GPQA that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines, providing comprehensive coverage of academic domains. - Year: 2025 - Tasks: 285 disciplines - Format: Multiple choice questions - Difficulty: Graduate level - Paper: SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines (https://arxiv.org/abs/2502.14739) - Authors: Xiaoxuan Du, Yao Yao, Kexin Ma, Bowen Wang, Tianyu Zheng, Kaiyan Zhu, Yiming Zhang, Yutao Zhu, Jiawei Zhou, Jingren Zhou SuperGPQA significantly expands the scope of graduate-level evaluation by covering 285 disciplines compared to GPQA's focus on 3 subjects. It maintains the same rigorous standards while providing broader coverage of academic knowledge. #### MMLU-Pro (Massive Multitask Language Understanding Professional) An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models. - Year: 2024 - Tasks: Multiple subjects - Format: 10-way multiple choice - Difficulty: Professional level - Paper: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (https://arxiv.org/abs/2406.01574) - Authors: Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen MMLU-Pro increases the number of choices from 4 to 10 and integrates more reasoning-focused problems, reducing the chance of correct guessing and better evaluating true understanding. It serves as a more robust discriminator of model capabilities. #### AGIEval (AGIEval) A human-centric exam benchmark for general knowledge and reasoning reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: General academic and professional exam questions - Format: Exact match - Difficulty: General knowledge - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores AGIEval as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### HLE (Humanity's Last Exam) An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult. - Year: 2025 - Tasks: Expert-level questions - Format: Open-ended and multiple choice - Difficulty: Frontier expert level - Paper: Humanity's Last Exam (https://lastexam.ai/) - Authors: Center for AI Safety, Scale AI, and thousands of expert contributors HLE represents the hardest public benchmark available, with top models scoring only 10-45%. Questions span advanced mathematics, theoretical physics, philosophy, and other fields at the cutting edge of human knowledge. #### FrontierScience (FrontierScience) A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning. - Year: 2026 - Tasks: Research-level science tasks - Format: Scientific reasoning benchmark - Difficulty: Research frontier - Paper: FrontierScience (https://openai.com/index/frontierscience/) - Authors: OpenAI FrontierScience matters because GPQA-style knowledge alone is not enough for scientific copilots. It better reflects the kind of reasoning needed for research assistance and frontier technical work. #### Artificial Analysis Intelligence Index (Artificial Analysis Intelligence Index) A display-only intelligence index published by Artificial Analysis that aggregates provider-reported and benchmark-derived signals into a single model-level score. - Year: 2026 - Tasks: Cross-benchmark intelligence index - Format: Aggregated model score - Difficulty: Display-only external reference - Paper: Artificial Analysis (https://artificialanalysis.ai/) - Authors: Artificial Analysis BenchLM tracks Artificial Analysis as a display-only external reference rather than a weighted benchmark. It is useful as a market snapshot, but it is not a benchmark-native row with a single public task set, scoring harness, or exact-source methodology aligned to BenchLM's core benchmark pages. #### AA-GPQA Diamond (Artificial Analysis GPQA Diamond) A display-only Artificial Analysis GPQA Diamond score. - Year: 2026 - Tasks: Graduate-level science questions - Format: Accuracy - Difficulty: Graduate-level science reasoning - Paper: Artificial Analysis GPQA Diamond Benchmark Leaderboard (https://artificialanalysis.ai/evaluations/gpqa-diamond) - Authors: Artificial Analysis BenchLM stores the Artificial Analysis GPQA Diamond result separately from the weighted GPQA lane so AA refreshes remain display-only. #### AA-HLE (Artificial Analysis Humanity's Last Exam) A display-only Artificial Analysis Humanity's Last Exam score. - Year: 2026 - Tasks: Expert-level questions - Format: Accuracy - Difficulty: Frontier expert reasoning - Paper: Artificial Analysis Humanity's Last Exam Benchmark Leaderboard (https://artificialanalysis.ai/evaluations/hle) - Authors: Artificial Analysis BenchLM stores the Artificial Analysis HLE result separately from the weighted HLE lane so AA refreshes remain display-only. #### AA-Omniscience Index (Artificial Analysis Omniscience Index) A display-only Artificial Analysis factual knowledge index. - Year: 2026 - Tasks: Knowledge questions - Format: Index score - Difficulty: Broad factual knowledge - Paper: AA-Omniscience: Knowledge and Hallucination Benchmark (https://artificialanalysis.ai/evaluations/omniscience) - Authors: Artificial Analysis BenchLM stores the AA-Omniscience index as a display-only factuality signal alongside the accuracy and hallucination-rate rows. #### AA-Omniscience Accuracy (Artificial Analysis Omniscience Accuracy) A display-only Artificial Analysis knowledge metric for the proportion of correctly answered questions. - Year: 2026 - Tasks: Knowledge questions - Format: Accuracy - Difficulty: Broad knowledge - Paper: Artificial Analysis model benchmarks (https://artificialanalysis.ai/models/grok-4-3) - Authors: Artificial Analysis BenchLM stores AA-Omniscience Accuracy as a display-only row when a model page publishes the exact Artificial Analysis benchmark card value. #### AA-Omniscience Hallucination Rate (Artificial Analysis Omniscience Hallucination Rate) A display-only Artificial Analysis factuality metric for the rate of incorrect answers among non-correct responses. - Year: 2026 - Tasks: Knowledge questions - Format: Hallucination rate - Difficulty: Factuality - Paper: Artificial Analysis model benchmarks (https://artificialanalysis.ai/models/grok-4-3) - Authors: Artificial Analysis BenchLM marks this row lower-is-better because a lower hallucination rate is preferable, even though the OpenRouter card displays the raw percentage. #### SimpleQA (Measuring Short-Form Factuality in Large Language Models) A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity. - Year: 2024 - Tasks: Factual questions - Format: Short-form Q&A - Difficulty: Factual accuracy focused - Paper: Measuring short-form factuality in large language models (https://arxiv.org/abs/2411.04368) - Authors: Jason Wei, Najoung Kim, Hyung Won Chung, Yu-An Chung, Siddhartha Papay, Yifeng Lu, Hannaneh Hajishirzi, Luke Zettlemoyer SimpleQA prioritizes two key properties: questions should have short, factual answers that can be easily verified, and questions should be diverse and challenging. It serves as a crucial test of factual knowledge and accuracy. #### Chinese-SimpleQA (Chinese-SimpleQA) A Chinese short-form factuality benchmark reported by DeepSeek for V4 model evaluations. - Year: 2026 - Tasks: Chinese factual questions - Format: Short-form factual QA - Difficulty: Factual accuracy focused - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores Chinese-SimpleQA as a display-only provider-table reference for DeepSeek-V4. It is separate from the English SimpleQA row. #### OpenBookQA (OpenBookQA) A science question-answering benchmark that tests whether models can apply a small open-book set of elementary science facts to multi-step reasoning questions. - Year: 2018 - Tasks: Elementary science questions - Format: 4-way multiple choice - Difficulty: Elementary science reasoning - Paper: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering (https://arxiv.org/abs/1809.02789) - Authors: Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal OpenBookQA was designed to test grounded science reasoning rather than pure memorization. Each question is paired with a core science fact, but models still need additional commonsense knowledge to infer the correct answer. #### HealthBench Hard (HealthBench Hard) A harder subset of OpenAI's HealthBench for evaluating open-ended medical and health reasoning with rubric-based grading. - Year: 2026 - Tasks: 1,000 health prompts - Format: Open-ended health evaluation - Difficulty: Advanced health reasoning - Paper: Muse Spark Eval Methodology (https://ai.meta.com/static-resource/muse-spark-eval-methodology) - Authors: Meta AI Meta describes HealthBench Hard as a 1,000-prompt subset of OpenAI's HealthBench, graded with the same simple-evals implementation and a GPT-4.1-based judge. BenchLM treats it as a display-only health benchmark reference. #### MedXpertQA (Text) (MedXpertQA Text) A medical multiple-choice benchmark spanning many specialties with 10 answer options per question. - Year: 2026 - Tasks: 2,450 medical multiple-choice questions - Format: Medical MCQ - Difficulty: Professional medical knowledge - Paper: Muse Spark Eval Methodology (https://ai.meta.com/static-resource/muse-spark-eval-methodology) - Authors: Meta AI Meta describes the text variant as 2,450 specialty-spanning medical questions with answer choices A-J. BenchLM treats it as a display-only health benchmark because it is not part of the weighted core schema. #### FrontierScience Research (FrontierScience Research) A research-focused FrontierScience evaluation variant for scientific investigation and problem solving. - Year: 2026 - Tasks: Scientific research problems - Format: Research evaluation - Difficulty: Frontier scientific research - Paper: Muse Spark Eval Methodology (https://ai.meta.com/static-resource/muse-spark-eval-methodology) - Authors: Meta AI Meta uses FrontierScience Research in its Contemplating-mode comparison table as a distinct scientific research variant. BenchLM stores it as a display-only frontier science reference. #### TruthfulQA (TruthfulQA) A benchmark designed to measure whether language models produce truthful answers instead of repeating common misconceptions or misleading falsehoods. - Year: 2021 - Tasks: Truthfulness and misconception resistance - Format: Question answering - Difficulty: Hallucination and factuality stress test - Paper: TruthfulQA: Measuring How Models Mimic Human Falsehoods (https://arxiv.org/abs/2109.07958) - Authors: Stephanie Lin, Jacob Hilton, Owain Evans TruthfulQA matters because many models sound confident while repeating popular but false answers. It is a useful factuality and hallucination-adjacent benchmark even though it is older than newer factuality suites. #### HLE w/o tools (Humanity's Last Exam without tools) Tool-free variant of Humanity's Last Exam that isolates a model's raw frontier reasoning. - Year: 2026 - Tasks: Expert-level questions - Format: Tool-free expert QA - Difficulty: Frontier expert level - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI This variant removes external tools so the score reflects pure model performance on frontier expert questions. #### MMLU-Pro (Arcee) (MMLU-Pro first-party comparison snapshot) A display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart. - Year: 2026 - Tasks: Professional academic QA - Format: 10-way multiple choice - Difficulty: Professional level - Paper: Trinity-Large-Thinking: Scaling an Open Source Frontier Agent (https://www.arcee.ai/blog/trinity-large-thinking) - Authors: Arcee AI BenchLM stores this chart-specific MMLU-Pro row separately so it does not overwrite the standardized weighted MMLU-Pro benchmark values. #### MMLU-Redux (MMLU-Redux) A harder refresh of MMLU intended to keep broad knowledge evaluation useful after the original benchmark became too easy for frontier models. - Year: 2026 - Tasks: Broad academic QA - Format: Multiple choice questions - Difficulty: Advanced general knowledge - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MMLU-Redux is useful when MMLU itself has largely saturated. It acts as a broader knowledge sanity check with fresher or harder questions intended to preserve separation among strong general-purpose models. #### MMMLU (MMMLU) A multilingual MMLU-style benchmark reported in provider evaluation tables. - Year: 2026 - Tasks: Multilingual academic QA - Format: Exact match - Difficulty: Broad multilingual knowledge - Paper: MMMLU (https://huggingface.co/datasets/openai/MMMLU) - Authors: OpenAI BenchLM stores MMMLU as a display-only provider-table row when exact public values are published. #### C-Eval (C-Eval) A Chinese-language academic and professional benchmark spanning humanities, social science, STEM, and applied subjects. - Year: 2023 - Tasks: Chinese academic and professional exams - Format: Multiple choice questions - Difficulty: High school to professional level - Paper: C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models (https://arxiv.org/abs/2305.08322) - Authors: C-Eval authors C-Eval is one of the clearest public signals for non-English academic knowledge performance. It tests whether a model can sustain strong factual recall and reasoning under Chinese-language exam conditions across many domains. #### CMMLU (Chinese Massive Multitask Language Understanding) A Chinese multitask academic benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Chinese academic QA - Format: Exact match - Difficulty: Broad Chinese knowledge - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores CMMLU as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### MultiLoKo (MultiLoKo) A multilingual/localized knowledge benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Localized multilingual knowledge questions - Format: Exact match - Difficulty: Multilingual knowledge - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores MultiLoKo as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### FACTS Parametric (FACTS Parametric) A parametric factuality benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Parametric factual recall - Format: Exact match - Difficulty: Factual accuracy focused - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores FACTS Parametric as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### TriviaQA (TriviaQA) A reading and trivia question-answering benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Trivia and reading-comprehension QA - Format: Exact match - Difficulty: General factual QA - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores TriviaQA as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. ### Coding Benchmarks #### HumanEval (Evaluating Large Language Models Trained on Code) A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests. - Year: 2021 - Tasks: 164 problems - Format: Python function generation - Difficulty: Introductory to intermediate programming - Paper: Evaluating Large Language Models Trained on Code (https://arxiv.org/abs/2107.03374) - Authors: Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba HumanEval measures functional correctness for synthesizing programs from docstrings. It focuses on whether generated code actually works correctly rather than just looking syntactically correct. Problems range from simple string manipulation to more complex algorithmic challenges. #### BigCodeBench (BigCodeBench) A code-generation benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Code generation tasks - Format: Pass@1 - Difficulty: Software engineering - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores BigCodeBench as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### Codeforces (Codeforces Rating) Competitive-programming rating reported for DeepSeek-V4 thinking-mode evaluations. - Year: 2026 - Tasks: Competitive programming contests - Format: Rating - Difficulty: Elite competitive programming - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores Codeforces as a display-only provider-table row because its rating scale is not a 0-100 percentage benchmark. #### Terminal-Bench 2.0 (Terminal-Bench 2.0) A benchmark for agentic software engineering tasks executed in real terminal environments. DeepSeek reports it in the agentic section, while BenchLM also mirrors it in coding for models that publish it as a developer-task signal. - Year: 2026 - Tasks: Terminal-based software tasks - Format: Interactive CLI agent evaluation - Difficulty: Professional software engineering - Paper: Terminal-Bench 2.0 (https://www.tbench.ai/) - Authors: Terminal-Bench contributors Terminal-Bench 2.0 focuses on realistic CLI and repository workflows rather than toy code generation. BenchLM keeps coding-category copies display-only unless the scoring weights include them. #### SWE-bench Verified (Software Engineering Benchmark Verified) A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn. - Year: 2024 - Tasks: 500 verified issues - Format: Code patch generation - Difficulty: Professional software engineering - Paper: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (https://arxiv.org/abs/2310.06770) - Authors: Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan SWE-bench Verified is the gold standard for evaluating AI coding agents on real-world software engineering tasks. Each task requires understanding codebases, writing patches, and passing test suites. #### SWE-Rebench (SWE-Rebench) A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported. - Year: 2026 - Tasks: Fresh GitHub issues (rolling window) - Format: Code patch generation - Difficulty: Professional software engineering - Paper: SWE-Rebench: Contamination-Free Evaluation of Software Engineering Agents (https://swe-rebench.com) - Authors: Nebius SWE-Rebench uses a rolling window of fresh problems from real GitHub repositories, sourced after each model's release date to prevent contamination. Each model runs 5 times with a standardized 128K-context ReAct scaffold. Unlike SWE-bench Verified (2023 problems), scores reflect consistent, up-to-date difficulty. #### LiveCodeBench (LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code) A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation. - Year: 2024 - Tasks: Continuously updated - Format: Competitive programming - Difficulty: Competitive programming level - Paper: LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (https://arxiv.org/abs/2403.07974) - Authors: Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica LiveCodeBench addresses data contamination concerns by continuously sourcing new problems from competitive programming platforms. It evaluates code generation, self-repair, code execution, and test output prediction. #### LiveCodeBench v6 (LiveCodeBench v6) A newer LiveCodeBench slice used in provider comparison tables to benchmark contamination-resistant coding performance on fresher competitive programming sets. - Year: 2026 - Tasks: Fresh programming problems - Format: Competitive programming - Difficulty: Competitive programming level - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen Providers often publish a specific LiveCodeBench release or season instead of the rolling aggregate. BenchLM tracks the v6 slice separately so exact first-party values remain visible without overwriting the broader LiveCodeBench row. #### LiveCodeBench Pro (LiveCodeBench Pro) A harder competitive-programming benchmark family built from Codeforces, ICPC, and IOI problems, with quarter-specific public leaderboards and difficulty-aware reporting. - Year: 2025 - Tasks: Quarter-specific contest programming sets - Format: Competitive programming - Difficulty: High-end contest programming - Paper: LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? (https://arxiv.org/abs/2506.11928) - Authors: LiveCodeBench Pro authors LiveCodeBench Pro is distinct from the original LiveCodeBench family. It excludes LeetCode, emphasizes stronger contest difficulty, and the official site publishes quarter-specific leaderboards such as 25Q2 with hard, medium, and easy pass rates. #### FLTEval (FLTEval) A repository-level Lean 4 proof engineering benchmark that measures whether a model can complete formal proofs and correctly define new mathematical concepts inside realistic FLT project pull requests. - Year: 2026 - Tasks: FLT project pull requests - Format: Lean 4 repository task completion - Difficulty: Formal verification / proof engineering - Paper: Leanstral: Open-Source foundation for trustworthy vibe-coding (https://mistral.ai/news/leanstral) - Authors: Mistral AI FLTEval is designed to move evaluation beyond isolated competition-math problems. Instead of proving one-off statements, models must operate inside realistic formal repositories and finish pull-request-style Lean 4 work with Lean itself acting as a verifier. #### SWE-bench Pro (SWE-bench Pro) A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work. - Year: 2026 - Tasks: Real-world software engineering - Format: Repository task completion - Difficulty: Frontier coding agent - Paper: Why we no longer evaluate SWE-bench Verified (https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/) - Authors: OpenAI SWE-bench Pro is the more relevant frontier signal when selecting coding agents in 2026. It reflects more realistic difficulty than the older verified subset. #### FrontierCode (FrontierCode Diamond) A Cognition software-engineering benchmark that evaluates whether coding agents produce mergeable, production-quality pull requests, scoring correctness, tests, scope, style, and maintainability through maintainer-authored rubrics. - Year: 2026 - Tasks: 50 Diamond tasks (150 total across Extended) - Format: Repository task completion with maintainer rubrics - Difficulty: Frontier coding-agent quality - Paper: Introducing FrontierCode (https://cognition.ai/blog/frontier-code) - Authors: Cognition FrontierCode uses 150 software-engineering tasks built with maintainers of 36 open-source repositories. BenchLM displays the hardest 50-task Diamond score as a display-only coding benchmark because the tasks are private and the public rows combine models with agent harnesses such as Claude Code, Codex, Gemini CLI, mini-swe-agent, and Devin. #### SWE Multilingual (SWE Multilingual) A multilingual software-engineering benchmark for real-world code issue resolution across multiple programming languages. - Year: 2026 - Tasks: Multilingual software-engineering tasks - Format: Repository task completion - Difficulty: Professional software engineering - Paper: MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - Authors: MiniMax MiniMax reports SWE Multilingual as a coding benchmark focused on multilingual software-engineering tasks beyond single-language Python issue fixing. #### SWE Multimodal (SWE-bench Multimodal) A multimodal variant of SWE-bench that adds visual context such as screenshots and design mockups to software engineering issue descriptions. - Year: 2025 - Tasks: Multimodal software engineering tasks - Format: Code patch generation with visual context - Difficulty: Frontier multimodal coding - Paper: SWE-bench Multimodal (https://www.swebench.com/multimodal) - Authors: SWE-bench team BenchLM stores provider-reported SWE-bench Multimodal values in the coding category when the model vendor reports the benchmark as part of a software-engineering capability suite. #### CursorBench v3.1 (CursorBench v3.1) Cursor's first-party harder-task benchmark for long-horizon agentic coding behavior inside the Cursor agent loop. - Year: 2026 - Tasks: Harder long-horizon agentic coding tasks - Format: Cursor agent-loop evaluation - Difficulty: Professional agentic software engineering - Paper: CursorBench 3.1 (https://cursor.com/evals) - Authors: Cursor Cursor reports CursorBench v3.1 on its public evals page for ambiguous, multi-file tasks from real Cursor sessions. BenchLM tracks it as display-only because it is a first-party benchmark. #### Multi-SWE Bench (Multi-SWE Bench) A multi-language software-engineering benchmark that measures repository-level bug fixing and implementation across more than one programming ecosystem. - Year: 2026 - Tasks: Multi-language repo tasks - Format: Repository task completion - Difficulty: Professional software engineering - Paper: MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - Authors: MiniMax MiniMax positions Multi-SWE Bench as a benchmark closer to real engineering work than isolated code generation, emphasizing multi-language repository workflows. #### VIBE-Pro (VIBE-Pro) A repo-level code generation and full-project delivery benchmark spanning web, mobile, and simulation-style implementation tasks. - Year: 2026 - Tasks: Full project delivery tasks - Format: Repository-level implementation benchmark - Difficulty: End-to-end software delivery - Paper: MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - Authors: MiniMax MiniMax describes VIBE-Pro as an end-to-end project delivery benchmark that tests whether a model can complete substantial product requirements rather than single-file snippets. #### Vibe Code Bench (Vibe Code Bench v1.1) Vals.ai benchmark for evaluating whether models can build complete web applications from natural language specifications in a production-like development environment. - Year: 2026 - Tasks: End-to-end web application builds - Format: Full-stack app implementation benchmark - Difficulty: End-to-end software delivery - Paper: Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development (https://www.vals.ai/benchmarks/vibe-code) - Authors: Vals AI Vibe Code Bench v1.1 asks models to build full web apps with services such as Supabase, Stripe test mode, email, browsing, and file editing available. The score is overall application pass accuracy across private end-to-end app tasks. #### ProgramBench (ProgramBench: Can Language Models Rebuild Programs From Scratch?) A cleanroom software-engineering benchmark where agents receive only a compiled executable and documentation, then must architect and implement a complete codebase that reproduces the original program's behavior. - Year: 2026 - Tasks: 200 program reconstruction tasks - Format: Cleanroom executable reimplementation - Difficulty: Full-repository software architecture - Paper: ProgramBench: Can Language Models Rebuild Programs From Scratch? (https://programbench.com/static/paper.pdf) - Authors: John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press ProgramBench turns open-source projects into cleanroom reconstruction tasks. Each task starts from an execute-only binary and usage documentation, with no source code, internet, decompilation, or prescribed skeleton. Evaluation uses hidden behavioral tests generated through agent-driven fuzzing. BenchLM shows ProgramBench as display-only because all current public rows are tied at 0% fully resolved and the visible score is the auxiliary almost-resolved metric. #### Kimi Code Bench v2 (Kimi Code Bench v2) A Moonshot AI internal coding-agent benchmark for realistic software-engineering tasks across mainstream programming languages and production technology stacks. - Year: 2026 - Tasks: Realistic coding-agent tasks - Format: Coding-agent pass rate - Difficulty: Production software engineering - Paper: Kimi K2.7 Code (https://huggingface.co/moonshotai/Kimi-K2.7-Code) - Authors: Moonshot AI Moonshot describes Kimi Code Bench v2 as an in-house coding-agent benchmark covering backend services, infrastructure, performance engineering, systems programming, security, frontend development, and ML/data engineering. BenchLM stores provider-reported exact values as display-only launch evidence. #### MLS-Bench Lite (MLS-Bench Lite) A 30-task subset of MLS-Bench that evaluates whether AI systems can invent generalizable and scalable machine-learning methods. - Year: 2026 - Tasks: 30 machine-learning research tasks - Format: Agentic ML task evaluation - Difficulty: ML research and systems engineering - Paper: MLS-Bench (https://mls-bench.com/) - Authors: MLS-Bench Moonshot reports MLS-Bench Lite as a coding-agent result for Kimi K2.7 Code. BenchLM stores the provider-reported exact value separately from weighted coding benchmarks because the row is a newly reported benchmark variant with sparse public model coverage. #### NL2Repo (NL2Repo) A repository-understanding benchmark that measures whether models can map natural-language requests onto the right code locations and system changes. - Year: 2026 - Tasks: Natural language to repository tasks - Format: Repository understanding benchmark - Difficulty: System-level software comprehension - Paper: MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - Authors: MiniMax MiniMax cites NL2Repo as a system-level engineering benchmark that rewards deep understanding of complex repositories and their operational structure. #### React Native Evals (React Native Evals) An open benchmark for AI coding agents on real-world React Native implementation tasks, emphasizing working app behavior, recommended architecture choices, and strict constraint adherence. - Year: 2026 - Tasks: React Native app implementation tasks - Format: Framework-specific app development evaluation - Difficulty: Production mobile app engineering - Paper: React Native Evals (https://rn-evals.vercel.app/) - Authors: Callstack React Native Evals focuses on framework-specific mobile work that generic coding benchmarks often miss. The public dashboard groups tasks into areas like navigation, animation, and async state, with repeated runs and cost tracking across models. #### Next.js Evals (AI Agent Evaluations for Next.js) A Vercel benchmark for AI coding agents on Next.js code generation and migration tasks, reporting success rate, average execution time, and an AGENTS.md documentation-assisted split. - Year: 2026 - Tasks: 24 Next.js code generation and migration tasks - Format: Agent task completion with withheld Vitest assertions - Difficulty: Framework-specific web application engineering - Paper: AI Agent Evaluations | Next.js (https://nextjs.org/evals) - Authors: Vercel Next.js Evals focuses on framework-specific web engineering tasks such as Pages Router to App Router migration, server actions, cache directives, proxy middleware, async cookies and headers, and other current Next.js patterns. BenchLM mirrors the public leaderboard as display-only because rows combine model choice with an agent harness. #### SWE-bench Verified* (SWE-bench Verified (mini-swe-agent-v2)) A display-only SWE-bench Verified reference from Arcee AI's Trinity-Large-Thinking comparison chart. - Year: 2026 - Tasks: Repository task completion - Format: Agent scaffold benchmark - Difficulty: Professional software engineering - Paper: Trinity-Large-Thinking: Scaling an Open Source Frontier Agent (https://www.arcee.ai/blog/trinity-large-thinking) - Authors: Arcee AI BenchLM stores this chart-specific SWE-bench Verified row separately because Arcee notes all models were evaluated in mini-swe-agent-v2. #### Spider 2.0-Lite (Spider 2.0-Lite) A text-to-SQL benchmark over realistic warehouse-scale schemas, reported by Interfaze for model comparison. - Year: 2024 - Tasks: Text-to-SQL queries - Format: Execution accuracy - Difficulty: Enterprise text-to-SQL - Paper: Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows (https://github.com/xlang-ai/Spider2) - Authors: Spider 2.0 authors Spider 2.0-Lite tests whether a model can generate executable SQL from natural-language questions against realistic database schemas. BenchLM tracks Interfaze's SQLite subset score as a display-only coding and data benchmark. #### SciCode (Scientific Code Benchmark) SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research. - Year: 2024 - Tasks: 80 - Format: undefined - Difficulty: undefined - Paper: undefined (undefined) - Authors: undefined undefined #### AA Coding Index (Artificial Analysis Coding Index) A display-only Artificial Analysis coding index. - Year: 2026 - Tasks: Cross-benchmark coding index - Format: Aggregated model score - Difficulty: Display-only external reference - Paper: Artificial Analysis model leaderboards (https://artificialanalysis.ai/leaderboards/models) - Authors: Artificial Analysis BenchLM mirrors this coding index for comparison, but does not use it as a weighted coding benchmark row. #### AA Coding Agents (Artificial Analysis Coding Agent Index) A display-only Artificial Analysis leaderboard for coding-agent systems, combining agent harnesses, host models, and execution settings across software-engineering benchmarks. - Year: 2026 - Tasks: Composite over DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA - Format: Average pass@1 index - Difficulty: Real-world coding-agent workflows - Paper: Artificial Analysis Coding Agent Benchmarks (https://artificialanalysis.ai/agents/coding-agents) - Authors: Artificial Analysis BenchLM mirrors the Artificial Analysis Coding Agent Index v1.1 page as a display-only external leaderboard. The source combines DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA component scores and publishes cost, token, and execution-time metadata. Rows are coding-agent systems rather than pure base-model results. #### AA-SciCode (Artificial Analysis SciCode) A display-only Artificial Analysis SciCode score. - Year: 2026 - Tasks: Scientific coding subproblems - Format: Task success rate - Difficulty: Scientific programming - Paper: Artificial Analysis SciCode Benchmark Leaderboard (https://artificialanalysis.ai/evaluations/scicode) - Authors: Artificial Analysis BenchLM stores the Artificial Analysis SciCode result separately from the weighted SciCode lane so AA refreshes remain display-only. #### Terminal-Bench Hard (Terminal-Bench Hard) A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice. - Year: 2026 - Tasks: Agentic coding and terminal tasks - Format: Task success rate - Difficulty: Professional software engineering - Paper: Artificial Analysis model benchmarks (https://artificialanalysis.ai/models/grok-4-3) - Authors: Artificial Analysis BenchLM stores Terminal-Bench Hard separately from Terminal-Bench 2.0 because OpenRouter and Artificial Analysis publish it as a distinct benchmark card. #### VIBE V2 (VIBE V2) A display-only MiniMax provider benchmark for end-to-end coding-agent and product-building tasks. - Year: 2026 - Tasks: End-to-end coding-agent tasks - Format: Task success rate - Difficulty: Frontier coding-agent workflows - Paper: MiniMax M3 model card (https://huggingface.co/MiniMaxAI/MiniMax-M3) - Authors: MiniMax MiniMax reports VIBE V2 in the M3 comparison chart. BenchLM tracks it as a display-only provider row because it is not part of the weighted coding schema. #### SVG-Bench (SVG-Bench) A display-only provider benchmark for generating or manipulating SVG outputs from natural-language requirements. - Year: 2026 - Tasks: SVG generation and editing tasks - Format: Task success rate - Difficulty: Visual coding and structured graphics generation - Paper: MiniMax M3 model card (https://huggingface.co/MiniMaxAI/MiniMax-M3) - Authors: MiniMax MiniMax reports SVG-Bench in the M3 comparison chart. BenchLM stores it as display-only because it is a provider-table row outside the weighted schema. #### KernelBench Hard (KernelBench Hard) A display-only benchmark for difficult GPU kernel implementation and optimization tasks. - Year: 2026 - Tasks: Hard GPU kernel coding tasks - Format: Task success rate - Difficulty: Specialized systems programming - Paper: MiniMax M3 model card (https://huggingface.co/MiniMaxAI/MiniMax-M3) - Authors: MiniMax MiniMax reports KernelBench Hard in the M3 comparison chart. BenchLM keeps it separate from Terminal-Bench Hard and excludes it from weighted coding scores. ### Mathematics Benchmarks #### AIME 2023 (American Invitational Mathematics Examination 2023) A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO). - Year: 2023 - Tasks: 15 problems - Format: Integer answers 000-999 - Difficulty: High school olympiad level - Paper: American Invitational Mathematics Examination (https://www.maa.org/math-competitions/aime) - Authors: Mathematical Association of America AIME is designed for students who score well on AMC 10/12. Problems require creative problem-solving and mathematical insight beyond standard high school curriculum. Only the top scorers qualify for USAMO. #### AIME 2024 (American Invitational Mathematics Examination 2024) The 2024 edition of AIME, maintaining the same format of 15 challenging mathematics problems with integer answers from 000 to 999. - Year: 2024 - Tasks: 15 problems - Format: Integer answers 000-999 - Difficulty: High school olympiad level - Paper: American Invitational Mathematics Examination (https://www.maa.org/math-competitions/aime) - Authors: Mathematical Association of America AIME 2024 continues the tradition of challenging mathematical reasoning problems. These problems test deep understanding of mathematical concepts and creative problem-solving abilities. #### AIME 2025 (American Invitational Mathematics Examination 2025) The most recent AIME examination, featuring 15 challenging mathematics problems testing olympiad-level mathematical reasoning with integer answers from 000-999. - Year: 2025 - Tasks: 15 problems - Format: Integer answers 000-999 - Difficulty: High school olympiad level - Paper: American Invitational Mathematics Examination (https://www.maa.org/math-competitions/aime) - Authors: Mathematical Association of America AIME 2025 represents the current standard for intermediate-level mathematical olympiad problems. Success requires sophisticated mathematical reasoning and problem-solving techniques. #### GSM8K (Grade School Math 8K) A grade-school mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Grade-school math word problems - Format: Exact match - Difficulty: Grade-school math - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores GSM8K as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### MATH (MATH) A competition-style mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Competition math problems - Format: Exact match - Difficulty: Advanced math reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores MATH as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### CMath (CMath) A Chinese mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Chinese math problems - Format: Exact match - Difficulty: Math reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores CMath as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### AIME25 (Arcee) (AIME25 first-party comparison snapshot) A display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart. - Year: 2026 - Tasks: 15 problems - Format: Integer answers 000-999 - Difficulty: High school olympiad level - Paper: Trinity-Large-Thinking: Scaling an Open Source Frontier Agent (https://www.arcee.ai/blog/trinity-large-thinking) - Authors: Arcee AI BenchLM stores the Arcee chart version of AIME25 separately so it does not overwrite the weighted AIME 2025 benchmark row. #### HMMT Feb 2023 (Harvard-MIT Mathematics Tournament February 2023) A prestigious high school mathematics competition hosted jointly by Harvard and MIT, featuring challenging problems across various mathematical disciplines. - Year: 2023 - Tasks: Tournament problems - Format: Competition mathematics - Difficulty: High school olympiad level - Paper: Harvard-MIT Mathematics Tournament (https://www.hmmt.org/) - Authors: Harvard and MIT Mathematics Departments HMMT is one of the most competitive high school mathematics tournaments in the US. Problems span algebra, geometry, combinatorics, and number theory, requiring deep mathematical insight. #### HMMT Feb 2024 (Harvard-MIT Mathematics Tournament February 2024) The 2024 February edition of the Harvard-MIT Mathematics Tournament, continuing the tradition of challenging high school mathematics competition. - Year: 2024 - Tasks: Tournament problems - Format: Competition mathematics - Difficulty: High school olympiad level - Paper: Harvard-MIT Mathematics Tournament (https://www.hmmt.org/) - Authors: Harvard and MIT Mathematics Departments HMMT Feb 2024 maintains the high standards of mathematical rigor and creativity expected from this premier competition. Problems test advanced mathematical reasoning skills. #### HMMT Feb 2025 (Harvard-MIT Mathematics Tournament February 2025) The most recent February edition of the Harvard-MIT Mathematics Tournament, featuring the latest challenging problems in competitive mathematics. - Year: 2025 - Tasks: Tournament problems - Format: Competition mathematics - Difficulty: High school olympiad level - Paper: Harvard-MIT Mathematics Tournament (https://www.hmmt.org/) - Authors: Harvard and MIT Mathematics Departments HMMT Feb 2025 represents the current pinnacle of high school mathematics competition, with problems designed to challenge the brightest mathematical minds. #### BRUMO 2025 (Bulgarian Mathematical Olympiad 2025) A challenging mathematical olympiad competition featuring problems that test advanced mathematical reasoning and problem-solving skills at the olympiad level. - Year: 2025 - Tasks: Olympiad problems - Format: Mathematical olympiad - Difficulty: Mathematical olympiad level - Paper: Bulgarian Mathematical Olympiad (https://www.math.bas.bg/) - Authors: Bulgarian Mathematical Society BRUMO represents the Bulgarian tradition of mathematical excellence, featuring problems that require deep mathematical insight and creative problem-solving approaches. #### MATH-500 (MATH-500 Problem Set) A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus. - Year: 2021 - Tasks: 500 problems - Format: Free-form mathematical answers - Difficulty: High school to undergraduate - Paper: Measuring Mathematical Problem Solving With the MATH Dataset (https://arxiv.org/abs/2103.03874) - Authors: Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt MATH-500 is one of the most widely cited math benchmarks. It is nearing saturation with top reasoning models scoring 96-99%, making it less useful for differentiating frontier models but still a standard baseline. #### AIME26 (AIME 2026) A 2026 American Invitational Mathematics Examination snapshot used in frontier-model comparison tables for mathematical reasoning. - Year: 2026 - Tasks: Competition math problems - Format: Short-answer mathematics - Difficulty: Olympiad-style mathematics - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen AIME-style benchmarks remain one of the fastest ways to separate top reasoning models on olympiad-style math. AIME 2026 is a newer contest-year snapshot than the legacy AIME rows already tracked on BenchLM. #### IPhO 2025 (Theory) (International Physics Olympiad 2025 (Theory)) The three official theory problems from the 2025 International Physics Olympiad, scored with blinded human evaluation. - Year: 2026 - Tasks: 3 olympiad theory problems - Format: Physics olympiad theory - Difficulty: International olympiad physics - Paper: Muse Spark Eval Methodology (https://ai.meta.com/static-resource/muse-spark-eval-methodology) - Authors: Meta AI Meta reports IPhO 2025 Theory as a Contemplating-mode comparison benchmark, scored with human evaluation guided by the official olympiad rubric. BenchLM stores it as a display-only advanced physics benchmark. #### HMMT Feb 2025 (Harvard-MIT Mathematics Tournament February 2025) A February 2025 HMMT slice used in exact-value provider tables for advanced contest-math reasoning. - Year: 2025 - Tasks: Competition math problems - Format: Contest mathematics - Difficulty: Olympiad-style mathematics - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen BenchLM stores this HMMT monthly slice separately from the aggregate HMMT rows so first-party exact values remain visible without overwriting the broader yearly HMMT reference. #### HMMT Nov 2025 (Harvard-MIT Mathematics Tournament November 2025) A November 2025 HMMT slice for high-end mathematical reasoning comparisons. - Year: 2025 - Tasks: Competition math problems - Format: Contest mathematics - Difficulty: Olympiad-style mathematics - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen This row preserves exact provider-table values from the late-2025 HMMT contest cycle. It is useful for spotting whether frontier models generalize across separate contest sets rather than a single annual rollup. #### HMMT Feb 2026 (Harvard-MIT Mathematics Tournament February 2026) A February 2026 HMMT slice used in newer frontier-model math comparisons. - Year: 2026 - Tasks: Competition math problems - Format: Contest mathematics - Difficulty: Olympiad-style mathematics - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen HMMT February 2026 matters because small score deltas at the frontier often depend on which contest set is used. BenchLM keeps this newer slice distinct from older HMMT summary rows. #### IMOAnswerBench (IMOAnswerBench) A challenging mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations. - Year: 2026 - Tasks: Advanced mathematical answer generation - Format: Pass@1 math benchmark - Difficulty: Olympiad-level mathematics - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores IMOAnswerBench as a display-only provider-table row when exact values are published for frontier math comparisons. #### Apex (Apex) A high-difficulty mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations. - Year: 2026 - Tasks: Advanced mathematical reasoning - Format: Pass@1 math benchmark - Difficulty: Frontier math reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores Apex as a display-only provider-table reference for exact first-party model comparisons. #### Apex Shortlist (Apex Shortlist) A shortlist subset of the Apex mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations. - Year: 2026 - Tasks: Advanced mathematical reasoning - Format: Pass@1 math benchmark - Difficulty: Frontier math reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores Apex Shortlist separately from the broader Apex row so provider-reported table values remain traceable. #### MMAnswerBench (MMAnswerBench) A multimodal mathematical reasoning benchmark that tests whether models can answer visually grounded math questions correctly. - Year: 2026 - Tasks: Multimodal math questions - Format: Visual and structured mathematical QA - Difficulty: Advanced mathematical reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MMAnswerBench matters because text-only math ability does not guarantee strong performance when the relevant information is embedded in diagrams, tables, or other visual inputs. It acts as a multimodal math transfer check. #### FrontierMath (FrontierMath) An expert-level mathematical reasoning benchmark by Epoch AI featuring original, research-level problems created by mathematicians including IMO gold medalists and Fields Medal recipients. Problems require deep creativity and multi-step reasoning. - Year: 2024 - Tasks: 350 original research-level math problems - Format: Open-ended mathematical reasoning with tool access - Difficulty: Research-level mathematics - Paper: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI (https://epoch.ai/frontiermath) - Authors: Epoch AI FrontierMath is the hardest public math benchmark. It consists of 300 Tier 1-3 problems and 50 Tier 4 problems, all original and unpublished. Models are evaluated with access to Python and computational tools. Top models score under 50%, making it a critical discriminator for frontier mathematical reasoning. #### USAMO 2026 (United States of America Mathematical Olympiad 2026) The premier US mathematical olympiad competition, featuring proof-based problems that require deep mathematical insight and rigorous argumentation at the highest competition level. - Year: 2026 - Tasks: 6 proof-based problems - Format: Mathematical proof construction - Difficulty: International olympiad level - Paper: United States of America Mathematical Olympiad (https://www.maa.org/math-competitions/usamo) - Authors: Mathematical Association of America USAMO represents the highest tier of US math competitions, serving as the selection exam for the International Mathematical Olympiad team. Problems require full proofs rather than just numerical answers. Mythos Preview scored 97.6%, GPT-5.4 scored 95.2%, Gemini 3.1 Pro scored 74.4%. ### Reasoning Benchmarks #### MuSR (Testing the Limits of Chain-of-thought with Multistep Soft Reasoning) A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning. - Year: 2023 - Tasks: Multi-step reasoning - Format: Narrative-based reasoning - Difficulty: Complex reasoning tasks - Paper: MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning (https://arxiv.org/abs/2310.16049) - Authors: Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett MuSR challenges models to perform multistep reasoning over complex narratives. Unlike simple factual questions, it requires models to track multiple entities, relationships, and logical steps across extended contexts. #### BBH (BIG-Bench Hard) A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting. - Year: 2022 - Tasks: 23 tasks - Format: Mixed reasoning tasks - Difficulty: Advanced reasoning - Paper: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (https://arxiv.org/abs/2210.09261) - Authors: Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei BBH focuses on 23 tasks from BIG-Bench that remain challenging for language models. Tasks include logical deduction, tracking shuffled objects, causal judgement, and other complex reasoning scenarios. #### DROP (Discrete Reasoning Over Paragraphs) A reading-comprehension benchmark requiring discrete reasoning over paragraphs, reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Paragraph reasoning questions - Format: F1 - Difficulty: Reading and numerical reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores DROP as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### HellaSwag (HellaSwag) A commonsense natural-language inference benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Commonsense completion questions - Format: Exact match - Difficulty: Commonsense reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores HellaSwag as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### WinoGrande (WinoGrande) A commonsense coreference benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Coreference resolution questions - Format: Exact match - Difficulty: Commonsense reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores WinoGrande as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### CLUEWSC (CLUEWSC) A Chinese Winograd Schema Challenge benchmark reported in DeepSeek-V4 base-model evaluations. - Year: 2026 - Tasks: Chinese coreference questions - Format: Exact match - Difficulty: Chinese commonsense reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores CLUEWSC as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### LisanBench (LisanBench) A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains. - Year: 2026 - Tasks: 50 starting words × 3 trials - Format: Difficulty-weighted word-chain reasoning - Difficulty: Open-ended lexical planning - Paper: LisanBench methodology (https://lisanbench.com/?tab=about) - Authors: voice-from-the-outer-world BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word. #### Pencil Puzzle Bench (Pencil Puzzle Bench) A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions. - Year: 2026 - Tasks: 300 evaluation puzzles - Format: Direct and agentic puzzle solve rate - Difficulty: Multi-step verifiable reasoning - Paper: Pencil Puzzle Bench (https://arxiv.org/abs/2603.02119) - Authors: Approximate Labs BenchLM mirrors the public Pencil Puzzle Bench leaderboard as a display-only reasoning benchmark. The public site reports direct-ask and agentic solve rates across a 300-puzzle evaluation selection from the 62,231-puzzle dataset. #### LongBench v2 (LongBench v2) A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval. - Year: 2025 - Tasks: Long-context tasks - Format: Extended-context retrieval and reasoning - Difficulty: Hard long-context - Paper: LongBench v2 (https://arxiv.org/abs/2412.15204) - Authors: LongBench v2 authors LongBench v2 is useful because context-window size alone is not a capability. It measures whether a model can retain, retrieve, and reason over long inputs effectively. #### MRCRv2 (MRCRv2) A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts. - Year: 2025 - Tasks: Long-context retrieval - Format: Multi-round long-context evaluation - Difficulty: Hard long-context - Paper: Introducing GPT-5.2 and GPT-5.2 Pro (https://openai.com/index/introducing-gpt-5-2/) - Authors: OpenAI MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions. #### MRCR v2 64K-128K (OpenAI MRCR v2 8-needle 64K-128K) MRCR v2 slice focused on long-context retrieval at 64K-128K lengths. - Year: 2026 - Tasks: 8-needle retrieval tasks - Format: Long-context retrieval - Difficulty: Long-context reasoning - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI Measures whether models can recover the right details when multiple relevant items are buried in long contexts. #### MRCR v2 128K-256K (OpenAI MRCR v2 8-needle 128K-256K) MRCR v2 slice focused on very long contexts at 128K-256K lengths. - Year: 2026 - Tasks: 8-needle retrieval tasks - Format: Very-long-context retrieval - Difficulty: Very long-context reasoning - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI A harder MRCR setting that stresses memory discipline and retrieval deeper into long contexts. #### Graphwalks BFS 128K (Graphwalks BFS 0K-128K) Long-context graph traversal benchmark using breadth-first search tasks. - Year: 2026 - Tasks: Graph traversal tasks - Format: Long-context graph reasoning - Difficulty: Algorithmic long-context reasoning - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI Graphwalks BFS tests whether a model can preserve algorithmic state while traversing graph structures across long contexts. #### Graphwalks Parents 128K (Graphwalks parents 0-128K) Long-context benchmark for recovering parent relationships inside graph tasks. - Year: 2026 - Tasks: Graph parent-retrieval tasks - Format: Long-context graph reasoning - Difficulty: Algorithmic long-context reasoning - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI Measures whether models can keep structural relationships straight across long contexts. #### MRCR 1M (MRCR 1M) A million-token MRCR long-context retrieval benchmark reported in DeepSeek-V4 model evaluations. - Year: 2026 - Tasks: Million-token retrieval - Format: Long-context retrieval MMR - Difficulty: Million-token long context - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores this DeepSeek-reported MRCR 1M value as a display-only row distinct from the existing MRCRv2 keys. #### CorpusQA 1M (CorpusQA 1M) A million-token CorpusQA long-context question-answering benchmark reported in DeepSeek-V4 model evaluations. - Year: 2026 - Tasks: Million-token corpus question answering - Format: Long-context QA accuracy - Difficulty: Million-token long context - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI CorpusQA 1M is tracked as a display-only provider-table row for long-context DeepSeek-V4 comparisons. #### ARC-AGI-2 (Abstraction and Reasoning Corpus for AGI v2) A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%. - Year: 2025 - Tasks: Visual pattern completion and abstract reasoning - Format: Grid transformation puzzles with novel rules - Difficulty: Expert-level — hardest public reasoning benchmark - Paper: ARC-AGI-2: A Harder General Intelligence Benchmark (https://arcprize.org/arc-agi/2/) - Authors: Francois Chollet, ARC Prize Foundation ARC-AGI-2 extends the original ARC benchmark with harder puzzles designed to test genuine fluid intelligence. Four major AI labs (Anthropic, Google, OpenAI, xAI) now report their model performance on this benchmark. Average individual human performance is 66%, the human panel completion rate is 100%, and the grand prize threshold is greater than 85%. Top frontier models reach 75-85 in BenchLM's tracked data, making it one of the few benchmarks that still separates current reasoning systems. #### AI-Needle (AI-Needle) A long-context retrieval benchmark that measures whether a model can recover relevant information embedded deep inside very long contexts. - Year: 2026 - Tasks: Long-context retrieval - Format: Needle-in-a-haystack recall - Difficulty: Long-context memory - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen AI-Needle is useful for testing whether very large context windows are actually usable rather than just headline numbers. It rewards precise recall under distractors and long-document clutter. #### GPQA Diamond (GPQA Diamond) The hardest subset of GPQA featuring the most challenging graduate-level science questions. Sometimes reported separately from the standard GPQA benchmark. - Year: 2023 - Tasks: Expert-level science questions - Format: Multiple choice questions - Difficulty: Graduate-level scientific reasoning - Paper: GPQA: A Graduate-Level Google-Proof Q&A Benchmark (https://arxiv.org/abs/2311.12022) - Authors: David Rein et al. GPQA Diamond is the 'diamond' difficulty tier of GPQA, containing the most expert-validated and challenging questions. It is often cited in system cards as a standalone benchmark for scientific reasoning. #### AA-LCR (Artificial Analysis Long Context Reasoning) A display-only Artificial Analysis long-context reasoning evaluation. - Year: 2026 - Tasks: Long-context reasoning tasks - Format: Accuracy - Difficulty: Long-context reasoning - Paper: Artificial Analysis model benchmarks (https://artificialanalysis.ai/models/grok-4-3) - Authors: Artificial Analysis BenchLM stores AA-LCR as a display-only row when OpenRouter or Artificial Analysis publishes the exact long-context reasoning card value. #### CritPt (Critical Physics Tasks) A display-only Artificial Analysis metric for research-level physics reasoning. - Year: 2026 - Tasks: Research-level physics questions - Format: Accuracy - Difficulty: Research-level physics reasoning - Paper: Artificial Analysis model benchmarks (https://artificialanalysis.ai/models/grok-4-3) - Authors: Artificial Analysis BenchLM stores CritPt as a display-only research physics reasoning row from Artificial Analysis benchmark cards. #### BullshitBench v2 (BullshitBench v2) A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input. - Year: 2025 - Tasks: Nonsensical and flawed prompts across multiple domains - Format: Prompt challenge and refusal evaluation - Difficulty: Robustness and critical reasoning - Paper: BullshitBench: Measuring whether AI models challenge nonsensical prompts (https://petergpt.github.io/bullshit-benchmark/) - Authors: Peter Gostev BullshitBench evaluates a crucial real-world capability: knowing when NOT to answer. Models that score highly recognize flawed premises, impossible physics scenarios, and logical contradictions rather than hallucinating plausible-sounding responses. V2 includes harder and more diverse challenge categories. #### WildBench (WildBench) An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings. - Year: 2024 - Tasks: 1,024 real-world tasks - Format: Real-world task evaluation - Difficulty: Diverse real-world scenarios - Paper: WildBench: Benchmarking Language Models with Challenging Tasks from Real Users in the Wild (https://arxiv.org/abs/2406.04770) - Authors: Bill Yuchen Lin et al. WildBench bridges the gap between static benchmarks and human preference evaluations. Tasks are derived from real ChatGPT conversations, making it more representative of actual user needs than synthetic benchmarks. ### Instruction Following Benchmarks #### IFEval (Instruction-Following Eval) A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements. - Year: 2023 - Tasks: 500+ instructions - Format: Constrained generation - Difficulty: Instruction precision - Paper: Instruction-Following Evaluation for Large Language Models (https://arxiv.org/abs/2311.07911) - Authors: Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou IFEval uses verifiable instructions to objectively measure instruction-following ability. Instructions include requirements like 'write in all caps', 'include exactly 3 bullet points', or 'respond in JSON format', making evaluation automated and reproducible. #### IFBench (Instruction Following Benchmark) IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns. - Year: 2025 - Tasks: 58 - Format: undefined - Difficulty: undefined - Paper: undefined (undefined) - Authors: undefined undefined #### AA-IFBench (Artificial Analysis IFBench) A display-only Artificial Analysis IFBench score. - Year: 2026 - Tasks: Verifiable instruction constraints - Format: Constraint satisfaction accuracy - Difficulty: Instruction precision - Paper: Artificial Analysis IFBench Benchmark Leaderboard (https://artificialanalysis.ai/evaluations/ifbench) - Authors: Artificial Analysis BenchLM stores the Artificial Analysis IFBench result separately from the weighted IFBench lane so AA refreshes remain display-only. #### SOB Value Acc (Structured Output Benchmark Value Accuracy) A structured-output benchmark from Interfaze measuring whether extracted JSON leaf values exactly match verified ground truth. - Year: 2026 - Tasks: Structured output extraction - Format: Value accuracy - Difficulty: Production structured-output reliability - Paper: Structured Output Benchmark Leaderboard (https://interfaze.ai/leaderboards/structured-output-benchmark) - Authors: Interfaze SOB Value Accuracy goes beyond JSON parse success: it measures whether values in the structured response are correct and grounded in the source context across text, image, and audio-normalized inputs. ### Multilingual Benchmarks #### MGSM (Multilingual Grade School Math) A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese. - Year: 2022 - Tasks: 250 problems × 11 languages - Format: Math word problems - Difficulty: Grade school math, multilingual - Paper: Language Models are Multilingual Chain-of-Thought Reasoners (https://arxiv.org/abs/2210.03057) - Authors: Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei MGSM evaluates mathematical reasoning across languages, revealing that performance can vary significantly across languages, with lower-resource languages (Bengali, Swahili, Telugu) typically showing the largest gaps. #### MMLU-ProX (MMLU-ProX) A multilingual extension of professional-level academic evaluation across many languages. - Year: 2025 - Tasks: Multilingual professional QA - Format: Multilingual multiple choice - Difficulty: Professional multilingual - Paper: MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation (https://arxiv.org/abs/2503.10497) - Authors: MMLU-ProX authors MMLU-ProX expands multilingual evaluation beyond translated arithmetic, making it a better signal for broad cross-lingual reasoning and knowledge. #### NOVA-63 (NOVA-63) A broad multilingual benchmark row from Qwen's launch comparisons intended to measure cross-lingual capability beyond a single language family. - Year: 2026 - Tasks: Broad multilingual evaluation - Format: Cross-lingual benchmark - Difficulty: Broad multilingual capability - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen NOVA-63 appears in multilingual comparison tables as a harder broad-language benchmark than standard translated math. BenchLM tracks it as a display-only multilingual capability signal until a cleaner public benchmark specification is available. #### INCLUDE (INCLUDE) A multilingual benchmark used in provider tables to measure inclusive language coverage and cross-lingual understanding beyond common high-resource languages. - Year: 2026 - Tasks: Cross-lingual understanding - Format: Multilingual benchmark - Difficulty: Broad multilingual capability - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen INCLUDE is useful as a multilingual breadth check because it is intended to reward stronger performance across a wider and less English-centric language set than basic translated math benchmarks. #### PolyMath (PolyMath) A multilingual mathematical reasoning benchmark that tests whether math performance transfers across languages rather than only in English. - Year: 2026 - Tasks: Multilingual math problems - Format: Cross-lingual mathematical reasoning - Difficulty: Advanced multilingual reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen PolyMath isolates cross-lingual math transfer rather than general chat quality. It is useful for spotting models that keep surface fluency in other languages but lose structured reasoning quality. #### VWT2k-lite (VWT2k-lite) A lighter multilingual benchmark slice published in provider tables for broad cross-lingual transfer and understanding. - Year: 2026 - Tasks: Multilingual transfer tasks - Format: Cross-lingual benchmark - Difficulty: Broad multilingual capability - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen VWT2k-lite acts as a compact multilingual stress test. BenchLM tracks it separately because providers often publish it as a standalone row without enough public detail to merge it into existing multilingual benchmark families. #### MAXIFE (MAXIFE) A multilingual instruction-following and understanding benchmark row published in Qwen's launch comparisons. - Year: 2026 - Tasks: Multilingual instruction following - Format: Cross-lingual benchmark - Difficulty: Advanced multilingual instruction following - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MAXIFE appears as a high-level multilingual benchmark intended to capture both instruction compliance and language transfer. BenchLM tracks it as a display-only multilingual signal pending a fuller public benchmark specification. #### SWE Multilingual (SWE-bench Multilingual) A multilingual extension of SWE-bench covering 300 problems across 9 programming languages, testing code generation and bug fixing beyond Python. - Year: 2025 - Tasks: 300 problems across 9 languages - Format: Multi-language code patch generation - Difficulty: Professional multilingual software engineering - Paper: SWE-bench Multilingual (https://www.swebench.com/multilingual) - Authors: SWE-bench team SWE-bench Multilingual extends evaluation to Java, JavaScript, TypeScript, C++, Go, Rust, Ruby, PHP, and Swift. Mythos Preview achieves 87.3% averaged over 5 trials. ### Agentic Benchmarks #### Terminal-Bench 2.0 (Terminal-Bench 2.0) A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows. - Year: 2026 - Tasks: Terminal-based software tasks - Format: Interactive CLI agent evaluation - Difficulty: Professional software engineering - Paper: Terminal-Bench 2.0 (https://www.tbench.ai/) - Authors: Terminal-Bench contributors Terminal-Bench 2.0 focuses on realistic CLI and repository workflows rather than toy code generation. It is a strong proxy for how useful a model is inside coding agents and autonomous developer tools. #### BrowseComp (BrowseComp) A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions. - Year: 2025 - Tasks: Research questions requiring browsing - Format: Web search and evidence synthesis - Difficulty: Hard web research - Paper: BrowseComp (https://openai.com/index/browsecomp/) - Authors: OpenAI BrowseComp is designed to measure real web research behavior, not just latent world knowledge. It rewards models that can plan searches, inspect multiple pages, and avoid shallow answer synthesis. #### HLE w/ tools (Humanity's Last Exam with tools) Tool-augmented Humanity's Last Exam scores reported in DeepSeek-V4 thinking-mode evaluations. - Year: 2026 - Tasks: Expert questions with tool use - Format: Pass@1 - Difficulty: Frontier tool-augmented reasoning - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores HLE w/ tools as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations. #### GDPval-AA (GDPval-AA) An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations. - Year: 2026 - Tasks: Agentic real-world work tasks - Format: Elo - Difficulty: Professional agentic workflows - Paper: DeepSeek-V4 Technical Report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) - Authors: DeepSeek-AI BenchLM stores GDPval-AA as a display-only provider-table row for DeepSeek-V4 because the source reports an Elo score rather than a 0-100 percentage. #### GDPval-AA (GDPval-AA normalized) A display-only Artificial Analysis normalized score for economically valuable tasks. - Year: 2026 - Tasks: Economically valuable tasks - Format: Normalized score - Difficulty: Professional agentic workflows - Paper: Artificial Analysis model benchmarks (https://artificialanalysis.ai/models/grok-4-3) - Authors: Artificial Analysis OpenRouter's Grok 4.3 benchmark card displays GDPval-AA as a normalized percentage. BenchLM stores it separately from the Elo-style GDPval-AA rows used in provider comparison tables. #### AA Agentic Index (Artificial Analysis Agentic Index) A display-only Artificial Analysis agentic index. - Year: 2026 - Tasks: Cross-benchmark agentic index - Format: Aggregated model score - Difficulty: Display-only external reference - Paper: Artificial Analysis model leaderboards (https://artificialanalysis.ai/leaderboards/models) - Authors: Artificial Analysis BenchLM mirrors this agentic index for comparison, but does not use it as a weighted agentic benchmark row. #### APEX-Agents-AA (APEX-Agents-AA) Artificial Analysis' implementation of the APEX-Agents benchmark for long-horizon professional-services agent tasks. - Year: 2026 - Tasks: 452 professional-services agent tasks - Format: Pass@1 - Difficulty: Long-horizon workplace agent tasks - Paper: APEX-Agents-AA Benchmark Leaderboard (https://artificialanalysis.ai/evaluations/apex-agents-aa) - Authors: Artificial Analysis / Mercor BenchLM stores APEX-Agents-AA as a display-only agentic row. Artificial Analysis reports pass@1 over 452 public APEX-Agents tasks spanning investment banking, management consulting, and corporate law. #### Gert Labs (Gert Labs Composite Game Benchmark) A game-environment benchmark that evaluates AI models in novel games covering strategic planning, resource management, spatial reasoning, cooperation, and theory of mind. - Year: 2026 - Tasks: Novel game environments - Format: Composite game leaderboard - Difficulty: Agentic coding and decision-making - Paper: Gert Labs rankings (https://gertlabs.com/rankings) - Authors: Gert Labs The public Gert Labs leaderboard reports a composite 0-100 metric derived from average and median percentile across games, success rate, and response-time penalty. The combined leaderboard blends agentic coding, one-shot coding, and social decision-making modes. #### OSWorld-Verified (OSWorld-Verified) A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion. - Year: 2025 - Tasks: Desktop and GUI tasks - Format: Interactive computer-use evaluation - Difficulty: Complex multi-step workflows - Paper: OSWorld (https://os-world.github.io/) - Authors: OSWorld contributors OSWorld-Verified measures whether models can operate software interfaces, keep state across steps, and complete practical GUI workflows. It is one of the clearest public signals for computer-use capability. #### CyberGym (CyberGym) A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance. - Year: 2026 - Tasks: 1,507 vulnerability analysis instances - Format: Vulnerability reproduction and PoC generation - Difficulty: Real-world cybersecurity - Paper: CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale (https://www.cybergym.io/) - Authors: Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song CyberGym includes 1,507 benchmark instances from historical vulnerabilities across 188 large software projects. BenchLM stores CyberGym as a display-only agentic security benchmark when exact provider comparison values are published. #### BrowseComp-VL (BrowseComp-VL) A vision-language browsing benchmark for multimodal web research and tool-use workflows. - Year: 2026 - Tasks: Multimodal browsing tasks - Format: Vision-language web research evaluation - Difficulty: Multimodal browser-agent - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM stores BrowseComp-VL as a display-only provider-table reference while keeping BrowseComp as the weighted core browsing benchmark. #### OSWorld (OSWorld) A computer-use benchmark for GUI task completion across the broader OSWorld task suite. - Year: 2026 - Tasks: Computer-use tasks - Format: Interactive GUI evaluation - Difficulty: Broad computer-use suite - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM tracks plain OSWorld as a display-only provider-table reference and preserves OSWorld-Verified as the weighted core benchmark key. #### AndroidWorld (AndroidWorld) A mobile GUI agent benchmark for completing Android app workflows and on-device tasks. - Year: 2026 - Tasks: Android app workflows - Format: Interactive mobile-agent evaluation - Difficulty: Complex mobile task completion - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM tracks AndroidWorld as a display-only benchmark reference when providers publish exact values alongside broader GUI-agent summaries. #### WebVoyager (WebVoyager) A browser-agent benchmark for completing multi-step workflows on live websites. - Year: 2026 - Tasks: Live website workflows - Format: Interactive browser-agent evaluation - Difficulty: Multi-step web navigation - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM stores WebVoyager as a display-only browser-agent benchmark reference outside the weighted ranking schema. #### MCP Atlas (MCP Atlas) A benchmark for tool-calling over Model Context Protocol integrations and external tools. - Year: 2026 - Tasks: Tool-integrated agent tasks - Format: Interactive tool-calling evaluation - Difficulty: Advanced tool use - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI OpenAI reports MCP Atlas as a tool-use benchmark that measures how well models work with MCP-backed systems and external tools. #### Kimi Claw 24/7 (Kimi Claw 24/7 Bench) A Moonshot AI internal long-horizon agent benchmark for persistent professional coworking tasks. - Year: 2026 - Tasks: 17 professional scenarios, 610 evaluation points - Format: Average pass rate across repeated OpenClaw runs - Difficulty: Long-horizon agentic work - Paper: Kimi K2.7 Code (https://huggingface.co/moonshotai/Kimi-K2.7-Code) - Authors: Moonshot AI Moonshot describes Kimi Claw 24/7 Bench as an in-house benchmark spanning 17 professional scenarios and 610 evaluation points across software engineering, ML research, recruiting, trading, and marketing. BenchLM stores the provider-reported exact value as display-only launch evidence. #### MCP Mark Verified (MCPMark-Verified) A human-verified edition of MCPMark for MCP tool use across Notion, GitHub, Filesystem, Postgres, and Playwright server environments. - Year: 2026 - Tasks: MCP tool-use tasks across five server environments - Format: Interactive MCP task completion - Difficulty: Advanced tool use - Paper: MCPMark (https://mcpmark.ai/) - Authors: MCPMark Moonshot reports MCPMark-Verified as a human-verified edition of MCPMark and says it will be open-sourced. BenchLM stores provider-reported exact values as display-only launch evidence until a stable public leaderboard is available. #### Toolathlon (Toolathlon) A tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools. - Year: 2026 - Tasks: Multi-tool workflows - Format: Interactive tool-calling evaluation - Difficulty: Advanced tool use - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI Toolathlon is useful for judging whether a model can do more than answer in chat and instead complete multi-step tool workflows. #### ZClawBench (ZClawBench) A Z.AI benchmark for OpenClaw-style agent workflows spanning information search, office work, data analysis, development and operations, automation, and security. - Year: 2026 - Tasks: OpenClaw agent workflows - Format: End-to-end agent benchmark - Difficulty: Broad productivity and operations workflows - Paper: GLM-5-Turbo (https://docs.z.ai/guides/llm/glm-5-turbo) - Authors: Z.AI BenchLM tracks the overall ZClawBench score only when Z.AI publishes an exact public value for a specific model. #### Tau2-Telecom (Tau2-Telecom) A telecom-oriented tool benchmark that measures structured tool use in domain workflows. - Year: 2026 - Tasks: Telecom tool workflows - Format: Domain-specific tool evaluation - Difficulty: Professional workflow - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI OpenAI reports tau2-bench as a domain-specific tool benchmark for telecom tasks, useful for measuring API-call reliability under constraints. #### DeepSearchQA (DeepSearchQA) An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools. - Year: 2026 - Tasks: Agentic browsing and list-answer questions - Format: Search / open / find browser-agent evaluation - Difficulty: Agentic web research - Paper: Muse Spark Eval Methodology (https://ai.meta.com/static-resource/muse-spark-eval-methodology) - Authors: Meta AI Meta describes DeepSearchQA as a browser-tool evaluation graded with an F1-style semantic set match. BenchLM stores it as a display-only agentic search benchmark. #### Tau2-Airline (Tau2-Airline) An airline-domain tool-use benchmark for structured workflow execution and API correctness. - Year: 2026 - Tasks: Airline support workflows - Format: Domain-specific tool evaluation - Difficulty: Professional workflow - Paper: Trinity-Large-Thinking: Scaling an Open Source Frontier Agent (https://www.arcee.ai/blog/trinity-large-thinking) - Authors: Arcee AI BenchLM stores Tau2-Airline as a display-only provider-table reference alongside tau2-bench telecom scores. #### PinchBench (PinchBench) An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows. - Year: 2026 - Tasks: 23 OpenClaw agent tasks - Format: Average success rate from official runs - Difficulty: Long-horizon agent workflows - Paper: About PinchBench (https://pinchbench.com/about) - Authors: Kilo Code PinchBench publishes official OpenClaw runs across 23 tasks and grades results with automated checks plus an LLM judge. BenchLM mirrors the public average-score view as a display-only benchmark. #### OpenHands Index (OpenHands Index) A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering. - Year: 2025 - Tasks: SWE-bench Verified, SWE-bench Multimodal, Commit0, SWT-bench Verified, and GAIA - Format: Macro-average across five coding-agent categories - Difficulty: Real-world software engineering agent tasks - Paper: OpenHands Index methodology (https://index.openhands.dev/about) - Authors: OpenHands BenchLM mirrors the official OpenHands Index REST API as a display-only agentic software-engineering benchmark. The source reports average agent score, cost, runtime, per-category scores, logs, and visualizations for each model and SDK version. #### SWE-Atlas Refactoring (SWE-Atlas Refactoring) A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks. - Year: 2026 - Tasks: SWE-Atlas refactoring tasks - Format: Refactoring score with confidence intervals - Difficulty: Real-world software-engineering agent tasks - Paper: SWE-Atlas (https://labs.scale.com/papers/sweatlas) - Authors: Scale AI BenchLM mirrors the public Scale SWE-Atlas Refactoring leaderboard as a display-only agentic software-engineering benchmark. The source compares model-agent combinations such as Claude Code, Codex, Gemini CLI, and Mini-SWE-Agent. #### InferenceBench (InferenceBench) A benchmark for open-ended LLM inference optimization by AI agents. Agents receive a base model, one H100, and a fixed time budget to build a valid OpenAI-compatible inference server that improves serving speed. - Year: 2026 - Tasks: 4 inference-serving optimization scenarios - Format: Two-hour autonomous CLI agent run - Difficulty: Open-ended ML systems engineering - Paper: InferenceBench (https://inferencebench.ai/) - Authors: Jehyeok Yeon, Ben Rank, Maksym Andriushchenko BenchLM mirrors the public InferenceBench agent leaderboard as a display-only agentic systems-engineering benchmark. The primary score is aggregate geometric-mean speedup over a PyTorch baseline across prefill latency, decode latency, throughput, and all-in-one serving scenarios. #### BFCL v4 (Berkeley Function Calling Leaderboard v4) A function-calling benchmark for tool selection, schema adherence, and argument correctness. - Year: 2026 - Tasks: Function-calling tasks - Format: Tool invocation and schema evaluation - Difficulty: Advanced tool use - Paper: Trinity-Large-Thinking: Scaling an Open Source Frontier Agent (https://www.arcee.ai/blog/trinity-large-thinking) - Authors: Arcee AI BenchLM stores BFCL v4 as a display-only function-calling reference outside the current weighted core schema. #### MLE-Bench Lite (MLE-Bench Lite) A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings. - Year: 2026 - Tasks: Low-resource ML competitions - Format: Autonomous iterative ML optimization - Difficulty: Agentic machine learning - Paper: MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - Authors: MiniMax MiniMax reports MLE-Bench Lite results from autonomous multi-round optimization on low-resource machine-learning competitions, making it a useful signal for agentic ML workflows. #### MM-ClawBench (MM-ClawBench) An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance. - Year: 2026 - Tasks: OpenClaw-style real-world tasks - Format: Agent workflow evaluation - Difficulty: Broad real-world agentic execution - Paper: MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - Authors: MiniMax MiniMax built MM-ClawBench from commonly used OpenClaw tasks to evaluate how well models handle broad real-world agent scenarios across work and personal productivity. #### Claw-Eval (Claw-Eval) A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks. - Year: 2026 - Tasks: 300 tasks, 2,159 rubrics - Format: End-to-end autonomous-agent evaluation with Pass^3 scoring - Difficulty: Real-world general, multi-turn, and native multimodal agent execution - Paper: Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (https://arxiv.org/abs/2604.06132) - Authors: Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang Claw-Eval v1.1.0 evaluates autonomous agents on full-trajectory tasks audited for completion, safety, and robustness. Its primary Pass^3 metric requires a task to pass in all three independent trials, reducing lucky-run effects. BenchLM mirrors the official leaderboard as display-only because rows reflect benchmark harness execution as well as model capability. #### QwenClawBench (QwenClawBench) Qwen's internal OpenClaw-style benchmark for measuring broad real-world agent performance across practical productivity and research tasks. - Year: 2026 - Tasks: Real-world agent workflows - Format: End-to-end agent evaluation - Difficulty: Broad real-world agentic execution - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen QwenClawBench appears in the Qwen3.6 launch comparisons as an internal real-world agent benchmark. BenchLM tracks it separately rather than merging it into other Claw-style benchmarks because the task mix and exact protocol are Qwen-specific. #### QwenWebBench (QwenWebBench) A Qwen benchmark for artifact and webpage generation quality reported as an Elo-style rating. - Year: 2026 - Tasks: Web artifacts and interactive deliverables - Format: Elo-style artifact benchmark - Difficulty: Artifact generation - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen QwenWebBench measures how strong a model is at producing web artifacts and interactive deliverables, with scores reported as Elo ratings rather than percentages. BenchLM tracks it as a display-only benchmark because it is a provider-specific artifact benchmark rather than a standardized public core benchmark. #### TAU3-Bench (TAU3-Bench) A next-generation tool-use benchmark for complex, long-horizon agent workflows beyond the older tau2 telecom and airline task families. - Year: 2026 - Tasks: Long-horizon tool workflows - Format: Interactive tool-use evaluation - Difficulty: Advanced tool use - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen TAU3-Bench appears in the Qwen3.6 launch tables as a broader, more modern tool-use benchmark than the earlier tau2 slices. BenchLM tracks it separately because it is not the same benchmark as Tau2-Telecom or Tau2-Airline. #### VITA-Bench (VITA-Bench) An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows. - Year: 2025 - Tasks: Interactive consumer-service agent tasks - Format: End-to-end interactive agent evaluation - Difficulty: Long-horizon real-world workflows - Paper: VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications (https://vitabench.github.io/) - Authors: Meituan LongCat Team VITA-Bench is built to test realistic interactive agent behavior rather than toy tool calls. It stresses long-horizon coordination, tool selection, changing user intent, and domain switching across daily-life applications. #### DeepPlanning (DeepPlanning) A long-horizon planning benchmark that tests whether agents can optimize under explicit time, budget, and feasibility constraints. - Year: 2026 - Tasks: Travel planning and constrained shopping - Format: Long-horizon planning benchmark - Difficulty: Constrained agent planning - Paper: DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints (https://arxiv.org/abs/2601.18137) - Authors: DeepPlanning authors DeepPlanning focuses on global constrained optimization rather than local next-step reasoning. It is useful because many agents can execute short actions but still fail when they must gather information and plan coherently over a long horizon under hard constraints. #### MCP-Tasks (MCP-Tasks) A Model Context Protocol task benchmark used in Qwen's launch tables to measure practical execution over MCP-style tools and integrations. - Year: 2026 - Tasks: MCP-integrated tool tasks - Format: Interactive tool-use evaluation - Difficulty: Advanced MCP workflows - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MCP-Tasks is distinct from MCP Atlas in the Qwen3.6 comparisons. BenchLM keeps it separate until a fuller public benchmark specification is available because it appears to represent a different task protocol and score scale. #### WideResearch (WideResearch) A broad research-agent benchmark for open-ended information gathering, synthesis, and answer construction across wide search spaces. - Year: 2026 - Tasks: Open-ended research tasks - Format: Multi-source research evaluation - Difficulty: Broad research-agent workflows - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen WideResearch evaluates whether a model can sustain a research process over multiple sources and branches rather than answering from shallow retrieval. BenchLM tracks it as a display-only browsing and synthesis benchmark. #### GAIA (General AI Assistants) GAIA evaluates AI models on real-world tasks that are conceptually simple for humans but require multi-step reasoning, web browsing, tool use, and multimodal understanding for AI. Tasks span three difficulty levels and test practical assistant capabilities rather than academic knowledge. - Year: 2024 - Tasks: 466 - Format: undefined - Difficulty: undefined - Paper: undefined (undefined) - Authors: undefined undefined #### TAU-bench (Tool-Agent-User Benchmark) TAU-bench evaluates AI agents in realistic enterprise scenarios requiring multi-turn tool use, database interactions, and policy adherence. It tests across retail and airline domains, measuring an agent's ability to reliably complete customer service tasks while following complex business rules. - Year: 2024 - Tasks: 680 - Format: undefined - Difficulty: undefined - Paper: undefined (undefined) - Authors: undefined undefined #### WebArena (WebArena Web Agent Benchmark) WebArena is a realistic web environment for evaluating autonomous AI agents on complex, multi-step browser tasks. Agents must navigate e-commerce sites, forums, content management systems, and code repositories to complete practical objectives like purchasing items, finding information, and managing accounts. - Year: 2024 - Tasks: 812 - Format: undefined - Difficulty: undefined - Paper: undefined (undefined) - Authors: undefined undefined #### MEWC (Multi-Environment Web Challenge) A benchmark that evaluates AI agents on multi-environment web challenges, testing navigation and task completion across diverse live web environments. - Year: 2026 - Tasks: Web-agent tasks - Format: Browser task completion - Difficulty: Open-web agent workflows - Paper: MiniMax M2.5 benchmark release surface (https://www.minimax.io/news/minimax-m25) - Authors: MiniMax / benchmark maintainers MEWC is useful as an agentic browsing benchmark because it focuses on open-web interaction and multi-environment task execution rather than single-site scripted browsing. #### Finance Agent v2 (Finance Agent v2) Vals AI benchmark for realistic financial analyst agent tasks across qualitative analysis, quantitative analysis, market work, comparables, precedents, earnings, disclosure, and modeling. - Year: 2026 - Tasks: Financial analyst task categories - Format: Mean score across repeated runs - Difficulty: Professional expert-task agent workflow - Paper: Finance Agent v2 (https://www.vals.ai/benchmarks/fabv2) - Authors: Vals AI Vals reports Finance Agent v2 as a multi-category benchmark with severity-weighted partial credit and repeated runs per model. BenchLM mirrors the public Vals leaderboard as a display-only expert-task benchmark. #### GDPval rubrics (GDPval rubrics) A display-only provider-table GDPval rubric score for economically valuable work tasks. - Year: 2026 - Tasks: Economically valuable work tasks - Format: Rubric score - Difficulty: Professional agentic workflows - Paper: MiniMax M3 model card (https://huggingface.co/MiniMaxAI/MiniMax-M3) - Authors: MiniMax MiniMax reports GDPval rubrics as a percentage-style provider benchmark. BenchLM stores it separately from AA GDPval Elo and normalized GDPval rows. #### BankerToolBench (BankerToolBench) A display-only provider benchmark for finance-oriented tool-use and agent workflows. - Year: 2026 - Tasks: Finance and banking tool-use tasks - Format: Task success rate - Difficulty: Professional finance-agent workflows - Paper: MiniMax M3 model card (https://huggingface.co/MiniMaxAI/MiniMax-M3) - Authors: MiniMax MiniMax reports BankerToolBench in the M3 comparison chart. BenchLM tracks it as display-only because it is not part of the weighted agentic schema. ### Multimodal & Grounded Benchmarks #### MMMU (Massive Multi-discipline Multimodal Understanding) A broad multimodal reasoning benchmark spanning charts, diagrams, tables, and academic visual question answering. - Year: 2024 - Tasks: Multimodal academic reasoning - Format: Image + text question answering - Difficulty: Frontier multimodal - Paper: MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (https://arxiv.org/abs/2401.05508) - Authors: MMMU authors MMMU is the base benchmark family behind later MMMU-Pro variants. It measures whether a model can answer expert-style questions that require combining visual understanding with domain knowledge and reasoning. #### MMMU-Pro (Massive Multi-discipline Multimodal Understanding Pro) A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks. - Year: 2024 - Tasks: Multimodal academic reasoning - Format: Image + text question answering - Difficulty: Frontier multimodal - Paper: MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark (https://arxiv.org/abs/2409.02813) - Authors: MMMU-Pro authors MMMU-Pro extends the original MMMU setup with more difficult multimodal questions and stronger separation at the top end of the model market. #### AA-MMMU-Pro (Artificial Analysis MMMU-Pro) A display-only Artificial Analysis MMMU-Pro score. - Year: 2026 - Tasks: Multimodal academic reasoning - Format: Image + text question answering - Difficulty: Frontier multimodal - Paper: Artificial Analysis MMMU-Pro Benchmark Leaderboard (https://artificialanalysis.ai/evaluations/mmmu-pro) - Authors: Artificial Analysis BenchLM stores the Artificial Analysis MMMU-Pro result separately from the weighted MMMU-Pro lane so AA refreshes remain display-only. #### OCRBench V2 (OCRBench V2) A native OCR benchmark for reading text from images across multilingual scripts, low-quality scans, handwriting, structured layouts, charts, and screenshots. - Year: 2025 - Tasks: Image OCR tasks - Format: Accuracy - Difficulty: Native visual text understanding - Paper: OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning (https://arxiv.org/abs/2501.00321) - Authors: OCRBench authors OCRBench V2 evaluates whether multimodal models can extract visual text directly from images before downstream reasoning or structure extraction. BenchLM stores Interfaze's reported score as a display-only OCR row. #### olmOCR (olmOCR-Bench) An end-to-end document understanding benchmark over long, layout-rich PDFs with tables, equations, headers, footnotes, and multi-column flows. - Year: 2025 - Tasks: Layout-rich PDF understanding - Format: Mean accuracy - Difficulty: Complex document processing - Paper: olmOCR-Bench (https://github.com/allenai/olmocr/tree/main/olmocr/bench) - Authors: Allen Institute for AI olmOCR-Bench tests whether a system preserves reading order and document structure, not only character-level OCR. BenchLM tracks Interfaze's reported mean score as a display-only document AI benchmark. #### VoxPopuli WER (VoxPopuli-Cleaned-AA Word Error Rate) A speech-recognition benchmark on the cleaned Artificial Analysis VoxPopuli subset, reported as word error rate where lower is better. - Year: 2026 - Tasks: Speech-to-text transcription - Format: Word error rate - Difficulty: Audio speech recognition - Paper: VoxPopuli-Cleaned-AA (https://huggingface.co/datasets/ArtificialAnalysis/VoxPopuli-Cleaned-AA) - Authors: Artificial Analysis / VoxPopuli dataset authors VoxPopuli-Cleaned-AA measures transcription quality on multilingual European Parliament speech using Whisper-style text normalization. BenchLM stores Interfaze's WER as a display-only multimodal and audio row. #### Design Arena Website (Design Arena Website Elo) A display-only Design Arena website-generation Elo score surfaced on OpenRouter model benchmark pages. - Year: 2026 - Tasks: Website generation comparisons - Format: Elo - Difficulty: Design and website generation - Paper: OpenRouter Grok 4.3 benchmarks (https://openrouter.ai/x-ai/grok-4.3/benchmarks) - Authors: Design Arena OpenRouter's Grok 4.3 benchmark page reports Website at 1294 Elo, 56.5% win rate, 166.3s average generation time, and Top 13%. BenchLM stores the Elo as the benchmark value and keeps the supporting fields in the description. #### OfficeQA Pro (OfficeQA Pro) A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts. - Year: 2026 - Tasks: Document and spreadsheet tasks - Format: Grounded QA over office artifacts - Difficulty: Enterprise grounded reasoning - Paper: OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning (https://arxiv.org/abs/2603.08655) - Authors: OfficeQA Pro authors OfficeQA Pro is useful when choosing models for enterprise copilots because it measures whether they can reason correctly over real office content rather than generic chat prompts. #### MMMU-Pro w/ Python (MMMU-Pro with Python) Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning. - Year: 2026 - Tasks: Multimodal academic reasoning - Format: Image + text question answering with Python - Difficulty: Frontier multimodal - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI Useful for measuring multimodal reasoning when the model can combine visual understanding with computation. #### OmniDocBench 1.5 (OmniDocBench 1.5) A document understanding benchmark used in frontier-model comparison tables to measure extraction and grounded reasoning quality on complex documents. - Year: 2026 - Tasks: Document understanding tasks - Format: Document understanding benchmark - Difficulty: Grounded document reasoning - Paper: Introducing GPT-5.4 mini and nano (https://openai.com/index/introducing-gpt-5-4-mini-and-nano/) - Authors: OpenAI BenchLM stores OmniDocBench 1.5 as the higher-is-better score format used in current first-party comparison tables. Earlier low-is-better error-style rows are intentionally not mixed into this benchmark key. #### Liquid Extract JSON Validity (Liquid image-to-JSON extraction JSON validity) A display-only Liquid AI extraction metric measuring the share of image-to-JSON outputs that parse as strict JSON. - Year: 2026 - Tasks: Image-to-JSON extraction - Format: Strict JSON parseability rate - Difficulty: Structured visual extraction - Paper: LiquidAI LFM2.5-VL Extract model cards (https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-Extract) - Authors: Liquid AI Liquid evaluates Extract models on 2,000 image, schema, and JSON triples. BenchLM stores JSON Validity as a specialized display-only extraction signal rather than a weighted multimodal benchmark. #### Liquid Extract F1 (Liquid image-to-JSON extraction schema consistency F1) A display-only Liquid AI extraction metric measuring field-name agreement between requested schema fields and extracted JSON fields. - Year: 2026 - Tasks: Image-to-JSON extraction - Format: Schema field F1 - Difficulty: Structured visual extraction - Paper: LiquidAI LFM2.5-VL Extract model cards (https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-Extract) - Authors: Liquid AI Liquid reports Schema Consistency F1 as a macro-averaged set-level F1 over predicted versus requested field names. BenchLM keeps it separate from general OCR and VQA metrics. #### Liquid Extract VLM Judge (Liquid image-to-JSON extraction VLM judge score) A display-only Liquid AI extraction metric measuring judged agreement between extracted values and the source image. - Year: 2026 - Tasks: Image-to-JSON extraction - Format: VLM-judged extraction accuracy - Difficulty: Structured visual extraction - Paper: LiquidAI LFM2.5-VL Extract model cards (https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-Extract) - Authors: Liquid AI Liquid reports VLM Judge Score using a separate vision model to compare extracted JSON content against the image. BenchLM stores it as a specialized visual extraction quality signal. #### RealWorldQA (RealWorldQA) A grounded visual QA benchmark focused on answering practical questions about real-world images and scenes. - Year: 2026 - Tasks: Real-world visual question answering - Format: Image-grounded QA - Difficulty: General visual reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen RealWorldQA is useful because it emphasizes practical perception and grounded answering on realistic images rather than synthetic or purely academic multimodal tasks. #### Video-MME (with subtitle) (Video-MME with subtitle) A video understanding benchmark that allows subtitle access when answering multimodal questions about videos. - Year: 2026 - Tasks: Video understanding - Format: Video QA with subtitle context - Difficulty: Multimodal video reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen The subtitle-enabled Video-MME setting measures how well a model combines video perception with textual cues from subtitles rather than relying on frames alone. #### Video-MME (w/o subtitle) (Video-MME without subtitle) A stricter Video-MME setting that removes subtitle help and tests video understanding from visual and audio context alone. - Year: 2026 - Tasks: Video understanding - Format: Video QA without subtitle context - Difficulty: Multimodal video reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen This row isolates raw video understanding by removing subtitle cues. It is a better proxy for whether a model can parse action, scene changes, and temporal context from the media itself. #### Video-MME (Video-MME) A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos. - Year: 2024 - Tasks: Video understanding - Format: Video QA and analysis - Difficulty: Broad multimodal video reasoning - Paper: Video-MME benchmark (https://mme-benchmark.github.io/) - Authors: Video-MME benchmark team BenchLM tracks the aggregate Video-MME row as a display-oriented video benchmark when providers publish a single overall score rather than separate with-subtitle and without-subtitle splits. #### MathVision (MathVision) A visual mathematics benchmark that tests whether a model can solve math problems grounded in diagrams, equations, figures, and other visual inputs. - Year: 2026 - Tasks: Visually grounded math problems - Format: Image + math reasoning - Difficulty: Advanced multimodal mathematics - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MathVision matters because text-only math ability does not guarantee strong performance when the relevant information is embedded in images, geometry diagrams, or formatted equations. #### We-Math (We-Math) A multimodal math benchmark for visually grounded mathematical reasoning and answer generation. - Year: 2026 - Tasks: Visually grounded math problems - Format: Multimodal mathematical reasoning - Difficulty: Advanced multimodal mathematics - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen We-Math is useful as a visual-math stress test because it combines symbolic reasoning with figure understanding. It helps reveal whether a model's math strength transfers into multimodal settings. #### DynaMath (DynaMath) A multimodal benchmark for dynamic mathematical reasoning over visual and structured inputs. - Year: 2026 - Tasks: Dynamic visual math problems - Format: Multimodal mathematical reasoning - Difficulty: Advanced multimodal mathematics - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen DynaMath is intended to probe whether models can track changing mathematical structure in multimodal settings rather than solving static text-only equations. #### MStar (MStar) A general visual question-answering benchmark used in provider tables for real-image reasoning quality. - Year: 2026 - Tasks: Real-image visual QA - Format: Image-grounded QA - Difficulty: General visual reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MStar sits between broad multimodal reasoning and grounded VQA. It is useful for checking whether a model can answer real-image questions without the stronger domain structure of office or academic benchmarks. #### ChatCVQA (ChatCVQA) A conversational visual QA benchmark that tests multi-turn grounded answering over images and documents. - Year: 2026 - Tasks: Conversational visual QA - Format: Multi-turn image-grounded QA - Difficulty: Conversational multimodal reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen ChatCVQA matters because many multimodal products are conversational rather than single-turn. It evaluates whether a model can sustain grounded image understanding across follow-up questions. #### MMLongBench-Doc (MMLongBench-Doc) A long-document multimodal benchmark for grounded reasoning over extended document contexts. - Year: 2026 - Tasks: Long document understanding - Format: Document-grounded reasoning - Difficulty: Long-context document reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MMLongBench-Doc is designed to test whether a model can maintain grounded understanding across large document contexts rather than only short OCR-style snippets. #### CC-OCR (CC-OCR) An OCR-focused benchmark for reading and extracting text from visually complex documents and images. - Year: 2026 - Tasks: Optical character recognition - Format: Text extraction from images and documents - Difficulty: Document reading - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen CC-OCR is useful as a direct check on raw reading ability before higher-level reasoning. It highlights whether failures come from extraction quality or from later reasoning over the extracted content. #### AI2D_TEST (AI2D test split) A diagram understanding benchmark focused on scientific and educational visual question answering. - Year: 2026 - Tasks: Diagram understanding - Format: Diagram-grounded QA - Difficulty: Structured visual reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen AI2D-style tasks matter because diagrams compress structure differently from photos or office documents. They test whether a model can parse arrows, labels, and spatial relations in technical illustrations. #### CountBench (CountBench) A visual counting benchmark that tests whether a model can count objects and entities reliably in complex scenes. - Year: 2026 - Tasks: Visual counting tasks - Format: Image-grounded counting - Difficulty: Fine-grained visual perception - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen Counting failures are a common multimodal weakness even in otherwise strong models. CountBench isolates that skill and makes it easy to compare raw perception accuracy across models. #### RefCOCO (avg) (RefCOCO average) A referring-expression grounding benchmark averaged across RefCOCO variants to test whether a model can localize described objects correctly. - Year: 2026 - Tasks: Referring-expression grounding - Format: Grounded visual localization - Difficulty: Fine-grained visual grounding - Paper: RefCOCO referring expression datasets (https://github.com/lichengunc/refer) - Authors: RefCOCO dataset authors RefCOCO-style tasks matter for grounding-heavy assistants because they measure whether the model can map language to specific objects or regions instead of only answering abstract questions. BenchLM stores provider-reported aggregate RefCOCO values as a display-only grounding row. #### ODINW13 (ODINW13) A visual detection and grounding benchmark slice used to compare zero-shot object understanding across diverse domains. - Year: 2026 - Tasks: Out-of-distribution object understanding - Format: Detection and grounding - Difficulty: Robust visual grounding - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen ODINW-style scores matter because real deployment images often differ sharply from canonical internet photos. This benchmark checks whether object understanding transfers out of distribution. #### ERQA (ERQA) A grounded visual reasoning benchmark focused on evidence-based question answering over real images. - Year: 2026 - Tasks: Evidence-based visual QA - Format: Grounded image reasoning - Difficulty: Grounded multimodal reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen ERQA is useful as a grounded reasoning check because it emphasizes answer correctness tied to visual evidence rather than fluent but ungrounded descriptions. #### VideoMMMU (VideoMMMU) A video extension of MMMU-style multimodal reasoning over expert questions grounded in temporal media. - Year: 2026 - Tasks: Video-grounded expert reasoning - Format: Video + text reasoning - Difficulty: Frontier multimodal video reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen VideoMMMU tests whether multimodal reasoning skills extend from static images into temporal video understanding. It is useful for evaluating long-form visual reasoning rather than static scene recognition. #### MLVU (M-Avg) (MLVU mean average) A multi-task video understanding benchmark averaged across MLVU categories. - Year: 2026 - Tasks: General video understanding - Format: Video QA and understanding - Difficulty: Broad multimodal video reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen MLVU captures general-purpose video understanding rather than a single narrow skill. BenchLM tracks the mean-average summary row so provider comparison tables can be compared directly. #### MMVU (Multimodal Multi-disciplinary Video Understanding) A benchmark for evaluating multimodal models on video understanding tasks across multiple disciplines, emphasizing temporal reasoning and comprehension over video content. - Year: 2026 - Tasks: Video understanding - Format: Video reasoning benchmark - Difficulty: Multi-disciplinary multimodal video reasoning - Paper: Kimi K2.5 benchmark release surface (https://www.kimi.com/blog/kimi-k2-5.html) - Authors: MMVU benchmark maintainers MMVU is a useful video-understanding benchmark for BenchLM because it appears in frontier provider tables and complements Video-MME with a more discipline-oriented video reasoning slice. #### ScreenSpot Pro (ScreenSpot Pro) A high-resolution GUI grounding benchmark for professional computer-use environments. - Year: 2025 - Tasks: GUI grounding tasks - Format: Interface element localization - Difficulty: Professional GUI grounding - Paper: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use (https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf) - Authors: ScreenSpot-Pro authors ScreenSpot Pro matters for computer-use agents because it tests whether a model can find and ground the right UI target before it ever clicks or types. Strong GUI grounding is a prerequisite for reliable desktop agents. #### TIR-Bench (TIR-Bench) A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces. - Year: 2026 - Tasks: Visual agent and interface reasoning - Format: Screenshot-grounded task reasoning - Difficulty: Computer-use visual reasoning - Paper: Qwen3.6 launch benchmarks (https://qwen.ai/blog?id=qwen3.6) - Authors: Qwen TIR-Bench appears in Qwen's launch tables as a visual-agent benchmark with separate submetrics. BenchLM tracks it as a display-only row while preserving the exact values published by providers. #### GDPval-AA (GDPval-AA) An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work. - Year: 2026 - Tasks: Professional office delivery - Format: ELO-style office benchmark - Difficulty: Professional knowledge work - Paper: MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - Authors: MiniMax MiniMax describes GDPval-AA as an office-domain evaluation for professional expertise and delivery quality. BenchLM stores the published ELO-style score as a display-only benchmark reference. #### MedXpertQA (MM) (MedXpertQA Multimodal) A multimodal medical multiple-choice benchmark covering clinical images such as X-rays, histology, and dermatology. - Year: 2026 - Tasks: 2,000 multimodal medical questions - Format: Medical visual MCQ - Difficulty: Clinical multimodal reasoning - Paper: Muse Spark Eval Methodology (https://ai.meta.com/static-resource/muse-spark-eval-methodology) - Authors: Meta AI Meta describes the multimodal MedXpertQA variant as 2,000 clinically grounded medical questions with five answer choices. BenchLM stores it as a display-only health and multimodal reference. #### ZeroBench (ZeroBench) A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use. - Year: 2026 - Tasks: 100 visual reasoning questions - Format: Multi-step visual reasoning - Difficulty: Tool-augmented visual reasoning - Paper: Muse Spark Eval Methodology (https://ai.meta.com/static-resource/muse-spark-eval-methodology) - Authors: Meta AI Meta evaluates ZeroBench on 100 main questions and reports pass@5, using an LLM judge to compare free-form answers against references. BenchLM stores it as a display-only multimodal reasoning benchmark. #### Design2Code (Design2Code) A multimodal coding benchmark for turning visual designs into working frontend implementations. - Year: 2026 - Tasks: Design-to-code tasks - Format: Visual input to frontend implementation - Difficulty: Multimodal coding - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM stores Design2Code as a display-only screenshot-to-code reference rather than a weighted multimodal ranking input. #### Flame-VLM-Code (Flame-VLM-Code) A vision-language coding benchmark for generating correct code from visual and multimodal inputs. - Year: 2026 - Tasks: Multimodal coding tasks - Format: Vision-language code generation - Difficulty: Multimodal coding - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM tracks Flame-VLM-Code as a display-only multimodal coding benchmark reference. #### Vision2Web (Vision2Web) A benchmark for converting visual references into functional web implementations. - Year: 2026 - Tasks: Screenshot-to-web tasks - Format: Visual reference to web implementation - Difficulty: Multimodal web generation - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM stores Vision2Web as a display-only screenshot-to-web benchmark reference outside the weighted core schema. #### ImageMining (ImageMining) A multimodal retrieval and extraction benchmark over image-heavy task settings. - Year: 2026 - Tasks: Visual retrieval tasks - Format: Image-grounded retrieval and extraction - Difficulty: Multimodal retrieval - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM tracks ImageMining as a display-only reference for visual retrieval and extraction performance. #### MMSearch (MMSearch) A multimodal search benchmark for retrieval and grounded answering across mixed-media inputs. - Year: 2026 - Tasks: Multimodal search tasks - Format: Mixed-media retrieval and grounded answering - Difficulty: Multimodal search - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM stores MMSearch as a display-only benchmark because it is not yet part of the weighted core schema. #### MMSearch-Plus (MMSearch-Plus) A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows. - Year: 2026 - Tasks: Hard multimodal search tasks - Format: Advanced mixed-media retrieval benchmark - Difficulty: Advanced multimodal search - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM tracks MMSearch-Plus as a display-only extension of multimodal search capability. #### SimpleVQA (SimpleVQA) A visual question answering benchmark focused on straightforward image-grounded understanding. - Year: 2026 - Tasks: Visual QA tasks - Format: Image-grounded question answering - Difficulty: General visual understanding - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM uses SimpleVQA as a display-only visual QA reference rather than a weighted multimodal ranking input. #### Facts-VLM (Facts-VLM) A grounded multimodal factuality benchmark for evidence-linked answer correctness. - Year: 2026 - Tasks: Grounded factuality tasks - Format: Evidence-linked multimodal factuality - Difficulty: Grounded multimodal factuality - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM stores Facts-VLM as a display-only benchmark reference when exact provider tables are available. #### V* (V*) A vision-centric benchmark for high-level multimodal reasoning and perception quality. - Year: 2026 - Tasks: Frontier multimodal reasoning tasks - Format: Vision-centric reasoning benchmark - Difficulty: Frontier multimodal - Paper: GLM-5V-Turbo (https://docs.z.ai/guides/vlm/glm-5v-turbo) - Authors: Z.AI BenchLM tracks V* as a display-only frontier multimodal benchmark reference outside the current weighted schema. #### CharXiv (CharXiv Reasoning) A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts. - Year: 2024 - Tasks: Scientific chart reasoning - Format: Chart understanding and reasoning - Difficulty: Scientific visualization reasoning - Paper: CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs (https://charxiv.github.io/) - Authors: CharXiv authors CharXiv evaluates a model's ability to reason about real-world scientific charts rather than simple visual QA. With-tools and without-tools variants isolate raw visual reasoning from tool-augmented performance. #### CharXiv w/o tools (CharXiv Reasoning without tools) Tool-free variant of CharXiv that isolates raw visual reasoning ability without code execution or tool augmentation. - Year: 2024 - Tasks: Scientific chart reasoning (tool-free) - Format: Chart understanding without tools - Difficulty: Scientific visualization reasoning - Paper: CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs (https://charxiv.github.io/) - Authors: CharXiv authors The tool-free CharXiv variant measures pure multimodal reasoning. Mythos Preview scores 86.1% without tools vs 93.2% with tools, demonstrating strong baseline chart reasoning. #### SWE-bench Multimodal (SWE-bench Multimodal) A multimodal variant of SWE-bench that adds visual context (screenshots, design mockups) to software engineering issue descriptions, testing whether models can leverage visual information for code generation. - Year: 2025 - Tasks: Multimodal software engineering tasks - Format: Code patch generation with visual context - Difficulty: Frontier multimodal coding - Paper: SWE-bench Multimodal (https://www.swebench.com/multimodal) - Authors: SWE-bench team SWE-bench Multimodal is important because real-world software engineering increasingly involves visual inputs like UI mockups, error screenshots, and design specifications. Scores tend to be much lower than text-only SWE-bench variants. #### Blueprint-Bench 2 (Blueprint-Bench 2) An agentic spatial reasoning benchmark reported as a normalized score. - Year: 2026 - Tasks: Spatial reasoning from blueprints - Format: Normalized score - Difficulty: Agentic spatial reasoning - Paper: Gemini 3.5 Flash launch screenshots (https://x.com/GoogleDeepMind) - Authors: Google DeepMind Google reported Blueprint-Bench 2 in the Gemini 3.5 Flash launch comparison table. BenchLM stores it as a display-only multimodal and spatial-reasoning benchmark until Google publishes the full methodology page. ### korean Benchmarks #### KMMLU (Korean Massive Multitask Language Understanding) Evaluates Korean expert-level knowledge across 45 subjects. 20% of questions require Korean cultural context. - Year: 2024 - Tasks: 35,030 questions - Format: Multiple choice questions - Difficulty: Elementary to professional level in Korean - Paper: KMMLU: Measuring Massive Multitask Language Understanding in Korean (https://arxiv.org/abs/2402.11548) - Authors: KMMLU Authors Tests human-level understanding and reasoning in the Korean language across diverse subjects. #### KMMLU-Hard (KMMLU-Hard) A filtered hard subset of KMMLU containing ~5,000 questions that most models get wrong. - Year: 2025 - Tasks: ~5,000 questions - Format: Multiple choice questions - Difficulty: Advanced Korean reasoning - Paper: Evaluating LLMs on Hard Korean Queries (https://github.com/daekeun-ml/evaluate-llm-on-korean-dataset) - Authors: Daekeun ML Provides strong signals for advanced frontier models attempting reasoning in Korean. #### KMMLU-Redux (KMMLU-Redux) Cleaned KMMLU from national technical qualification exams, with errors removed, decontaminated, and deduplicated. - Year: undefined - Tasks: ~3,500 questions - Format: Technical multiple choice - Difficulty: Industrial/technical - Paper: undefined (undefined) - Authors: undefined undefined #### KMMLU-Pro (KMMLU-Pro) Korean National Professional Licensure exams evaluating professional-grade knowledge. - Year: undefined - Tasks: ~2,500 questions - Format: Professional licensure exams - Difficulty: Professional - Paper: undefined (undefined) - Authors: undefined undefined #### CLIcK (Cultural and Linguistic Intelligence in Korean) Evaluates Korean culture and linguistics. - Year: undefined - Tasks: 1,995 questions - Format: Cultural/linguistic QA - Difficulty: Korean cultural nuances - Paper: undefined (undefined) - Authors: undefined undefined #### KoBALT (Korean Benchmark for Advanced Linguistic Tasks) Evaluates advanced Korean linguistic competence. - Year: undefined - Tasks: Linguistics questions - Format: Advanced linguistics - Difficulty: Advanced linguistic phenomena - Paper: undefined (undefined) - Authors: undefined undefined #### Korean CSAT (College Scholastic Ability Test (수능)) The Korean SAT exam. - Year: undefined - Tasks: Multi-subject exam - Format: Standardized test - Difficulty: High school to college level - Paper: undefined (undefined) - Authors: undefined undefined #### HRM8K (HAE-RAE Math 8K) Korean mathematical reasoning (high-school to Olympiad level). - Year: undefined - Tasks: 8,011 instances - Format: Math word problems - Difficulty: Olympiad level - Paper: undefined (undefined) - Authors: undefined undefined ### external Benchmarks #### Vals Index (Vals Index v1.1) Vals AI composite benchmark across finance and coding tasks, including Finance Agent v2, CorpFin v2, SWE-bench, Terminal-Bench 2.0, and Vibe Code Bench. - Year: 2026 - Tasks: Finance and coding components - Format: Composite score - Difficulty: Private economic-work benchmark composite - Paper: Vals Index (https://www.vals.ai/benchmarks/vals_index) - Authors: Vals AI BenchLM mirrors Vals Index v1.1 as a display-only external composite. It is useful context for Vals' private benchmark suite, but it is not used as a BenchLM weighted ranking input. #### Vals Multimodal Index (Vals Multimodal Index v1.1) Vals AI multimodal composite across finance, coding, education, and mortgage-tax task families. - Year: 2026 - Tasks: Finance, coding, education, and mortgage-tax components - Format: Composite score - Difficulty: Private multimodal economic-work benchmark composite - Paper: Vals Multimodal Index (https://www.vals.ai/benchmarks/vals_multimodal_index) - Authors: Vals AI BenchLM mirrors the Vals Multimodal Index as a display-only external composite with task-level component scores preserved in the Vals snapshot. #### CorpFin v2 (Vals CorpFin v2) Vals AI private benchmark for understanding long-context credit agreements. - Year: 2026 - Tasks: Credit-agreement understanding tasks - Format: Accuracy score - Difficulty: Professional finance document reasoning - Paper: CorpFin v2 (https://www.vals.ai/benchmarks/corp_fin_v2) - Authors: Vals AI The Vals CorpFin v2 page reports overall, exact-page, max-fitting-context, and shared-max-context task views. BenchLM keeps it display only. #### MedCode (Vals MedCode) Vals AI healthcare benchmark for whether models can support the medical billing process. - Year: 2026 - Tasks: Medical billing support tasks - Format: Accuracy score - Difficulty: Professional healthcare administration - Paper: MedCode (https://www.vals.ai/benchmarks/medcode) - Authors: Vals AI BenchLM mirrors the public Vals MedCode leaderboard as display-only healthcare evidence. #### MedScribe (Vals MedScribe) Vals AI healthcare benchmark for whether models can support doctors with administrative work. - Year: 2026 - Tasks: Medical administrative support tasks - Format: Accuracy score - Difficulty: Professional healthcare administration - Paper: MedScribe (https://www.vals.ai/benchmarks/medscribe) - Authors: Vals AI BenchLM mirrors the public Vals MedScribe leaderboard as display-only healthcare evidence. #### MortgageTax (Vals MortgageTax) Vals AI benchmark for mortgage and tax document reasoning, including semantic and numerical extraction task views. - Year: 2026 - Tasks: Mortgage and tax extraction tasks - Format: Accuracy score - Difficulty: Professional mortgage-tax document reasoning - Paper: MortgageTax (https://www.vals.ai/benchmarks/mortgage_tax) - Authors: Vals AI BenchLM mirrors Vals MortgageTax as a display-only finance and document-reasoning benchmark. #### ProofBench (Vals ProofBench) Vals AI automated theorem-proving benchmark. - Year: 2026 - Tasks: Automated theorem proving - Format: Accuracy score - Difficulty: Formal proof reasoning - Paper: ProofBench (https://www.vals.ai/benchmarks/proof_bench) - Authors: Vals AI BenchLM mirrors Vals ProofBench as a display-only math and proof benchmark. #### LegalBench (Vals LegalBench) Vals AI legal benchmark with issue, rule, conclusion, interpretation, and rhetoric task views. - Year: 2026 - Tasks: Legal reasoning task views - Format: Accuracy score - Difficulty: Professional legal reasoning - Paper: LegalBench (https://www.vals.ai/benchmarks/legal_bench) - Authors: Vals AI BenchLM mirrors Vals LegalBench as display-only legal-domain evidence and does not use it in weighted rankings. #### CaseLaw v2 (Vals CaseLaw v2) Vals AI private question-answer benchmark over Canadian court cases. - Year: 2026 - Tasks: Canadian case-law question answering - Format: Accuracy score - Difficulty: Professional legal retrieval and reasoning - Paper: CaseLaw v2 (https://www.vals.ai/benchmarks/case_law_v2) - Authors: Vals AI Vals marks CaseLaw v2 as archived. BenchLM mirrors the public leaderboard as display-only historical legal-domain context. #### DeepSWE (DeepSWE) A long-horizon software engineering benchmark from Datacurve for measuring frontier coding agents on original tasks drawn from active open-source repositories. - Year: 2026 - Tasks: 113 software engineering tasks across 91 repositories and 5 languages - Format: Pass@1 with confidence interval, cost, time, and token metadata - Difficulty: Long-horizon software engineering - Paper: DeepSWE benchmark blog (https://deepswe.datacurve.ai/blog) - Authors: Datacurve AI DeepSWE includes original tasks with isolated environments and program-based verifiers. BenchLM mirrors the public DeepSWE leaderboard JSON as display-only, using the best available mini-swe-agent configuration per model and preserving cost, time, token, and effort-level source metadata. Each row combines a model, agent harness, and reasoning-effort setting rather than a pure model-only benchmark score. #### SWE-Marathon (SWE-Marathon) A long-horizon software engineering benchmark from Abundant AI with multi-hour tasks spanning library reproductions, full-stack product clones, and ML engineering. - Year: 2026 - Tasks: 20 multi-hour software engineering tasks - Format: Task resolution and trajectory review - Difficulty: Ultra-long-horizon software engineering - Paper: SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work? (https://www.swe-marathon.org/) - Authors: Abundant AI and BenchFlow BenchLM tracks SWE-Marathon as a display-only external benchmark. The official v1.0 site reports 20 multi-hour tasks, 1,300 logged trials, task-level leaderboards, and replayable trajectory artifacts; BenchLM keeps it source-metadata-only until there is a stable public aggregate feed. #### ExploitBench (ExploitBench v8-bench) A cybersecurity benchmark for evaluating LLM agents on full-control V8 exploit synthesis using 16 measured exploit capability flags. - Year: 2026 - Tasks: V8 exploit synthesis runs - Format: Capability coverage percentage over 16 flags - Difficulty: Browser exploitation and cybersecurity - Paper: ExploitBench (https://exploitbench.ai/) - Authors: Seunghyun Lee, David Brumley, Carnegie Mellon University ExploitBench measures whether LLM agents can turn patched V8 bugs into progressively stronger exploit capabilities, from reaching vulnerable code to full control. BenchLM mirrors the official public leaderboard as display-only security-evaluation context. #### GBA-Eval (GBA-Eval) An agentic coding benchmark that asks models to build a Game Boy Advance emulator from scratch and grades emulator behavior against procedural, audio, and gameplay tests. - Year: 2026 - Tasks: 27 emulator test cases - Format: Overall emulator score - Difficulty: Long-horizon systems programming - Paper: GBA-Eval (https://gbaeval.com/) - Authors: Stephen Yang GBA-Eval evaluates long-horizon coding agents by having them implement a working GBA emulator. The public leaderboard reports overall scores across 27 test cases with token usage and checkpoints preserved in the source JSON feed. #### CAIS Text Leaderboard (CAIS AI Dashboard Text Capabilities Index) A Center for AI Safety dashboard view summarizing text capabilities across HLE, ARC-AGI-2, SWE-Bench Pro, and TextQuests. - Year: 2025 - Tasks: HLE, ARC-AGI-2, SWE-Bench Pro, and TextQuests - Format: Average component score - Difficulty: Composite frontier text capability - Paper: CAIS AI Dashboard (https://dashboard.safe.ai/) - Authors: Center for AI Safety BenchLM mirrors the text-capability portion of the CAIS AI Dashboard as a display-only composite. The displayed score is the average of the public HLE, ARC-AGI-2, SWE-Bench Pro, and TextQuests component scores. #### WeirdML (WeirdML v2) A machine-learning engineering benchmark that tests whether LLMs can train models on novel datasets, write PyTorch code, and improve through iterative feedback. - Year: 2026 - Tasks: 17 novel ML engineering tasks - Format: Average accuracy across tasks - Difficulty: Novel dataset modeling and iterative debugging - Paper: WeirdML (https://htihle.github.io/weirdml.html) - Authors: Havard Tveit Ihle WeirdML v2 evaluates models on 17 unusual ML tasks and reports average accuracy across tasks from the official CSV. BenchLM mirrors the top official rows as display-only ML-agent evidence. #### ALE-Bench (Agents Last Exam) A benchmark for agentic professional workflows with verifiable success criteria, reporting pass rates and partial scores for model plus agent-harness rows. - Year: 2026 - Tasks: 152 ALE-V1 professional workflow tasks across 13 top-level domains - Format: Pass rate, partial-credit score, cost, token, and duration metadata - Difficulty: Real-world agentic workflows - Paper: Agents Last Exam (https://agents-last-exam.org/leaderboard) - Authors: UC Berkeley RDI BenchLM mirrors the public Agents Last Exam full leaderboard API as ALE-Bench and links the June 2026 Agent Showdown analysis for domain, cost, speed, and failure-mode context. Rows combine base models with agent harnesses such as Codex, OpenClaw, Claude Code, Droid, Cursor CLI, and Gemini CLI, so the table remains display-only. The source notes that Claude Code plus Fable 5 may include upstream fallback to Opus 4.8 on refused tasks. #### RuneScape-Bench (RuneBench / runescape-bench) An agentic coding benchmark where models use a TypeScript SDK to play a RuneScape-like environment and optimize skill-training performance. - Year: 2026 - Tasks: 16 RuneScape skill-training tasks - Format: Average log XP-rate score - Difficulty: Agentic gameplay automation - Paper: RuneBench (https://maxbittker.github.io/runebench/) - Authors: Max Bittker RuneBench evaluates gameplay automation and coding-agent strategy. BenchLM mirrors the public aggregate computed as average ln(1 + XP/min) across 16 skill-training tasks, while keeping the benchmark display-only because rows reflect agent harness and gameplay strategy. #### Toloka Arena (Toloka Arena) An independent agentic-intelligence evaluation from Toloka using private simulated workflows and a pass^5 metric. - Year: 2026 - Tasks: Private simulated enterprise workflows - Format: pass^5 arena score - Difficulty: Agentic workflow reliability - Paper: Toloka Arena (https://toloka.ai/arena) - Authors: Toloka Toloka Arena evaluates agents on private simulated workflows with tools, databases, policies, and multi-turn tasks. BenchLM tracks it as source metadata only until a stable public leaderboard data feed is available. #### Vals SWE-bench mirror (Vals-hosted SWE-bench mirror) Vals AI hosted SWE-bench view for solving production software engineering tasks. - Year: 2026 - Tasks: Software engineering issue-resolution tasks - Format: Accuracy score - Difficulty: Production software engineering - Paper: Vals SWE-bench (https://www.vals.ai/benchmarks/swebench) - Authors: Vals AI BenchLM keeps this separate from its canonical SWE-bench Verified page so Vals-hosted results remain secondary context rather than source-of-record data. #### Vals Terminal-Bench 2.0 mirror (Vals-hosted Terminal-Bench 2.0 mirror) Vals AI hosted Terminal-Bench 2.0 view with easy, medium, and hard task splits. - Year: 2026 - Tasks: Terminal task difficulty splits - Format: Accuracy score - Difficulty: Terminal-based agent execution - Paper: Vals Terminal-Bench 2.0 (https://www.vals.ai/benchmarks/terminal-bench-2) - Authors: Vals AI BenchLM mirrors this Vals-hosted Terminal-Bench view as display-only secondary context. #### Vals LiveCodeBench mirror (Vals-hosted LiveCodeBench mirror) Vals AI implementation of LiveCodeBench with easy, medium, and hard task splits. - Year: 2026 - Tasks: Coding problem difficulty splits - Format: Accuracy score - Difficulty: Contamination-resistant coding problems - Paper: Vals LiveCodeBench (https://www.vals.ai/benchmarks/lcb) - Authors: Vals AI BenchLM keeps this separate from its canonical LiveCodeBench rows because it is a Vals-hosted implementation snapshot. #### Vals GPQA Diamond mirror (Vals-hosted GPQA Diamond mirror) Vals AI hosted GPQA Diamond view with few-shot and zero-shot chain-of-thought task splits. - Year: 2026 - Tasks: GPQA Diamond task splits - Format: Accuracy score - Difficulty: Graduate science reasoning - Paper: Vals GPQA Diamond (https://www.vals.ai/benchmarks/gpqa) - Authors: Vals AI BenchLM keeps this Vals-hosted GPQA Diamond table separate from canonical GPQA source records. #### Vals MMLU-Pro mirror (Vals-hosted MMLU-Pro mirror) Vals AI hosted MMLU-Pro view with subject-level task splits. - Year: 2026 - Tasks: MMLU-Pro subject splits - Format: Accuracy score - Difficulty: Professional academic reasoning - Paper: Vals MMLU-Pro (https://www.vals.ai/benchmarks/mmlu_pro) - Authors: Vals AI BenchLM keeps this Vals-hosted MMLU-Pro table separate from canonical MMLU-Pro source records. ## All Model Benchmark Scores ### #1 Claude Mythos 5 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M+ - Overall Score: 99/100 - Family: Claude Mythos - Variant: restricted - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/claude-mythos-5 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 88, OSWorld-Verified: 85, BrowseComp: 88, GDPval-AA: 1932 **Coding**: SWE-bench Verified: 95.5, SWE-bench Pro: 80.3, FrontierCode: 29.3, Terminal-Bench 2.0: 88 **Multimodal & Grounded**: MMMU-Pro: 92.7, SWE-bench Multimodal: 54.9, CharXiv: 93.5, CharXiv w/o tools: 88.9, Blueprint-Bench 2: 38.6 **Knowledge**: GPQA: 94.1, HLE: 64.5, HLE w/o tools: 59 **Multilingual**: SWE Multilingual: 92.2 **Mathematics**: USAMO 2026: 97.6 ### #2 Claude Fable 5 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M+ - Overall Score: 97/100 - Family: Claude Fable - Variant: base - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/claude-fable-5 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 84.3, OSWorld-Verified: 85, BrowseComp: 86.9, GDPval-AA: 1932 **Coding**: SWE-bench Verified: 95, SWE-bench Pro: 80, FrontierCode: 29.3, Terminal-Bench 2.0: 84.3 **Multimodal & Grounded**: MMMU-Pro: 92.7, SWE-bench Multimodal: 59, CharXiv: 93.2, CharXiv w/o tools: 86.1, Blueprint-Bench 2: 38.6 **Knowledge**: GPQA: 94.5, HLE: 64.5, HLE w/o tools: 59 **Multilingual**: SWE Multilingual: 87.3 **Mathematics**: USAMO 2026: 97.6 ### #3 Claude Opus 4.8 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: 93/100 - Family: Claude Opus 4.8 - Variant: base - Benchmarks Covered: 41 of 247 - Profile: https://benchlm.ai/models/claude-opus-4-8 - Related Earlier Model: Claude Opus 4.7 (Adaptive) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 74.6, BrowseComp: 84.3, DeepSearchQA: 93.1, OSWorld-Verified: 83.4, Finance Agent v2: 53.9, GDPval-AA: 1890, MCP Atlas: 82.2, Toolathlon: 59.9, Gert Labs: 72.97, AA Agentic Index: 77.81, Tau2-Telecom: 94.4, GDPval-AA: 69.5 **Coding**: SWE-bench Verified: 88.6, SWE-bench Pro: 69.2, SWE Multilingual: 84.4, SWE Multimodal: 38.4, Terminal-Bench 2.0: 74.6, CursorBench v3.1: 58.4, AA Coding Index: 56.71, Terminal-Bench Hard: 58.3, AA-SciCode: 53.5 **Multimodal & Grounded**: OfficeQA Pro: 66.2, ScreenSpot Pro: 87.9, CharXiv: 89.9, CharXiv w/o tools: 80.5, Design Arena Website: 1284 **Reasoning**: AA-LCR: 67.7, CritPt: 20.9 **Knowledge**: GPQA: 93.6, GPQA-D: 93.6, HLE: 57.9, HLE w/o tools: 49.8, Artificial Analysis Intelligence Index: 61.44, AA-GPQA Diamond: 92, AA-HLE: 45.7, AA-Omniscience Index: 27.4, AA-Omniscience Accuracy: 46.6, AA-Omniscience Hallucination Rate: 35.9 **Instruction Following**: AA-IFBench: 62.2 **Multilingual**: INCLUDE: 87.6 **Mathematics**: USAMO 2026: 96.7 ### #4 Gemini 3.1 Pro - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: 91/100 - Family: Gemini 3.1 Pro - Variant: base - Benchmarks Covered: 38 of 247 - Profile: https://benchlm.ai/models/gemini-3-1-pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Claw-Eval: 57.8, DeepSearchQA: 69.7, Tau2-Telecom: 95.6, AA Agentic Index: 59.09, APEX-Agents-AA: 32, GDPval-AA: 40.7, GDPval-AA: 1314, Gert Labs: 56.87 **Coding**: LiveCodeBench Pro: 82.9, React Native Evals: 78.9, Vibe Code Bench: 32.034, AA Coding Index: 55.5, Terminal-Bench Hard: 53.8, AA-SciCode: 58.9 **Multimodal & Grounded**: MMMU-Pro: 83.9, CharXiv: 80.2, ERQA: 69.4, SimpleVQA: 72.4, ScreenSpot Pro: 84.4, ZeroBench: 29, MedXpertQA (MM): 81.3, GDPval-AA: 1320, AA-MMMU-Pro: 82.4, Design Arena Website: 1296 **Reasoning**: ARC-AGI-2: 77.1, AA-LCR: 72.7, CritPt: 17.7 **Knowledge**: GPQA-D: 94.3, HLE w/o tools: 45.4, HealthBench Hard: 20.6, MedXpertQA (Text): 71.5, Artificial Analysis Intelligence Index: 57.18, AA-GPQA Diamond: 94.1, AA-HLE: 44.7, AA-Omniscience Index: 32.9, AA-Omniscience Accuracy: 55.3, AA-Omniscience Hallucination Rate: 49.9 **Instruction Following**: AA-IFBench: 77.1 ### #5 Qwen3.7 Max - Creator: Alibaba - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: 91/100 - Family: Qwen3.7 Max - Variant: base - Benchmarks Covered: 51 of 247 - Profile: https://benchlm.ai/models/qwen3-7-max - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 69.7, QwenClawBench: 64.3, QwenWebBench: 1568, Claw-Eval: 65.2, BFCL v4: 75, MCP Atlas: 76.4, VITA-Bench: 47.9, HLE w/ tools: 53.5, AA Agentic Index: 66.56, Tau2-Telecom: 94.7, GDPval-AA: 52.2, GDPval-AA: 1543, Gert Labs: 64.27 **Coding**: SWE-bench Verified: 80.4, SWE-bench Pro: 60.6, SWE Multilingual: 78.3, NL2Repo: 47.2, SciCode: 53.5, LiveCodeBench: 91.6, Terminal-Bench 2.0: 69.7, AA Coding Index: 50.12, Terminal-Bench Hard: 50.8, AA-SciCode: 48.8 **Multimodal & Grounded**: Design Arena Website: 1307 **Reasoning**: MRCRv2: 90.4, CritPt: 13.4, AA-LCR: 69 **Knowledge**: GPQA: 92.4, GPQA-D: 92.4, HLE: 41.4, MMLU-Pro: 89.6, MMLU-Redux: 95, SuperGPQA: 73.6, MMMLU: 90.3, Artificial Analysis Intelligence Index: 56.58, AA-GPQA Diamond: 92.3, AA-HLE: 38.1, AA-Omniscience Index: 14.1, AA-Omniscience Accuracy: 30.1, AA-Omniscience Hallucination Rate: 22.9 **Instruction Following**: IFEval: 94.3, IFBench: 79.1, AA-IFBench: 80.5 **Multilingual**: MMLU-ProX: 87, NOVA-63: 59, INCLUDE: 86.2, MAXIFE: 89.2, PolyMath: 86.5 **Mathematics**: HMMT Feb 2026: 97.1, IMOAnswerBench: 90, Apex: 44.5 ### #6 GPT-5.4 Pro - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1.05M - Overall Score: 90/100 - Family: GPT-5.4 - Variant: pro - Benchmarks Covered: 10 of 247 - Profile: https://benchlm.ai/models/gpt-5-4-pro - Sibling Models: GPT-5.4, GPT-5.4 mini, GPT-5.4 nano - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: BrowseComp: 89.3 **Multimodal & Grounded**: MMMU-Pro: 94 **Reasoning**: ARC-AGI-2: 83.3, CritPt: 30 **Knowledge**: HLE: 58.7, FrontierScience: 36.7, FrontierScience Research: 36.7, HLE w/o tools: 42.7 **Mathematics**: IPhO 2025 (Theory): 93.5, FrontierMath: 50 ### #7 GPT-5.5 - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: 89/100 - Family: GPT-5.5 - Variant: base - Benchmarks Covered: 42 of 247 - Profile: https://benchlm.ai/models/gpt-5-5 - Sibling Models: GPT-5.5 Pro - Related Earlier Model: GPT-5.4 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 82, CyberGym: 81.8, BrowseComp: 84.4, OSWorld-Verified: 78.7, MCP Atlas: 75.3, Toolathlon: 55.6, Tau2-Telecom: 93.9, AA Agentic Index: 74.12, APEX-Agents-AA: 37.7, GDPval-AA: 63.5, GDPval-AA: 1769, Gert Labs: 72.93 **Coding**: SWE-bench Pro: 58.6, Terminal-Bench 2.0: 82, Vibe Code Bench: 69.847, React Native Evals: 84.7, CursorBench v3.1: 59.2, AA Coding Index: 59.12, Terminal-Bench Hard: 60.6, AA-SciCode: 56.1 **Multimodal & Grounded**: MMMU-Pro: 81.2, MMMU-Pro w/ Python: 83.2, OfficeQA Pro: 54.1, AA-MMMU-Pro: 79.9, Design Arena Website: 1297 **Reasoning**: MRCR v2 64K-128K: 83.1, MRCR v2 128K-256K: 87.5, ARC-AGI-2: 85, AA-LCR: 74.3, CritPt: 27.1 **Knowledge**: GPQA: 93.6, GPQA-D: 93.6, HLE: 52.2, HLE w/o tools: 41.4, Artificial Analysis Intelligence Index: 60.24, AA-GPQA Diamond: 93.5, AA-HLE: 44.3, AA-Omniscience Index: 20.1, AA-Omniscience Accuracy: 56.9, AA-Omniscience Hallucination Rate: 85.5 **Instruction Following**: AA-IFBench: 75.9 **Mathematics**: FrontierMath: 51.7 ### #8 Gemini 3 Pro Deep Think - Creator: Google - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 2M - Overall Score: ~89/100 (estimated) - Family: Gemini 3 Pro - Variant: reasoning - Benchmarks Covered: 4 of 247 - Profile: https://benchlm.ai/models/gemini-3-pro-deep-think - Sibling Models: Gemini 3 Pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: GDPval-AA: 41.2, GDPval-AA: 1324 **Reasoning**: ARC-AGI-2: 45.1, CritPt: 25.7 ### #9 Grok 4.1 - Creator: xAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~89/100 (estimated) - Family: Grok 4.1 - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/grok-4-1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #10 GPT-5.4 - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1.05M - Overall Score: 88/100 - Family: GPT-5.4 - Variant: base - Benchmarks Covered: 48 of 247 - Profile: https://benchlm.ai/models/gpt-5-4 - Sibling Models: GPT-5.4 Pro, GPT-5.4 mini, GPT-5.4 nano - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 75.1, CyberGym: 79, BrowseComp: 82.7, OSWorld-Verified: 75, MCP Atlas: 70.6, Toolathlon: 54.6, Tau2-Telecom: 87.1, Claw-Eval: 60.3, DeepSearchQA: 73.6, AA Agentic Index: 67.96, APEX-Agents-AA: 33.3, GDPval-AA: 58.7, GDPval-AA: 1674, Gert Labs: 64.89 **Coding**: LiveCodeBench Pro: 87.5, SWE-bench Pro: 57.7, React Native Evals: 85.3, Vibe Code Bench: 67.421, AA Coding Index: 57.25, Terminal-Bench Hard: 57.6, AA-SciCode: 56.6 **Multimodal & Grounded**: MMMU-Pro: 81.2, OfficeQA Pro: 53.2, MMMU-Pro w/ Python: 82.1, CharXiv: 82.8, ERQA: 65.4, SimpleVQA: 61.1, ScreenSpot Pro: 85.4, ZeroBench: 41, MedXpertQA (MM): 77.1, GDPval-AA: 1672, AA-MMMU-Pro: 78.4, Design Arena Website: 1269 **Reasoning**: AA-LCR: 74, CritPt: 23.4 **Knowledge**: GPQA: 92.8, HLE: 52.1, HLE w/o tools: 39.8, GPQA-D: 92.8, HealthBench Hard: 40.1, MedXpertQA (Text): 59.6, Artificial Analysis Intelligence Index: 56.8, AA-GPQA Diamond: 92, AA-HLE: 41.6, AA-Omniscience Index: 5.7, AA-Omniscience Accuracy: 50, AA-Omniscience Hallucination Rate: 88.6 **Instruction Following**: AA-IFBench: 73.9 ### #11 Qwen3.7 Plus - Creator: Alibaba - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: 88/100 - Family: Qwen3.7 Plus - Variant: base - Benchmarks Covered: 68 of 247 - Profile: https://benchlm.ai/models/qwen3-7-plus - Related Earlier Model: Qwen3.6 Plus - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 70.3, QwenClawBench: 61.8, QwenWebBench: 1536, Claw-Eval: 62.7, BFCL v4: 72.9, MCP Atlas: 73.2, VITA-Bench: 45.6, DeepPlanning: 62.3, OSWorld-Verified: 73.3, AndroidWorld: 81, AA Agentic Index: 65.13, APEX-Agents-AA: 22.4, Tau2-Telecom: 93, GDPval-AA: 50.9, GDPval-AA: 1518 **Coding**: Terminal-Bench 2.0: 70.3, SWE-bench Verified: 77.7, SWE-bench Pro: 57.6, SWE Multilingual: 75.8, NL2Repo: 41.1, SciCode: 51.3, LiveCodeBench: 89.6, AA Coding Index: 46.48, Terminal-Bench Hard: 47, AA-SciCode: 45.5 **Multimodal & Grounded**: MMMU-Pro: 79, MathVision: 90.3, CharXiv: 85.9, ERQA: 69.8, MedXpertQA (MM): 71, ScreenSpot Pro: 79, SimpleVQA: 81.7, MMSearch-Plus: 41.4, RealWorldQA: 86.9, OmniDocBench 1.5: 91.4, OCRBench V2: 70.7, ODINW13: 51.1, Video-MME (with subtitle): 88, VideoMMMU: 85.4, MLVU (M-Avg): 87.4, AA-MMMU-Pro: 44.8 **Reasoning**: CritPt: 9.1, MRCRv2: 91.7, AA-LCR: 65 **Knowledge**: GPQA: 90.3, GPQA-D: 90.3, HLE: 34.7, MMLU-Pro: 88.5, MMLU-Redux: 94.5, SuperGPQA: 71.4, MMMLU: 89, Artificial Analysis Intelligence Index: 53.25, AA-GPQA Diamond: 90, AA-HLE: 33.4, AA-Omniscience Index: 2.4, AA-Omniscience Accuracy: 22.2, AA-Omniscience Hallucination Rate: 25.5 **Instruction Following**: IFEval: 94.6, IFBench: 79.1, AA-IFBench: 78 **Multilingual**: MMLU-ProX: 85.4, NOVA-63: 58.8, INCLUDE: 83, MAXIFE: 88.8, PolyMath: 84 **Mathematics**: HMMT Feb 2026: 92.9, IMOAnswerBench: 86, Apex: 22.7 ### #12 Claude Opus 4.6 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: 86/100 - Family: Claude Opus 4.6 - Variant: base - Benchmarks Covered: 47 of 247 - Profile: https://benchlm.ai/models/claude-opus-4-6 - Sibling Models: Claude Opus 4.6 (Adaptive) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 65.4, BrowseComp: 83.7, OSWorld-Verified: 72.7, Tau2-Telecom: 84.8, Claw-Eval: 70.4, DeepSearchQA: 73.7, CyberGym: 66.6, AA Agentic Index: 64.22, GDPval-AA: 54.5, GDPval-AA: 1589, Gert Labs: 61.85 **Coding**: SWE-bench Verified: 80.84, SWE-bench Verified*: 75.6, LiveCodeBench Pro: 70.7, SWE-bench Pro: 53.4, SWE-Rebench: 65.3, React Native Evals: 84.1, Vibe Code Bench: 57.573, AA Coding Index: 47.56, Terminal-Bench Hard: 48.5, AA-SciCode: 45.7 **Multimodal & Grounded**: MMMU-Pro: 77.3, ERQA: 51.6, ScreenSpot Pro: 83.1, MedXpertQA (MM): 64.8, GDPval-AA: 1606, AA-MMMU-Pro: 72.5, Design Arena Website: 1340 **Reasoning**: AA-LCR: 58.3, CritPt: 2.8 **Knowledge**: GPQA: 91.3, GPQA-D: 89.2, SuperGPQA: 95, MMLU-Pro: 82, MMLU-Pro (Arcee): 89.1, HLE: 53, HLE w/o tools: 40, HealthBench Hard: 14.8, MedXpertQA (Text): 52.1, Artificial Analysis Intelligence Index: 46.46, AA-GPQA Diamond: 84, AA-HLE: 18.6, AA-Omniscience Index: 3.5, AA-Omniscience Accuracy: 45.2, AA-Omniscience Hallucination Rate: 76 **Instruction Following**: AA-IFBench: 44.6 **Mathematics**: AIME25 (Arcee): 99.8 ### #13 Gemini 3.5 Flash - Creator: Google - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: 86/100 - Family: Gemini 3.5 Flash - Variant: base - Benchmarks Covered: 40 of 247 - Profile: https://benchlm.ai/models/gemini-3-5-flash - Related Earlier Model: Gemini 3 Flash - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 76.2, MCP Atlas: 83.6, Toolathlon: 56.5, OSWorld-Verified: 78.4, Finance Agent v2: 57.861, GDPval-AA: 1656, Tau2-Telecom: 95.3, GDPval-AA: 57.8, AA Agentic Index: 70.3, APEX-Agents-AA: 47.1, Gert Labs: 61.85 **Coding**: Terminal-Bench 2.0: 76.2, Terminal-Bench Hard: 40.9, SWE-bench Pro: 55.1, SciCode: 53.1, Vibe Code Bench: 48.683, CursorBench v3.1: 49.8, AA Coding Index: 44.98, AA-SciCode: 53.1 **Multimodal & Grounded**: CharXiv: 84.2, MMMU-Pro: 83.6, Blueprint-Bench 2: 33.6, AA-MMMU-Pro: 84.3, Design Arena Website: 1292 **Reasoning**: MRCRv2: 77.3, MRCR 1M: 26.6, ARC-AGI-2: 72.1, AA-LCR: 69.3, CritPt: 13.1 **Knowledge**: Artificial Analysis Intelligence Index: 55.33, GPQA: 92.2, GPQA-D: 92.676, HLE: 40.2, AA-Omniscience Accuracy: 51.9, AA-Omniscience Hallucination Rate: 60.7, AA-GPQA Diamond: 92.2, AA-HLE: 41, AA-Omniscience Index: 22.7 **Instruction Following**: IFBench: 76.3, AA-IFBench: 76.3 ### #14 DeepSeek V4 Pro (Max) - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 1M - Overall Score: 86/100 - Family: DeepSeek V4 - Variant: pro-reasoning (max) - Benchmarks Covered: 42 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-pro-max - Sibling Models: DeepSeek V4 Pro (High), DeepSeek V4 Flash (Max), DeepSeek V4 Flash (High), DeepSeek V4 Pro, DeepSeek V4 Flash, DeepSeek V4 Pro Base, DeepSeek V4 Flash Base - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 67.9, BrowseComp: 83.4, HLE w/ tools: 48.2, MCP Atlas: 73.6, GDPval-AA: 1554, Toolathlon: 51.8, AA Agentic Index: 67.19, APEX-Agents-AA: 24.3, Tau2-Telecom: 96.2, GDPval-AA: 52.7 **Coding**: LiveCodeBench: 93.5, Codeforces: 3206, SWE-bench Verified: 80.6, SWE-bench Pro: 55.4, SWE Multilingual: 76.2, Terminal-Bench 2.0: 67.9, Vibe Code Bench: 49.931, AA Coding Index: 47.47, Terminal-Bench Hard: 46.2, AA-SciCode: 50 **Multimodal & Grounded**: Design Arena Website: 1286 **Reasoning**: MRCR 1M: 83.5, CorpusQA 1M: 62, AA-LCR: 66.3, CritPt: 12.9 **Knowledge**: MMLU-Pro: 87.5, SimpleQA: 57.9, Chinese-SimpleQA: 84.4, GPQA: 90.1, GPQA-D: 90.1, HLE: 37.7, Artificial Analysis Intelligence Index: 51.51, AA-GPQA Diamond: 88.8, AA-HLE: 35.9, AA-Omniscience Index: -10, AA-Omniscience Accuracy: 43.3, AA-Omniscience Hallucination Rate: 94 **Instruction Following**: AA-IFBench: 76.5 **Mathematics**: HMMT Feb 2026: 95.2, IMOAnswerBench: 89.8, Apex: 38.3, Apex Shortlist: 90.2 ### #15 GPT-5.3 Codex - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: ~85/100 (estimated) - Family: GPT-5.3 Codex - Variant: base - Benchmarks Covered: 25 of 247 - Profile: https://benchlm.ai/models/gpt-5-3-codex - Sibling Models: GPT-5.3-Codex-Spark - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 77.3, OSWorld-Verified: 64.7, AA Agentic Index: 60.54, Tau2-Telecom: 86, GDPval-AA: 49, GDPval-AA: 1480, Gert Labs: 57.47 **Coding**: SWE-bench Verified: 85, SWE-bench Pro: 56.8, SWE-Rebench: 58.2, Vibe Code Bench: 61.767, AA Coding Index: 53.1, Terminal-Bench Hard: 53, AA-SciCode: 53.2 **Multimodal & Grounded**: AA-MMMU-Pro: 78.5, Design Arena Website: 1208 **Reasoning**: AA-LCR: 74, CritPt: 16.9 **Knowledge**: Artificial Analysis Intelligence Index: 53.56, AA-GPQA Diamond: 91.5, AA-HLE: 39.9, AA-Omniscience Index: 9.9, AA-Omniscience Accuracy: 51.8, AA-Omniscience Hallucination Rate: 86.9 **Instruction Following**: AA-IFBench: 75.4 ### #16 Claude Opus 4.7 (Adaptive) - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: 84/100 - Family: Claude Opus 4.7 - Variant: reasoning (adaptive) - Benchmarks Covered: 36 of 247 - Profile: https://benchlm.ai/models/claude-opus-4-7-adaptive - Sibling Models: Claude Opus 4.7 - Related Earlier Model: Claude Opus 4.6 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 69.4, BrowseComp: 79.3, MCP Atlas: 77.3, OSWorld-Verified: 78, CyberGym: 73.1, AA Agentic Index: 71.29, Tau2-Telecom: 88.6, GDPval-AA: 62.6, GDPval-AA: 1753 **Coding**: SWE-bench Verified: 87.6, SWE-bench Pro: 64.3, Terminal-Bench 2.0: 69.4, AA Coding Index: 52.51, Terminal-Bench Hard: 51.5, AA-SciCode: 54.5 **Multimodal & Grounded**: OfficeQA Pro: 43.6, CharXiv: 91, CharXiv w/o tools: 82.1, AA-MMMU-Pro: 78.8, Design Arena Website: 1338 **Reasoning**: MRCR v2 128K-256K: 59.2, ARC-AGI-2: 75.8, AA-LCR: 70.3, CritPt: 12 **Knowledge**: GPQA: 94.2, GPQA-D: 94.2, HLE: 54.7, HLE w/o tools: 46.9, Artificial Analysis Intelligence Index: 57.28, AA-GPQA Diamond: 91.4, AA-HLE: 39.6, AA-Omniscience Index: 26.2, AA-Omniscience Accuracy: 45.8, AA-Omniscience Hallucination Rate: 36.2 **Instruction Following**: AA-IFBench: 58.6 **Mathematics**: FrontierMath: 43.8 ### #17 GLM-5.1 - Creator: Z.AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 203K - Overall Score: 82/100 - Family: GLM-5 - Variant: snapshot (5.1) - Benchmarks Covered: 33 of 247 - Profile: https://benchlm.ai/models/glm-5-1 - Sibling Models: GLM-5 (Reasoning), GLM-5, GLM-5V-Turbo, GLM-5-Turbo - Related Earlier Model: GLM-5 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 63.5, BrowseComp: 68, TAU3-Bench: 70.6, MCP Atlas: 71.8, CyberGym: 68.7, Claw-Eval: 62.3, AA Agentic Index: 67.05, Tau2-Telecom: 97.7, GDPval-AA: 51.8, Gert Labs: 60.11 **Coding**: SWE-bench Pro: 58.4, NL2Repo: 42.7, SWE-Rebench: 62.7, Vibe Code Bench: 31.456, AA Coding Index: 43.37, Terminal-Bench Hard: 43.2, AA-SciCode: 43.8 **Multimodal & Grounded**: Design Arena Website: 1315 **Reasoning**: AA-LCR: 62.3, CritPt: 4.6 **Knowledge**: GPQA-D: 86.2, HLE: 52.3, Artificial Analysis Intelligence Index: 51.41, AA-GPQA Diamond: 86.8, AA-HLE: 28, AA-Omniscience Index: 1.9, AA-Omniscience Accuracy: 24.2, AA-Omniscience Hallucination Rate: 29.4 **Instruction Following**: AA-IFBench: 76.3 **Mathematics**: AIME26: 95.3, HMMT Nov 2025: 94, HMMT Feb 2026: 82.6, MMAnswerBench: 83.8 ### #18 Claude Sonnet 4.6 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: 82/100 - Family: Claude Sonnet 4.6 - Variant: base - Benchmarks Covered: 33 of 247 - Profile: https://benchlm.ai/models/claude-sonnet-4-6 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 59.1, OSWorld-Verified: 72.1, Claw-Eval: 67.8, CyberGym: 65.2, AA Agentic Index: 61.62, Tau2-Telecom: 79.5, GDPval-AA: 54.8, GDPval-AA: 1596, Gert Labs: 62.92 **Coding**: SWE-bench Verified: 79.6, SWE-Rebench: 60.7, React Native Evals: 80.6, Vibe Code Bench: 51.476, CursorBench v3.1: 48.8, AA Coding Index: 46.43, Terminal-Bench Hard: 46.2, AA-SciCode: 46.9 **Multimodal & Grounded**: CharXiv: 77.4, AA-MMMU-Pro: 70.6, Design Arena Website: 1327 **Reasoning**: AA-LCR: 57.7, CritPt: 0.9 **Knowledge**: GPQA: 89.9, SuperGPQA: 95, MMLU-Pro: 79.2, HLE: 49, Artificial Analysis Intelligence Index: 44.38, AA-GPQA Diamond: 79.9, AA-HLE: 13.2, AA-Omniscience Index: -2.9, AA-Omniscience Accuracy: 38, AA-Omniscience Hallucination Rate: 65.9 **Instruction Following**: AA-IFBench: 41.2 ### #19 DeepSeek V4 Pro (High) - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 1M - Overall Score: 82/100 - Family: DeepSeek V4 - Variant: pro-reasoning (high) - Benchmarks Covered: 40 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-pro-high - Sibling Models: DeepSeek V4 Pro (Max), DeepSeek V4 Flash (Max), DeepSeek V4 Flash (High), DeepSeek V4 Pro, DeepSeek V4 Flash, DeepSeek V4 Pro Base, DeepSeek V4 Flash Base - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 63.3, BrowseComp: 80.4, HLE w/ tools: 44.7, MCP Atlas: 74.2, Toolathlon: 49, AA Agentic Index: 66.65, Tau2-Telecom: 94.2, GDPval-AA: 52.9, GDPval-AA: 1558 **Coding**: LiveCodeBench: 89.8, Codeforces: 2919, SWE-bench Verified: 79.4, SWE-bench Pro: 54.4, SWE Multilingual: 74.1, Terminal-Bench 2.0: 63.3, AA Coding Index: 43.25, Terminal-Bench Hard: 41.7, AA-SciCode: 46.4 **Multimodal & Grounded**: Design Arena Website: 1286 **Reasoning**: MRCR 1M: 83.3, CorpusQA 1M: 56.5, AA-LCR: 65, CritPt: 10 **Knowledge**: MMLU-Pro: 87.1, SimpleQA: 46.2, Chinese-SimpleQA: 77.7, GPQA: 89.1, GPQA-D: 89.1, HLE: 34.5, Artificial Analysis Intelligence Index: 49.79, AA-GPQA Diamond: 90.5, AA-HLE: 33.5, AA-Omniscience Index: -9.7, AA-Omniscience Accuracy: 41.8, AA-Omniscience Hallucination Rate: 88.6 **Instruction Following**: AA-IFBench: 71.3 **Mathematics**: HMMT Feb 2026: 94, IMOAnswerBench: 88, Apex: 27.4, Apex Shortlist: 85.5 ### #20 o1-preview - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~82/100 (estimated) - Family: o1 - Variant: snapshot (preview) - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/o1-preview - Sibling Models: o1, o1-pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: AA Coding Index: 34.05 **Knowledge**: Artificial Analysis Intelligence Index: 23.74 ### #21 Kimi K2.6 - Creator: Moonshot AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: 81/100 - Family: Kimi K2.6 - Variant: base - Benchmarks Covered: 48 of 247 - Profile: https://benchlm.ai/models/kimi-k2-6 - Related Earlier Model: Kimi K2.5 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 66.7, BrowseComp: 83.2, OSWorld-Verified: 73.1, Toolathlon: 50, MCP Atlas: 55.9, Claw-Eval: 62.3, DeepSearchQA: 92.5, WideResearch: 80.8, AA Agentic Index: 65.97, Tau2-Telecom: 95.9, GDPval-AA: 49.1, GDPval-AA: 1481, APEX-Agents-AA: 28.5, Gert Labs: 56.82 **Coding**: SWE-bench Verified: 80.2, LiveCodeBench: 89.6, LiveCodeBench v6: 89.6, SWE-bench Pro: 58.6, SWE Multilingual: 76.7, SciCode: 52.2, Terminal-Bench 2.0: 66.7, Vibe Code Bench: 37.891, CursorBench v3.1: 47.6, AA Coding Index: 47.12, Terminal-Bench Hard: 43.9, AA-SciCode: 53.5 **Multimodal & Grounded**: MMMU-Pro: 79.4, MMMU-Pro w/ Python: 80.1, CharXiv: 80.4, MathVision: 87.4, V*: 96.9, AA-MMMU-Pro: 79.4, Design Arena Website: 1322 **Reasoning**: AA-LCR: 69.7, CritPt: 8 **Knowledge**: GPQA: 90.5, GPQA-D: 90.5, HLE: 34.7, Artificial Analysis Intelligence Index: 53.9, AA-GPQA Diamond: 91.1, AA-HLE: 35.9, AA-Omniscience Index: 6.4, AA-Omniscience Accuracy: 32.8, AA-Omniscience Hallucination Rate: 39.3 **Instruction Following**: AA-IFBench: 76 **Mathematics**: AIME26: 96.4, HMMT Feb 2026: 92.7, MMAnswerBench: 86 ### #22 Gemini 3 Pro - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 2M - Overall Score: 80/100 - Family: Gemini 3 Pro - Variant: base - Benchmarks Covered: 26 of 247 - Profile: https://benchlm.ai/models/gemini-3-pro - Sibling Models: Gemini 3 Pro Deep Think - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 51.98, Tau2-Telecom: 87.1, GDPval-AA: 34.2, GDPval-AA: 1184, Gert Labs: 63.23 **Coding**: Vibe Code Bench: 14.3, AA Coding Index: 46.49, Terminal-Bench Hard: 41.7, AA-SciCode: 56.1 **Multimodal & Grounded**: MMMU-Pro: 81, MathVision: 86.6, VideoMMMU: 87.6, ScreenSpot Pro: 72.7, CharXiv: 81.4, V*: 88, AA-MMMU-Pro: 80.2 **Reasoning**: ARC-AGI-2: 31.1, AA-LCR: 70.7, CritPt: 9.1 **Knowledge**: Artificial Analysis Intelligence Index: 48.39, AA-GPQA Diamond: 90.8, AA-HLE: 37.2, AA-Omniscience Index: 15.8, AA-Omniscience Accuracy: 55.9, AA-Omniscience Hallucination Rate: 90.9 **Instruction Following**: AA-IFBench: 70.4 ### #23 MiniMax M3 - Creator: MiniMax - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: 79/100 - Family: MiniMax M3 - Variant: base - Benchmarks Covered: 38 of 247 - Profile: https://benchlm.ai/models/minimax-m3 - Related Earlier Model: MiniMax M2.7 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 66, BrowseComp: 83.52, OSWorld-Verified: 70.06, MCP Atlas: 74.2, Claw-Eval: 74.5, AA Agentic Index: 68.62, Tau2-Telecom: 88.9, GDPval-AA: 58.5, GDPval-AA: 1670, GDPval rubrics: 74.7, BankerToolBench: 76.1 **Coding**: SWE-bench Verified: 80.5, SWE-bench Pro: 59, Terminal-Bench 2.0: 66, NL2Repo: 42.13, AA Coding Index: 43.41, Terminal-Bench Hard: 42.4, AA-SciCode: 45.4, VIBE V2: 50.1, SVG-Bench: 63.7, KernelBench Hard: 28.8 **Multimodal & Grounded**: OfficeQA Pro: 45.1, OmniDocBench 1.5: 91.6, MMMU-Pro: 78.1, VideoMMMU: 84.6, Video-MME (with subtitle): 85.4, AA-MMMU-Pro: 79.9, Design Arena Website: 1312 **Reasoning**: AA-LCR: 74, CritPt: 3.7 **Knowledge**: Artificial Analysis Intelligence Index: 54.67, AA-GPQA Diamond: 92.9, AA-HLE: 37.1, AA-Omniscience Index: 1.4, AA-Omniscience Accuracy: 15, AA-Omniscience Hallucination Rate: 16.1 **Instruction Following**: AA-IFBench: 82.9 **Mathematics**: USAMO 2026: 85.71 ### #24 GLM-5 (Reasoning) - Creator: Z.AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~79/100 (estimated) - Family: GLM-5 - Variant: reasoning - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/glm-5-reasoning - Sibling Models: GLM-5.1, GLM-5, GLM-5V-Turbo, GLM-5-Turbo - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: Vibe Code Bench: 23.359 **Multimodal & Grounded**: Design Arena Website: 1292 ### #25 GPT-5.2 - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: 78/100 - Family: GPT-5.2 - Variant: thinking - Benchmarks Covered: 29 of 247 - Profile: https://benchlm.ai/models/gpt-5-2 - Sibling Models: GPT-5.2 Instant, GPT-5.2 Pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: BrowseComp: 65.8, OSWorld-Verified: 47.3, AA Agentic Index: 60.2, Tau2-Telecom: 84.8, GDPval-AA: 48.3, GDPval-AA: 1467, Gert Labs: 46.54 **Coding**: SWE-bench Verified: 80, SWE-bench Pro: 55.6, Vibe Code Bench: 53.499, AA Coding Index: 48.67, Terminal-Bench Hard: 47, AA-SciCode: 52.1 **Multimodal & Grounded**: MMMU-Pro: 79.5, MathVision: 83, CharXiv: 82.1, V*: 75.9, Design Arena Website: 1240 **Reasoning**: ARC-AGI-2: 52.9, AA-LCR: 72.7, CritPt: 11.6 **Knowledge**: GPQA: 92.4, Artificial Analysis Intelligence Index: 51.28, AA-GPQA Diamond: 90.3, AA-HLE: 35.4, AA-Omniscience Index: -1, AA-Omniscience Accuracy: 43.8, AA-Omniscience Hallucination Rate: 79.7 **Instruction Following**: AA-IFBench: 75.4 ### #26 Qwen3.5 397B (Reasoning) - Creator: Alibaba - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~77/100 (estimated) - Family: Qwen3.5 397B - Variant: reasoning - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/qwen3-5-397b-reasoning - Sibling Models: Qwen3.5 397B - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 55.83, APEX-Agents-AA: 15.3, Tau2-Telecom: 95.6, GDPval-AA: 34.5, GDPval-AA: 1190 **Coding**: AA Coding Index: 41.28, Terminal-Bench Hard: 40.9, AA-SciCode: 42 **Multimodal & Grounded**: AA-MMMU-Pro: 77.3 **Reasoning**: AA-LCR: 65.7, CritPt: 1.7 **Knowledge**: Artificial Analysis Intelligence Index: 45.05, AA-GPQA Diamond: 89.3, AA-HLE: 27.3, AA-Omniscience Index: -29.8, AA-Omniscience Accuracy: 31.4, AA-Omniscience Hallucination Rate: 89.1 **Instruction Following**: AA-IFBench: 78.8 ### #27 GPT-5.1 - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~77/100 (estimated) - Family: GPT-5.1 - Variant: base - Benchmarks Covered: 20 of 247 - Profile: https://benchlm.ai/models/gpt-5-1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 51.26, Tau2-Telecom: 81.9, GDPval-AA: 36.4, GDPval-AA: 1227, Gert Labs: 41.24 **Coding**: Vibe Code Bench: 24.606, AA Coding Index: 44.73, Terminal-Bench Hard: 45.5, AA-SciCode: 43.3 **Multimodal & Grounded**: AA-MMMU-Pro: 75.5, Design Arena Website: 1233 **Reasoning**: AA-LCR: 75, CritPt: 4.9 **Knowledge**: Artificial Analysis Intelligence Index: 47.7, AA-GPQA Diamond: 87.3, AA-HLE: 26.5, AA-Omniscience Index: 5.6, AA-Omniscience Accuracy: 37.6, AA-Omniscience Hallucination Rate: 51.3 **Instruction Following**: AA-IFBench: 72.9 ### #28 Claude Opus 4.5 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: 76/100 - Family: Claude Opus 4.5 - Variant: base - Benchmarks Covered: 60 of 247 - Profile: https://benchlm.ai/models/claude-opus-4-5 - Sibling Models: Claude Opus 4.5 Thinking - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 59.3, OSWorld-Verified: 66.3, OSWorld: 66.3, Claw-Eval: 59.6, QwenClawBench: 52.3, TAU3-Bench: 70.2, VITA-Bench: 23.3, DeepPlanning: 26.4, Toolathlon: 43.5, MCP Atlas: 42.3, MCP-Tasks: 71.8, WideResearch: 76.4, CyberGym: 50.6, AA Agentic Index: 59.22, Tau2-Telecom: 86.3, GDPval-AA: 45.9, GDPval-AA: 1418, Gert Labs: 64.23 **Coding**: SWE-bench Verified: 80.9, LiveCodeBench v6: 84.8, SWE-bench Pro: 57.1, SWE Multilingual: 77.5, NL2Repo: 43.2, AA Coding Index: 42.94, Terminal-Bench Hard: 40.9, AA-SciCode: 47 **Multimodal & Grounded**: MMMU-Pro: 70.6, MathVision: 74.3, CharXiv: 68.5, VideoMMMU: 84.4, ScreenSpot Pro: 45.7, V*: 67, AA-MMMU-Pro: 71.2, Design Arena Website: 1292 **Reasoning**: LongBench v2: 64.4, AI-Needle: 74, AA-LCR: 65.3, CritPt: 0.3 **Knowledge**: GPQA: 87, SuperGPQA: 70.6, MMLU-Pro: 89.5, MMLU-Redux: 96.6, C-Eval: 92.2, HLE: 30.8, Artificial Analysis Intelligence Index: 43.09, AA-GPQA Diamond: 81, AA-HLE: 12.9, AA-Omniscience Index: -3.9, AA-Omniscience Accuracy: 40.7, AA-Omniscience Hallucination Rate: 75.4 **Instruction Following**: IFEval: 90.9, IFBench: 58, AA-IFBench: 43 **Multilingual**: MMLU-ProX: 85.7, NOVA-63: 56.7 **Mathematics**: AIME26: 95.1, HMMT Feb 2025: 92.9, HMMT Nov 2025: 93.3, HMMT Feb 2026: 85.3, MMAnswerBench: 84 ### #29 GPT-5 (high) - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~76/100 (estimated) - Family: GPT-5 - Variant: reasoning (high) - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/gpt-5-high - Sibling Models: GPT-5 (medium), GPT-5 mini, GPT-5 nano - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 54.65, Tau2-Telecom: 84.8, GDPval-AA: 39.6, GDPval-AA: 1292 **Coding**: Vibe Code Bench: 20.088, AA Coding Index: 36.03, Terminal-Bench Hard: 32.6, AA-SciCode: 42.9 **Multimodal & Grounded**: AA-MMMU-Pro: 74.2, Design Arena Website: 1230 **Reasoning**: AA-LCR: 75.6, CritPt: 5.7 **Knowledge**: Artificial Analysis Intelligence Index: 44.63, AA-GPQA Diamond: 85.4, AA-HLE: 26.5, AA-Omniscience Index: -8.1, AA-Omniscience Accuracy: 40.7, AA-Omniscience Hallucination Rate: 82.1 **Instruction Following**: AA-IFBench: 73.1 ### #30 GPT-5.2-Codex - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: ~76/100 (estimated) - Family: GPT-5.2-Codex - Variant: base - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/gpt-5-2-codex - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 56.52, Tau2-Telecom: 92.1, GDPval-AA: 39.4, GDPval-AA: 1288, Gert Labs: 51.79 **Coding**: Vibe Code Bench: 37.912, AA Coding Index: 42.96, Terminal-Bench Hard: 37.1, AA-SciCode: 54.6 **Multimodal & Grounded**: AA-MMMU-Pro: 76.3 **Reasoning**: AA-LCR: 75.7, CritPt: 8.7 **Knowledge**: Artificial Analysis Intelligence Index: 49.03, AA-GPQA Diamond: 89.9, AA-HLE: 33.5, AA-Omniscience Index: -2.5, AA-Omniscience Accuracy: 40.7, AA-Omniscience Hallucination Rate: 72.8 **Instruction Following**: AA-IFBench: 77.6 ### #31 Kimi K2.5 (Reasoning) - Creator: Moonshot AI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: 75/100 - Family: Kimi K2.5 - Variant: reasoning - Benchmarks Covered: 28 of 247 - Profile: https://benchlm.ai/models/kimi-k2-5-reasoning - Sibling Models: Kimi K2.5 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 50.8, BrowseComp: 60.6, AA Agentic Index: 58.94, APEX-Agents-AA: 11.5, Tau2-Telecom: 95.9, GDPval-AA: 39.2, GDPval-AA: 1284, Gert Labs: 32.58 **Coding**: SWE-bench Verified: 76.8, Vibe Code Bench: 17.536, AA Coding Index: 39.55, Terminal-Bench Hard: 34.8, AA-SciCode: 49 **Multimodal & Grounded**: MMMU-Pro: 78.5, AA-MMMU-Pro: 75.4, Design Arena Website: 1294 **Reasoning**: AA-LCR: 65.3, CritPt: 3.1 **Knowledge**: GPQA: 87.6, MMLU-Pro: 87.1, Artificial Analysis Intelligence Index: 46.81, AA-GPQA Diamond: 87.9, AA-HLE: 29.4, AA-Omniscience Index: -8.1, AA-Omniscience Accuracy: 34.3, AA-Omniscience Hallucination Rate: 64.6 **Instruction Following**: AA-IFBench: 70.2 **Mathematics**: AIME 2025: 96.1 ### #32 GPT-5.1-Codex-Max - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: ~75/100 (estimated) - Family: GPT-5.1-Codex-Max - Variant: base - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/gpt-5-1-codex-max - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 50.68, Tau2-Telecom: 83, GDPval-AA: 34.5, GDPval-AA: 1191 **Coding**: Vibe Code Bench: 22.168, AA Coding Index: 36.62, Terminal-Bench Hard: 34.8, AA-SciCode: 40.2 **Multimodal & Grounded**: AA-MMMU-Pro: 72.5 **Reasoning**: AA-LCR: 67.3, CritPt: 5.7 **Knowledge**: Artificial Analysis Intelligence Index: 43.11, AA-GPQA Diamond: 86, AA-HLE: 23.4, AA-Omniscience Index: -6, AA-Omniscience Accuracy: 39.2, AA-Omniscience Hallucination Rate: 74.4 **Instruction Following**: AA-IFBench: 70 ### #33 DeepSeek V4 Flash (Max) - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 1M - Overall Score: 74/100 - Family: DeepSeek V4 - Variant: flash-reasoning (max) - Benchmarks Covered: 40 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-flash-max - Sibling Models: DeepSeek V4 Pro (Max), DeepSeek V4 Pro (High), DeepSeek V4 Flash (High), DeepSeek V4 Pro, DeepSeek V4 Flash, DeepSeek V4 Pro Base, DeepSeek V4 Flash Base - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 56.9, BrowseComp: 73.2, HLE w/ tools: 45.1, MCP Atlas: 69, GDPval-AA: 1388, Toolathlon: 47.8, AA Agentic Index: 61.28, Tau2-Telecom: 95, GDPval-AA: 44.4 **Coding**: LiveCodeBench: 91.6, Codeforces: 3052, SWE-bench Verified: 79, SWE-bench Pro: 52.6, SWE Multilingual: 73.3, Terminal-Bench 2.0: 56.9, AA Coding Index: 38.71, Terminal-Bench Hard: 35.6, AA-SciCode: 44.9 **Multimodal & Grounded**: Design Arena Website: 1259 **Reasoning**: MRCR 1M: 78.7, CorpusQA 1M: 60.5, AA-LCR: 63, CritPt: 7.1 **Knowledge**: MMLU-Pro: 86.2, SimpleQA: 34.1, Chinese-SimpleQA: 78.9, GPQA: 88.1, GPQA-D: 88.1, HLE: 34.8, Artificial Analysis Intelligence Index: 46.52, AA-GPQA Diamond: 89.4, AA-HLE: 32.1, AA-Omniscience Index: -22.9, AA-Omniscience Accuracy: 37.2, AA-Omniscience Hallucination Rate: 95.8 **Instruction Following**: AA-IFBench: 79.2 **Mathematics**: HMMT Feb 2026: 94.8, IMOAnswerBench: 88.4, Apex: 33, Apex Shortlist: 85.7 ### #34 Qwen3.6-27B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 262K - Overall Score: 72/100 - Family: Qwen3.6-27B - Variant: base - Benchmarks Covered: 55 of 247 - Profile: https://benchlm.ai/models/qwen3-6-27b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 59.3, Claw-Eval: 72.4, QwenClawBench: 53.4, QwenWebBench: 1487, AndroidWorld: 70.3, AA Agentic Index: 62.85, Tau2-Telecom: 94.2, GDPval-AA: 45.2, GDPval-AA: 1403, Gert Labs: 54.84 **Coding**: SWE-bench Verified: 77.2, SWE Multilingual: 71.3, SWE-bench Pro: 53.5, Terminal-Bench 2.0: 59.3, LiveCodeBench: 83.9, NL2Repo: 36.2, AA Coding Index: 36.5, Terminal-Bench Hard: 34.8, AA-SciCode: 39.8 **Multimodal & Grounded**: MMMU: 82.9, MMMU-Pro: 75.8, RealWorldQA: 84.1, DynaMath: 85.6, MStar: 81.4, SimpleVQA: 56.1, CharXiv: 78.4, CC-OCR: 81.2, CountBench: 97.8, RefCOCO (avg): 92.5, ERQA: 62.5, Video-MME (with subtitle): 87.7, VideoMMMU: 84.4, MLVU (M-Avg): 86.6, V*: 94.7, AA-MMMU-Pro: 74.6 **Reasoning**: AA-LCR: 68.7, CritPt: 1.1 **Knowledge**: MMLU-Pro: 86.2, MMLU-Redux: 93.5, SuperGPQA: 66, C-Eval: 91.4, GPQA: 87.8, HLE: 24, Artificial Analysis Intelligence Index: 45.82, AA-GPQA Diamond: 84.2, AA-HLE: 21.6, AA-Omniscience Index: -19.8, AA-Omniscience Accuracy: 19.2, AA-Omniscience Hallucination Rate: 48.3 **Instruction Following**: AA-IFBench: 67.6 **Mathematics**: HMMT Feb 2025: 93.8, HMMT Nov 2025: 90.7, HMMT Feb 2026: 84.3, MMAnswerBench: 80.8, AIME26: 94.1 ### #35 Grok 4.20 - Creator: xAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 2M - Overall Score: 71/100 - Family: Grok 4.20 - Variant: reasoning - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/grok-4-20-beta - Sibling Models: Grok 4.20 Multi-agent - Related Earlier Model: Grok 4.1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 47.1, DeepSearchQA: 62.8, Gert Labs: 38.36 **Coding**: LiveCodeBench Pro: 74.2, SWE-bench Verified: 76.7, SWE-bench Pro: 51.8, Vibe Code Bench: 4.064 **Multimodal & Grounded**: MMMU-Pro: 75.2, CharXiv: 60.9, ERQA: 54.1, SimpleVQA: 57.4, MedXpertQA (MM): 65.8, GDPval-AA: 1055 **Reasoning**: ARC-AGI-2: 53.3 **Knowledge**: GPQA-D: 88.5, HLE w/o tools: 31.6, HealthBench Hard: 20.3, MedXpertQA (Text): 50.2 ### #36 DeepSeek V4 Flash (High) - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 1M - Overall Score: 71/100 - Family: DeepSeek V4 - Variant: flash-reasoning (high) - Benchmarks Covered: 40 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-flash-high - Sibling Models: DeepSeek V4 Pro (Max), DeepSeek V4 Pro (High), DeepSeek V4 Flash (Max), DeepSeek V4 Pro, DeepSeek V4 Flash, DeepSeek V4 Pro Base, DeepSeek V4 Flash Base - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 56.6, BrowseComp: 53.5, HLE w/ tools: 40.3, MCP Atlas: 67.4, Toolathlon: 43.5, AA Agentic Index: 62.33, Tau2-Telecom: 95.6, GDPval-AA: 45.7, GDPval-AA: 1414 **Coding**: LiveCodeBench: 88.4, Codeforces: 2816, SWE-bench Verified: 78.6, SWE-bench Pro: 52.3, SWE Multilingual: 70.2, Terminal-Bench 2.0: 56.6, AA Coding Index: 39.76, Terminal-Bench Hard: 38.6, AA-SciCode: 42 **Multimodal & Grounded**: Design Arena Website: 1259 **Reasoning**: MRCR 1M: 76.9, CorpusQA 1M: 59.3, AA-LCR: 62.7, CritPt: 3.4 **Knowledge**: MMLU-Pro: 86.4, SimpleQA: 28.9, Chinese-SimpleQA: 73.2, GPQA: 87.4, GPQA-D: 87.4, HLE: 29.4, Artificial Analysis Intelligence Index: 46, AA-GPQA Diamond: 86.7, AA-HLE: 27.8, AA-Omniscience Index: -22.3, AA-Omniscience Accuracy: 35.5, AA-Omniscience Hallucination Rate: 89.7 **Instruction Following**: AA-IFBench: 73.5 **Mathematics**: HMMT Feb 2026: 91.9, IMOAnswerBench: 85.1, Apex: 19.1, Apex Shortlist: 72.1 ### #37 GPT-5 (medium) - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~70/100 (estimated) - Family: GPT-5 - Variant: reasoning (medium) - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/gpt-5-medium - Sibling Models: GPT-5 (high), GPT-5 mini, GPT-5 nano - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 45.83, Tau2-Telecom: 86.5, GDPval-AA: 25.1, GDPval-AA: 1001 **Coding**: AA Coding Index: 38.95, Terminal-Bench Hard: 37.9, AA-SciCode: 41.1 **Multimodal & Grounded**: AA-MMMU-Pro: 74.3, Design Arena Website: 1230 **Reasoning**: AA-LCR: 72.8, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 42.03, AA-GPQA Diamond: 84.2, AA-HLE: 23.5, AA-Omniscience Index: -10.1, AA-Omniscience Accuracy: 38.9, AA-Omniscience Hallucination Rate: 80.1 **Instruction Following**: AA-IFBench: 70.6 ### #38 Nemotron 3 Ultra - Creator: NVIDIA - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 1M - Overall Score: 68/100 - Family: Nemotron 3 Ultra - Variant: base - Benchmarks Covered: 34 of 247 - Profile: https://benchlm.ai/models/nemotron-3-ultra - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 56.4, PinchBench: 90, BrowseComp: 44.4, TAU3-Bench: 70.9, GDPval-AA: 44, HLE w/ tools: 37.4, AA Agentic Index: 57.06, Tau2-Telecom: 83.3, GDPval-AA: 1379 **Coding**: SWE-bench Verified: 71.9, SWE Multilingual: 67.7, LiveCodeBench: 89, SciCode: 44.6, Terminal-Bench 2.0: 56.4, AA Coding Index: 37.55, Terminal-Bench Hard: 36.4, AA-SciCode: 39.9 **Reasoning**: AA-LCR: 67, CritPt: 3.1, LongBench v2: 61.9 **Knowledge**: GPQA: 87, GPQA-D: 87, HLE: 26.7, HLE w/o tools: 26.7, MMLU-Pro: 86.8, AA-Omniscience Accuracy: 21.6, Artificial Analysis Intelligence Index: 47.67, AA-GPQA Diamond: 86.7, AA-HLE: 26.6, AA-Omniscience Index: -0.8, AA-Omniscience Hallucination Rate: 28.5 **Instruction Following**: IFBench: 81.7, AA-IFBench: 81.4 **Multilingual**: MMLU-ProX: 83 ### #39 DeepSeek V4 Pro - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: 68/100 - Family: DeepSeek V4 - Variant: pro (non-think) - Benchmarks Covered: 23 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-pro - Sibling Models: DeepSeek V4 Pro (Max), DeepSeek V4 Pro (High), DeepSeek V4 Flash (Max), DeepSeek V4 Flash (High), DeepSeek V4 Flash, DeepSeek V4 Pro Base, DeepSeek V4 Flash Base - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 59.1, MCP Atlas: 69.4, Toolathlon: 46.3, Claw-Eval: 59.8, Gert Labs: 50.28 **Coding**: LiveCodeBench: 56.8, SWE-bench Verified: 73.6, SWE-bench Pro: 52.1, SWE Multilingual: 69.8, Terminal-Bench 2.0: 59.1 **Multimodal & Grounded**: Design Arena Website: 1286 **Reasoning**: MRCR 1M: 44.7, CorpusQA 1M: 35.6 **Knowledge**: MMLU-Pro: 82.9, SimpleQA: 45, Chinese-SimpleQA: 75.8, GPQA: 72.9, GPQA-D: 72.9, HLE: 7.7 **Mathematics**: HMMT Feb 2026: 31.7, IMOAnswerBench: 35.3, Apex: 0.4, Apex Shortlist: 9.2 ### #40 GLM-4.7 - Creator: Z.AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~68/100 (estimated) - Family: GLM-4.7 - Variant: base - Benchmarks Covered: 28 of 247 - Profile: https://benchlm.ai/models/glm-4-7 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 41, BrowseComp: 52, VITA-Bench: 15.5, AA Agentic Index: 55.01, Tau2-Telecom: 95.9, GDPval-AA: 34.1, GDPval-AA: 1183, Gert Labs: 39.95 **Coding**: SWE-bench Verified: 73.8, LiveCodeBench: 84.9, SWE-Rebench: 58.7, AA Coding Index: 36.26, Terminal-Bench Hard: 31.8, AA-SciCode: 45.1 **Multimodal & Grounded**: Design Arena Website: 1272 **Reasoning**: AA-LCR: 64, CritPt: 1.7 **Knowledge**: GPQA: 85.7, MMLU-Pro: 84.3, HLE: 24.8, Artificial Analysis Intelligence Index: 42.11, AA-GPQA Diamond: 85.9, AA-HLE: 25.1, AA-Omniscience Index: -34.6, AA-Omniscience Accuracy: 29.3, AA-Omniscience Hallucination Rate: 90.3 **Instruction Following**: AA-IFBench: 67.9 **Mathematics**: AIME 2025: 95.7 ### #41 Grok 4.1 Fast - Creator: xAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~68/100 (estimated) - Family: Grok 4.1 Fast - Variant: base - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/grok-4-1-fast - Sibling Models: Grok 4.1 Fast (Reasoning) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 32.95, Tau2-Telecom: 63.7, GDPval-AA: 14.1, GDPval-AA: 781, Gert Labs: 47.32 **Coding**: AA Coding Index: 19.47, Terminal-Bench Hard: 14.4, AA-SciCode: 29.6 **Multimodal & Grounded**: AA-MMMU-Pro: 48.4 **Reasoning**: AA-LCR: 22, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 23.56, AA-GPQA Diamond: 63.7, AA-HLE: 5, AA-Omniscience Index: -50.9, AA-Omniscience Accuracy: 17, AA-Omniscience Hallucination Rate: 81.8 **Instruction Following**: AA-IFBench: 36.5 ### #42 GLM-5 - Creator: Z.AI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: 67/100 - Family: GLM-5 - Variant: base - Benchmarks Covered: 52 of 247 - Profile: https://benchlm.ai/models/glm-5 - Sibling Models: GLM-5.1, GLM-5 (Reasoning), GLM-5V-Turbo, GLM-5-Turbo - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 56.2, Claw-Eval: 57.7, QwenClawBench: 54.1, TAU3-Bench: 65.6, DeepPlanning: 14.6, Toolathlon: 38, MCP Atlas: 31.1, MCP-Tasks: 60.8, WideResearch: 69.8, Tau2-Telecom: 98.2, CyberGym: 43.2, AA Agentic Index: 63.14, APEX-Agents-AA: 14.5, GDPval-AA: 44.6, GDPval-AA: 1391, Gert Labs: 50.99 **Coding**: SWE-bench Verified: 77.8, SWE-bench Verified*: 72.8, SWE-bench Pro: 55.1, SWE Multilingual: 73.3, SWE-Rebench: 62.8, React Native Evals: 74.8, AA Coding Index: 44.18, Terminal-Bench Hard: 43.2, AA-SciCode: 46.2 **Multimodal & Grounded**: Design Arena Website: 1292 **Reasoning**: LongBench v2: 60.8, AI-Needle: 63.3, AA-LCR: 63.3, CritPt: 2 **Knowledge**: GPQA: 86, GPQA-D: 86, SuperGPQA: 66.8, MMLU-Pro: 85.7, MMLU-Pro (Arcee): 85.8, HLE: 50.4, Artificial Analysis Intelligence Index: 49.77, AA-GPQA Diamond: 82, AA-HLE: 27.2, AA-Omniscience Index: 2, AA-Omniscience Accuracy: 26.9, AA-Omniscience Hallucination Rate: 34 **Instruction Following**: IFEval: 92.6, AA-IFBench: 72.3 **Multilingual**: MMLU-ProX: 83.1, NOVA-63: 55.1 **Mathematics**: AIME26: 95.8, AIME25 (Arcee): 93.3, HMMT Feb 2025: 97.5, HMMT Nov 2025: 96.9, HMMT Feb 2026: 86.4, MMAnswerBench: 82.5 ### #43 Qwen3.6 Plus - Creator: Alibaba - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: 66/100 - Family: Qwen3.6 Plus - Variant: base - Benchmarks Covered: 58 of 247 - Profile: https://benchlm.ai/models/qwen3-6-plus - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 61.6, Claw-Eval: 58.8, QwenClawBench: 57.2, TAU3-Bench: 70.7, VITA-Bench: 44.3, DeepPlanning: 41.5, Toolathlon: 39.8, MCP Atlas: 48.2, MCP-Tasks: 74.1, WideResearch: 74.3, AA Agentic Index: 61.67, Tau2-Telecom: 97.7, GDPval-AA: 42.5, GDPval-AA: 1350, Gert Labs: 50.6 **Coding**: SWE-bench Verified: 78.8, SWE-bench Pro: 56.6, SWE Multilingual: 73.8, LiveCodeBench v6: 87.1, Vibe Code Bench: 25.564, AA Coding Index: 42.87, Terminal-Bench Hard: 43.9, AA-SciCode: 40.7 **Multimodal & Grounded**: MMMU: 86, MMMU-Pro: 78.8, MathVision: 88, VideoMMMU: 84, ScreenSpot Pro: 68.2, CharXiv: 81.5, V*: 96.9, AA-MMMU-Pro: 78, Design Arena Website: 1264 **Reasoning**: AI-Needle: 68.3, LongBench v2: 62, AA-LCR: 69.7, CritPt: 2.9 **Knowledge**: GPQA: 90.4, SuperGPQA: 71.6, MMLU-Pro: 88.5, MMLU-Redux: 94.5, C-Eval: 93.3, HLE: 28.8, Artificial Analysis Intelligence Index: 49.98, AA-GPQA Diamond: 88.2, AA-HLE: 25.7, AA-Omniscience Index: 2.7, AA-Omniscience Accuracy: 26.2, AA-Omniscience Hallucination Rate: 32 **Instruction Following**: IFEval: 94.3, IFBench: 75.8, AA-IFBench: 75.2 **Multilingual**: MMLU-ProX: 84.7, NOVA-63: 57.9 **Mathematics**: AIME26: 95.3, HMMT Feb 2025: 96.7, HMMT Nov 2025: 94.6, HMMT Feb 2026: 87.8, MMAnswerBench: 83.8 ### #44 MAI-Thinking-1 - Creator: Microsoft - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 256K - Overall Score: 65/100 - Family: MAI-Thinking - Variant: 1 - Benchmarks Covered: 14 of 247 - Profile: https://benchlm.ai/models/mai-thinking-1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 46 **Coding**: LiveCodeBench: 87.7, SWE-bench Verified: 73.5, SWE-bench Pro: 52.8, Terminal-Bench 2.0: 46 **Reasoning**: Graphwalks BFS 128K: 90 **Knowledge**: GPQA: 84.2, GPQA-D: 84.2, MMLU-Pro: 85, SimpleQA: 31 **Instruction Following**: IFBench: 85 **Mathematics**: AIME 2025: 97, AIME26: 94.5, HMMT Feb 2026: 84.9 ### #45 Qwen3.6-35B-A3B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 262K - Overall Score: 65/100 - Family: Qwen3.6-35B-A3B - Variant: base - Benchmarks Covered: 58 of 247 - Profile: https://benchlm.ai/models/qwen3-6-35b-a3b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 51.5, Claw-Eval: 68.7, QwenClawBench: 52.6, QwenWebBench: 1397, TAU3-Bench: 67.2, VITA-Bench: 35.6, DeepPlanning: 25.9, Toolathlon: 26.9, MCP Atlas: 62.8, WideResearch: 60.1, AA Agentic Index: 58.34, Tau2-Telecom: 95.3, GDPval-AA: 39.9, GDPval-AA: 1298, Gert Labs: 42.65 **Coding**: SWE-bench Verified: 73.4, SWE Multilingual: 67.2, SWE-bench Pro: 49.5, Terminal-Bench 2.0: 51.5, LiveCodeBench: 80.4, NL2Repo: 29.4, AA Coding Index: 35.15, Terminal-Bench Hard: 34.8, AA-SciCode: 35.8 **Multimodal & Grounded**: MMMU: 81.7, MMMU-Pro: 75.3, RealWorldQA: 85.3, OmniDocBench 1.5: 89.9, CharXiv: 78, SimpleVQA: 58.9, CC-OCR: 81.9, AI2D_TEST: 92.7, RefCOCO (avg): 92, ODINW13: 50.8, Video-MME (with subtitle): 86.6, Video-MME (w/o subtitle): 82.5, VideoMMMU: 83.7, MLVU (M-Avg): 86.2, AA-MMMU-Pro: 75 **Reasoning**: AA-LCR: 63.7, CritPt: 0.3 **Knowledge**: MMLU-Pro: 85.2, SuperGPQA: 64.7, C-Eval: 90, GPQA: 86, HLE: 21.4, Artificial Analysis Intelligence Index: 43.49, AA-GPQA Diamond: 84.1, AA-HLE: 20.2, AA-Omniscience Index: -21.4, AA-Omniscience Accuracy: 18.9, AA-Omniscience Hallucination Rate: 49.7 **Instruction Following**: AA-IFBench: 64.4 **Mathematics**: HMMT Feb 2025: 90.7, HMMT Nov 2025: 89.1, HMMT Feb 2026: 83.6, MMAnswerBench: 78.9, AIME26: 92.7 ### #46 Claude Sonnet 4.5 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: ~64/100 (estimated) - Family: Claude Sonnet 4.5 - Variant: base - Benchmarks Covered: 9 of 247 - Profile: https://benchlm.ai/models/claude-sonnet-4-5 - Sibling Models: Claude Sonnet 4.5 Thinking - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 50, OSWorld-Verified: 61.4, VITA-Bench: 17, Gert Labs: 48.51 **Coding**: SWE-bench Verified: 77.2 **Multimodal & Grounded**: Design Arena Website: 1235 **Reasoning**: ARC-AGI-2: 13.6 **Knowledge**: GPQA: 83.4 **Mathematics**: AIME 2025: 87 ### #47 Kimi K2.5 - Creator: Moonshot AI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: 63/100 - Family: Kimi K2.5 - Variant: base - Benchmarks Covered: 61 of 247 - Profile: https://benchlm.ai/models/kimi-k2-5 - Sibling Models: Kimi K2.5 (Reasoning) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 50.8, BrowseComp: 60.6, Claw-Eval: 52.3, QwenClawBench: 54.3, TAU3-Bench: 65.7, DeepSearchQA: 77.1, DeepPlanning: 14.4, Toolathlon: 27.8, MCP Atlas: 29.5, MCP-Tasks: 59.1, WideResearch: 72.7, Tau2-Telecom: 95.9, AA Agentic Index: 58.94, APEX-Agents-AA: 11.5, GDPval-AA: 39.2, GDPval-AA: 1284, Gert Labs: 45.88 **Coding**: SWE-bench Verified: 76.8, SWE-bench Verified*: 70.8, LiveCodeBench: 85, LiveCodeBench v6: 85, SWE-bench Pro: 50.7, SWE Multilingual: 73, SWE-Rebench: 58.5, React Native Evals: 77.2, SciCode: 48.7, AA Coding Index: 39.55, Terminal-Bench Hard: 34.8, AA-SciCode: 49 **Multimodal & Grounded**: MMMU-Pro: 78.5, Video-MME: 87.4, MMVU: 80.4, VideoMMMU: 86.6, AA-MMMU-Pro: 75.4, Design Arena Website: 1294 **Reasoning**: LongBench v2: 61, AA-LCR: 65.3, CritPt: 3.1 **Knowledge**: GPQA: 87.6, GPQA-D: 87.6, SuperGPQA: 69.2, MMLU-Pro: 87.1, MMLU-Pro (Arcee): 87.1, HLE: 30.1, Artificial Analysis Intelligence Index: 46.81, AA-GPQA Diamond: 87.9, AA-HLE: 29.4, AA-Omniscience Index: -8.1, AA-Omniscience Accuracy: 34.3, AA-Omniscience Hallucination Rate: 64.6 **Instruction Following**: IFEval: 93.9, AA-IFBench: 70.2 **Multilingual**: MMLU-ProX: 82.3, NOVA-63: 56 **Mathematics**: AIME 2025: 96.1, AIME26: 95.8, AIME25 (Arcee): 96.3, HMMT Feb 2025: 95.4, HMMT Nov 2025: 91.1, HMMT Feb 2026: 87.1, MMAnswerBench: 81.8 ### #48 Qwen3.5-122B-A10B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 262K - Overall Score: 63/100 - Family: Qwen3.5-122B-A10B - Variant: base - Benchmarks Covered: 32 of 247 - Profile: https://benchlm.ai/models/qwen3-5-122b-a10b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 49.4, BrowseComp: 63.8, OSWorld-Verified: 58, Tau2-Telecom: 93.6, AA Agentic Index: 53, GDPval-AA: 30.7, GDPval-AA: 1115 **Coding**: SWE-bench Verified: 72, AA Coding Index: 34.71, Terminal-Bench Hard: 31.1, AA-SciCode: 42 **Multimodal & Grounded**: MMMU: 83.9, MMVU: 74.7, MathVision: 86.2, CharXiv: 77.2, V*: 93.2, AA-MMMU-Pro: 75 **Reasoning**: LongBench v2: 60.2, AA-LCR: 66.7, CritPt: 0.6 **Knowledge**: MMLU-Pro: 86.7, SuperGPQA: 67.1, GPQA: 86.6, Artificial Analysis Intelligence Index: 41.6, AA-GPQA Diamond: 85.7, AA-HLE: 23.4, AA-Omniscience Index: -39.6, AA-Omniscience Accuracy: 24.7, AA-Omniscience Hallucination Rate: 85.5 **Instruction Following**: IFEval: 93.4, AA-IFBench: 75.7 **Multilingual**: MMLU-ProX: 82.2 ### #49 Gemini 2.5 Pro - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~63/100 (estimated) - Family: Gemini 2.5 Pro - Variant: base - Benchmarks Covered: 23 of 247 - Profile: https://benchlm.ai/models/gemini-2-5-pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 32.68, Tau2-Telecom: 54.1, GDPval-AA: 20.9, GDPval-AA: 918, Gert Labs: 42.01 **Coding**: SWE-bench Verified: 63.8, Vibe Code Bench: 0.4, AA Coding Index: 31.95, Terminal-Bench Hard: 26.5, AA-SciCode: 42.8 **Multimodal & Grounded**: AA-MMMU-Pro: 74.9, Design Arena Website: 1212 **Reasoning**: AA-LCR: 66, CritPt: 2.6 **Knowledge**: GPQA: 83, HLE: 18.8, Artificial Analysis Intelligence Index: 34.63, AA-GPQA Diamond: 84.4, AA-HLE: 21.1, AA-Omniscience Index: -14.3, AA-Omniscience Accuracy: 39, AA-Omniscience Hallucination Rate: 87.4 **Instruction Following**: AA-IFBench: 48.7 ### #50 Grok 4 - Creator: xAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~63/100 (estimated) - Family: Grok 4 - Variant: base - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/grok-4 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 41.5, Tau2-Telecom: 74.9, GDPval-AA: 24.6, GDPval-AA: 991, Gert Labs: 42.34 **Coding**: React Native Evals: 72.6, AA Coding Index: 40.49, Terminal-Bench Hard: 37.9, AA-SciCode: 45.7 **Multimodal & Grounded**: AA-MMMU-Pro: 68.8 **Reasoning**: AA-LCR: 68, CritPt: 2 **Knowledge**: Artificial Analysis Intelligence Index: 41.52, AA-GPQA Diamond: 87.7, AA-HLE: 23.9, AA-Omniscience Index: 3.8, AA-Omniscience Accuracy: 41.4, AA-Omniscience Hallucination Rate: 64.2 **Instruction Following**: AA-IFBench: 53.7 ### #51 Qwen3.5 397B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: 62/100 - Family: Qwen3.5 397B - Variant: base - Benchmarks Covered: 54 of 247 - Profile: https://benchlm.ai/models/qwen3-5-397b - Sibling Models: Qwen3.5 397B (Reasoning) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 52.5, BrowseComp: 62, Claw-Eval: 56.8, QwenClawBench: 51.8, TAU3-Bench: 68.4, VITA-Bench: 43.7, DeepPlanning: 37.6, Toolathlon: 36.3, MCP Atlas: 46.1, MCP-Tasks: 74.2, WideResearch: 74, AA Agentic Index: 53.32, Tau2-Telecom: 83.9, GDPval-AA: 35.8, GDPval-AA: 1217, Gert Labs: 46.76 **Coding**: SWE-bench Verified: 76.2, LiveCodeBench v6: 83.6, SWE-bench Pro: 50.9, AA Coding Index: 37.43, Terminal-Bench Hard: 35.6, AA-SciCode: 41.1 **Multimodal & Grounded**: MMMU-Pro: 79, MathVision: 88.6, CharXiv: 80.8, VideoMMMU: 84.7, ScreenSpot Pro: 65.6, V*: 95.8, AA-MMMU-Pro: 52.7 **Reasoning**: LongBench v2: 63.2, AI-Needle: 68.7, AA-LCR: 58, CritPt: 0.9 **Knowledge**: GPQA: 88.4, SuperGPQA: 70.4, MMLU-Pro: 87.8, MMLU-Redux: 94.9, C-Eval: 93, HLE: 28.7, Artificial Analysis Intelligence Index: 40.1, AA-GPQA Diamond: 86.1, AA-HLE: 18.8, AA-Omniscience Index: -36.1, AA-Omniscience Accuracy: 24.3, AA-Omniscience Hallucination Rate: 79.8 **Instruction Following**: IFEval: 92.6, AA-IFBench: 51.6 **Multilingual**: MMLU-ProX: 84.7, NOVA-63: 59.1 **Mathematics**: AIME26: 93.3, HMMT Feb 2025: 94.8, HMMT Nov 2025: 92.7, HMMT Feb 2026: 87.9, MMAnswerBench: 80.9 ### #52 Qwen3.5-27B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 262K - Overall Score: 61/100 - Family: Qwen3.5-27B - Variant: base - Benchmarks Covered: 33 of 247 - Profile: https://benchlm.ai/models/qwen3-5-27b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 41.6, BrowseComp: 61, OSWorld-Verified: 56.2, Tau2-Telecom: 93.9, AA Agentic Index: 54.61, GDPval-AA: 33, GDPval-AA: 1160, Gert Labs: 39.41 **Coding**: SWE-bench Verified: 72.4, SWE-Rebench: 58.9, AA Coding Index: 34.87, Terminal-Bench Hard: 32.6, AA-SciCode: 39.5 **Multimodal & Grounded**: MMMU: 82.3, MMVU: 73.3, MathVision: 86, V*: 93.7, AA-MMMU-Pro: 75 **Reasoning**: LongBench v2: 60.6, AA-LCR: 67.3, CritPt: 0.9 **Knowledge**: MMLU-Pro: 86.1, SuperGPQA: 65.6, GPQA: 85.5, Artificial Analysis Intelligence Index: 42.07, AA-GPQA Diamond: 85.8, AA-HLE: 22.2, AA-Omniscience Index: -42, AA-Omniscience Accuracy: 21, AA-Omniscience Hallucination Rate: 79.7 **Instruction Following**: IFEval: 95, AA-IFBench: 75.6 **Multilingual**: MMLU-ProX: 82.2 ### #53 DeepSeek V3.2 (Thinking) - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~60/100 (estimated) - Family: DeepSeek V3.2 - Variant: reasoning - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/deepseek-v3-2-thinking - Sibling Models: DeepSeek V3.2 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: Vibe Code Bench: 5.108 **Multimodal & Grounded**: Design Arena Website: 1222 ### #54 MiMo-V2-Flash - Creator: Xiaomi - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: ~59/100 (estimated) - Family: MiMo-V2-Flash - Variant: base - Benchmarks Covered: 21 of 247 - Profile: https://benchlm.ai/models/mimo-v2-flash - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 47.34, Tau2-Telecom: 83.9, GDPval-AA: 28, GDPval-AA: 1059 **Coding**: SWE-bench Verified: 73.4, AA Coding Index: 25.81, Terminal-Bench Hard: 25.8, AA-SciCode: 25.9 **Multimodal & Grounded**: Design Arena Website: 1212 **Reasoning**: AA-LCR: 31.3, CritPt: 0 **Knowledge**: GPQA: 83.7, MMLU-Pro: 84.9, Artificial Analysis Intelligence Index: 30.35, AA-GPQA Diamond: 65.6, AA-HLE: 8, AA-Omniscience Index: -48.5, AA-Omniscience Accuracy: 15.2, AA-Omniscience Hallucination Rate: 75.1 **Instruction Following**: AA-IFBench: 39.9 **Mathematics**: AIME 2025: 94.1 ### #55 DeepSeek V4 Flash - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: 57/100 - Family: DeepSeek V4 - Variant: flash (non-think) - Benchmarks Covered: 23 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-flash - Sibling Models: DeepSeek V4 Pro (Max), DeepSeek V4 Pro (High), DeepSeek V4 Flash (Max), DeepSeek V4 Flash (High), DeepSeek V4 Pro, DeepSeek V4 Pro Base, DeepSeek V4 Flash Base - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 49.1, MCP Atlas: 64, Toolathlon: 40.7, Claw-Eval: 57.8, Gert Labs: 54.35 **Coding**: LiveCodeBench: 55.2, SWE-bench Verified: 73.7, SWE-bench Pro: 49.1, SWE Multilingual: 69.7, Terminal-Bench 2.0: 49.1 **Multimodal & Grounded**: Design Arena Website: 1259 **Reasoning**: MRCR 1M: 37.5, CorpusQA 1M: 15.5 **Knowledge**: MMLU-Pro: 83, SimpleQA: 23.1, Chinese-SimpleQA: 71.5, GPQA: 71.2, GPQA-D: 71.2, HLE: 8.1 **Mathematics**: HMMT Feb 2026: 40.8, IMOAnswerBench: 41.9, Apex: 1, Apex Shortlist: 9.3 ### #56 GPT-4.1 - Creator: OpenAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~57/100 (estimated) - Family: GPT-4.1 - Variant: base - Benchmarks Covered: 23 of 247 - Profile: https://benchlm.ai/models/gpt-4-1 - Sibling Models: GPT-4.1 mini, GPT-4.1 nano - Related Earlier Model: GPT-4o - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 27.26, Tau2-Telecom: 47.1, GDPval-AA: 13.8, GDPval-AA: 777, Gert Labs: 25.65 **Coding**: SWE-bench Verified: 54.6, AA Coding Index: 21.78, Terminal-Bench Hard: 13.6, AA-SciCode: 38.1 **Multimodal & Grounded**: AA-MMMU-Pro: 61.2, Design Arena Website: 1084 **Reasoning**: AA-LCR: 61, CritPt: 0 **Knowledge**: MMLU: 90.2, GPQA: 66.3, Artificial Analysis Intelligence Index: 26.28, AA-GPQA Diamond: 66.6, AA-HLE: 4.6, AA-Omniscience Index: -36.2, AA-Omniscience Accuracy: 24.2, AA-Omniscience Hallucination Rate: 79.6 **Instruction Following**: IFEval: 87.4, AA-IFBench: 43 ### #57 o3-pro - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~57/100 (estimated) - Family: o3 - Variant: pro - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/o3-pro - Sibling Models: o3, o3-mini - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Knowledge**: Artificial Analysis Intelligence Index: 40.69, AA-GPQA Diamond: 84.5 ### #58 o1 - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~57/100 (estimated) - Family: o1 - Variant: base - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/o1 - Sibling Models: o1-preview, o1-pro - Related Earlier Model: o1-preview - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 31.08, Tau2-Telecom: 62.6, GDPval-AA: 11.5, GDPval-AA: 730 **Coding**: AA Coding Index: 20.51, Terminal-Bench Hard: 12.9, AA-SciCode: 35.8 **Reasoning**: AA-LCR: 59.3, CritPt: 0.3 **Knowledge**: MMLU: 91.8, GPQA: 75.7, Artificial Analysis Intelligence Index: 30.75, AA-GPQA Diamond: 74.7, AA-HLE: 7.7, AA-Omniscience Index: -10.5, AA-Omniscience Accuracy: 34.7, AA-Omniscience Hallucination Rate: 69.3 **Instruction Following**: IFEval: 92.2, AA-IFBench: 70.3 ### #59 DeepSeek V3.2 - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~56/100 (estimated) - Family: DeepSeek V3.2 - Variant: base - Benchmarks Covered: 22 of 247 - Profile: https://benchlm.ai/models/deepseek-v3-2 - Sibling Models: DeepSeek V3.2 (Thinking) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Claw-Eval: 40.2, VITA-Bench: 18.5, AA Agentic Index: 39.82, Tau2-Telecom: 78.9, GDPval-AA: 18.8, GDPval-AA: 876, Gert Labs: 29.57 **Coding**: SWE-Rebench: 60.9, React Native Evals: 71.5, AA Coding Index: 34.6, Terminal-Bench Hard: 32.6, AA-SciCode: 38.7 **Multimodal & Grounded**: Design Arena Website: 1222 **Reasoning**: AA-LCR: 39, CritPt: 0.9 **Knowledge**: Artificial Analysis Intelligence Index: 32.09, AA-GPQA Diamond: 75.1, AA-HLE: 10.5, AA-Omniscience Index: -46.7, AA-Omniscience Accuracy: 24.2, AA-Omniscience Hallucination Rate: 93.5 **Instruction Following**: AA-IFBench: 49 ### #60 Claude Haiku 4.5 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: ~56/100 (estimated) - Family: Claude Haiku 4.5 - Variant: base - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/claude-haiku-4-5 - Sibling Models: Claude Haiku 4.5 Thinking - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: SWE-bench Verified: 73.3 **Multimodal & Grounded**: Design Arena Website: 1167 ### #61 o3 - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~56/100 (estimated) - Family: o3 - Variant: base - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/o3 - Sibling Models: o3-pro, o3-mini - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 36.09, Tau2-Telecom: 80.7, GDPval-AA: 12.8, GDPval-AA: 757 **Coding**: AA Coding Index: 38.4, Terminal-Bench Hard: 37.1, AA-SciCode: 41 **Multimodal & Grounded**: AA-MMMU-Pro: 70.1, Design Arena Website: 1082 **Reasoning**: AA-LCR: 69.3, CritPt: 1.1 **Knowledge**: Artificial Analysis Intelligence Index: 38.37, AA-GPQA Diamond: 82.7, AA-HLE: 20, AA-Omniscience Index: -15.3, AA-Omniscience Accuracy: 38.4, AA-Omniscience Hallucination Rate: 87.1 **Instruction Following**: AA-IFBench: 71.4 ### #62 Qwen3.5-35B-A3B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 262K - Overall Score: 55/100 - Family: Qwen3.5-35B-A3B - Variant: base - Benchmarks Covered: 33 of 247 - Profile: https://benchlm.ai/models/qwen3-5-35b-a3b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 40.5, BrowseComp: 61, OSWorld-Verified: 54.5, Tau2-Telecom: 89.2, AA Agentic Index: 44.11, GDPval-AA: 20.3, GDPval-AA: 905, Gert Labs: 28.96 **Coding**: SWE-bench Verified: 69.2, SWE-Rebench: 53.7, AA Coding Index: 30.25, Terminal-Bench Hard: 26.5, AA-SciCode: 37.7 **Multimodal & Grounded**: MMMU: 81.4, MMVU: 72.3, MathVision: 83.9, V*: 92.7, AA-MMMU-Pro: 72.7 **Reasoning**: LongBench v2: 59, AA-LCR: 62.7, CritPt: 0.9 **Knowledge**: MMLU-Pro: 85.3, SuperGPQA: 63.4, GPQA: 84.2, Artificial Analysis Intelligence Index: 37.12, AA-GPQA Diamond: 84.5, AA-HLE: 19.7, AA-Omniscience Index: -46.4, AA-Omniscience Accuracy: 20.5, AA-Omniscience Hallucination Rate: 84 **Instruction Following**: IFEval: 91.9, AA-IFBench: 72.5 **Multilingual**: MMLU-ProX: 81 ### #63 Gemini 3 Flash - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~55/100 (estimated) - Family: Gemini 3 Flash - Variant: base - Benchmarks Covered: 21 of 247 - Profile: https://benchlm.ai/models/gemini-3-flash - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Claw-Eval: 49.2, AA Agentic Index: 35.01, Tau2-Telecom: 43.3, GDPval-AA: 30.7, GDPval-AA: 1114, Gert Labs: 56.63 **Coding**: Vibe Code Bench: 20.204, AA Coding Index: 37.84, Terminal-Bench Hard: 31.8, AA-SciCode: 49.9 **Multimodal & Grounded**: AA-MMMU-Pro: 78.6, Design Arena Website: 1241 **Reasoning**: AA-LCR: 48, CritPt: 1.4 **Knowledge**: Artificial Analysis Intelligence Index: 35.05, AA-GPQA Diamond: 81.2, AA-HLE: 14.1, AA-Omniscience Index: -3.6, AA-Omniscience Accuracy: 45.5, AA-Omniscience Hallucination Rate: 90.2 **Instruction Following**: AA-IFBench: 55.1 ### #64 o3-mini - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~55/100 (estimated) - Family: o3 - Variant: mini - Benchmarks Covered: 12 of 247 - Profile: https://benchlm.ai/models/o3-mini - Sibling Models: o3-pro, o3 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Tau2-Telecom: 28.7 **Coding**: SWE-bench Verified: 49.3, AA Coding Index: 17.86, Terminal-Bench Hard: 6.8, AA-SciCode: 39.9 **Knowledge**: MMLU: 86.9, GPQA: 77.2, Artificial Analysis Intelligence Index: 25.86, AA-GPQA Diamond: 74.8, AA-HLE: 8.7 **Instruction Following**: IFEval: 93.9 **Mathematics**: AIME 2024: 87.3 ### #65 MiniMax M2.7 - Creator: MiniMax - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: 53/100 - Family: MiniMax M2.7 - Variant: base - Benchmarks Covered: 37 of 247 - Profile: https://benchlm.ai/models/minimax-m2-7 - Related Earlier Model: MiniMax M2.5 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Terminal-Bench 2.0: 57, Tau2-Telecom: 84.8, Toolathlon: 46.3, MLE-Bench Lite: 66.6, MM-ClawBench: 62.7, Claw-Eval: 48.7, AA Agentic Index: 61.49, APEX-Agents-AA: 10.6, GDPval-AA: 50.2, GDPval-AA: 1505, Gert Labs: 40.4 **Coding**: SWE-bench Verified*: 75.4, SWE-bench Pro: 56.2, SWE-Rebench: 51.9, SWE Multilingual: 76.5, Multi-SWE Bench: 52.7, VIBE-Pro: 55.6, NL2Repo: 39.8, Vibe Code Bench: 27.037, React Native Evals: 71.4, AA Coding Index: 41.93, Terminal-Bench Hard: 39.4, AA-SciCode: 47 **Multimodal & Grounded**: GDPval-AA: 1495, Design Arena Website: 1287 **Reasoning**: AA-LCR: 68.7, CritPt: 0.6 **Knowledge**: GPQA-D: 87, MMLU-Pro (Arcee): 80.8, Artificial Analysis Intelligence Index: 49.62, AA-GPQA Diamond: 87.4, AA-HLE: 28.1, AA-Omniscience Index: 0.7, AA-Omniscience Accuracy: 26.1, AA-Omniscience Hallucination Rate: 34.4 **Instruction Following**: AA-IFBench: 75.7 **Mathematics**: AIME25 (Arcee): 80 ### #66 DeepSeek Coder 2.0 - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~51/100 (estimated) - Family: DeepSeek Coder 2.0 - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/deepseek-coder-2-0 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #67 Claude 4.1 Opus - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: ~51/100 (estimated) - Family: Claude 4.1 Opus - Variant: base - Benchmarks Covered: 3 of 247 - Profile: https://benchlm.ai/models/claude-4-1-opus - Sibling Models: Claude 4.1 Opus Thinking - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: SWE-bench Verified: 74.5 **Multimodal & Grounded**: Design Arena Website: 1222 **Knowledge**: Artificial Analysis Intelligence Index: 36 ### #68 DeepSeek LLM 2.0 - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~50/100 (estimated) - Family: DeepSeek LLM 2.0 - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/deepseek-llm-2-0 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #69 Qwen2.5-1M - Creator: Alibaba - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~50/100 (estimated) - Family: Qwen2.5-1M - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/qwen2-5-1m - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #70 Claude 4 Sonnet - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: ~50/100 (estimated) - Family: Claude 4 Sonnet - Variant: base - Benchmarks Covered: 20 of 247 - Profile: https://benchlm.ai/models/claude-4-sonnet - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 39.21, Tau2-Telecom: 52.3, GDPval-AA: 31.2, GDPval-AA: 1123, Gert Labs: 39.66 **Coding**: SWE-bench Verified: 72.7, AA Coding Index: 30.6, Terminal-Bench Hard: 27.3, AA-SciCode: 37.3 **Multimodal & Grounded**: AA-MMMU-Pro: 62.4, Design Arena Website: 1191 **Reasoning**: AA-LCR: 44.3, CritPt: 1.1 **Knowledge**: Artificial Analysis Intelligence Index: 33, AA-GPQA Diamond: 68.3, AA-HLE: 4, AA-Omniscience Index: -9.2, AA-Omniscience Accuracy: 22.4, AA-Omniscience Hallucination Rate: 40.8 **Instruction Following**: AA-IFBench: 45.4 ### #71 GPT-4o mini - Creator: OpenAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~49/100 (estimated) - Family: GPT-4o - Variant: mini - Benchmarks Covered: 6 of 247 - Profile: https://benchlm.ai/models/gpt-4o-mini - Sibling Models: GPT-4o - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: AA-SciCode: 22.9 **Multimodal & Grounded**: AA-MMMU-Pro: 41.5 **Knowledge**: Artificial Analysis Intelligence Index: 12.65, AA-GPQA Diamond: 42.6, AA-HLE: 4 **Instruction Following**: AA-IFBench: 31 ### #72 Qwen2.5-72B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~49/100 (estimated) - Family: Qwen2.5-72B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/qwen2-5-72b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #73 DeepSeekMath V2 - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~49/100 (estimated) - Family: DeepSeekMath - Variant: snapshot (V2) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/deepseekmath-v2 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #74 Mistral Large 3 - Creator: Mistral - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~48/100 (estimated) - Family: Mistral Large 3 - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/mistral-large-3 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 21.7, Tau2-Telecom: 24.6, GDPval-AA: 18.2, GDPval-AA: 864 **Coding**: AA Coding Index: 22.68, Terminal-Bench Hard: 15.9, AA-SciCode: 36.2 **Multimodal & Grounded**: AA-MMMU-Pro: 55.7 **Reasoning**: AA-LCR: 34.7, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 22.8, AA-GPQA Diamond: 68, AA-HLE: 4.1, AA-Omniscience Index: -39.4, AA-Omniscience Accuracy: 24.1, AA-Omniscience Hallucination Rate: 83.7 **Instruction Following**: AA-IFBench: 36.2 ### #75 Gemini 3.1 Flash-Lite - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~47/100 (estimated) - Family: Gemini 3.1 Flash-Lite - Variant: base - Benchmarks Covered: 21 of 247 - Profile: https://benchlm.ai/models/gemini-3-1-flash-lite - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 25.67, APEX-Agents-AA: 12.2, Tau2-Telecom: 31.3, GDPval-AA: 21.3, GDPval-AA: 926, Gert Labs: 38.46 **Coding**: Vibe Code Bench: 0, AA Coding Index: 30.13, Terminal-Bench Hard: 24.2, AA-SciCode: 41.9 **Multimodal & Grounded**: CharXiv: 73.2, AA-MMMU-Pro: 75.5 **Reasoning**: AA-LCR: 65.3, CritPt: 1.1 **Knowledge**: Artificial Analysis Intelligence Index: 33.52, AA-GPQA Diamond: 82.2, AA-HLE: 16.2, AA-Omniscience Index: -15.5, AA-Omniscience Accuracy: 36.4, AA-Omniscience Hallucination Rate: 81.6 **Instruction Following**: AA-IFBench: 77.2 ### #76 Qwen3 235B 2507 (Reasoning) - Creator: Alibaba - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~45/100 (estimated) - Family: Qwen3 235B 2507 - Variant: reasoning (2507) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/qwen3-235b-2507-reasoning - Sibling Models: Qwen3 235B 2507 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #77 GPT-4.1 mini - Creator: OpenAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~45/100 (estimated) - Family: GPT-4.1 - Variant: mini - Benchmarks Covered: 22 of 247 - Profile: https://benchlm.ai/models/gpt-4-1-mini - Sibling Models: GPT-4.1, GPT-4.1 nano - Related Earlier Model: GPT-4o mini - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 25.15, Tau2-Telecom: 52.9, GDPval-AA: 6, GDPval-AA: 619 **Coding**: SWE-bench Verified: 23.6, AA Coding Index: 18.52, Terminal-Bench Hard: 7.6, AA-SciCode: 40.4 **Multimodal & Grounded**: AA-MMMU-Pro: 58.7, Design Arena Website: 1043 **Reasoning**: AA-LCR: 42.3, CritPt: 0 **Knowledge**: MMLU: 87.5, GPQA: 64.2, Artificial Analysis Intelligence Index: 22.9, AA-GPQA Diamond: 66.4, AA-HLE: 4.6, AA-Omniscience Index: -50.1, AA-Omniscience Accuracy: 17.5, AA-Omniscience Hallucination Rate: 82 **Instruction Following**: IFEval: 88.5, AA-IFBench: 38.3 ### #78 Nemotron 3 Super 100B - Creator: NVIDIA - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~43/100 (estimated) - Family: Nemotron 3 Super 100B - Variant: base - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/nemotron-3-super-100b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Claw-Eval: 5.5 ### #79 o4-mini (high) - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~43/100 (estimated) - Family: o4-mini - Variant: reasoning (high) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/o4-mini-high - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #80 Claude 4.1 Opus Thinking - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~43/100 (estimated) - Family: Claude 4.1 Opus - Variant: reasoning - Benchmarks Covered: 11 of 247 - Profile: https://benchlm.ai/models/claude-4-1-opus-thinking - Sibling Models: Claude 4.1 Opus - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Tau2-Telecom: 71.4 **Coding**: AA Coding Index: 36.52, Terminal-Bench Hard: 34.3, AA-SciCode: 40.9 **Multimodal & Grounded**: AA-MMMU-Pro: 67.9 **Reasoning**: AA-LCR: 66.3, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 42, AA-GPQA Diamond: 80.9, AA-HLE: 11.9 **Instruction Following**: AA-IFBench: 55.4 ### #81 GPT-4o - Creator: OpenAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~42/100 (estimated) - Family: GPT-4o - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/gpt-4o - Sibling Models: GPT-4o mini - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 8.38, Tau2-Telecom: 25.1, GDPval-AA: 0, GDPval-AA: 348 **Coding**: AA Coding Index: 16.67, Terminal-Bench Hard: 8.3, AA-SciCode: 33.3 **Multimodal & Grounded**: Design Arena Website: 876 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 17.32, AA-GPQA Diamond: 54.3, AA-HLE: 3.3, AA-Omniscience Index: -10.7, AA-Omniscience Accuracy: 19.7, AA-Omniscience Hallucination Rate: 37.9 **Instruction Following**: AA-IFBench: 34.3 ### #82 Kimi K2 - Creator: Moonshot AI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~41/100 (estimated) - Family: Kimi K2 - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/kimi-k2 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: Tau2-Telecom: 61.1, AA Agentic Index: 24.27, GDPval-AA: 1.2, GDPval-AA: 525 **Coding**: AA Coding Index: 22.1, Terminal-Bench Hard: 15.9, AA-SciCode: 34.5 **Multimodal & Grounded**: Design Arena Website: 1096 **Reasoning**: AA-LCR: 51, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 26.32, AA-GPQA Diamond: 76.6, AA-HLE: 7, AA-Omniscience Index: -27.5, AA-Omniscience Accuracy: 26.8, AA-Omniscience Hallucination Rate: 74.2 **Instruction Following**: AA-IFBench: 41.5 ### #83 Llama 3.1 405B - Creator: Meta - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~40/100 (estimated) - Family: Llama 3.1 405B - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/llama-3-1-405b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 6.34, Tau2-Telecom: 19, GDPval-AA: 0, GDPval-AA: 255 **Coding**: AA Coding Index: 14.5, Terminal-Bench Hard: 6.8, AA-SciCode: 29.9 **Reasoning**: AA-LCR: 24.3, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 17.38, AA-GPQA Diamond: 51.5, AA-HLE: 4.2, AA-Omniscience Index: -17.3, AA-Omniscience Accuracy: 22.3, AA-Omniscience Hallucination Rate: 51 **Instruction Following**: AA-IFBench: 39 ### #84 Claude 3.5 Sonnet - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: ~40/100 (estimated) - Family: Claude 3.5 Sonnet - Variant: base - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/claude-3-5-sonnet - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: SWE-bench Verified: 49 **Knowledge**: GPQA: 59.4 ### #85 Grok Code Fast 1 - Creator: xAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: ~39/100 (estimated) - Family: Grok Code Fast 1 - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/grok-code-fast-1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 35.63, Tau2-Telecom: 75.7, GDPval-AA: 13.1, GDPval-AA: 763 **Coding**: SWE-bench Verified: 70.8, AA Coding Index: 23.69, Terminal-Bench Hard: 17.4, AA-SciCode: 36.2 **Reasoning**: AA-LCR: 48.3, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 28.74, AA-GPQA Diamond: 72.7, AA-HLE: 7.5, AA-Omniscience Index: -36, AA-Omniscience Accuracy: 23.8, AA-Omniscience Hallucination Rate: 78.5 **Instruction Following**: AA-IFBench: 41.4 ### #86 Sarvam 105B - Creator: Sarvam - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~39/100 (estimated) - Family: Sarvam 105B - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/sarvam-105b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 24.69, Tau2-Telecom: 46.8, GDPval-AA: 11.9, GDPval-AA: 738 **Coding**: AA Coding Index: 9.81, Terminal-Bench Hard: 1.5, AA-SciCode: 26.4 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 18.16, AA-GPQA Diamond: 73.8, AA-HLE: 10.1, AA-Omniscience Index: -59.5, AA-Omniscience Accuracy: 17.6, AA-Omniscience Hallucination Rate: 93.5 **Instruction Following**: AA-IFBench: 34.4 ### #87 Mistral Large 2 - Creator: Mistral - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~38/100 (estimated) - Family: Mistral Large 2 - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/mistral-large-2 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 10.23, Tau2-Telecom: 30.7, GDPval-AA: 0, GDPval-AA: 323 **Coding**: AA Coding Index: 13.76, Terminal-Bench Hard: 6.1, AA-SciCode: 29.2 **Reasoning**: AA-LCR: 5.3, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 15.09, AA-GPQA Diamond: 48.6, AA-HLE: 4, AA-Omniscience Index: -34, AA-Omniscience Accuracy: 20.1, AA-Omniscience Hallucination Rate: 67.8 **Instruction Following**: AA-IFBench: 31.2 ### #88 Gemini 2.5 Flash - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~37/100 (estimated) - Family: Gemini 2.5 Flash - Variant: base - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/gemini-2-5-flash - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 15.01, Tau2-Telecom: 14.9, GDPval-AA: 11.9, GDPval-AA: 739 **Coding**: AA Coding Index: 17.76, Terminal-Bench Hard: 12.1, AA-SciCode: 29.1 **Multimodal & Grounded**: AA-MMMU-Pro: 65.5, Design Arena Website: 1160 **Reasoning**: AA-LCR: 45.9, CritPt: 1.4 **Knowledge**: Artificial Analysis Intelligence Index: 20.56, AA-GPQA Diamond: 68.3, AA-HLE: 5.1, AA-Omniscience Index: -42, AA-Omniscience Accuracy: 26.5, AA-Omniscience Hallucination Rate: 93.3 **Instruction Following**: AA-IFBench: 39 ### #89 Gemini 1.5 Pro - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 2M - Overall Score: ~35/100 (estimated) - Family: Gemini 1.5 Pro - Variant: base - Benchmarks Covered: 6 of 247 - Profile: https://benchlm.ai/models/gemini-1-5-pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: AA Coding Index: 23.63, AA-SciCode: 29.5 **Multimodal & Grounded**: AA-MMMU-Pro: 55 **Knowledge**: Artificial Analysis Intelligence Index: 15.99, AA-GPQA Diamond: 58.9, AA-HLE: 4.9 ### #90 DeepSeek V3 - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~34/100 (estimated) - Family: DeepSeek - Variant: snapshot (V3) - Benchmarks Covered: 22 of 247 - Profile: https://benchlm.ai/models/deepseek-v3 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 8.83, Tau2-Telecom: 22.8, GDPval-AA: 0, GDPval-AA: 409 **Coding**: LiveCodeBench: 37.6, SWE-bench Verified: 42, AA Coding Index: 16.35, Terminal-Bench Hard: 6.8, AA-SciCode: 35.4 **Multimodal & Grounded**: Design Arena Website: 1165 **Reasoning**: AA-LCR: 29, CritPt: 0 **Knowledge**: GPQA: 59.1, MMLU-Pro: 75.9, Artificial Analysis Intelligence Index: 16.46, AA-GPQA Diamond: 55.7, AA-HLE: 3.6, AA-Omniscience Index: -41.3, AA-Omniscience Accuracy: 25.4, AA-Omniscience Hallucination Rate: 89.4 **Instruction Following**: IFEval: 86.1, AA-IFBench: 34.8 ### #91 GPT-OSS 120B - Creator: OpenAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~34/100 (estimated) - Family: GPT-OSS - Variant: base - Benchmarks Covered: 20 of 247 - Profile: https://benchlm.ai/models/gpt-oss-120b - Sibling Models: GPT-OSS 20B - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 37.87, APEX-Agents-AA: 3.1, Tau2-Telecom: 65.8, GDPval-AA: 22.4, GDPval-AA: 947, Gert Labs: 29.61 **Coding**: React Native Evals: 71.6, AA Coding Index: 28.62, Terminal-Bench Hard: 23.5, AA-SciCode: 38.9 **Multimodal & Grounded**: Design Arena Website: 1013 **Reasoning**: AA-LCR: 50.7, CritPt: 1.1 **Knowledge**: Artificial Analysis Intelligence Index: 33.27, AA-GPQA Diamond: 78.2, AA-HLE: 18.5, AA-Omniscience Index: -50, AA-Omniscience Accuracy: 21.5, AA-Omniscience Hallucination Rate: 91.2 **Instruction Following**: AA-IFBench: 69 ### #92 Claude 3 Opus - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: ~34/100 (estimated) - Family: Claude 3 Opus - Variant: base - Benchmarks Covered: 5 of 247 - Profile: https://benchlm.ai/models/claude-3-opus - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: AA Coding Index: 19.53, AA-SciCode: 23.3 **Knowledge**: Artificial Analysis Intelligence Index: 18, AA-GPQA Diamond: 48.9, AA-HLE: 3.1 ### #93 MiniCPM5-1B - Creator: OpenBMB - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 131K - Overall Score: ~34/100 (estimated) - Family: MiniCPM5 - Variant: 1b - Benchmarks Covered: 14 of 247 - Profile: https://benchlm.ai/models/minicpm5-1b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: BFCL v4: 25.15 **Coding**: LiveCodeBench Pro: 22.68, LiveCodeBench v6: 33.52 **Reasoning**: BBH: 71.89 **Knowledge**: MMLU-Pro: 48.85, MMLU-Redux: 70.06, GPQA-D: 26.26, SuperGPQA: 23.14 **Instruction Following**: IFBench: 46.67, IFEval: 80.41 **Mathematics**: AIME 2025: 40.42, AIME26: 40.42, HMMT Feb 2026: 25.76, MATH-500: 91.6 ### #94 DeepSeek-R1 - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~32/100 (estimated) - Family: DeepSeek-R1 - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/deepseek-r1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 20.84, Tau2-Telecom: 36.5, GDPval-AA: 9, GDPval-AA: 680 **Coding**: AA Coding Index: 24.03, Terminal-Bench Hard: 15.9, AA-SciCode: 40.3 **Reasoning**: AA-LCR: 54.7, CritPt: 1.4 **Knowledge**: Artificial Analysis Intelligence Index: 27.07, AA-GPQA Diamond: 81.3, AA-HLE: 14.9, AA-Omniscience Index: -27.1, AA-Omniscience Accuracy: 31, AA-Omniscience Hallucination Rate: 84 **Instruction Following**: AA-IFBench: 39.6 ### #95 Qwen3 235B 2507 - Creator: Alibaba - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~32/100 (estimated) - Family: Qwen3 235B 2507 - Variant: base (2507) - Benchmarks Covered: 4 of 247 - Profile: https://benchlm.ai/models/qwen3-235b-2507 - Sibling Models: Qwen3 235B 2507 (Reasoning) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Knowledge**: GPQA: 77.5, SuperGPQA: 62.6, MMLU-Pro: 83 **Multilingual**: MMLU-ProX: 79.4 ### #96 DBRX Instruct - Creator: Databricks - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~32/100 (estimated) - Family: DBRX - Variant: instruct - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/dbrx-instruct - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #97 Grok 3 [Beta] - Creator: xAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~30/100 (estimated) - Family: Grok 3 - Variant: snapshot (beta) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/grok-3-beta - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #98 DeepSeek V3.1 (Reasoning) - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: ~29/100 (estimated) - Family: DeepSeek V3.1 - Variant: reasoning - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/deepseek-v3-1-reasoning - Sibling Models: DeepSeek V3.1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 18.85, Tau2-Telecom: 37.4, GDPval-AA: 5.6, GDPval-AA: 612 **Coding**: AA Coding Index: 29.71, Terminal-Bench Hard: 25, AA-SciCode: 39.1 **Multimodal & Grounded**: Design Arena Website: 1168 **Reasoning**: AA-LCR: 53.3, CritPt: 2 **Knowledge**: Artificial Analysis Intelligence Index: 27.71, AA-GPQA Diamond: 77.9, AA-HLE: 13, AA-Omniscience Index: -28.4, AA-Omniscience Accuracy: 28.8, AA-Omniscience Hallucination Rate: 80.3 **Instruction Following**: AA-IFBench: 41.5 ### #99 o1-pro - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: ~28/100 (estimated) - Family: o1 - Variant: pro - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/o1-pro - Sibling Models: o1-preview, o1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Knowledge**: GPQA: 79, Artificial Analysis Intelligence Index: 25.76 ### #100 Phi-4 - Creator: Microsoft - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 16K - Overall Score: ~27/100 (estimated) - Family: Phi-4 - Variant: base - Benchmarks Covered: 14 of 247 - Profile: https://benchlm.ai/models/phi-4 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 0, Tau2-Telecom: 0 **Coding**: AA Coding Index: 11.21, Terminal-Bench Hard: 3.8, AA-SciCode: 26 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 10.41, AA-GPQA Diamond: 57.5, AA-HLE: 4.1, AA-Omniscience Index: -56.7, AA-Omniscience Accuracy: 13.2, AA-Omniscience Hallucination Rate: 80.5 **Instruction Following**: AA-IFBench: 23.5 ### #101 GPT-4.1 nano - Creator: OpenAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~27/100 (estimated) - Family: GPT-4.1 - Variant: nano - Benchmarks Covered: 21 of 247 - Profile: https://benchlm.ai/models/gpt-4-1-nano - Sibling Models: GPT-4.1, GPT-4.1 mini - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 5.75, Tau2-Telecom: 17.3, GDPval-AA: 0, GDPval-AA: 318 **Coding**: AA Coding Index: 11.17, Terminal-Bench Hard: 3.8, AA-SciCode: 25.9 **Multimodal & Grounded**: AA-MMMU-Pro: 40.1, Design Arena Website: 1018 **Reasoning**: AA-LCR: 17, CritPt: 0 **Knowledge**: MMLU: 80.1, GPQA: 50.3, Artificial Analysis Intelligence Index: 13.04, AA-GPQA Diamond: 51.2, AA-HLE: 3.9, AA-Omniscience Index: -56.4, AA-Omniscience Accuracy: 13.3, AA-Omniscience Hallucination Rate: 80.4 **Instruction Following**: IFEval: 83.2, AA-IFBench: 32 ### #102 GLM-4.5 - Creator: Z.AI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~25/100 (estimated) - Family: GLM-4.5 - Variant: base - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/glm-4-5 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Multimodal & Grounded**: Design Arena Website: 1215 ### #103 Llama 4 Scout - Creator: Meta - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 10M - Overall Score: ~25/100 (estimated) - Family: Llama 4 Scout - Variant: base - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/llama-4-scout - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 5.17, Tau2-Telecom: 15.5, GDPval-AA: 0, GDPval-AA: 269 **Coding**: AA Coding Index: 6.68, Terminal-Bench Hard: 1.5, AA-SciCode: 17 **Multimodal & Grounded**: AA-MMMU-Pro: 52.9, Design Arena Website: 796 **Reasoning**: AA-LCR: 25.8, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 13.52, AA-GPQA Diamond: 58.7, AA-HLE: 4.3, AA-Omniscience Index: -52.4, AA-Omniscience Accuracy: 14.6, AA-Omniscience Hallucination Rate: 78.3 **Instruction Following**: AA-IFBench: 39.5 ### #104 Nemotron 3 Nano 30B - Creator: NVIDIA - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~25/100 (estimated) - Family: Nemotron 3 Nano 30B - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/nemotron-3-nano-30b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 8.48, Tau2-Telecom: 25.4, GDPval-AA: 0, GDPval-AA: 347 **Coding**: AA Coding Index: 15.76, Terminal-Bench Hard: 12.1, AA-SciCode: 23 **Reasoning**: AA-LCR: 6.7, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 13.17, AA-GPQA Diamond: 39.9, AA-HLE: 4.6, AA-Omniscience Index: -69.2, AA-Omniscience Accuracy: 11.4, AA-Omniscience Hallucination Rate: 90.9 **Instruction Following**: AA-IFBench: 37.5 ### #105 Llama 3 70B - Creator: Meta - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~25/100 (estimated) - Family: Llama 3 70B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/llama-3-70b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #106 DeepSeek V3.1 - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~24/100 (estimated) - Family: DeepSeek V3.1 - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/deepseek-v3-1 - Sibling Models: DeepSeek V3.1 (Reasoning) - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 31.94, Tau2-Telecom: 34.8, GDPval-AA: 28.7, GDPval-AA: 1075 **Coding**: AA Coding Index: 28.39, Terminal-Bench Hard: 24.2, AA-SciCode: 36.7 **Multimodal & Grounded**: Design Arena Website: 1168 **Reasoning**: AA-LCR: 45, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 28.13, AA-GPQA Diamond: 73.5, AA-HLE: 6.3, AA-Omniscience Index: -41.1, AA-Omniscience Accuracy: 23.1, AA-Omniscience Hallucination Rate: 83.5 **Instruction Following**: AA-IFBench: 37.8 ### #107 GPT-4 Turbo - Creator: OpenAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~24/100 (estimated) - Family: GPT-4 Turbo - Variant: base - Benchmarks Covered: 4 of 247 - Profile: https://benchlm.ai/models/gpt-4-turbo - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: AA Coding Index: 21.49, AA-SciCode: 31.9 **Knowledge**: Artificial Analysis Intelligence Index: 13.72, AA-HLE: 3.3 ### #108 Gemini 1.0 Pro - Creator: Google - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~24/100 (estimated) - Family: Gemini 1.0 Pro - Variant: base - Benchmarks Covered: 4 of 247 - Profile: https://benchlm.ai/models/gemini-1-0-pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Coding**: AA-SciCode: 11.7 **Knowledge**: Artificial Analysis Intelligence Index: 8.5, AA-GPQA Diamond: 27.7, AA-HLE: 4.6 ### #109 Z-1 - Creator: Z - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~23/100 (estimated) - Family: Z-1 - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/z-1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #110 Mistral 8x7B - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~23/100 (estimated) - Family: Mistral 8x7B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/mistral-8x7b - Sibling Models: Mistral 8x7B v0.2 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #111 Claude 3 Haiku - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: ~23/100 (estimated) - Family: Claude 3 Haiku - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/claude-3-haiku - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 7.02, Tau2-Telecom: 21.1, GDPval-AA: 0, GDPval-AA: 378 **Coding**: AA Coding Index: 6.72, Terminal-Bench Hard: 0.8, AA-SciCode: 18.6 **Multimodal & Grounded**: AA-MMMU-Pro: 30.8 **Reasoning**: AA-LCR: 21, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 12.26, AA-GPQA Diamond: 37.4, AA-HLE: 3.9, AA-Omniscience Index: -47.6, AA-Omniscience Accuracy: 17.2, AA-Omniscience Hallucination Rate: 78.2 **Instruction Following**: AA-IFBench: 36.1 ### #112 Mixtral 8x22B Instruct v0.1 - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: ~22/100 (estimated) - Family: Mixtral 8x22B - Variant: instruct (v0.1) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/mixtral-8x22b-instruct-v0-1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #113 Nemotron-4 15B - Creator: NVIDIA - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~22/100 (estimated) - Family: Nemotron-4 15B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/nemotron-4-15b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #114 Moonshot v1 - Creator: Moonshot AI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~22/100 (estimated) - Family: Moonshot - Variant: snapshot (v1) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/moonshot-v1 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #115 Nemotron Ultra 253B - Creator: NVIDIA - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 32K - Overall Score: ~22/100 (estimated) - Family: Nemotron Ultra 253B - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/nemotron-ultra-253b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 3.8, Tau2-Telecom: 11.4, GDPval-AA: 0, GDPval-AA: 238 **Coding**: AA Coding Index: 13.09, Terminal-Bench Hard: 2.3, AA-SciCode: 34.7 **Reasoning**: AA-LCR: 7.3, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 15.02, AA-GPQA Diamond: 72.8, AA-HLE: 8.1, AA-Omniscience Index: -45.5, AA-Omniscience Accuracy: 19.9, AA-Omniscience Hallucination Rate: 81.7 **Instruction Following**: AA-IFBench: 38.2 ### #116 GLM-4.5-Air - Creator: Z.AI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~18/100 (estimated) - Family: GLM-4.5-Air - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/glm-4-5-air - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 21.01, Tau2-Telecom: 46.5, GDPval-AA: 3, GDPval-AA: 560 **Coding**: AA Coding Index: 23.82, Terminal-Bench Hard: 20.5, AA-SciCode: 30.6 **Multimodal & Grounded**: Design Arena Website: 1192 **Reasoning**: AA-LCR: 43.7, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 23.17, AA-GPQA Diamond: 73.3, AA-HLE: 6.8, AA-Omniscience Index: -62.5, AA-Omniscience Accuracy: 15.5, AA-Omniscience Hallucination Rate: 92.3 **Instruction Following**: AA-IFBench: 37.6 ### #117 Llama 4 Maverick - Creator: Meta - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: ~17/100 (estimated) - Family: Llama 4 Maverick - Variant: base - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/llama-4-maverick - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 7.22, Tau2-Telecom: 17.8, GDPval-AA: 0, GDPval-AA: 436 **Coding**: AA Coding Index: 15.58, Terminal-Bench Hard: 6.8, AA-SciCode: 33.1 **Multimodal & Grounded**: AA-MMMU-Pro: 62.1, Design Arena Website: 916 **Reasoning**: AA-LCR: 46, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 18.36, AA-GPQA Diamond: 67.1, AA-HLE: 4.8, AA-Omniscience Index: -41.8, AA-Omniscience Accuracy: 24.3, AA-Omniscience Hallucination Rate: 87.3 **Instruction Following**: AA-IFBench: 43 ### #118 Gemma 3 27B - Creator: Google - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~16/100 (estimated) - Family: Gemma 3 27B - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/gemma-3-27b - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 3.51, Tau2-Telecom: 10.5, GDPval-AA: 0, GDPval-AA: 283 **Coding**: AA Coding Index: 9.59, Terminal-Bench Hard: 3.8, AA-SciCode: 21.2 **Multimodal & Grounded**: AA-MMMU-Pro: 48 **Reasoning**: AA-LCR: 5.7, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 10.31, AA-GPQA Diamond: 42.8, AA-HLE: 4.7, AA-Omniscience Index: -65.9, AA-Omniscience Accuracy: 12.5, AA-Omniscience Hallucination Rate: 89.5 **Instruction Following**: AA-IFBench: 31.8 ### #119 GPT-OSS 20B - Creator: OpenAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~16/100 (estimated) - Family: GPT-OSS - Variant: mini - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/gpt-oss-20b - Sibling Models: GPT-OSS 120B - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 27.6, APEX-Agents-AA: 0.7, Tau2-Telecom: 60.2, GDPval-AA: 7.4, GDPval-AA: 647 **Coding**: React Native Evals: 71, AA Coding Index: 18.53, Terminal-Bench Hard: 10.6, AA-SciCode: 34.4 **Multimodal & Grounded**: Design Arena Website: 898 **Reasoning**: AA-LCR: 30.7, CritPt: 1.4 **Knowledge**: Artificial Analysis Intelligence Index: 24.47, AA-GPQA Diamond: 68.8, AA-HLE: 9.8, AA-Omniscience Index: -63.9, AA-Omniscience Accuracy: 15.5, AA-Omniscience Hallucination Rate: 94.1 **Instruction Following**: AA-IFBench: 65.1 ### #120 Llama 4 Behemoth - Creator: Meta - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~11/100 (estimated) - Family: Llama 4 Behemoth - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/llama-4-behemoth - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #121 Nova Pro - Creator: Amazon - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: ~10/100 (estimated) - Family: Nova Pro - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/nova-pro - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. **Agentic**: AA Agentic Index: 4.68, Tau2-Telecom: 14, GDPval-AA: 0, GDPval-AA: 386 **Coding**: AA Coding Index: 10.98, Terminal-Bench Hard: 6.1, AA-SciCode: 20.8 **Multimodal & Grounded**: AA-MMMU-Pro: 44.3 **Reasoning**: AA-LCR: 19, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 13.48, AA-GPQA Diamond: 49.9, AA-HLE: 3.4, AA-Omniscience Index: -47.6, AA-Omniscience Accuracy: 17, AA-Omniscience Hallucination Rate: 77.9 **Instruction Following**: AA-IFBench: 38.1 ### #122 Mistral 7B v0.3 - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~4/100 (estimated) - Family: Mistral 7B - Variant: snapshot (v0.3) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/mistral-7b-v0-3 - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #123 Mistral 8x7B v0.2 - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: ~1/100 (estimated) - Family: Mistral 8x7B - Variant: snapshot (v0.2) - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/mistral-8x7b-v0-2 - Sibling Models: Mistral 8x7B - Coverage Note: Partial benchmark coverage; overall score is conservative until more sourced results are added. ### #124 GPT-5.5 Pro - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: GPT-5.5 - Variant: pro - Benchmarks Covered: 4 of 247 - Profile: https://benchlm.ai/models/gpt-5-5-pro - Sibling Models: GPT-5.5 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: BrowseComp: 90.1 **Knowledge**: HLE: 57.2, HLE w/o tools: 43.1 **Mathematics**: FrontierMath: 52.4 ### #125 Holo3-35B-A3B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Holo3 - Variant: 35b-a3b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo3-35b-a3b - Sibling Models: Holo3-122B-A10B - Related Earlier Model: Holo2-30B-A3B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: OSWorld-Verified: 82.56 ### #126 Holo3-122B-A10B - Creator: H Company - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Holo3 - Variant: 122b-a10b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo3-122b-a10b - Sibling Models: Holo3-35B-A3B - Related Earlier Model: Holo2-235B-A22B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: OSWorld-Verified: 78.85 ### #127 MiMo-V2.5-Pro - Creator: Xiaomi - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: MiMo-V2.5 - Variant: pro - Benchmarks Covered: 26 of 247 - Profile: https://benchlm.ai/models/mimo-v2-5-pro - Sibling Models: MiMo-V2.5 - Related Earlier Model: MiMo-V2-Pro - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Claw-Eval: 63.8, GDPval-AA: 1571, TAU3-Bench: 72.9, Terminal-Bench 2.0: 68.4, AA Agentic Index: 67.44, Tau2-Telecom: 94.2, GDPval-AA: 53.6, APEX-Agents-AA: 2.4, Gert Labs: 62.7 **Coding**: SWE-bench Pro: 57.2, Terminal-Bench 2.0: 68.4, AA Coding Index: 45.53, Terminal-Bench Hard: 43.2, AA-SciCode: 50.2 **Multimodal & Grounded**: Design Arena Website: 1312 **Reasoning**: AA-LCR: 73.3, CritPt: 4 **Knowledge**: HLE: 48, HLE w/o tools: 34, Artificial Analysis Intelligence Index: 53.83, AA-GPQA Diamond: 86.6, AA-HLE: 33.8, AA-Omniscience Index: 3.6, AA-Omniscience Accuracy: 22.6, AA-Omniscience Hallucination Rate: 24.5 **Instruction Following**: AA-IFBench: 79.9 ### #128 MiMo-V2-Pro - Creator: Xiaomi - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: MiMo-V2-Pro - Variant: base - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/mimo-v2-pro - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Claw-Eval: 57.8, AA Agentic Index: 62.8, Tau2-Telecom: 95, GDPval-AA: 45.3, GDPval-AA: 1405, Gert Labs: 36.68 **Coding**: SWE-bench Verified: 78, AA Coding Index: 41.43, Terminal-Bench Hard: 40.9, AA-SciCode: 42.5 **Reasoning**: AA-LCR: 60.7, CritPt: 0.3 **Knowledge**: Artificial Analysis Intelligence Index: 49.2, AA-GPQA Diamond: 87, AA-HLE: 28.3, AA-Omniscience Index: 4.9, AA-Omniscience Accuracy: 26.8, AA-Omniscience Hallucination Rate: 29.9 **Instruction Following**: AA-IFBench: 68.8 ### #129 MiMo-V2-Omni - Creator: Xiaomi - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: MiMo-V2-Omni - Variant: base - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/mimo-v2-omni - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Claw-Eval: 45.2, AA Agentic Index: 58.56, Tau2-Telecom: 91.2, GDPval-AA: 40.9, GDPval-AA: 1317 **Coding**: SWE-bench Verified: 74.8, AA Coding Index: 35.46, Terminal-Bench Hard: 34.8, AA-SciCode: 36.7 **Multimodal & Grounded**: AA-MMMU-Pro: 69.9 **Reasoning**: AA-LCR: 66.7, CritPt: 1.1 **Knowledge**: Artificial Analysis Intelligence Index: 43.4, AA-GPQA Diamond: 82.8, AA-HLE: 19.9, AA-Omniscience Index: -17.4, AA-Omniscience Accuracy: 18.7, AA-Omniscience Hallucination Rate: 44.4 **Instruction Following**: AA-IFBench: 53.5 ### #130 Composer 2.5 - Creator: Cursor - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: Composer - Variant: base - Benchmarks Covered: 4 of 247 - Profile: https://benchlm.ai/models/composer-2-5 - Sibling Models: Composer 2, Composer 2 Fast - Related Earlier Model: Composer 2 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 69.3 **Coding**: Terminal-Bench 2.0: 69.3, SWE Multilingual: 79.8, CursorBench v3.1: 63.2 ### #131 Muse Spark - Creator: Meta - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Muse Spark - Variant: base - Benchmarks Covered: 39 of 247 - Profile: https://benchlm.ai/models/muse-spark - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 59, Tau2-Telecom: 91.5, DeepSearchQA: 74.8, CyberGym: 43.5, Claw-Eval: 63.8, AA Agentic Index: 61.99, GDPval-AA: 45.9, GDPval-AA: 1417 **Coding**: SWE-bench Verified: 77.4, SWE-bench Pro: 52.4, LiveCodeBench Pro: 80, Vibe Code Bench: 19.674, AA Coding Index: 47.47, Terminal-Bench Hard: 45.5, AA-SciCode: 51.5 **Multimodal & Grounded**: CharXiv: 86.4, MMMU-Pro: 80.4, ERQA: 64.7, SimpleVQA: 71.3, ScreenSpot Pro: 84.1, ZeroBench: 33, MedXpertQA (MM): 78.4, GDPval-AA: 1444, AA-MMMU-Pro: 80.5 **Reasoning**: ARC-AGI-2: 42.5, AA-LCR: 69.7, CritPt: 11.3 **Knowledge**: GPQA-D: 89.5, HLE: 50.4, HLE w/o tools: 42.8, HealthBench Hard: 42.8, MedXpertQA (Text): 52.6, Artificial Analysis Intelligence Index: 52.15, AA-GPQA Diamond: 88.4, AA-HLE: 39.9, AA-Omniscience Index: 4.1, AA-Omniscience Accuracy: 44.6, AA-Omniscience Hallucination Rate: 73.2 **Instruction Following**: AA-IFBench: 75.9 ### #132 Qwen 3.6 Max (preview) - Creator: Alibaba - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Qwen 3.6 Max - Variant: preview (preview) - Benchmarks Covered: 24 of 247 - Profile: https://benchlm.ai/models/qwen-3-6-max-preview - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 65.4, QwenClawBench: 59, QwenWebBench: 1532, AA Agentic Index: 64.83, Tau2-Telecom: 95.9, GDPval-AA: 50.2, GDPval-AA: 1504 **Coding**: SWE-bench Pro: 57.3, SciCode: 47, NL2Repo: 42.9, Terminal-Bench 2.0: 65.4, AA Coding Index: 44.92, Terminal-Bench Hard: 43.9, AA-SciCode: 46.9 **Reasoning**: AA-LCR: 69.7, CritPt: 3.7 **Knowledge**: SuperGPQA: 73.9, Artificial Analysis Intelligence Index: 51.81, AA-GPQA Diamond: 88.8, AA-HLE: 28.9, AA-Omniscience Index: 10.2, AA-Omniscience Accuracy: 37.7, AA-Omniscience Hallucination Rate: 44.2 **Instruction Following**: AA-IFBench: 76.6 ### #133 Mistral Medium 3.5 128B - Creator: Mistral - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Mistral Medium 3.5 - Variant: 128b - Benchmarks Covered: 20 of 247 - Profile: https://benchlm.ai/models/mistral-medium-3-5-128b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: TAU3-Bench: 91.4, AA Agentic Index: 53.16, Tau2-Telecom: 94.2, GDPval-AA: 33.4, GDPval-AA: 1168, Gert Labs: 39.1 **Coding**: SWE-bench Verified: 77.6, AA Coding Index: 35.42, Terminal-Bench Hard: 33.3, AA-SciCode: 39.6 **Multimodal & Grounded**: AA-MMMU-Pro: 64.9 **Reasoning**: AA-LCR: 61, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 39.23, AA-GPQA Diamond: 74.8, AA-HLE: 12.8, AA-Omniscience Index: -36.3, AA-Omniscience Accuracy: 25.1, AA-Omniscience Hallucination Rate: 82 **Instruction Following**: AA-IFBench: 68.8 ### #134 Interfaze Beta - Creator: Interfaze - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: Interfaze - Variant: beta (beta) - Benchmarks Covered: 10 of 247 - Profile: https://benchlm.ai/models/interfaze-beta - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: Spider 2.0-Lite: 52.9 **Multimodal & Grounded**: OCRBench V2: 70.7, olmOCR: 85.7, RefCOCO (avg): 82.1, VoxPopuli WER: 2.4, MMMU-Pro: 71.1 **Knowledge**: GPQA: 89.9, GPQA-D: 89.9, MMMLU: 90.9 **Instruction Following**: SOB Value Acc: 79.5 ### #135 Grok 4.3 - Creator: xAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: Grok 4.3 - Variant: base - Benchmarks Covered: 25 of 247 - Profile: https://benchlm.ai/models/grok-4-3 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Tau2-Telecom: 97.7, GDPval-AA: 49.8, AA Agentic Index: 65.89, APEX-Agents-AA: 17, GDPval-AA: 1495, Gert Labs: 43.86 **Coding**: SciCode: 47.3, Terminal-Bench Hard: 37.9, AA Coding Index: 41.03, AA-SciCode: 47.3 **Multimodal & Grounded**: MMMU-Pro: 78.1, Design Arena Website: 1252, AA-MMMU-Pro: 78.1 **Reasoning**: AA-LCR: 64.3, CritPt: 8 **Knowledge**: Artificial Analysis Intelligence Index: 53.2, GPQA: 90.1, HLE: 35, AA-Omniscience Accuracy: 34.6, AA-Omniscience Hallucination Rate: 25, AA-GPQA Diamond: 90.1, AA-HLE: 35, AA-Omniscience Index: 18.3 **Instruction Following**: IFBench: 81.3, AA-IFBench: 81.3 ### #136 Composer 2 - Creator: Cursor - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: Composer - Variant: base - Benchmarks Covered: 5 of 247 - Profile: https://benchlm.ai/models/composer-2 - Sibling Models: Composer 2.5, Composer 2 Fast - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 61.7 **Coding**: SWE Multilingual: 73.7, SWE-Rebench: 58, React Native Evals: 96.1, Terminal-Bench 2.0: 61.7 ### #137 MiMo-V2.5 - Creator: Xiaomi - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: MiMo-V2.5 - Variant: base - Benchmarks Covered: 10 of 247 - Profile: https://benchlm.ai/models/mimo-v2-5 - Sibling Models: MiMo-V2.5-Pro - Related Earlier Model: MiMo-V2-Omni - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Claw-Eval: 62.3, MM-ClawBench: 23.8, Terminal-Bench 2.0: 65.8, Gert Labs: 46.89 **Coding**: SWE-bench Pro: 56.1, Terminal-Bench 2.0: 65.8 **Multimodal & Grounded**: Video-MME (with subtitle): 87.7, CharXiv: 81, MMMU-Pro: 77.9, Design Arena Website: 1306 ### #138 Step 3.7 Flash - Creator: StepFun - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Step 3.7 Flash - Variant: base - Benchmarks Covered: 29 of 247 - Profile: https://benchlm.ai/models/step-3-7-flash - Related Earlier Model: Step 3.5 Flash - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 59.5, BrowseComp: 75.82, DeepSearchQA: 92.82, GDPval-AA: 40, Toolathlon: 49.5, Claw-Eval: 67.1, HLE w/ tools: 47.2, Gert Labs: 51.57, AA Agentic Index: 59.53, Tau2-Telecom: 98.5, GDPval-AA: 1300 **Coding**: SWE-bench Pro: 56.3, Terminal-Bench 2.0: 59.5, AA Coding Index: 37.09, Terminal-Bench Hard: 35.6, AA-SciCode: 40 **Multimodal & Grounded**: SimpleVQA: 79.2, V*: 95.3, AA-MMMU-Pro: 75.3, Design Arena Website: 1227 **Reasoning**: AA-LCR: 63.7, CritPt: 2.3 **Knowledge**: Artificial Analysis Intelligence Index: 42.59, AA-GPQA Diamond: 80.9, AA-HLE: 19.9, AA-Omniscience Index: -37.5, AA-Omniscience Accuracy: 25.4, AA-Omniscience Hallucination Rate: 84.4 **Instruction Following**: AA-IFBench: 67.3 ### #139 Grok 4.20 Multi-agent - Creator: xAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 2M - Overall Score: Not ranked yet - Family: Grok 4.20 - Variant: multi-agent - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/grok-4-20-multi-agent-beta - Sibling Models: Grok 4.20 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #140 GPT-5.4 mini - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: Not ranked yet - Family: GPT-5.4 - Variant: mini - Benchmarks Covered: 28 of 247 - Profile: https://benchlm.ai/models/gpt-5-4-mini - Sibling Models: GPT-5.4 Pro, GPT-5.4, GPT-5.4 nano - Related Earlier Model: GPT-5 mini - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 60, OSWorld-Verified: 72.1, MCP Atlas: 57.7, Toolathlon: 42.9, Tau2-Telecom: 83.3, AA Agentic Index: 58.88, APEX-Agents-AA: 28.2, GDPval-AA: 46.9, GDPval-AA: 1438 **Coding**: Vibe Code Bench: 47.969, AA Coding Index: 51.48, Terminal-Bench Hard: 52.3, AA-SciCode: 49.9 **Multimodal & Grounded**: MMMU-Pro: 76.6, MMMU-Pro w/ Python: 78, AA-MMMU-Pro: 73.3 **Reasoning**: AA-LCR: 69.3, CritPt: 10 **Knowledge**: GPQA: 88, HLE: 41.5, HLE w/o tools: 28.2, Artificial Analysis Intelligence Index: 48.9, AA-GPQA Diamond: 87.5, AA-HLE: 26.6, AA-Omniscience Index: -18.7, AA-Omniscience Accuracy: 37.5, AA-Omniscience Hallucination Rate: 89.8 **Instruction Following**: AA-IFBench: 73.3 ### #141 Gemma 4 31B - Creator: Google - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Gemma 4 - Variant: 31b - Benchmarks Covered: 25 of 247 - Profile: https://benchlm.ai/models/gemma-4-31b - Sibling Models: Gemma 4 26B A4B, Gemma 4 12B, Gemma 4 E4B, Gemma 4 E2B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 40.94, Tau2-Telecom: 59.9, GDPval-AA: 30.7, GDPval-AA: 1113, Gert Labs: 35.26 **Coding**: SWE-Rebench: 41.6, React Native Evals: 75.2, AA Coding Index: 38.71, Terminal-Bench Hard: 36.4, AA-SciCode: 43.4 **Multimodal & Grounded**: MMMU-Pro: 76.9, AA-MMMU-Pro: 73.4 **Reasoning**: AA-LCR: 62, CritPt: 1.4 **Knowledge**: GPQA: 84.3, MMLU-Pro: 85.2, HLE: 26.5, HLE w/o tools: 19.5, Artificial Analysis Intelligence Index: 39.18, AA-GPQA Diamond: 85.7, AA-HLE: 22.7, AA-Omniscience Index: -45.4, AA-Omniscience Accuracy: 19.9, AA-Omniscience Hallucination Rate: 81.6 **Instruction Following**: AA-IFBench: 75.6 ### #142 Exaone 4.0 32B - Creator: LG AI Research - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Exaone 4.0 - Variant: 32b - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/exaone-4-0-32b - Sibling Models: Exaone 4.0 1.2B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 1.36, Tau2-Telecom: 4.1, GDPval-AA: 0, GDPval-AA: 328 **Coding**: AA Coding Index: 9.42, Terminal-Bench Hard: 1.5, AA-SciCode: 25.2 **Reasoning**: AA-LCR: 8, CritPt: 0 **Knowledge**: MMLU-Pro: 81.8, Artificial Analysis Intelligence Index: 11.66, AA-GPQA Diamond: 62.8, AA-HLE: 4.9, AA-Omniscience Index: -62.3, AA-Omniscience Accuracy: 10.4, AA-Omniscience Hallucination Rate: 81 **Instruction Following**: AA-IFBench: 33.5 **Mathematics**: AIME 2025: 85.3 ### #143 GLM-5V-Turbo - Creator: Z.AI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: GLM-5 - Variant: vision-turbo - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/glm-5v-turbo - Sibling Models: GLM-5.1, GLM-5 (Reasoning), GLM-5, GLM-5-Turbo - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Claw-Eval: 53.8, AA Agentic Index: 61.07, Tau2-Telecom: 98.5, GDPval-AA: 41.4, GDPval-AA: 1328, Gert Labs: 30.76 **Coding**: AA Coding Index: 36.22, Terminal-Bench Hard: 32.6, AA-SciCode: 43.5 **Multimodal & Grounded**: AA-MMMU-Pro: 72.8 **Reasoning**: AA-LCR: 61, CritPt: 0.6 **Knowledge**: Artificial Analysis Intelligence Index: 42.85, AA-GPQA Diamond: 80.9, AA-HLE: 15.8, AA-Omniscience Index: -19, AA-Omniscience Accuracy: 29.1, AA-Omniscience Hallucination Rate: 67.9 **Instruction Following**: AA-IFBench: 61.1 ### #144 GPT-5.4 nano - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: Not ranked yet - Family: GPT-5.4 - Variant: nano - Benchmarks Covered: 28 of 247 - Profile: https://benchlm.ai/models/gpt-5-4-nano - Sibling Models: GPT-5.4 Pro, GPT-5.4, GPT-5.4 mini - Related Earlier Model: GPT-5 nano - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 46.3, OSWorld-Verified: 39, MCP Atlas: 56.1, Toolathlon: 35.5, Tau2-Telecom: 76, AA Agentic Index: 47.6, APEX-Agents-AA: 24.9, GDPval-AA: 34.8, GDPval-AA: 1195 **Coding**: Vibe Code Bench: 26.097, AA Coding Index: 43.91, Terminal-Bench Hard: 42.4, AA-SciCode: 46.9 **Multimodal & Grounded**: MMMU-Pro: 66.1, MMMU-Pro w/ Python: 69.5, AA-MMMU-Pro: 65.4 **Reasoning**: AA-LCR: 66, CritPt: 9.3 **Knowledge**: GPQA: 82.8, HLE: 37.7, HLE w/o tools: 24.3, Artificial Analysis Intelligence Index: 43.98, AA-GPQA Diamond: 81.7, AA-HLE: 26.5, AA-Omniscience Index: -29.5, AA-Omniscience Accuracy: 25.4, AA-Omniscience Hallucination Rate: 73.6 **Instruction Following**: AA-IFBench: 75.9 ### #145 Mellum2-12B-A2.5B-Thinking - Creator: JetBrains - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Mellum2 12B-A2.5B - Variant: thinking - Benchmarks Covered: 6 of 247 - Profile: https://benchlm.ai/models/mellum2-12b-a2-5b-thinking - Sibling Models: Mellum2-12B-A2.5B-Instruct - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: BFCL v4: 45.6 **Coding**: LiveCodeBench: 69.9 **Knowledge**: MMLU-Redux: 86.2, GPQA: 57.6, GPQA-D: 57.6 **Instruction Following**: IFEval: 76.5 ### #146 Hy3 Preview - Creator: Tencent - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Hy3 - Variant: preview (preview) - Benchmarks Covered: 25 of 247 - Profile: https://benchlm.ai/models/hy3-preview - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 54.4, Tau2-Telecom: 92.7, GDPval-AA: 36.8, AA Agentic Index: 55.67, GDPval-AA: 1236, Gert Labs: 36.91 **Coding**: SWE-bench Verified: 74.4, Terminal-Bench 2.0: 54.4, Terminal-Bench Hard: 34.1, SciCode: 41.2, AA Coding Index: 36.46, AA-SciCode: 41.2 **Reasoning**: AA-LCR: 54.7, CritPt: 4.6 **Knowledge**: Artificial Analysis Intelligence Index: 41.85, GPQA: 87.2, GPQA-D: 87.2, HLE: 25.5, AA-Omniscience Accuracy: 28, AA-Omniscience Hallucination Rate: 86.9, AA-GPQA Diamond: 86.7, AA-HLE: 25.5, AA-Omniscience Index: -34.6 **Instruction Following**: IFBench: 63.1, AA-IFBench: 63.1 ### #147 ZAYA1-8B - Creator: Zyphra - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 131K - Overall Score: Not ranked yet - Family: ZAYA1 - Variant: 8b - Benchmarks Covered: 11 of 247 - Profile: https://benchlm.ai/models/zaya1-8b - Sibling Models: ZAYA1-74B-Preview - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: BFCL v4: 39.22 **Coding**: LiveCodeBench v6: 65.8 **Knowledge**: GPQA: 71, GPQA-D: 71, MMLU-Pro: 74.2 **Instruction Following**: IFEval: 85.58, IFBench: 52.56 **Mathematics**: AIME26: 89.1, HMMT Feb 2026: 71.6, IMOAnswerBench: 59.3, Apex: 32.2 ### #148 Gemma 4 26B A4B - Creator: Google - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Gemma 4 - Variant: 26b-a4b - Benchmarks Covered: 21 of 247 - Profile: https://benchlm.ai/models/gemma-4-26b-a4b - Sibling Models: Gemma 4 31B, Gemma 4 12B, Gemma 4 E4B, Gemma 4 E2B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 32.15, Tau2-Telecom: 43.6, GDPval-AA: 25.7, GDPval-AA: 1014 **Coding**: AA Coding Index: 22.44, Terminal-Bench Hard: 13.6, AA-SciCode: 40 **Multimodal & Grounded**: MMMU-Pro: 73.8, AA-MMMU-Pro: 69.2 **Reasoning**: AA-LCR: 55.7, CritPt: 0 **Knowledge**: MMLU-Pro: 82.6, HLE: 17.2, HLE w/o tools: 8.7, Artificial Analysis Intelligence Index: 31.21, AA-GPQA Diamond: 79.2, AA-HLE: 18.3, AA-Omniscience Index: -48.1, AA-Omniscience Accuracy: 18.2, AA-Omniscience Hallucination Rate: 80.9 **Instruction Following**: AA-IFBench: 72.4 ### #149 ZAYA1-74B-Preview - Creator: Zyphra - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: ZAYA1 - Variant: 74b-preview (preview) - Benchmarks Covered: 7 of 247 - Profile: https://benchlm.ai/models/zaya1-74b-preview - Sibling Models: ZAYA1-8B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Tau2-Airline: 56.1 **Coding**: LiveCodeBench v6: 65.7, SWE-bench Verified: 53.2 **Knowledge**: MMLU-Pro: 68.1, GPQA: 57.3, GPQA-D: 57.3 **Mathematics**: AIME26: 76.4 ### #150 Mistral Small 4 (Reasoning) - Creator: Mistral - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Mistral Small 4 - Variant: reasoning - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/mistral-small-4-reasoning - Sibling Models: Mistral Small 4 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 25.87, Tau2-Telecom: 41.2, GDPval-AA: 18, GDPval-AA: 859 **Coding**: AA Coding Index: 24.27, Terminal-Bench Hard: 17.4, AA-SciCode: 38 **Multimodal & Grounded**: AA-MMMU-Pro: 56.8 **Reasoning**: AA-LCR: 44.7, CritPt: 0.3 **Knowledge**: Artificial Analysis Intelligence Index: 27.8, AA-GPQA Diamond: 76.9, AA-HLE: 9.5, AA-Omniscience Index: -29.9, AA-Omniscience Accuracy: 22.1, AA-Omniscience Hallucination Rate: 66.8 **Instruction Following**: AA-IFBench: 48.2 ### #151 Laguna M.1 - Creator: Poolside - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 131K - Overall Score: Not ranked yet - Family: Laguna - Variant: m-1 - Benchmarks Covered: 5 of 247 - Profile: https://benchlm.ai/models/laguna-m-1 - Sibling Models: Laguna XS.2 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 45.8 **Coding**: SWE-bench Verified: 74.6, SWE Multilingual: 63.1, SWE-bench Pro: 49.2, Terminal-Bench 2.0: 45.8 ### #152 K-Exaone - Creator: LG AI Research - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: K-Exaone - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/k-exaone - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 38.14, Tau2-Telecom: 74.3, GDPval-AA: 16.2, GDPval-AA: 824 **Coding**: AA Coding Index: 27.03, Terminal-Bench Hard: 22.7, AA-SciCode: 35.6 **Reasoning**: AA-LCR: 55.7, CritPt: 1.1 **Knowledge**: Artificial Analysis Intelligence Index: 32.12, AA-GPQA Diamond: 78.3, AA-HLE: 13.1, AA-Omniscience Index: -57.9, AA-Omniscience Accuracy: 16.5, AA-Omniscience Hallucination Rate: 89.1 **Instruction Following**: AA-IFBench: 64.7 ### #153 Nemotron 3 Nano Omni 30B A3B - Creator: NVIDIA - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Nemotron 3 Nano Omni - Variant: 30b-a3b - Benchmarks Covered: 32 of 247 - Profile: https://benchlm.ai/models/nemotron-3-nano-omni-30b-a3b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: OSWorld: 47.4, Tau2-Telecom: 45.3, AA Agentic Index: 23.87, GDPval-AA: 13.1, GDPval-AA: 762 **Coding**: LiveCodeBench: 63.2, SciCode: 32, AA Coding Index: 14.81, Terminal-Bench Hard: 8.3, AA-SciCode: 27.8 **Multimodal & Grounded**: MMMU: 70.8, MMLongBench-Doc: 57.5, CharXiv: 76.25, ScreenSpot Pro: 57.8, Video-MME (w/o subtitle): 72.2, AI2D_TEST: 88.5, RefCOCO (avg): 90.5, AA-MMMU-Pro: 53.2 **Reasoning**: AA-LCR: 35.7, CritPt: 0 **Knowledge**: MMLU-Pro: 77.3, GPQA: 72.2, GPQA-D: 72.2, Artificial Analysis Intelligence Index: 21.43, AA-GPQA Diamond: 46.9, AA-HLE: 5.3, AA-Omniscience Index: -56, AA-Omniscience Accuracy: 14.8, AA-Omniscience Hallucination Rate: 83.1 **Instruction Following**: IFBench: 74.2, AA-IFBench: 63.2 **Mathematics**: AIME 2025: 82.1 ### #154 Gemma 4 12B - Creator: Google - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Gemma 4 - Variant: 12b - Benchmarks Covered: 29 of 247 - Profile: https://benchlm.ai/models/gemma-4-12b - Sibling Models: Gemma 4 31B, Gemma 4 26B A4B, Gemma 4 E4B, Gemma 4 E2B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 24.63, Tau2-Telecom: 36.3, GDPval-AA: 18.8, GDPval-AA: 875 **Coding**: LiveCodeBench: 72, AA Coding Index: 24.85, Terminal-Bench Hard: 18.2, AA-SciCode: 38.2 **Multimodal & Grounded**: MMMU-Pro: 69.1, MathVision: 79.7, MedXpertQA (MM): 48.7, AA-MMMU-Pro: 69.7 **Reasoning**: BBH: 53, MRCRv2: 43.4, AA-LCR: 55.3, CritPt: 0 **Knowledge**: GPQA: 78.8, GPQA-D: 78.8, MMLU-Pro: 77.2, HLE w/o tools: 5.2, MMMLU: 83.4, Artificial Analysis Intelligence Index: 29.17, AA-GPQA Diamond: 75.3, AA-HLE: 14.8, AA-Omniscience Index: -51.9, AA-Omniscience Accuracy: 16, AA-Omniscience Hallucination Rate: 80.8 **Instruction Following**: AA-IFBench: 73.5 **Mathematics**: AIME26: 77.5 ### #155 Ternary Bonsai 8B - Creator: Prism ML - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Ternary Bonsai - Variant: 8b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ternary-bonsai-8b - Sibling Models: Ternary Bonsai 1.7B, Ternary Bonsai 4B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #156 LFM2.5-8B-A1B - Creator: LiquidAI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: LFM2.5-8B-A1B - Variant: reasoning - Benchmarks Covered: 22 of 247 - Profile: https://benchlm.ai/models/lfm2-5-8b-a1b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: BFCL v4: 49.73, GDPval-AA: 0, GDPval-AA: 255, AA Agentic Index: 5.36, Tau2-Telecom: 16.1 **Coding**: AA Coding Index: 5.62, Terminal-Bench Hard: 4.5, AA-SciCode: 7.8 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: AA-GPQA Diamond: 51.3, AA-HLE: 6.9, AA-Omniscience Index: -33.3, AA-Omniscience Accuracy: 9.4, AA-Omniscience Hallucination Rate: 47, Artificial Analysis Intelligence Index: 14.19 **Instruction Following**: IFEval: 91.84, IFBench: 56.47, AA-IFBench: 55.6 **Mathematics**: MATH-500: 88.76, AIME 2025: 42.53, AIME26: 50 ### #157 Mistral Medium 3 - Creator: Mistral - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Mistral Medium 3 - Variant: base - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/mistral-medium-3 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 13.74, Tau2-Telecom: 24.3, GDPval-AA: 4.2, GDPval-AA: 585 **Coding**: AA Coding Index: 13.56, Terminal-Bench Hard: 3.8, AA-SciCode: 33.1 **Multimodal & Grounded**: AA-MMMU-Pro: 53, Design Arena Website: 1124 **Reasoning**: AA-LCR: 28, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 18.76, AA-GPQA Diamond: 57.8, AA-HLE: 4.3, AA-Omniscience Index: -31.5, AA-Omniscience Accuracy: 18.3, AA-Omniscience Hallucination Rate: 60.9 **Instruction Following**: AA-IFBench: 39.3 ### #158 DeepSeek V4 Pro Base - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: DeepSeek V4 - Variant: base (pro-base) - Benchmarks Covered: 24 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-pro-base - Sibling Models: DeepSeek V4 Pro (Max), DeepSeek V4 Pro (High), DeepSeek V4 Flash (Max), DeepSeek V4 Flash (High), DeepSeek V4 Pro, DeepSeek V4 Flash, DeepSeek V4 Flash Base - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: BigCodeBench: 59.2, HumanEval: 76.8 **Reasoning**: BBH: 87.5, DROP: 88.7, HellaSwag: 88, WinoGrande: 81.5, CLUEWSC: 85.2, LongBench v2: 51.5 **Knowledge**: AGIEval: 83.1, MMLU: 90.1, MMLU-Redux: 90.8, MMLU-Pro: 73.5, MMMLU: 90.3, C-Eval: 93.1, CMMLU: 90.8, MultiLoKo: 51.1, SimpleQA: 55.2, SuperGPQA: 53.9, FACTS Parametric: 62.6, TriviaQA: 85.6 **Multilingual**: MGSM: 84.4 **Mathematics**: GSM8K: 92.6, MATH: 64.5, CMath: 90.9 ### #159 Mistral Small 4 - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Mistral Small 4 - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/mistral-small-4 - Sibling Models: Mistral Small 4 (Reasoning) - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 25.87, Tau2-Telecom: 41.2, GDPval-AA: 18, GDPval-AA: 859 **Coding**: AA Coding Index: 24.27, Terminal-Bench Hard: 17.4, AA-SciCode: 38 **Multimodal & Grounded**: AA-MMMU-Pro: 56.8 **Reasoning**: AA-LCR: 44.7, CritPt: 0.3 **Knowledge**: Artificial Analysis Intelligence Index: 27.8, AA-GPQA Diamond: 76.9, AA-HLE: 9.5, AA-Omniscience Index: -29.9, AA-Omniscience Accuracy: 22.1, AA-Omniscience Hallucination Rate: 66.8 **Instruction Following**: AA-IFBench: 48.2 ### #160 Grok 3 Mini - Creator: xAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Grok 3 Mini - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/grok-3-mini - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #161 Sarvam 30B - Creator: Sarvam - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Sarvam 30B - Variant: base - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/sarvam-30b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 11.5, Tau2-Telecom: 34.5, GDPval-AA: 0, GDPval-AA: 359 **Coding**: AA Coding Index: 7.92, Terminal-Bench Hard: 2.3, AA-SciCode: 19.2 **Reasoning**: AA-LCR: 0, CritPt: 0.3 **Knowledge**: Artificial Analysis Intelligence Index: 12.34, AA-GPQA Diamond: 63.3, AA-HLE: 7, AA-Omniscience Index: -72, AA-Omniscience Accuracy: 12.7, AA-Omniscience Hallucination Rate: 97 **Instruction Following**: AA-IFBench: 26.5 ### #162 Command A+ - Creator: Cohere - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Command A - Variant: plus - Benchmarks Covered: 20 of 247 - Profile: https://benchlm.ai/models/command-a - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Tau2-Telecom: 80.7, AA Agentic Index: 40.9, GDPval-AA: 20.9, GDPval-AA: 919 **Coding**: Terminal-Bench Hard: 25, AA Coding Index: 29.28, AA-SciCode: 37.8 **Multimodal & Grounded**: MMMU: 75.1, MMMU-Pro: 63, CharXiv: 52.7, AA-MMMU-Pro: 63.2 **Reasoning**: AA-LCR: 46, CritPt: 0.3 **Knowledge**: Artificial Analysis Intelligence Index: 37.16, AA-GPQA Diamond: 76.1, AA-HLE: 11.4, AA-Omniscience Index: -4, AA-Omniscience Accuracy: 8.9, AA-Omniscience Hallucination Rate: 14.1 **Instruction Following**: AA-IFBench: 73.9 ### #163 Laguna XS.2 - Creator: Poolside - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 131K - Overall Score: Not ranked yet - Family: Laguna - Variant: xs-2 - Benchmarks Covered: 5 of 247 - Profile: https://benchlm.ai/models/laguna-xs-2 - Sibling Models: Laguna M.1 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Terminal-Bench 2.0: 35.7 **Coding**: SWE-bench Verified: 69.9, SWE Multilingual: 57.7, SWE-bench Pro: 46.3, Terminal-Bench 2.0: 35.7 ### #164 Gemma 4 E4B - Creator: Google - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Gemma 4 - Variant: e4b - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/gemma-4-e4b - Sibling Models: Gemma 4 31B, Gemma 4 26B A4B, Gemma 4 12B, Gemma 4 E2B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 6.92, Tau2-Telecom: 20.8, GDPval-AA: 0, GDPval-AA: 303 **Coding**: AA Coding Index: 13.7, Terminal-Bench Hard: 8.3, AA-SciCode: 24.4 **Multimodal & Grounded**: AA-MMMU-Pro: 51.4 **Reasoning**: AA-LCR: 30.7, CritPt: 0.6 **Knowledge**: GPQA: 58.6, MMLU-Pro: 69.4, Artificial Analysis Intelligence Index: 18.76, AA-GPQA Diamond: 57.6, AA-HLE: 3.7, AA-Omniscience Index: -20, AA-Omniscience Accuracy: 8.6, AA-Omniscience Hallucination Rate: 31.3 **Instruction Following**: AA-IFBench: 44.2 ### #165 Ling 2.6 Flash - Creator: InclusionAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Ling 2.6 - Variant: flash - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/ling-2-6-flash - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Tau2-Telecom: 86, GDPval-AA: 785, AA Agentic Index: 38.06, GDPval-AA: 14.2 **Coding**: SciCode: 27, AA Coding Index: 23.17, Terminal-Bench Hard: 21.2, AA-SciCode: 27.1 **Reasoning**: AA-LCR: 25, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 26.16, GPQA: 59, AA-GPQA Diamond: 59.3, AA-HLE: 6.2, AA-Omniscience Index: -65.7, AA-Omniscience Accuracy: 15.4, AA-Omniscience Hallucination Rate: 95.8 **Instruction Following**: IFBench: 57, AA-IFBench: 57.4 ### #166 Granite-4.0-1B - Creator: IBM - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Granite 4.0 1B - Variant: dense - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/granite-4-0-1b - Sibling Models: Granite-4.0-H-1B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 7.6, Tau2-Telecom: 22.8, GDPval-AA: 0, GDPval-AA: 255 **Coding**: AA Coding Index: 2.89, Terminal-Bench Hard: 0, AA-SciCode: 8.7 **Reasoning**: AA-LCR: 4, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 7.34, AA-GPQA Diamond: 28.1, AA-HLE: 5.1, AA-Omniscience Index: -81.8, AA-Omniscience Accuracy: 6.1, AA-Omniscience Hallucination Rate: 93.5 **Instruction Following**: AA-IFBench: 20.5 ### #167 DeepSeek V4 Flash Base - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: DeepSeek V4 - Variant: base (flash-base) - Benchmarks Covered: 24 of 247 - Profile: https://benchlm.ai/models/deepseek-v4-flash-base - Sibling Models: DeepSeek V4 Pro (Max), DeepSeek V4 Pro (High), DeepSeek V4 Flash (Max), DeepSeek V4 Flash (High), DeepSeek V4 Pro, DeepSeek V4 Flash, DeepSeek V4 Pro Base - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: BigCodeBench: 56.8, HumanEval: 69.5 **Reasoning**: BBH: 86.9, DROP: 88.6, HellaSwag: 85.7, WinoGrande: 79.5, CLUEWSC: 82.2, LongBench v2: 44.7 **Knowledge**: AGIEval: 82.6, MMLU: 88.7, MMLU-Redux: 89.4, MMLU-Pro: 68.3, MMMLU: 88.8, C-Eval: 92.1, CMMLU: 90.4, MultiLoKo: 42.2, SimpleQA: 30.1, SuperGPQA: 46.5, FACTS Parametric: 33.9, TriviaQA: 82.8 **Multilingual**: MGSM: 85.7 **Mathematics**: GSM8K: 90.8, MATH: 57.4, CMath: 93.6 ### #168 Qwen3.5 Flash - Creator: Alibaba - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: Qwen3.5 Flash - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/qwen3-5-flash - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #169 Ternary Bonsai 1.7B - Creator: Prism ML - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: Ternary Bonsai - Variant: 1-7b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ternary-bonsai-1-7b - Sibling Models: Ternary Bonsai 8B, Ternary Bonsai 4B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #170 Mellum2-12B-A2.5B-Instruct - Creator: JetBrains - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Mellum2 12B-A2.5B - Variant: instruct - Benchmarks Covered: 6 of 247 - Profile: https://benchlm.ai/models/mellum2-12b-a2-5b-instruct - Sibling Models: Mellum2-12B-A2.5B-Thinking - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: BFCL v4: 44.2 **Coding**: LiveCodeBench: 37.2 **Knowledge**: MMLU-Redux: 78.1, GPQA: 40.9, GPQA-D: 40.9 **Instruction Following**: IFEval: 75.8 ### #171 Claude Opus 4.6 (Adaptive) - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: Claude Opus 4.6 - Variant: reasoning (adaptive) - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/claude-opus-4-6-adaptive - Sibling Models: Claude Opus 4.6 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 67.58, APEX-Agents-AA: 33, Tau2-Telecom: 92.1, GDPval-AA: 55.9, GDPval-AA: 1619 **Coding**: Vibe Code Bench: 53.498, AA Coding Index: 48.09, Terminal-Bench Hard: 46.2, AA-SciCode: 51.9 **Multimodal & Grounded**: AA-MMMU-Pro: 75.4 **Reasoning**: AA-LCR: 70.7, CritPt: 12.6 **Knowledge**: Artificial Analysis Intelligence Index: 52.95, AA-GPQA Diamond: 89.6, AA-HLE: 36.7, AA-Omniscience Index: 13.5, AA-Omniscience Accuracy: 46.4, AA-Omniscience Hallucination Rate: 61.3 **Instruction Following**: AA-IFBench: 53.1 ### #172 Qwen2.5-VL-32B - Creator: Alibaba - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: Qwen2.5-VL-32B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/qwen2-5-vl-32b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #173 Gemma 4 E2B - Creator: Google - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Gemma 4 - Variant: e2b - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/gemma-4-e2b - Sibling Models: Gemma 4 31B, Gemma 4 26B A4B, Gemma 4 12B, Gemma 4 E4B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 6.92, Tau2-Telecom: 20.8, GDPval-AA: 0, GDPval-AA: 270 **Coding**: AA Coding Index: 9, Terminal-Bench Hard: 3, AA-SciCode: 20.9 **Multimodal & Grounded**: AA-MMMU-Pro: 44.6 **Reasoning**: AA-LCR: 15, CritPt: 0 **Knowledge**: GPQA: 43.4, MMLU-Pro: 60, Artificial Analysis Intelligence Index: 15.21, AA-GPQA Diamond: 43.3, AA-HLE: 4.8, AA-Omniscience Index: -24, AA-Omniscience Accuracy: 6.7, AA-Omniscience Hallucination Rate: 32.9 **Instruction Following**: AA-IFBench: 38 ### #174 1-bit Bonsai 1.7B - Creator: Prism ML - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: 1-bit Bonsai - Variant: 1-7b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/1-bit-bonsai-1-7b - Sibling Models: 1-bit Bonsai 8B, 1-bit Bonsai 4B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #175 Claude Opus 4.7 - Creator: Anthropic - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: Claude Opus 4.7 - Variant: base - Benchmarks Covered: 21 of 247 - Profile: https://benchlm.ai/models/claude-opus-4-7 - Sibling Models: Claude Opus 4.7 (Adaptive) - Related Earlier Model: Claude Opus 4.6 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 64.64, Tau2-Telecom: 74, GDPval-AA: 58.6, GDPval-AA: 1672, Gert Labs: 65.59 **Coding**: Vibe Code Bench: 71.003, React Native Evals: 82.8, AA Coding Index: 53.07, Terminal-Bench Hard: 54.5, AA-SciCode: 50.1 **Multimodal & Grounded**: AA-MMMU-Pro: 76.4, Design Arena Website: 1338 **Reasoning**: AA-LCR: 67, CritPt: 5.1 **Knowledge**: Artificial Analysis Intelligence Index: 51.82, AA-GPQA Diamond: 88.5, AA-HLE: 31.2, AA-Omniscience Index: 14.2, AA-Omniscience Accuracy: 43.5, AA-Omniscience Hallucination Rate: 51.9 **Instruction Following**: AA-IFBench: 43.6 ### #176 Ternary Bonsai 4B - Creator: Prism ML - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: Ternary Bonsai - Variant: 4b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ternary-bonsai-4b - Sibling Models: Ternary Bonsai 8B, Ternary Bonsai 1.7B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #177 1-bit Bonsai 8B - Creator: Prism ML - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: 1-bit Bonsai - Variant: 8b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/1-bit-bonsai-8b - Sibling Models: 1-bit Bonsai 1.7B, 1-bit Bonsai 4B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #178 Claude Opus 4.5 Thinking - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: Claude Opus 4.5 - Variant: reasoning (thinking) - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/claude-opus-4-5-thinking - Sibling Models: Claude Opus 4.5 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 59.64, Tau2-Telecom: 89.5, GDPval-AA: 47.3, GDPval-AA: 1446 **Coding**: Vibe Code Bench: 20.63, AA Coding Index: 47.83, Terminal-Bench Hard: 47, AA-SciCode: 49.5 **Multimodal & Grounded**: AA-MMMU-Pro: 74, Design Arena Website: 1292 **Reasoning**: AA-LCR: 74, CritPt: 4.6 **Knowledge**: Artificial Analysis Intelligence Index: 49.73, AA-GPQA Diamond: 86.6, AA-HLE: 28.4, AA-Omniscience Index: 13.3, AA-Omniscience Accuracy: 45.7, AA-Omniscience Hallucination Rate: 59.8 **Instruction Following**: AA-IFBench: 58 ### #179 GLM-5-Turbo - Creator: Z.AI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: GLM-5 - Variant: turbo - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/glm-5-turbo - Sibling Models: GLM-5.1, GLM-5 (Reasoning), GLM-5, GLM-5V-Turbo - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Claw-Eval: 55.8, AA Agentic Index: 66.13, Tau2-Telecom: 98.5, GDPval-AA: 49.7, GDPval-AA: 1493 **Coding**: AA Coding Index: 36.77, Terminal-Bench Hard: 33.3, AA-SciCode: 43.6 **Multimodal & Grounded**: Design Arena Website: 1322 **Reasoning**: AA-LCR: 60.7, CritPt: 0.3 **Knowledge**: Artificial Analysis Intelligence Index: 46.76, AA-GPQA Diamond: 84.7, AA-HLE: 25.4, AA-Omniscience Index: -15.1, AA-Omniscience Accuracy: 29, AA-Omniscience Hallucination Rate: 62.2 **Instruction Following**: AA-IFBench: 73.2 ### #180 GPT-5.2 Instant - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: GPT-5.2 - Variant: instant - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/gpt-5-2-instant - Sibling Models: GPT-5.2, GPT-5.2 Pro - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #181 GPT-5.2 Pro - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: Not ranked yet - Family: GPT-5.2 - Variant: pro - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/gpt-5-2-pro - Sibling Models: GPT-5.2, GPT-5.2 Instant - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #182 GPT-5.3 Instant - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: Not ranked yet - Family: GPT-5.3 - Variant: instant - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/gpt-5-3-instant - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #183 GPT-5.3-Codex-Spark - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: GPT-5.3 Codex - Variant: spark - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/gpt-5-3-codex-spark - Sibling Models: GPT-5.3 Codex - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #184 GPT-5.1-Codex - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: Not ranked yet - Family: GPT-5.1-Codex - Variant: base - Benchmarks Covered: 20 of 247 - Profile: https://benchlm.ai/models/gpt-5-1-codex - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 50.68, Tau2-Telecom: 83, GDPval-AA: 34.5, GDPval-AA: 1191, Gert Labs: 49.68 **Coding**: Vibe Code Bench: 13.115, AA Coding Index: 36.62, Terminal-Bench Hard: 34.8, AA-SciCode: 40.2 **Multimodal & Grounded**: AA-MMMU-Pro: 72.5, Design Arena Website: 1206 **Reasoning**: AA-LCR: 67.3, CritPt: 5.7 **Knowledge**: Artificial Analysis Intelligence Index: 43.11, AA-GPQA Diamond: 86, AA-HLE: 23.4, AA-Omniscience Index: -6, AA-Omniscience Accuracy: 39.2, AA-Omniscience Hallucination Rate: 74.4 **Instruction Following**: AA-IFBench: 70 ### #185 1-bit Bonsai 4B - Creator: Prism ML - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: 1-bit Bonsai - Variant: 4b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/1-bit-bonsai-4b - Sibling Models: 1-bit Bonsai 1.7B, 1-bit Bonsai 8B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #186 Grok 4.1 Fast (Reasoning) - Creator: xAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 2M - Overall Score: Not ranked yet - Family: Grok 4.1 Fast - Variant: reasoning (reasoning) - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/grok-4-1-fast-reasoning - Sibling Models: Grok 4.1 Fast - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 49.31, Tau2-Telecom: 93.3, GDPval-AA: 27.3, GDPval-AA: 1045 **Coding**: Vibe Code Bench: 1.2, AA Coding Index: 30.9, Terminal-Bench Hard: 24.2, AA-SciCode: 44.2 **Multimodal & Grounded**: AA-MMMU-Pro: 63.3 **Reasoning**: AA-LCR: 68, CritPt: 2.9 **Knowledge**: Artificial Analysis Intelligence Index: 38.61, AA-GPQA Diamond: 85.3, AA-HLE: 17.6, AA-Omniscience Index: -28.7, AA-Omniscience Accuracy: 25.3, AA-Omniscience Hallucination Rate: 72.4 **Instruction Following**: AA-IFBench: 52.7 ### #187 GLM-4.6 - Creator: Z.AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: GLM-4.6 - Variant: base - Benchmarks Covered: 17 of 247 - Profile: https://benchlm.ai/models/glm-4-6 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 42.89, Tau2-Telecom: 76.9, GDPval-AA: 24.3, GDPval-AA: 985 **Coding**: Vibe Code Bench: 3.09, AA Coding Index: 30.23, Terminal-Bench Hard: 28.8, AA-SciCode: 33.1 **Reasoning**: AA-LCR: 26.3, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 30.24, AA-GPQA Diamond: 63.2, AA-HLE: 5.2, AA-Omniscience Index: -31.6, AA-Omniscience Accuracy: 20.8, AA-Omniscience Hallucination Rate: 66.1 **Instruction Following**: AA-IFBench: 36.7 ### #188 Grok 4 Fast (Reasoning) - Creator: xAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 2M - Overall Score: Not ranked yet - Family: Grok 4 Fast - Variant: reasoning (reasoning) - Benchmarks Covered: 18 of 247 - Profile: https://benchlm.ai/models/grok-4-fast-reasoning - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 39.51, Tau2-Telecom: 65.8, GDPval-AA: 25.7, GDPval-AA: 1015 **Coding**: Vibe Code Bench: 0, AA Coding Index: 27.36, Terminal-Bench Hard: 18.9, AA-SciCode: 44.2 **Multimodal & Grounded**: AA-MMMU-Pro: 61.8 **Reasoning**: AA-LCR: 64.7, CritPt: 2.9 **Knowledge**: Artificial Analysis Intelligence Index: 35.06, AA-GPQA Diamond: 84.7, AA-HLE: 17, AA-Omniscience Index: -28.4, AA-Omniscience Accuracy: 22.6, AA-Omniscience Hallucination Rate: 66 **Instruction Following**: AA-IFBench: 50.5 ### #189 Trinity-Large-Preview - Creator: Arcee AI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 512K - Overall Score: Not ranked yet - Family: Trinity Large - Variant: preview (preview) - Benchmarks Covered: 21 of 247 - Profile: https://benchlm.ai/models/trinity-large-preview - Sibling Models: Trinity-Large-Thinking - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 42.61, Tau2-Telecom: 90.1, GDPval-AA: 18.2, GDPval-AA: 864 **Coding**: AA Coding Index: 27.19, Terminal-Bench Hard: 22.7, AA-SciCode: 36.1 **Multimodal & Grounded**: Design Arena Website: 1181 **Reasoning**: AA-LCR: 33, CritPt: 0.9 **Knowledge**: MMLU: 87.2, MMLU-Pro (Arcee): 75.2, GPQA-D: 63.3, Artificial Analysis Intelligence Index: 31.87, AA-GPQA Diamond: 75.2, AA-HLE: 14.7, AA-Omniscience Index: -44.2, AA-Omniscience Accuracy: 22.8, AA-Omniscience Hallucination Rate: 86.6 **Instruction Following**: AA-IFBench: 56.3 **Mathematics**: AIME25 (Arcee): 24 ### #190 Trinity-Large-Thinking - Creator: Arcee AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 512K - Overall Score: Not ranked yet - Family: Trinity Large - Variant: thinking - Benchmarks Covered: 22 of 247 - Profile: https://benchlm.ai/models/trinity-large-thinking - Sibling Models: Trinity-Large-Preview - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Tau2-Telecom: 90.1, AA Agentic Index: 42.61, GDPval-AA: 18.2, GDPval-AA: 864, Gert Labs: 32.55 **Coding**: SWE-bench Verified*: 63.2, AA Coding Index: 27.19, Terminal-Bench Hard: 22.7, AA-SciCode: 36.1 **Multimodal & Grounded**: Design Arena Website: 1181 **Reasoning**: AA-LCR: 33, CritPt: 0.9 **Knowledge**: GPQA-D: 76.3, MMLU-Pro (Arcee): 83.4, Artificial Analysis Intelligence Index: 31.87, AA-GPQA Diamond: 75.2, AA-HLE: 14.7, AA-Omniscience Index: -44.2, AA-Omniscience Accuracy: 22.8, AA-Omniscience Hallucination Rate: 86.6 **Instruction Following**: AA-IFBench: 56.3 **Mathematics**: AIME25 (Arcee): 96.3 ### #191 Qwen3 Max - Creator: Alibaba - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: Qwen3 Max - Variant: base - Benchmarks Covered: 19 of 247 - Profile: https://benchlm.ai/models/qwen3-max - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 43.01, Tau2-Telecom: 74.3, GDPval-AA: 26.8, GDPval-AA: 1037, Gert Labs: 43.74 **Coding**: Vibe Code Bench: 3.506, AA Coding Index: 26.41, Terminal-Bench Hard: 20.5, AA-SciCode: 38.3 **Multimodal & Grounded**: Design Arena Website: 1164 **Reasoning**: AA-LCR: 46.7, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 31.38, AA-GPQA Diamond: 76.4, AA-HLE: 11.1, AA-Omniscience Index: -43.1, AA-Omniscience Accuracy: 24.4, AA-Omniscience Hallucination Rate: 89.4 **Instruction Following**: AA-IFBench: 44.1 ### #192 GLM-4.7-Flash - Creator: Z.AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: GLM-4.7-Flash - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/glm-4-7-flash - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #193 Mercury 2 - Creator: Inception - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Mercury 2 - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/mercury-2 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #194 LFM2.5-350M - Creator: LiquidAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: LFM2.5-350M - Variant: instruct - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/lfm2-5-350m - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #195 Nemotron 3 Super 120B A12B - Creator: NVIDIA - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Nemotron 3 Super 120B A12B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/nemotron-3-super-120b-a12b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #196 Granite-4.0-H-1B - Creator: IBM - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Granite 4.0 1B - Variant: hybrid - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/granite-4-0-h-1b - Sibling Models: Granite-4.0-1B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 6.53, Tau2-Telecom: 19.6, GDPval-AA: 0, GDPval-AA: 268 **Coding**: AA Coding Index: 2.74, Terminal-Bench Hard: 0, AA-SciCode: 8.2 **Reasoning**: AA-LCR: 6.3, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 7.99, AA-GPQA Diamond: 26.3, AA-HLE: 5, AA-Omniscience Index: -73.6, AA-Omniscience Accuracy: 5.3, AA-Omniscience Hallucination Rate: 83.4 **Instruction Following**: AA-IFBench: 26.2 ### #197 Seed 1.6 - Creator: ByteDance - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Seed 1.6 - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/seed-1-6 - Sibling Models: Seed 1.6 Flash - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #198 Qwen2.5 Coder 32B Instruct - Creator: Alibaba - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Qwen2.5 Coder - Variant: 32b-instruct - Benchmarks Covered: 4 of 247 - Profile: https://benchlm.ai/models/qwen2-5-coder-32b-instruct - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: AA-SciCode: 27.1 **Knowledge**: Artificial Analysis Intelligence Index: 12.87, AA-GPQA Diamond: 41.7, AA-HLE: 3.8 ### #199 Seed-2.0-Lite - Creator: ByteDance - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Seed 2.0 - Variant: lite - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/seed-2-0-lite - Sibling Models: Seed-2.0-Mini - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #200 DeepSeek R1 Distill Qwen 32B - Creator: DeepSeek - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: DeepSeek R1 Distill - Variant: qwen-32b - Benchmarks Covered: 6 of 247 - Profile: https://benchlm.ai/models/deepseek-r1-distill-qwen-32b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: AA-SciCode: 37.6 **Reasoning**: AA-LCR: 9.7 **Knowledge**: Artificial Analysis Intelligence Index: 17.17, AA-GPQA Diamond: 61.5, AA-HLE: 5.5 **Instruction Following**: AA-IFBench: 22.9 ### #201 Ministral 3 14B (Reasoning) - Creator: Mistral - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Ministral 3 14B - Variant: reasoning - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ministral-3-14b-reasoning - Sibling Models: Ministral 3 14B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #202 Ministral 3 14B - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Ministral 3 14B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ministral-3-14b - Sibling Models: Ministral 3 14B (Reasoning) - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #203 Aion-2.0 - Creator: Aion Labs - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Aion-2.0 - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/aion-2-0 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #204 Seed 1.6 Flash - Creator: ByteDance - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Seed 1.6 - Variant: flash - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/seed-1-6-flash - Sibling Models: Seed 1.6 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #205 MiniMax M1 80k - Creator: MiniMax - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 80K - Overall Score: Not ranked yet - Family: MiniMax M1 80k - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/minimax-m1-80k - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #206 Solar Pro 2 - Creator: Upstage - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Solar - Variant: undefined - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/solar-pro-2 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 12.71, Tau2-Telecom: 31.9, GDPval-AA: 0, GDPval-AA: 443 **Coding**: AA Coding Index: 11.29, Terminal-Bench Hard: 4.5, AA-SciCode: 24.8 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 13.59, AA-GPQA Diamond: 56.1, AA-HLE: 3.8, AA-Omniscience Index: -61.7, AA-Omniscience Accuracy: 15.6, AA-Omniscience Hallucination Rate: 91.5 **Instruction Following**: AA-IFBench: 33.7 ### #207 Seed-2.0-Mini - Creator: ByteDance - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Seed 2.0 - Variant: mini - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/seed-2-0-mini - Sibling Models: Seed-2.0-Lite - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #208 Ministral 3 8B (Reasoning) - Creator: Mistral - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Ministral 3 8B - Variant: reasoning - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ministral-3-8b-reasoning - Sibling Models: Ministral 3 8B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #209 Ministral 3 8B - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Ministral 3 8B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ministral-3-8b - Sibling Models: Ministral 3 8B (Reasoning) - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #210 LFM2-24B-A2B - Creator: LiquidAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: LFM2-24B-A2B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/lfm2-24b-a2b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #211 LFM2.5-1.2B-Thinking - Creator: LiquidAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: LFM2.5 1.2B - Variant: reasoning - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/lfm2-5-1-2b-thinking - Sibling Models: LFM2.5-1.2B-Instruct, LFM2.5-1.2B-JP-202606 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #212 Ministral 3 3B (Reasoning) - Creator: Mistral - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Ministral 3 3B - Variant: reasoning - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ministral-3-3b-reasoning - Sibling Models: Ministral 3 3B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #213 Ministral 3 3B - Creator: Mistral - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Ministral 3 3B - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/ministral-3-3b - Sibling Models: Ministral 3 3B (Reasoning) - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #214 Exaone 4.0 1.2B - Creator: LG AI Research - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Exaone 4.0 - Variant: 1-2b - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/exaone-4-0-1-2b - Sibling Models: Exaone 4.0 32B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 6.82, Tau2-Telecom: 20.5, GDPval-AA: 0, GDPval-AA: 293 **Coding**: AA Coding Index: 2.47, Terminal-Bench Hard: 0, AA-SciCode: 7.4 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 8.11, AA-GPQA Diamond: 42.4, AA-HLE: 5.8, AA-Omniscience Index: -82.6, AA-Omniscience Accuracy: 4.7, AA-Omniscience Hallucination Rate: 91.5 **Instruction Following**: AA-IFBench: 25.3 ### #215 Step 3.5 Flash - Creator: StepFun - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Step 3.5 Flash - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/step-3-5-flash - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #216 MiniMax M2.5 - Creator: MiniMax - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: MiniMax M2.5 - Variant: base - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/minimax-m2-5 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: Vibe Code Bench: 14.852 ### #217 GPT-5 mini - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: GPT-5 - Variant: mini - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/gpt-5-mini - Sibling Models: GPT-5 (high), GPT-5 (medium), GPT-5 nano - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: Vibe Code Bench: 14.171 ### #218 LFM2.5-1.2B-Instruct - Creator: LiquidAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: LFM2.5 1.2B - Variant: instruct - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/lfm2-5-1-2b-instruct - Sibling Models: LFM2.5-1.2B-Thinking, LFM2.5-1.2B-JP-202606 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #219 LFM2.5-VL-450M - Creator: LiquidAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: LFM2.5-VL-450M - Variant: vl - Benchmarks Covered: 7 of 247 - Profile: https://benchlm.ai/models/lfm2-5-vl-450m - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: BFCL v4: 21.08 **Multimodal & Grounded**: MMMU: 32.67, RealWorldQA: 58.43, CountBench: 73.31 **Knowledge**: GPQA: 25.66, MMLU-Pro: 19.32 **Instruction Following**: IFEval: 61.16 ### #220 LFM2.5-VL-1.6B-Extract - Creator: LiquidAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: LFM2.5-VL Extract - Variant: 1-6b - Benchmarks Covered: 20 of 247 - Profile: https://benchlm.ai/models/lfm2-5-vl-1-6b-extract - Sibling Models: LFM2.5-VL-450M-Extract - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 2.83, Tau2-Telecom: 8.5, GDPval-AA: 0, GDPval-AA: 232 **Coding**: AA Coding Index: 1, Terminal-Bench Hard: 0, AA-SciCode: 3 **Multimodal & Grounded**: Liquid Extract JSON Validity: 99.6, Liquid Extract F1: 99.6, Liquid Extract VLM Judge: 90.6, AA-MMMU-Pro: 26.5 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 6.18, AA-GPQA Diamond: 28.9, AA-HLE: 5.1, AA-Omniscience Index: -83.9, AA-Omniscience Accuracy: 5.2, AA-Omniscience Hallucination Rate: 94 **Instruction Following**: AA-IFBench: 33.1 ### #221 LFM2.5-VL-450M-Extract - Creator: LiquidAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: LFM2.5-VL Extract - Variant: 450m - Benchmarks Covered: 3 of 247 - Profile: https://benchlm.ai/models/lfm2-5-vl-450m-extract - Sibling Models: LFM2.5-VL-1.6B-Extract - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Multimodal & Grounded**: Liquid Extract JSON Validity: 98.9, Liquid Extract F1: 98.8, Liquid Extract VLM Judge: 84.5 ### #222 Kimi K2.7 Code - Creator: Moonshot AI - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Kimi K2.7 Code - Variant: code - Benchmarks Covered: 6 of 247 - Profile: https://benchlm.ai/models/kimi-k2-7-code - Related Earlier Model: Kimi K2.6 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Kimi Claw 24/7: 46.9, MCP Atlas: 76, MCP Mark Verified: 81.1 **Coding**: Kimi Code Bench v2: 62, ProgramBench: 53.6, MLS-Bench Lite: 35.1 ### #223 Holo3.1-35B-A3B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo3.1 - Variant: 35b-a3b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo3-1-35b-a3b - Sibling Models: Holo3.1-4B, Holo3.1-9B, Holo3.1-0.8B, Holo3.1-35B-A3B-FP8, Holo3.1-35B-A3B-GGUF, Holo3.1-35B-A3B-NVFP4 - Related Earlier Model: Holo3-35B-A3B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AndroidWorld: 79.3 ### #224 Holo3.1-4B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo3.1 - Variant: 4b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo3-1-4b - Sibling Models: Holo3.1-35B-A3B, Holo3.1-9B, Holo3.1-0.8B, Holo3.1-35B-A3B-FP8, Holo3.1-35B-A3B-GGUF, Holo3.1-35B-A3B-NVFP4 - Related Earlier Model: Holo2-4B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AndroidWorld: 71 ### #225 Holo3.1-9B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo3.1 - Variant: 9b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo3-1-9b - Sibling Models: Holo3.1-35B-A3B, Holo3.1-4B, Holo3.1-0.8B, Holo3.1-35B-A3B-FP8, Holo3.1-35B-A3B-GGUF, Holo3.1-35B-A3B-NVFP4 - Related Earlier Model: Holo2-8B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AndroidWorld: 71 ### #226 Composer 2 Fast - Creator: Cursor - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: Composer - Variant: fast - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/composer-2-fast - Sibling Models: Composer 2.5, Composer 2 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: React Native Evals: 94.9 ### #227 Qwen3.5 Plus - Creator: Alibaba - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 1M - Overall Score: Not ranked yet - Family: Qwen3.5 Plus - Variant: base - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/qwen3-5-plus - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: Vibe Code Bench: 15.738 ### #228 Claude Haiku 4.5 Thinking - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: Claude Haiku 4.5 - Variant: reasoning (thinking) - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/claude-haiku-4-5-thinking - Sibling Models: Claude Haiku 4.5 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: Vibe Code Bench: 11.393 **Multimodal & Grounded**: Design Arena Website: 1167 ### #229 Claude Sonnet 4.5 Thinking - Creator: Anthropic - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 200K - Overall Score: Not ranked yet - Family: Claude Sonnet 4.5 - Variant: reasoning (thinking) - Benchmarks Covered: 2 of 247 - Profile: https://benchlm.ai/models/claude-sonnet-4-5-thinking - Sibling Models: Claude Sonnet 4.5 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Coding**: Vibe Code Bench: 22.621 **Multimodal & Grounded**: Design Arena Website: 1235 ### #230 Holo2-235B-A22B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo2 - Variant: 235b-a22b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo2-235b-a22b - Sibling Models: Holo2-30B-A3B, Holo2-4B, Holo2-8B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Multimodal & Grounded**: ScreenSpot Pro: 70.6 ### #231 Holo2-30B-A3B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo2 - Variant: 30b-a3b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo2-30b-a3b - Sibling Models: Holo2-235B-A22B, Holo2-4B, Holo2-8B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Multimodal & Grounded**: ScreenSpot Pro: 66.1 ### #232 Holo2-4B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo2 - Variant: 4b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo2-4b - Sibling Models: Holo2-235B-A22B, Holo2-30B-A3B, Holo2-8B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Multimodal & Grounded**: ScreenSpot Pro: 57.2 ### #233 Holo2-8B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo2 - Variant: 8b - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/holo2-8b - Sibling Models: Holo2-235B-A22B, Holo2-30B-A3B, Holo2-4B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Multimodal & Grounded**: ScreenSpot Pro: 58.9 ### #234 Holo3.1-0.8B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo3.1 - Variant: 0-8b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/holo3-1-0-8b - Sibling Models: Holo3.1-35B-A3B, Holo3.1-4B, Holo3.1-9B, Holo3.1-35B-A3B-FP8, Holo3.1-35B-A3B-GGUF, Holo3.1-35B-A3B-NVFP4 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #235 Holo3.1-35B-A3B-FP8 - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo3.1 - Variant: 35b-a3b-fp8 - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/holo3-1-35b-a3b-fp8 - Sibling Models: Holo3.1-35B-A3B, Holo3.1-4B, Holo3.1-9B, Holo3.1-0.8B, Holo3.1-35B-A3B-GGUF, Holo3.1-35B-A3B-NVFP4 - Related Earlier Model: Holo3.1-35B-A3B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #236 Holo3.1-35B-A3B-GGUF - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo3.1 - Variant: 35b-a3b-gguf - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/holo3-1-35b-a3b-gguf - Sibling Models: Holo3.1-35B-A3B, Holo3.1-4B, Holo3.1-9B, Holo3.1-0.8B, Holo3.1-35B-A3B-FP8, Holo3.1-35B-A3B-NVFP4 - Related Earlier Model: Holo3.1-35B-A3B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #237 Holo3.1-35B-A3B-NVFP4 - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Holo3.1 - Variant: 35b-a3b-nvfp4 - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/holo3-1-35b-a3b-nvfp4 - Sibling Models: Holo3.1-35B-A3B, Holo3.1-4B, Holo3.1-9B, Holo3.1-0.8B, Holo3.1-35B-A3B-FP8, Holo3.1-35B-A3B-GGUF - Related Earlier Model: Holo3.1-35B-A3B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #238 LFM2.5-1.2B-JP-202606 - Creator: LiquidAI - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: LFM2.5-1.2B - Variant: jp-202606 - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/lfm2-5-1-2b-jp-202606 - Sibling Models: LFM2.5-1.2B-Thinking, LFM2.5-1.2B-Instruct - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #239 Grok Build 0.1 - Creator: xAI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Grok Build - Variant: base - Benchmarks Covered: 1 of 247 - Profile: https://benchlm.ai/models/grok-build-0-1 - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: Gert Labs: 49.15 ### #240 Hy-MT1.5-1.8B-1.25bit - Creator: Tencent Hunyuan - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: Hy-MT1.5-1.8B - Variant: 1.25bit - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/hy-mt1-5-1-8b-1-25bit - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #241 Leanstral - Creator: Mistral - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 256K - Overall Score: Not ranked yet - Family: Leanstral - Variant: base - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/leanstral - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #242 Granite-4.0-350M - Creator: IBM - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: Granite 4.0 350M - Variant: dense - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/granite-4-0-350m - Sibling Models: Granite-4.0-H-350M - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 4.39, Tau2-Telecom: 13.2, GDPval-AA: 0, GDPval-AA: 268 **Coding**: AA Coding Index: 0.31, Terminal-Bench Hard: 0, AA-SciCode: 0.9 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 6.1, AA-GPQA Diamond: 26.1, AA-HLE: 5.7, AA-Omniscience Index: -72.1, AA-Omniscience Accuracy: 3.2, AA-Omniscience Hallucination Rate: 77.8 **Instruction Following**: AA-IFBench: 15.9 ### #243 Granite-4.0-H-350M - Creator: IBM - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: Granite 4.0 350M - Variant: hybrid - Benchmarks Covered: 16 of 247 - Profile: https://benchlm.ai/models/granite-4-0-h-350m - Sibling Models: Granite-4.0-350M - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. **Agentic**: AA Agentic Index: 4.87, Tau2-Telecom: 14.6, GDPval-AA: 0, GDPval-AA: 289 **Coding**: AA Coding Index: 0.58, Terminal-Bench Hard: 0, AA-SciCode: 1.7 **Reasoning**: AA-LCR: 0, CritPt: 0 **Knowledge**: Artificial Analysis Intelligence Index: 5.44, AA-GPQA Diamond: 25.7, AA-HLE: 6.4, AA-Omniscience Index: -87.2, AA-Omniscience Accuracy: 3.7, AA-Omniscience Hallucination Rate: 94.4 **Instruction Following**: AA-IFBench: 17.6 ### #244 GPT-5 nano - Creator: OpenAI - Source Type: Proprietary - Reasoning: Reasoning - Context Window: 400K - Overall Score: Not ranked yet - Family: GPT-5 - Variant: nano - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/gpt-5-nano - Sibling Models: GPT-5 (high), GPT-5 (medium), GPT-5 mini - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #245 A.X series - Creator: SK Telecom - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: A.X - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/a-x-series - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #246 DNA 1.0 8B - Creator: Community - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: DNA 1.0 - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/dna-1-0-8b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #247 Holotron-12B - Creator: H Company - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: Holotron - Variant: 12b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/holotron-12b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #248 HyperClova X Dash - Creator: Naver Cloud - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: HyperClova X - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/hyperclova-x-dash - Sibling Models: HyperClova X Think 32B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #249 HyperClova X Think 32B - Creator: Naver Cloud - Source Type: Open Weight - Reasoning: Reasoning - Context Window: 128K - Overall Score: Not ranked yet - Family: HyperClova X - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/hyperclova-x-think-32b - Sibling Models: HyperClova X Dash - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #250 Kanana Essence - Creator: Kakao - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Kanana - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/kanana-essence - Sibling Models: Kanana Flag, Kanana Nano - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #251 Kanana Flag - Creator: Kakao - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Kanana - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/kanana-flag - Sibling Models: Kanana Essence, Kanana Nano - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #252 Kanana Nano - Creator: Kakao - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Kanana - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/kanana-nano - Sibling Models: Kanana Essence, Kanana Flag - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #253 OriOn-Mistral-24B - Creator: LightOn - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 344K - Overall Score: Not ranked yet - Family: OriOn - Variant: mistral-24b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/orion-mistral-24b - Sibling Models: OriOn-Qwen-32B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #254 OriOn-Qwen-32B - Creator: LightOn - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 262K - Overall Score: Not ranked yet - Family: OriOn - Variant: qwen-32b - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/orion-qwen-32b - Sibling Models: OriOn-Mistral-24B - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #255 Pharia-1-LLM-7B-control - Creator: Aleph Alpha - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 8K - Overall Score: Not ranked yet - Family: Pharia-1-LLM-7B - Variant: control - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/pharia-1-llm-7b-control - Sibling Models: Pharia-1-LLM-7B-control-aligned - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #256 Pharia-1-LLM-7B-control-aligned - Creator: Aleph Alpha - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 8K - Overall Score: Not ranked yet - Family: Pharia-1-LLM-7B - Variant: control-aligned - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/pharia-1-llm-7b-control-aligned - Sibling Models: Pharia-1-LLM-7B-control - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #257 Thunder-LLM 8B - Creator: Academic - Source Type: Open Weight - Reasoning: Non-Reasoning - Context Window: 32K - Overall Score: Not ranked yet - Family: Thunder-LLM - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/thunder-llm-8b - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ### #258 Varco - Creator: NC AI - Source Type: Proprietary - Reasoning: Non-Reasoning - Context Window: 64K - Overall Score: Not ranked yet - Family: Varco - Variant: undefined - Benchmarks Covered: 0 of 247 - Profile: https://benchlm.ai/models/varco - Coverage Note: Tracked on BenchLM, but not currently ranked because the remaining benchmark coverage is generated or otherwise excluded from leaderboard scoring. ## Overall Leaderboard | Rank | Model | Creator | Type | Context | Score | |------|-------|---------|------|---------|-------| | 1 | Claude Mythos 5 | Anthropic | Proprietary | 1M+ | 99 | | 2 | Claude Fable 5 | Anthropic | Proprietary | 1M+ | 97 | | 3 | Claude Opus 4.8 | Anthropic | Proprietary | 1M | 93 | | 4 | Gemini 3.1 Pro | Google | Proprietary | 1M | 91 | | 5 | Qwen3.7 Max | Alibaba | Proprietary | 1M | 91 | | 6 | GPT-5.4 Pro | OpenAI | Proprietary | 1.05M | 90 | | 7 | GPT-5.5 | OpenAI | Proprietary | 1M | 89 | | 8 | Gemini 3 Pro Deep Think | Google | Proprietary | 2M | 89 | | 9 | Grok 4.1 | xAI | Proprietary | 1M | 89 | | 10 | GPT-5.4 | OpenAI | Proprietary | 1.05M | 88 | | 11 | Qwen3.7 Plus | Alibaba | Proprietary | 1M | 88 | | 12 | Claude Opus 4.6 | Anthropic | Proprietary | 1M | 86 | | 13 | Gemini 3.5 Flash | Google | Proprietary | 1M | 86 | | 14 | DeepSeek V4 Pro (Max) | DeepSeek | Open Weight | 1M | 86 | | 15 | GPT-5.3 Codex | OpenAI | Proprietary | 400K | 85 | | 16 | Claude Opus 4.7 (Adaptive) | Anthropic | Proprietary | 1M | 84 | | 17 | GLM-5.1 | Z.AI | Open Weight | 203K | 82 | | 18 | Claude Sonnet 4.6 | Anthropic | Proprietary | 200K | 82 | | 19 | DeepSeek V4 Pro (High) | DeepSeek | Open Weight | 1M | 82 | | 20 | o1-preview | OpenAI | Proprietary | 200K | 82 | | 21 | Kimi K2.6 | Moonshot AI | Open Weight | 256K | 81 | | 22 | Gemini 3 Pro | Google | Proprietary | 2M | 80 | | 23 | MiniMax M3 | MiniMax | Open Weight | 1M | 79 | | 24 | GLM-5 (Reasoning) | Z.AI | Open Weight | 200K | 79 | | 25 | GPT-5.2 | OpenAI | Proprietary | 400K | 78 | | 26 | Qwen3.5 397B (Reasoning) | Alibaba | Open Weight | 128K | 77 | | 27 | GPT-5.1 | OpenAI | Proprietary | 200K | 77 | | 28 | Claude Opus 4.5 | Anthropic | Proprietary | 200K | 76 | | 29 | GPT-5 (high) | OpenAI | Proprietary | 128K | 76 | | 30 | GPT-5.2-Codex | OpenAI | Proprietary | 400K | 76 | | 31 | Kimi K2.5 (Reasoning) | Moonshot AI | Proprietary | 128K | 75 | | 32 | GPT-5.1-Codex-Max | OpenAI | Proprietary | 400K | 75 | | 33 | DeepSeek V4 Flash (Max) | DeepSeek | Open Weight | 1M | 74 | | 34 | Qwen3.6-27B | Alibaba | Open Weight | 262K | 72 | | 35 | Grok 4.20 | xAI | Proprietary | 2M | 71 | | 36 | DeepSeek V4 Flash (High) | DeepSeek | Open Weight | 1M | 71 | | 37 | GPT-5 (medium) | OpenAI | Proprietary | 128K | 70 | | 38 | Nemotron 3 Ultra | NVIDIA | Open Weight | 1M | 68 | | 39 | DeepSeek V4 Pro | DeepSeek | Open Weight | 1M | 68 | | 40 | GLM-4.7 | Z.AI | Open Weight | 200K | 68 | | 41 | Grok 4.1 Fast | xAI | Proprietary | 1M | 68 | | 42 | GLM-5 | Z.AI | Open Weight | 200K | 67 | | 43 | Qwen3.6 Plus | Alibaba | Proprietary | 1M | 66 | | 44 | MAI-Thinking-1 | Microsoft | Proprietary | 256K | 65 | | 45 | Qwen3.6-35B-A3B | Alibaba | Open Weight | 262K | 65 | | 46 | Claude Sonnet 4.5 | Anthropic | Proprietary | 200K | 64 | | 47 | Kimi K2.5 | Moonshot AI | Open Weight | 256K | 63 | | 48 | Qwen3.5-122B-A10B | Alibaba | Open Weight | 262K | 63 | | 49 | Gemini 2.5 Pro | Google | Proprietary | 1M | 63 | | 50 | Grok 4 | xAI | Proprietary | 128K | 63 | | 51 | Qwen3.5 397B | Alibaba | Open Weight | 128K | 62 | | 52 | Qwen3.5-27B | Alibaba | Open Weight | 262K | 61 | | 53 | DeepSeek V3.2 (Thinking) | DeepSeek | Open Weight | 128K | 60 | | 54 | MiMo-V2-Flash | Xiaomi | Open Weight | 256K | 59 | | 55 | DeepSeek V4 Flash | DeepSeek | Open Weight | 1M | 57 | | 56 | GPT-4.1 | OpenAI | Proprietary | 1M | 57 | | 57 | o3-pro | OpenAI | Proprietary | 200K | 57 | | 58 | o1 | OpenAI | Proprietary | 200K | 57 | | 59 | DeepSeek V3.2 | DeepSeek | Open Weight | 128K | 56 | | 60 | Claude Haiku 4.5 | Anthropic | Proprietary | 200K | 56 | | 61 | o3 | OpenAI | Proprietary | 200K | 56 | | 62 | Qwen3.5-35B-A3B | Alibaba | Open Weight | 262K | 55 | | 63 | Gemini 3 Flash | Google | Proprietary | 1M | 55 | | 64 | o3-mini | OpenAI | Proprietary | 200K | 55 | | 65 | MiniMax M2.7 | MiniMax | Open Weight | 200K | 53 | | 66 | DeepSeek Coder 2.0 | DeepSeek | Open Weight | 128K | 51 | | 67 | Claude 4.1 Opus | Anthropic | Proprietary | 200K | 51 | | 68 | DeepSeek LLM 2.0 | DeepSeek | Open Weight | 128K | 50 | | 69 | Qwen2.5-1M | Alibaba | Open Weight | 1M | 50 | | 70 | Claude 4 Sonnet | Anthropic | Proprietary | 200K | 50 | | 71 | GPT-4o mini | OpenAI | Proprietary | 128K | 49 | | 72 | Qwen2.5-72B | Alibaba | Open Weight | 128K | 49 | | 73 | DeepSeekMath V2 | DeepSeek | Open Weight | 128K | 49 | | 74 | Mistral Large 3 | Mistral | Proprietary | 128K | 48 | | 75 | Gemini 3.1 Flash-Lite | Google | Proprietary | 1M | 47 | | 76 | Qwen3 235B 2507 (Reasoning) | Alibaba | Open Weight | 128K | 45 | | 77 | GPT-4.1 mini | OpenAI | Proprietary | 1M | 45 | | 78 | Nemotron 3 Super 100B | NVIDIA | Open Weight | 1M | 43 | | 79 | o4-mini (high) | OpenAI | Proprietary | 200K | 43 | | 80 | Claude 4.1 Opus Thinking | Anthropic | Proprietary | 200K | 43 | | 81 | GPT-4o | OpenAI | Proprietary | 128K | 42 | | 82 | Kimi K2 | Moonshot AI | Proprietary | 128K | 41 | | 83 | Llama 3.1 405B | Meta | Open Weight | 128K | 40 | | 84 | Claude 3.5 Sonnet | Anthropic | Proprietary | 200K | 40 | | 85 | Grok Code Fast 1 | xAI | Proprietary | 256K | 39 | | 86 | Sarvam 105B | Sarvam | Open Weight | 128K | 39 | | 87 | Mistral Large 2 | Mistral | Proprietary | 128K | 38 | | 88 | Gemini 2.5 Flash | Google | Proprietary | 1M | 37 | | 89 | Gemini 1.5 Pro | Google | Proprietary | 2M | 35 | | 90 | DeepSeek V3 | DeepSeek | Open Weight | 128K | 34 | | 91 | GPT-OSS 120B | OpenAI | Open Weight | 128K | 34 | | 92 | Claude 3 Opus | Anthropic | Proprietary | 200K | 34 | | 93 | MiniCPM5-1B | OpenBMB | Open Weight | 131K | 34 | | 94 | DeepSeek-R1 | DeepSeek | Open Weight | 128K | 32 | | 95 | Qwen3 235B 2507 | Alibaba | Open Weight | 128K | 32 | | 96 | DBRX Instruct | Databricks | Open Weight | 32K | 32 | | 97 | Grok 3 [Beta] | xAI | Proprietary | 128K | 30 | | 98 | DeepSeek V3.1 (Reasoning) | DeepSeek | Open Weight | 128K | 29 | | 99 | o1-pro | OpenAI | Proprietary | 200K | 28 | | 100 | Phi-4 | Microsoft | Open Weight | 16K | 27 | | 101 | GPT-4.1 nano | OpenAI | Proprietary | 1M | 27 | | 102 | GLM-4.5 | Z.AI | Proprietary | 128K | 25 | | 103 | Llama 4 Scout | Meta | Open Weight | 10M | 25 | | 104 | Nemotron 3 Nano 30B | NVIDIA | Open Weight | 32K | 25 | | 105 | Llama 3 70B | Meta | Open Weight | 128K | 25 | | 106 | DeepSeek V3.1 | DeepSeek | Open Weight | 128K | 24 | | 107 | GPT-4 Turbo | OpenAI | Proprietary | 128K | 24 | | 108 | Gemini 1.0 Pro | Google | Proprietary | 32K | 24 | | 109 | Z-1 | Z | Proprietary | 128K | 23 | | 110 | Mistral 8x7B | Mistral | Open Weight | 32K | 23 | | 111 | Claude 3 Haiku | Anthropic | Proprietary | 200K | 23 | | 112 | Mixtral 8x22B Instruct v0.1 | Mistral | Open Weight | 64K | 22 | | 113 | Nemotron-4 15B | NVIDIA | Open Weight | 32K | 22 | | 114 | Moonshot v1 | Moonshot AI | Proprietary | 128K | 22 | | 115 | Nemotron Ultra 253B | NVIDIA | Open Weight | 32K | 22 | | 116 | GLM-4.5-Air | Z.AI | Proprietary | 128K | 18 | | 117 | Llama 4 Maverick | Meta | Open Weight | 1M | 17 | | 118 | Gemma 3 27B | Google | Open Weight | 32K | 16 | | 119 | GPT-OSS 20B | OpenAI | Open Weight | 128K | 16 | | 120 | Llama 4 Behemoth | Meta | Open Weight | 32K | 11 | | 121 | Nova Pro | Amazon | Proprietary | 128K | 10 | | 122 | Mistral 7B v0.3 | Mistral | Open Weight | 32K | 4 | | 123 | Mistral 8x7B v0.2 | Mistral | Open Weight | 32K | 1 | | 124 | GPT-5.5 Pro | OpenAI | Proprietary | 1M | — | | 125 | Holo3-35B-A3B | H Company | Open Weight | 64K | — | | 126 | Holo3-122B-A10B | H Company | Proprietary | 64K | — | | 127 | MiMo-V2.5-Pro | Xiaomi | Proprietary | 1M | — | | 128 | MiMo-V2-Pro | Xiaomi | Proprietary | 1M | — | | 129 | MiMo-V2-Omni | Xiaomi | Proprietary | 262K | — | | 130 | Composer 2.5 | Cursor | Proprietary | 200K | — | | 131 | Muse Spark | Meta | Proprietary | 262K | — | | 132 | Qwen 3.6 Max (preview) | Alibaba | Proprietary | 256K | — | | 133 | Mistral Medium 3.5 128B | Mistral | Open Weight | 256K | — | | 134 | Interfaze Beta | Interfaze | Proprietary | 1M | — | | 135 | Grok 4.3 | xAI | Proprietary | 1M | — | | 136 | Composer 2 | Cursor | Proprietary | 200K | — | | 137 | MiMo-V2.5 | Xiaomi | Proprietary | 1M | — | | 138 | Step 3.7 Flash | StepFun | Open Weight | 256K | — | | 139 | Grok 4.20 Multi-agent | xAI | Proprietary | 2M | — | | 140 | GPT-5.4 mini | OpenAI | Proprietary | 400K | — | | 141 | Gemma 4 31B | Google | Open Weight | 256K | — | | 142 | Exaone 4.0 32B | LG AI Research | Open Weight | 128K | — | | 143 | GLM-5V-Turbo | Z.AI | Proprietary | 200K | — | | 144 | GPT-5.4 nano | OpenAI | Proprietary | 400K | — | | 145 | Mellum2-12B-A2.5B-Thinking | JetBrains | Open Weight | 128K | — | | 146 | Hy3 Preview | Tencent | Open Weight | 256K | — | | 147 | ZAYA1-8B | Zyphra | Open Weight | 131K | — | | 148 | Gemma 4 26B A4B | Google | Open Weight | 256K | — | | 149 | ZAYA1-74B-Preview | Zyphra | Open Weight | 256K | — | | 150 | Mistral Small 4 (Reasoning) | Mistral | Open Weight | 256K | — | | 151 | Laguna M.1 | Poolside | Proprietary | 131K | — | | 152 | K-Exaone | LG AI Research | Proprietary | 256K | — | | 153 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | Open Weight | 256K | — | | 154 | Gemma 4 12B | Google | Open Weight | 256K | — | | 155 | Ternary Bonsai 8B | Prism ML | Open Weight | 64K | — | | 156 | LFM2.5-8B-A1B | LiquidAI | Open Weight | 128K | — | | 157 | Mistral Medium 3 | Mistral | Proprietary | 128K | — | | 158 | DeepSeek V4 Pro Base | DeepSeek | Open Weight | 1M | — | | 159 | Mistral Small 4 | Mistral | Open Weight | 256K | — | | 160 | Grok 3 Mini | xAI | Proprietary | 128K | — | | 161 | Sarvam 30B | Sarvam | Open Weight | 64K | — | | 162 | Command A+ | Cohere | Open Weight | 128K | — | | 163 | Laguna XS.2 | Poolside | Open Weight | 131K | — | | 164 | Gemma 4 E4B | Google | Open Weight | 128K | — | | 165 | Ling 2.6 Flash | InclusionAI | Open Weight | 262K | — | | 166 | Granite-4.0-1B | IBM | Open Weight | 128K | — | | 167 | DeepSeek V4 Flash Base | DeepSeek | Open Weight | 1M | — | | 168 | Qwen3.5 Flash | Alibaba | Proprietary | 1M | — | | 169 | Ternary Bonsai 1.7B | Prism ML | Open Weight | 32K | — | | 170 | Mellum2-12B-A2.5B-Instruct | JetBrains | Open Weight | 128K | — | | 171 | Claude Opus 4.6 (Adaptive) | Anthropic | Proprietary | 1M | — | | 172 | Qwen2.5-VL-32B | Alibaba | Open Weight | 32K | — | | 173 | Gemma 4 E2B | Google | Open Weight | 128K | — | | 174 | 1-bit Bonsai 1.7B | Prism ML | Open Weight | 32K | — | | 175 | Claude Opus 4.7 | Anthropic | Proprietary | 1M | — | | 176 | Ternary Bonsai 4B | Prism ML | Open Weight | 32K | — | | 177 | 1-bit Bonsai 8B | Prism ML | Open Weight | 64K | — | | 178 | Claude Opus 4.5 Thinking | Anthropic | Proprietary | 200K | — | | 179 | GLM-5-Turbo | Z.AI | Proprietary | 200K | — | | 180 | GPT-5.2 Instant | OpenAI | Proprietary | 128K | — | | 181 | GPT-5.2 Pro | OpenAI | Proprietary | 400K | — | | 182 | GPT-5.3 Instant | OpenAI | Proprietary | 400K | — | | 183 | GPT-5.3-Codex-Spark | OpenAI | Proprietary | 256K | — | | 184 | GPT-5.1-Codex | OpenAI | Proprietary | 400K | — | | 185 | 1-bit Bonsai 4B | Prism ML | Open Weight | 32K | — | | 186 | Grok 4.1 Fast (Reasoning) | xAI | Proprietary | 2M | — | | 187 | GLM-4.6 | Z.AI | Open Weight | 200K | — | | 188 | Grok 4 Fast (Reasoning) | xAI | Proprietary | 2M | — | | 189 | Trinity-Large-Preview | Arcee AI | Open Weight | 512K | — | | 190 | Trinity-Large-Thinking | Arcee AI | Open Weight | 512K | — | | 191 | Qwen3 Max | Alibaba | Proprietary | 1M | — | | 192 | GLM-4.7-Flash | Z.AI | Open Weight | 200K | — | | 193 | Mercury 2 | Inception | Proprietary | 128K | — | | 194 | LFM2.5-350M | LiquidAI | Open Weight | 32K | — | | 195 | Nemotron 3 Super 120B A12B | NVIDIA | Open Weight | 256K | — | | 196 | Granite-4.0-H-1B | IBM | Open Weight | 128K | — | | 197 | Seed 1.6 | ByteDance | Proprietary | 256K | — | | 198 | Qwen2.5 Coder 32B Instruct | Alibaba | Open Weight | 128K | — | | 199 | Seed-2.0-Lite | ByteDance | Proprietary | 256K | — | | 200 | DeepSeek R1 Distill Qwen 32B | DeepSeek | Open Weight | 128K | — | | 201 | Ministral 3 14B (Reasoning) | Mistral | Open Weight | 128K | — | | 202 | Ministral 3 14B | Mistral | Open Weight | 128K | — | | 203 | Aion-2.0 | Aion Labs | Proprietary | 128K | — | | 204 | Seed 1.6 Flash | ByteDance | Proprietary | 256K | — | | 205 | MiniMax M1 80k | MiniMax | Proprietary | 80K | — | | 206 | Solar Pro 2 | Upstage | Proprietary | 128K | — | | 207 | Seed-2.0-Mini | ByteDance | Proprietary | 256K | — | | 208 | Ministral 3 8B (Reasoning) | Mistral | Open Weight | 128K | — | | 209 | Ministral 3 8B | Mistral | Open Weight | 128K | — | | 210 | LFM2-24B-A2B | LiquidAI | Proprietary | 32K | — | | 211 | LFM2.5-1.2B-Thinking | LiquidAI | Proprietary | 32K | — | | 212 | Ministral 3 3B (Reasoning) | Mistral | Open Weight | 128K | — | | 213 | Ministral 3 3B | Mistral | Open Weight | 128K | — | | 214 | Exaone 4.0 1.2B | LG AI Research | Open Weight | 128K | — | | 215 | Step 3.5 Flash | StepFun | Open Weight | 256K | — | | 216 | MiniMax M2.5 | MiniMax | Proprietary | 128K | — | | 217 | GPT-5 mini | OpenAI | Proprietary | 128K | — | | 218 | LFM2.5-1.2B-Instruct | LiquidAI | Proprietary | 32K | — | | 219 | LFM2.5-VL-450M | LiquidAI | Open Weight | 128K | — | | 220 | LFM2.5-VL-1.6B-Extract | LiquidAI | Open Weight | 128K | — | | 221 | LFM2.5-VL-450M-Extract | LiquidAI | Open Weight | 128K | — | | 222 | Kimi K2.7 Code | Moonshot AI | Open Weight | 256K | — | | 223 | Holo3.1-35B-A3B | H Company | Open Weight | 262K | — | | 224 | Holo3.1-4B | H Company | Open Weight | 262K | — | | 225 | Holo3.1-9B | H Company | Open Weight | 262K | — | | 226 | Composer 2 Fast | Cursor | Proprietary | 200K | — | | 227 | Qwen3.5 Plus | Alibaba | Proprietary | 1M | — | | 228 | Claude Haiku 4.5 Thinking | Anthropic | Proprietary | 200K | — | | 229 | Claude Sonnet 4.5 Thinking | Anthropic | Proprietary | 200K | — | | 230 | Holo2-235B-A22B | H Company | Open Weight | 262K | — | | 231 | Holo2-30B-A3B | H Company | Open Weight | 262K | — | | 232 | Holo2-4B | H Company | Open Weight | 262K | — | | 233 | Holo2-8B | H Company | Open Weight | 262K | — | | 234 | Holo3.1-0.8B | H Company | Open Weight | 262K | — | | 235 | Holo3.1-35B-A3B-FP8 | H Company | Open Weight | 262K | — | | 236 | Holo3.1-35B-A3B-GGUF | H Company | Open Weight | 262K | — | | 237 | Holo3.1-35B-A3B-NVFP4 | H Company | Open Weight | 262K | — | | 238 | LFM2.5-1.2B-JP-202606 | LiquidAI | Open Weight | 32K | — | | 239 | Grok Build 0.1 | xAI | Proprietary | 256K | — | | 240 | Hy-MT1.5-1.8B-1.25bit | Tencent Hunyuan | Open Weight | 262K | — | | 241 | Leanstral | Mistral | Open Weight | 256K | — | | 242 | Granite-4.0-350M | IBM | Open Weight | 32K | — | | 243 | Granite-4.0-H-350M | IBM | Open Weight | 32K | — | | 244 | GPT-5 nano | OpenAI | Proprietary | 400K | — | | 245 | A.X series | SK Telecom | Proprietary | 64K | — | | 246 | DNA 1.0 8B | Community | Open Weight | 32K | — | | 247 | Holotron-12B | H Company | Open Weight | 128K | — | | 248 | HyperClova X Dash | Naver Cloud | Proprietary | 128K | — | | 249 | HyperClova X Think 32B | Naver Cloud | Open Weight | 128K | — | | 250 | Kanana Essence | Kakao | Proprietary | 64K | — | | 251 | Kanana Flag | Kakao | Proprietary | 64K | — | | 252 | Kanana Nano | Kakao | Proprietary | 64K | — | | 253 | OriOn-Mistral-24B | LightOn | Open Weight | 344K | — | | 254 | OriOn-Qwen-32B | LightOn | Open Weight | 262K | — | | 255 | Pharia-1-LLM-7B-control | Aleph Alpha | Open Weight | 8K | — | | 256 | Pharia-1-LLM-7B-control-aligned | Aleph Alpha | Open Weight | 8K | — | | 257 | Thunder-LLM 8B | Academic | Open Weight | 32K | — | | 258 | Varco | NC AI | Proprietary | 64K | — | ## Agentic Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | GPT-5.5 Pro | OpenAI | 90.1 | | 2 | GPT-5.4 Pro | OpenAI | 89.3 | | 3 | Claude Mythos 5 | Anthropic | 87 | | 4 | Claude Fable 5 | Anthropic | 85.2 | | 5 | Holo3-35B-A3B | H Company | 82.6 | | 6 | GPT-5.5 | OpenAI | 81.5 | | 7 | Claude Opus 4.8 | Anthropic | 80.1 | | 8 | Holo3-122B-A10B | H Company | 78.9 | | 9 | Gemini 3.5 Flash | Google | 77.2 | | 10 | GPT-5.4 | OpenAI | 77 | | 11 | Claude Opus 4.7 (Adaptive) | Anthropic | 74.9 | | 12 | DeepSeek V4 Pro (Max) | DeepSeek | 74 | | 13 | Kimi K2.6 | Moonshot AI | 73.1 | | 14 | Claude Opus 4.6 | Anthropic | 72.6 | | 15 | MiniMax M3 | MiniMax | 71.9 | | 16 | Qwen3.7 Plus | Alibaba | 71.7 | | 17 | GPT-5.3 Codex | OpenAI | 71.5 | | 18 | DeepSeek V4 Pro (High) | DeepSeek | 70 | | 19 | Qwen3.7 Max | Alibaba | 69.7 | | 20 | Composer 2.5 | Cursor | 69.3 | | 21 | MiMo-V2.5-Pro | Xiaomi | 68.4 | | 22 | Step 3.7 Flash | StepFun | 65.9 | | 23 | MiMo-V2.5 | Xiaomi | 65.8 | | 24 | GPT-5.4 mini | OpenAI | 65.6 | | 25 | Qwen 3.6 Max (preview) | Alibaba | 65.4 | | 26 | GLM-5.1 | Z.AI | 65.3 | | 27 | Claude Sonnet 4.6 | Anthropic | 65.1 | | 28 | DeepSeek V4 Flash (Max) | DeepSeek | 63.3 | | 29 | Claude Opus 4.5 | Anthropic | 62.5 | | 30 | Composer 2 | Cursor | 61.7 | | 31 | Qwen3.6 Plus | Alibaba | 61.6 | | 32 | Qwen3.6-27B | Alibaba | 59.3 | | 33 | DeepSeek V4 Pro | DeepSeek | 59.1 | | 34 | Muse Spark | Meta | 59 | | 35 | MiniMax M2.7 | MiniMax | 57 | | 36 | GLM-5 | Z.AI | 56.2 | | 37 | Qwen3.5 397B | Alibaba | 56.2 | | 38 | Qwen3.5-122B-A10B | Alibaba | 56.1 | | 39 | DeepSeek V4 Flash (High) | DeepSeek | 55.4 | | 40 | Claude Sonnet 4.5 | Anthropic | 55.3 | | 41 | GPT-5.2 | OpenAI | 55.2 | | 42 | Kimi K2.5 (Reasoning) | Moonshot AI | 54.6 | | 43 | Kimi K2.5 | Moonshot AI | 54.6 | | 44 | Hy3 Preview | Tencent | 54.4 | | 45 | Nemotron 3 Ultra | NVIDIA | 51.7 | | 46 | Qwen3.5-27B | Alibaba | 51.6 | | 47 | Qwen3.6-35B-A3B | Alibaba | 51.5 | | 48 | Qwen3.5-35B-A3B | Alibaba | 50.6 | | 49 | DeepSeek V4 Flash | DeepSeek | 49.1 | | 50 | Grok 4.20 | xAI | 47.1 | | 51 | MAI-Thinking-1 | Microsoft | 46 | | 52 | Laguna M.1 | Poolside | 45.8 | | 53 | GLM-4.7 | Z.AI | 45.3 | | 54 | GPT-5.4 nano | OpenAI | 42.9 | | 55 | Laguna XS.2 | Poolside | 35.7 | | 56 | Gemini 3.1 Pro | Google | 0 | | 57 | Gemini 3 Pro Deep Think | Google | 0 | | 58 | Gemini 3 Pro | Google | 0 | | 59 | Qwen3.5 397B (Reasoning) | Alibaba | 0 | | 60 | GPT-5.1 | OpenAI | 0 | | 61 | GPT-5 (high) | OpenAI | 0 | | 62 | GPT-5.2-Codex | OpenAI | 0 | | 63 | GPT-5.1-Codex-Max | OpenAI | 0 | | 64 | GPT-5 (medium) | OpenAI | 0 | | 65 | Grok 4.1 Fast | xAI | 0 | | 66 | Gemini 2.5 Pro | Google | 0 | | 67 | Grok 4 | xAI | 0 | | 68 | MiMo-V2-Flash | Xiaomi | 0 | | 69 | GPT-4.1 | OpenAI | 0 | | 70 | o1 | OpenAI | 0 | | 71 | DeepSeek V3.2 | DeepSeek | 0 | | 72 | o3 | OpenAI | 0 | | 73 | Gemini 3 Flash | Google | 0 | | 74 | o3-mini | OpenAI | 0 | | 75 | Claude 4 Sonnet | Anthropic | 0 | | 76 | Mistral Large 3 | Mistral | 0 | | 77 | Gemini 3.1 Flash-Lite | Google | 0 | | 78 | GPT-4.1 mini | OpenAI | 0 | | 79 | Nemotron 3 Super 100B | NVIDIA | 0 | | 80 | Claude 4.1 Opus Thinking | Anthropic | 0 | | 81 | GPT-4o | OpenAI | 0 | | 82 | Kimi K2 | Moonshot AI | 0 | | 83 | Llama 3.1 405B | Meta | 0 | | 84 | Grok Code Fast 1 | xAI | 0 | | 85 | Sarvam 105B | Sarvam | 0 | | 86 | Mistral Large 2 | Mistral | 0 | | 87 | Gemini 2.5 Flash | Google | 0 | | 88 | DeepSeek V3 | DeepSeek | 0 | | 89 | GPT-OSS 120B | OpenAI | 0 | | 90 | MiniCPM5-1B | OpenBMB | 0 | | 91 | DeepSeek-R1 | DeepSeek | 0 | | 92 | DeepSeek V3.1 (Reasoning) | DeepSeek | 0 | | 93 | Phi-4 | Microsoft | 0 | | 94 | GPT-4.1 nano | OpenAI | 0 | | 95 | Llama 4 Scout | Meta | 0 | | 96 | Nemotron 3 Nano 30B | NVIDIA | 0 | | 97 | DeepSeek V3.1 | DeepSeek | 0 | | 98 | Claude 3 Haiku | Anthropic | 0 | | 99 | Nemotron Ultra 253B | NVIDIA | 0 | | 100 | GLM-4.5-Air | Z.AI | 0 | | 101 | Llama 4 Maverick | Meta | 0 | | 102 | Gemma 3 27B | Google | 0 | | 103 | GPT-OSS 20B | OpenAI | 0 | | 104 | Nova Pro | Amazon | 0 | | 105 | MiMo-V2-Pro | Xiaomi | 0 | | 106 | MiMo-V2-Omni | Xiaomi | 0 | | 107 | Mistral Medium 3.5 128B | Mistral | 0 | | 108 | Grok 4.3 | xAI | 0 | | 109 | Gemma 4 31B | Google | 0 | | 110 | Exaone 4.0 32B | LG AI Research | 0 | | 111 | GLM-5V-Turbo | Z.AI | 0 | | 112 | Mellum2-12B-A2.5B-Thinking | JetBrains | 0 | | 113 | ZAYA1-8B | Zyphra | 0 | | 114 | Gemma 4 26B A4B | Google | 0 | | 115 | ZAYA1-74B-Preview | Zyphra | 0 | | 116 | Mistral Small 4 (Reasoning) | Mistral | 0 | | 117 | K-Exaone | LG AI Research | 0 | | 118 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | 0 | | 119 | Gemma 4 12B | Google | 0 | | 120 | LFM2.5-8B-A1B | LiquidAI | 0 | | 121 | Mistral Medium 3 | Mistral | 0 | | 122 | Mistral Small 4 | Mistral | 0 | | 123 | Sarvam 30B | Sarvam | 0 | | 124 | Command A+ | Cohere | 0 | | 125 | Gemma 4 E4B | Google | 0 | | 126 | Ling 2.6 Flash | InclusionAI | 0 | | 127 | Granite-4.0-1B | IBM | 0 | | 128 | Mellum2-12B-A2.5B-Instruct | JetBrains | 0 | | 129 | Claude Opus 4.6 (Adaptive) | Anthropic | 0 | | 130 | Gemma 4 E2B | Google | 0 | | 131 | Claude Opus 4.7 | Anthropic | 0 | | 132 | Claude Opus 4.5 Thinking | Anthropic | 0 | | 133 | GLM-5-Turbo | Z.AI | 0 | | 134 | GPT-5.1-Codex | OpenAI | 0 | | 135 | Grok 4.1 Fast (Reasoning) | xAI | 0 | | 136 | GLM-4.6 | Z.AI | 0 | | 137 | Grok 4 Fast (Reasoning) | xAI | 0 | | 138 | Trinity-Large-Preview | Arcee AI | 0 | | 139 | Trinity-Large-Thinking | Arcee AI | 0 | | 140 | Qwen3 Max | Alibaba | 0 | | 141 | Granite-4.0-H-1B | IBM | 0 | | 142 | Solar Pro 2 | Upstage | 0 | | 143 | Exaone 4.0 1.2B | LG AI Research | 0 | | 144 | LFM2.5-VL-450M | LiquidAI | 0 | | 145 | LFM2.5-VL-1.6B-Extract | LiquidAI | 0 | | 146 | Kimi K2.7 Code | Moonshot AI | 0 | | 147 | Holo3.1-35B-A3B | H Company | 0 | | 148 | Holo3.1-4B | H Company | 0 | | 149 | Holo3.1-9B | H Company | 0 | | 150 | Grok Build 0.1 | xAI | 0 | | 151 | Granite-4.0-350M | IBM | 0 | | 152 | Granite-4.0-H-350M | IBM | 0 | ## Coding Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | Claude Mythos 5 | Anthropic | 85.9 | | 2 | Claude Fable 5 | Anthropic | 85.6 | | 3 | MiMo-V2-Pro | Xiaomi | 78 | | 4 | Mistral Medium 3.5 128B | Mistral | 77.6 | | 5 | Claude Sonnet 4.5 | Anthropic | 77.2 | | 6 | Kimi K2.5 (Reasoning) | Moonshot AI | 76.8 | | 7 | Claude Opus 4.8 | Anthropic | 76.4 | | 8 | DeepSeek V4 Pro (Max) | DeepSeek | 75.9 | | 9 | MiMo-V2-Omni | Xiaomi | 74.8 | | 10 | Claude 4.1 Opus | Anthropic | 74.5 | | 11 | Nemotron 3 Ultra | NVIDIA | 74.2 | | 12 | DeepSeek V4 Pro (High) | DeepSeek | 73.8 | | 13 | DeepSeek V4 Flash (Max) | DeepSeek | 73.7 | | 14 | Qwen3.7 Max | Alibaba | 73.6 | | 15 | MiMo-V2-Flash | Xiaomi | 73.4 | | 16 | Claude Haiku 4.5 | Anthropic | 73.3 | | 17 | Claude Opus 4.7 (Adaptive) | Anthropic | 72.9 | | 18 | Claude 4 Sonnet | Anthropic | 72.7 | | 19 | DeepSeek V4 Flash (High) | DeepSeek | 72.2 | | 20 | Kimi K2.6 | Moonshot AI | 72 | | 21 | Qwen3.5-122B-A10B | Alibaba | 72 | | 22 | Gemma 4 12B | Google | 72 | | 23 | Qwen3.7 Plus | Alibaba | 71.1 | | 24 | MAI-Thinking-1 | Microsoft | 71 | | 25 | Grok Code Fast 1 | xAI | 70.8 | | 26 | Qwen3.6-27B | Alibaba | 70.6 | | 27 | GLM-4.7 | Z.AI | 70.6 | | 28 | Mellum2-12B-A2.5B-Thinking | JetBrains | 69.9 | | 29 | MiniMax M3 | MiniMax | 67 | | 30 | Qwen3.6-35B-A3B | Alibaba | 66.9 | | 31 | Claude Sonnet 4.6 | Anthropic | 66.4 | | 32 | Claude Opus 4.5 | Anthropic | 65.9 | | 33 | Qwen3.6 Plus | Alibaba | 64.8 | | 34 | GPT-5.2 | OpenAI | 64.7 | | 35 | Claude Opus 4.6 | Anthropic | 64.4 | | 36 | Kimi K2.5 | Moonshot AI | 64.2 | | 37 | Gemini 2.5 Pro | Google | 63.8 | | 38 | GLM-5 | Z.AI | 63.2 | | 39 | GPT-5.3 Codex | OpenAI | 63.1 | | 40 | Qwen3.5-27B | Alibaba | 63 | | 41 | Muse Spark | Meta | 61.7 | | 42 | Grok 4.20 | xAI | 61 | | 43 | GLM-5.1 | Z.AI | 60.9 | | 44 | DeepSeek V3.2 | DeepSeek | 60.9 | | 45 | Qwen3.5 397B | Alibaba | 60.3 | | 46 | Hy3 Preview | Tencent | 60 | | 47 | DeepSeek V4 Pro | DeepSeek | 58.8 | | 48 | GPT-5.5 | OpenAI | 58.6 | | 49 | Laguna M.1 | Poolside | 58.6 | | 50 | Qwen3.5-35B-A3B | Alibaba | 58.4 | | 51 | Composer 2 | Cursor | 58 | | 52 | GPT-5.4 | OpenAI | 57.7 | | 53 | MiMo-V2.5-Pro | Xiaomi | 57.2 | | 54 | DeepSeek V4 Flash | DeepSeek | 57.1 | | 55 | Step 3.7 Flash | StepFun | 56.3 | | 56 | MiMo-V2.5 | Xiaomi | 56.1 | | 57 | Laguna XS.2 | Poolside | 55.1 | | 58 | GPT-4.1 | OpenAI | 54.6 | | 59 | Gemini 3.5 Flash | Google | 54.5 | | 60 | Qwen 3.6 Max (preview) | Alibaba | 54.1 | | 61 | MiniMax M2.7 | MiniMax | 53.7 | | 62 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | 53.5 | | 63 | ZAYA1-74B-Preview | Zyphra | 53.2 | | 64 | o3-mini | OpenAI | 49.3 | | 65 | Claude 3.5 Sonnet | Anthropic | 49 | | 66 | Grok 4.3 | xAI | 47.3 | | 67 | Gemma 4 31B | Google | 41.6 | | 68 | DeepSeek V3 | DeepSeek | 39.2 | | 69 | Mellum2-12B-A2.5B-Instruct | JetBrains | 37.2 | | 70 | Ling 2.6 Flash | InclusionAI | 27 | | 71 | GPT-4.1 mini | OpenAI | 23.6 | | 72 | Gemini 3.1 Pro | Google | 0 | | 73 | o1-preview | OpenAI | 0 | | 74 | Gemini 3 Pro | Google | 0 | | 75 | GLM-5 (Reasoning) | Z.AI | 0 | | 76 | Qwen3.5 397B (Reasoning) | Alibaba | 0 | | 77 | GPT-5.1 | OpenAI | 0 | | 78 | GPT-5 (high) | OpenAI | 0 | | 79 | GPT-5.2-Codex | OpenAI | 0 | | 80 | GPT-5.1-Codex-Max | OpenAI | 0 | | 81 | GPT-5 (medium) | OpenAI | 0 | | 82 | Grok 4.1 Fast | xAI | 0 | | 83 | Grok 4 | xAI | 0 | | 84 | DeepSeek V3.2 (Thinking) | DeepSeek | 0 | | 85 | o1 | OpenAI | 0 | | 86 | o3 | OpenAI | 0 | | 87 | Gemini 3 Flash | Google | 0 | | 88 | GPT-4o mini | OpenAI | 0 | | 89 | Mistral Large 3 | Mistral | 0 | | 90 | Gemini 3.1 Flash-Lite | Google | 0 | | 91 | Claude 4.1 Opus Thinking | Anthropic | 0 | | 92 | GPT-4o | OpenAI | 0 | | 93 | Kimi K2 | Moonshot AI | 0 | | 94 | Llama 3.1 405B | Meta | 0 | | 95 | Sarvam 105B | Sarvam | 0 | | 96 | Mistral Large 2 | Mistral | 0 | | 97 | Gemini 2.5 Flash | Google | 0 | | 98 | Gemini 1.5 Pro | Google | 0 | | 99 | GPT-OSS 120B | OpenAI | 0 | | 100 | Claude 3 Opus | Anthropic | 0 | | 101 | MiniCPM5-1B | OpenBMB | 0 | | 102 | DeepSeek-R1 | DeepSeek | 0 | | 103 | DeepSeek V3.1 (Reasoning) | DeepSeek | 0 | | 104 | Phi-4 | Microsoft | 0 | | 105 | GPT-4.1 nano | OpenAI | 0 | | 106 | Llama 4 Scout | Meta | 0 | | 107 | Nemotron 3 Nano 30B | NVIDIA | 0 | | 108 | DeepSeek V3.1 | DeepSeek | 0 | | 109 | GPT-4 Turbo | OpenAI | 0 | | 110 | Gemini 1.0 Pro | Google | 0 | | 111 | Claude 3 Haiku | Anthropic | 0 | | 112 | Nemotron Ultra 253B | NVIDIA | 0 | | 113 | GLM-4.5-Air | Z.AI | 0 | | 114 | Llama 4 Maverick | Meta | 0 | | 115 | Gemma 3 27B | Google | 0 | | 116 | GPT-OSS 20B | OpenAI | 0 | | 117 | Nova Pro | Amazon | 0 | | 118 | Composer 2.5 | Cursor | 0 | | 119 | Interfaze Beta | Interfaze | 0 | | 120 | GPT-5.4 mini | OpenAI | 0 | | 121 | Exaone 4.0 32B | LG AI Research | 0 | | 122 | GLM-5V-Turbo | Z.AI | 0 | | 123 | GPT-5.4 nano | OpenAI | 0 | | 124 | ZAYA1-8B | Zyphra | 0 | | 125 | Gemma 4 26B A4B | Google | 0 | | 126 | Mistral Small 4 (Reasoning) | Mistral | 0 | | 127 | K-Exaone | LG AI Research | 0 | | 128 | LFM2.5-8B-A1B | LiquidAI | 0 | | 129 | Mistral Medium 3 | Mistral | 0 | | 130 | DeepSeek V4 Pro Base | DeepSeek | 0 | | 131 | Mistral Small 4 | Mistral | 0 | | 132 | Sarvam 30B | Sarvam | 0 | | 133 | Command A+ | Cohere | 0 | | 134 | Gemma 4 E4B | Google | 0 | | 135 | Granite-4.0-1B | IBM | 0 | | 136 | DeepSeek V4 Flash Base | DeepSeek | 0 | | 137 | Claude Opus 4.6 (Adaptive) | Anthropic | 0 | | 138 | Gemma 4 E2B | Google | 0 | | 139 | Claude Opus 4.7 | Anthropic | 0 | | 140 | Claude Opus 4.5 Thinking | Anthropic | 0 | | 141 | GLM-5-Turbo | Z.AI | 0 | | 142 | GPT-5.1-Codex | OpenAI | 0 | | 143 | Grok 4.1 Fast (Reasoning) | xAI | 0 | | 144 | GLM-4.6 | Z.AI | 0 | | 145 | Grok 4 Fast (Reasoning) | xAI | 0 | | 146 | Trinity-Large-Preview | Arcee AI | 0 | | 147 | Trinity-Large-Thinking | Arcee AI | 0 | | 148 | Qwen3 Max | Alibaba | 0 | | 149 | Granite-4.0-H-1B | IBM | 0 | | 150 | Qwen2.5 Coder 32B Instruct | Alibaba | 0 | | 151 | DeepSeek R1 Distill Qwen 32B | DeepSeek | 0 | | 152 | Solar Pro 2 | Upstage | 0 | | 153 | Exaone 4.0 1.2B | LG AI Research | 0 | | 154 | MiniMax M2.5 | MiniMax | 0 | | 155 | GPT-5 mini | OpenAI | 0 | | 156 | LFM2.5-VL-1.6B-Extract | LiquidAI | 0 | | 157 | Kimi K2.7 Code | Moonshot AI | 0 | | 158 | Composer 2 Fast | Cursor | 0 | | 159 | Qwen3.5 Plus | Alibaba | 0 | | 160 | Claude Haiku 4.5 Thinking | Anthropic | 0 | | 161 | Claude Sonnet 4.5 Thinking | Anthropic | 0 | | 162 | Granite-4.0-350M | IBM | 0 | | 163 | Granite-4.0-H-350M | IBM | 0 | ## Multimodal & Grounded Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | GPT-5.4 Pro | OpenAI | 94 | | 2 | Claude Mythos 5 | Anthropic | 92.7 | | 3 | Claude Fable 5 | Anthropic | 92.4 | | 4 | Gemini 3.5 Flash | Google | 83.8 | | 5 | Gemini 3.1 Pro | Google | 82.8 | | 6 | Muse Spark | Meta | 82.2 | | 7 | Qwen3.7 Plus | Alibaba | 81.1 | | 8 | Gemini 3 Pro | Google | 81.1 | | 9 | GPT-5.2 | OpenAI | 80.3 | | 10 | Kimi K2.6 | Moonshot AI | 79.7 | | 11 | Qwen3.6 Plus | Alibaba | 79.6 | | 12 | Qwen3.5 397B | Alibaba | 79.6 | | 13 | MiMo-V2.5 | Xiaomi | 78.9 | | 14 | Kimi K2.5 (Reasoning) | Moonshot AI | 78.5 | | 15 | Kimi K2.5 | Moonshot AI | 78.5 | | 16 | Grok 4.3 | xAI | 78.1 | | 17 | Claude Sonnet 4.6 | Anthropic | 77.4 | | 18 | Claude Opus 4.6 | Anthropic | 77.3 | | 19 | Qwen3.5-122B-A10B | Alibaba | 77.2 | | 20 | Gemma 4 31B | Google | 76.9 | | 21 | Qwen3.6-27B | Alibaba | 76.6 | | 22 | GPT-5.4 mini | OpenAI | 76.6 | | 23 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | 76.3 | | 24 | Claude Opus 4.8 | Anthropic | 76.1 | | 25 | Qwen3.6-35B-A3B | Alibaba | 76.1 | | 26 | Gemma 4 26B A4B | Google | 73.8 | | 27 | Gemini 3.1 Flash-Lite | Google | 73.2 | | 28 | GPT-5.4 | OpenAI | 72.7 | | 29 | Interfaze Beta | Interfaze | 71.1 | | 30 | Grok 4.20 | xAI | 70.8 | | 31 | GPT-5.5 | OpenAI | 70.4 | | 32 | Claude Opus 4.5 | Anthropic | 70 | | 33 | Gemma 4 12B | Google | 69.1 | | 34 | GPT-5.4 nano | OpenAI | 66.1 | | 35 | MiniMax M3 | MiniMax | 64.9 | | 36 | Claude Opus 4.7 (Adaptive) | Anthropic | 64.3 | | 37 | Command A+ | Cohere | 59.8 | | 38 | Qwen3.7 Max | Alibaba | 0 | | 39 | DeepSeek V4 Pro (Max) | DeepSeek | 0 | | 40 | GPT-5.3 Codex | OpenAI | 0 | | 41 | GLM-5.1 | Z.AI | 0 | | 42 | DeepSeek V4 Pro (High) | DeepSeek | 0 | | 43 | GLM-5 (Reasoning) | Z.AI | 0 | | 44 | Qwen3.5 397B (Reasoning) | Alibaba | 0 | | 45 | GPT-5.1 | OpenAI | 0 | | 46 | GPT-5 (high) | OpenAI | 0 | | 47 | GPT-5.2-Codex | OpenAI | 0 | | 48 | GPT-5.1-Codex-Max | OpenAI | 0 | | 49 | DeepSeek V4 Flash (Max) | DeepSeek | 0 | | 50 | DeepSeek V4 Flash (High) | DeepSeek | 0 | | 51 | GPT-5 (medium) | OpenAI | 0 | | 52 | DeepSeek V4 Pro | DeepSeek | 0 | | 53 | GLM-4.7 | Z.AI | 0 | | 54 | Grok 4.1 Fast | xAI | 0 | | 55 | GLM-5 | Z.AI | 0 | | 56 | Claude Sonnet 4.5 | Anthropic | 0 | | 57 | Gemini 2.5 Pro | Google | 0 | | 58 | Grok 4 | xAI | 0 | | 59 | Qwen3.5-27B | Alibaba | 0 | | 60 | DeepSeek V3.2 (Thinking) | DeepSeek | 0 | | 61 | MiMo-V2-Flash | Xiaomi | 0 | | 62 | DeepSeek V4 Flash | DeepSeek | 0 | | 63 | GPT-4.1 | OpenAI | 0 | | 64 | DeepSeek V3.2 | DeepSeek | 0 | | 65 | Claude Haiku 4.5 | Anthropic | 0 | | 66 | o3 | OpenAI | 0 | | 67 | Qwen3.5-35B-A3B | Alibaba | 0 | | 68 | Gemini 3 Flash | Google | 0 | | 69 | MiniMax M2.7 | MiniMax | 0 | | 70 | Claude 4.1 Opus | Anthropic | 0 | | 71 | Claude 4 Sonnet | Anthropic | 0 | | 72 | GPT-4o mini | OpenAI | 0 | | 73 | Mistral Large 3 | Mistral | 0 | | 74 | GPT-4.1 mini | OpenAI | 0 | | 75 | Claude 4.1 Opus Thinking | Anthropic | 0 | | 76 | GPT-4o | OpenAI | 0 | | 77 | Kimi K2 | Moonshot AI | 0 | | 78 | Gemini 2.5 Flash | Google | 0 | | 79 | Gemini 1.5 Pro | Google | 0 | | 80 | DeepSeek V3 | DeepSeek | 0 | | 81 | GPT-OSS 120B | OpenAI | 0 | | 82 | DeepSeek V3.1 (Reasoning) | DeepSeek | 0 | | 83 | GPT-4.1 nano | OpenAI | 0 | | 84 | GLM-4.5 | Z.AI | 0 | | 85 | Llama 4 Scout | Meta | 0 | | 86 | DeepSeek V3.1 | DeepSeek | 0 | | 87 | Claude 3 Haiku | Anthropic | 0 | | 88 | GLM-4.5-Air | Z.AI | 0 | | 89 | Llama 4 Maverick | Meta | 0 | | 90 | Gemma 3 27B | Google | 0 | | 91 | GPT-OSS 20B | OpenAI | 0 | | 92 | Nova Pro | Amazon | 0 | | 93 | MiMo-V2.5-Pro | Xiaomi | 0 | | 94 | MiMo-V2-Omni | Xiaomi | 0 | | 95 | Mistral Medium 3.5 128B | Mistral | 0 | | 96 | Step 3.7 Flash | StepFun | 0 | | 97 | GLM-5V-Turbo | Z.AI | 0 | | 98 | Mistral Small 4 (Reasoning) | Mistral | 0 | | 99 | Mistral Medium 3 | Mistral | 0 | | 100 | Mistral Small 4 | Mistral | 0 | | 101 | Gemma 4 E4B | Google | 0 | | 102 | Claude Opus 4.6 (Adaptive) | Anthropic | 0 | | 103 | Gemma 4 E2B | Google | 0 | | 104 | Claude Opus 4.7 | Anthropic | 0 | | 105 | Claude Opus 4.5 Thinking | Anthropic | 0 | | 106 | GLM-5-Turbo | Z.AI | 0 | | 107 | GPT-5.1-Codex | OpenAI | 0 | | 108 | Grok 4.1 Fast (Reasoning) | xAI | 0 | | 109 | Grok 4 Fast (Reasoning) | xAI | 0 | | 110 | Trinity-Large-Preview | Arcee AI | 0 | | 111 | Trinity-Large-Thinking | Arcee AI | 0 | | 112 | Qwen3 Max | Alibaba | 0 | | 113 | LFM2.5-VL-450M | LiquidAI | 0 | | 114 | LFM2.5-VL-1.6B-Extract | LiquidAI | 0 | | 115 | LFM2.5-VL-450M-Extract | LiquidAI | 0 | | 116 | Claude Haiku 4.5 Thinking | Anthropic | 0 | | 117 | Claude Sonnet 4.5 Thinking | Anthropic | 0 | | 118 | Holo2-235B-A22B | H Company | 0 | | 119 | Holo2-30B-A3B | H Company | 0 | | 120 | Holo2-4B | H Company | 0 | | 121 | Holo2-8B | H Company | 0 | ## Reasoning Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | Qwen3.7 Plus | Alibaba | 91.7 | | 2 | Qwen3.7 Max | Alibaba | 90.4 | | 3 | GPT-5.5 | OpenAI | 85 | | 4 | GPT-5.4 Pro | OpenAI | 83.3 | | 5 | Gemini 3.1 Pro | Google | 77.1 | | 6 | Claude Opus 4.7 (Adaptive) | Anthropic | 75.8 | | 7 | Gemini 3.5 Flash | Google | 74.7 | | 8 | Claude Opus 4.5 | Anthropic | 64.4 | | 9 | Qwen3.5 397B | Alibaba | 63.2 | | 10 | Qwen3.6 Plus | Alibaba | 62 | | 11 | Nemotron 3 Ultra | NVIDIA | 61.9 | | 12 | Kimi K2.5 | Moonshot AI | 61 | | 13 | GLM-5 | Z.AI | 60.8 | | 14 | Qwen3.5-27B | Alibaba | 60.6 | | 15 | Qwen3.5-122B-A10B | Alibaba | 60.2 | | 16 | Qwen3.5-35B-A3B | Alibaba | 59 | | 17 | Grok 4.20 | xAI | 53.3 | | 18 | GPT-5.2 | OpenAI | 52.9 | | 19 | DeepSeek V4 Pro Base | DeepSeek | 51.5 | | 20 | Gemini 3 Pro Deep Think | Google | 45.1 | | 21 | DeepSeek V4 Flash Base | DeepSeek | 44.7 | | 22 | Gemma 4 12B | Google | 43.4 | | 23 | Muse Spark | Meta | 42.5 | | 24 | Gemini 3 Pro | Google | 31.1 | | 25 | Claude Sonnet 4.5 | Anthropic | 13.6 | | 26 | Claude Opus 4.8 | Anthropic | 0 | | 27 | GPT-5.4 | OpenAI | 0 | | 28 | Claude Opus 4.6 | Anthropic | 0 | | 29 | DeepSeek V4 Pro (Max) | DeepSeek | 0 | | 30 | GPT-5.3 Codex | OpenAI | 0 | | 31 | GLM-5.1 | Z.AI | 0 | | 32 | Claude Sonnet 4.6 | Anthropic | 0 | | 33 | DeepSeek V4 Pro (High) | DeepSeek | 0 | | 34 | Kimi K2.6 | Moonshot AI | 0 | | 35 | MiniMax M3 | MiniMax | 0 | | 36 | Qwen3.5 397B (Reasoning) | Alibaba | 0 | | 37 | GPT-5.1 | OpenAI | 0 | | 38 | GPT-5 (high) | OpenAI | 0 | | 39 | GPT-5.2-Codex | OpenAI | 0 | | 40 | Kimi K2.5 (Reasoning) | Moonshot AI | 0 | | 41 | GPT-5.1-Codex-Max | OpenAI | 0 | | 42 | DeepSeek V4 Flash (Max) | DeepSeek | 0 | | 43 | Qwen3.6-27B | Alibaba | 0 | | 44 | DeepSeek V4 Flash (High) | DeepSeek | 0 | | 45 | GPT-5 (medium) | OpenAI | 0 | | 46 | DeepSeek V4 Pro | DeepSeek | 0 | | 47 | GLM-4.7 | Z.AI | 0 | | 48 | Grok 4.1 Fast | xAI | 0 | | 49 | MAI-Thinking-1 | Microsoft | 0 | | 50 | Qwen3.6-35B-A3B | Alibaba | 0 | | 51 | Gemini 2.5 Pro | Google | 0 | | 52 | Grok 4 | xAI | 0 | | 53 | MiMo-V2-Flash | Xiaomi | 0 | | 54 | DeepSeek V4 Flash | DeepSeek | 0 | | 55 | GPT-4.1 | OpenAI | 0 | | 56 | o1 | OpenAI | 0 | | 57 | DeepSeek V3.2 | DeepSeek | 0 | | 58 | o3 | OpenAI | 0 | | 59 | Gemini 3 Flash | Google | 0 | | 60 | MiniMax M2.7 | MiniMax | 0 | | 61 | Claude 4 Sonnet | Anthropic | 0 | | 62 | Mistral Large 3 | Mistral | 0 | | 63 | Gemini 3.1 Flash-Lite | Google | 0 | | 64 | GPT-4.1 mini | OpenAI | 0 | | 65 | Claude 4.1 Opus Thinking | Anthropic | 0 | | 66 | GPT-4o | OpenAI | 0 | | 67 | Kimi K2 | Moonshot AI | 0 | | 68 | Llama 3.1 405B | Meta | 0 | | 69 | Grok Code Fast 1 | xAI | 0 | | 70 | Sarvam 105B | Sarvam | 0 | | 71 | Mistral Large 2 | Mistral | 0 | | 72 | Gemini 2.5 Flash | Google | 0 | | 73 | DeepSeek V3 | DeepSeek | 0 | | 74 | GPT-OSS 120B | OpenAI | 0 | | 75 | MiniCPM5-1B | OpenBMB | 0 | | 76 | DeepSeek-R1 | DeepSeek | 0 | | 77 | DeepSeek V3.1 (Reasoning) | DeepSeek | 0 | | 78 | Phi-4 | Microsoft | 0 | | 79 | GPT-4.1 nano | OpenAI | 0 | | 80 | Llama 4 Scout | Meta | 0 | | 81 | Nemotron 3 Nano 30B | NVIDIA | 0 | | 82 | DeepSeek V3.1 | DeepSeek | 0 | | 83 | Claude 3 Haiku | Anthropic | 0 | | 84 | Nemotron Ultra 253B | NVIDIA | 0 | | 85 | GLM-4.5-Air | Z.AI | 0 | | 86 | Llama 4 Maverick | Meta | 0 | | 87 | Gemma 3 27B | Google | 0 | | 88 | GPT-OSS 20B | OpenAI | 0 | | 89 | Nova Pro | Amazon | 0 | | 90 | MiMo-V2.5-Pro | Xiaomi | 0 | | 91 | MiMo-V2-Pro | Xiaomi | 0 | | 92 | MiMo-V2-Omni | Xiaomi | 0 | | 93 | Qwen 3.6 Max (preview) | Alibaba | 0 | | 94 | Mistral Medium 3.5 128B | Mistral | 0 | | 95 | Grok 4.3 | xAI | 0 | | 96 | Step 3.7 Flash | StepFun | 0 | | 97 | GPT-5.4 mini | OpenAI | 0 | | 98 | Gemma 4 31B | Google | 0 | | 99 | Exaone 4.0 32B | LG AI Research | 0 | | 100 | GLM-5V-Turbo | Z.AI | 0 | | 101 | GPT-5.4 nano | OpenAI | 0 | | 102 | Hy3 Preview | Tencent | 0 | | 103 | Gemma 4 26B A4B | Google | 0 | | 104 | Mistral Small 4 (Reasoning) | Mistral | 0 | | 105 | K-Exaone | LG AI Research | 0 | | 106 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | 0 | | 107 | LFM2.5-8B-A1B | LiquidAI | 0 | | 108 | Mistral Medium 3 | Mistral | 0 | | 109 | Mistral Small 4 | Mistral | 0 | | 110 | Sarvam 30B | Sarvam | 0 | | 111 | Command A+ | Cohere | 0 | | 112 | Gemma 4 E4B | Google | 0 | | 113 | Ling 2.6 Flash | InclusionAI | 0 | | 114 | Granite-4.0-1B | IBM | 0 | | 115 | Claude Opus 4.6 (Adaptive) | Anthropic | 0 | | 116 | Gemma 4 E2B | Google | 0 | | 117 | Claude Opus 4.7 | Anthropic | 0 | | 118 | Claude Opus 4.5 Thinking | Anthropic | 0 | | 119 | GLM-5-Turbo | Z.AI | 0 | | 120 | GPT-5.1-Codex | OpenAI | 0 | | 121 | Grok 4.1 Fast (Reasoning) | xAI | 0 | | 122 | GLM-4.6 | Z.AI | 0 | | 123 | Grok 4 Fast (Reasoning) | xAI | 0 | | 124 | Trinity-Large-Preview | Arcee AI | 0 | | 125 | Trinity-Large-Thinking | Arcee AI | 0 | | 126 | Qwen3 Max | Alibaba | 0 | | 127 | Granite-4.0-H-1B | IBM | 0 | | 128 | DeepSeek R1 Distill Qwen 32B | DeepSeek | 0 | | 129 | Solar Pro 2 | Upstage | 0 | | 130 | Exaone 4.0 1.2B | LG AI Research | 0 | | 131 | LFM2.5-VL-1.6B-Extract | LiquidAI | 0 | | 132 | Granite-4.0-350M | IBM | 0 | | 133 | Granite-4.0-H-350M | IBM | 0 | ## Knowledge Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | GPT-5.2 | OpenAI | 92.4 | | 2 | Interfaze Beta | Interfaze | 89.9 | | 3 | Kimi K2.5 (Reasoning) | Moonshot AI | 87.3 | | 4 | MiMo-V2-Flash | Xiaomi | 84.5 | | 5 | Claude Sonnet 4.5 | Anthropic | 83.4 | | 6 | Exaone 4.0 32B | LG AI Research | 81.8 | | 7 | Qwen3.5-122B-A10B | Alibaba | 81.6 | | 8 | Qwen3.5-27B | Alibaba | 80.6 | | 9 | Qwen3.5-35B-A3B | Alibaba | 79.3 | | 10 | o1-pro | OpenAI | 79 | | 11 | Gemma 4 12B | Google | 77.8 | | 12 | o3-mini | OpenAI | 77.2 | | 13 | Claude Opus 4.6 | Anthropic | 76.2 | | 14 | Qwen3 235B 2507 | Alibaba | 76.2 | | 15 | o1 | OpenAI | 75.7 | | 16 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | 75.5 | | 17 | Claude Fable 5 | Anthropic | 74.8 | | 18 | Claude Mythos 5 | Anthropic | 74.6 | | 19 | Qwen 3.6 Max (preview) | Alibaba | 73.9 | | 20 | Claude Sonnet 4.6 | Anthropic | 73.7 | | 21 | ZAYA1-8B | Zyphra | 73.1 | | 22 | Qwen3.7 Max | Alibaba | 71.2 | | 23 | GLM-5 | Z.AI | 70.7 | | 24 | Claude Opus 4.8 | Anthropic | 70.1 | | 25 | DeepSeek V3 | DeepSeek | 70 | | 26 | MAI-Thinking-1 | Microsoft | 69.9 | | 27 | Claude Opus 4.7 (Adaptive) | Anthropic | 68.2 | | 28 | Qwen3.7 Plus | Alibaba | 67.9 | | 29 | GPT-5.5 | OpenAI | 66.4 | | 30 | GPT-4.1 | OpenAI | 66.3 | | 31 | Claude Opus 4.5 | Anthropic | 66.2 | | 32 | GPT-5.4 | OpenAI | 66.1 | | 33 | DeepSeek V4 Pro (Max) | DeepSeek | 66.1 | | 34 | Qwen3.6 Plus | Alibaba | 66 | | 35 | Gemma 4 E4B | Google | 65.6 | | 36 | Qwen3.5 397B | Alibaba | 65.2 | | 37 | Kimi K2.5 | Moonshot AI | 65.1 | | 38 | ZAYA1-74B-Preview | Zyphra | 64.3 | | 39 | GPT-4.1 mini | OpenAI | 64.2 | | 40 | DeepSeek V4 Pro Base | DeepSeek | 63.4 | | 41 | DeepSeek V4 Pro (High) | DeepSeek | 62.6 | | 42 | Nemotron 3 Ultra | NVIDIA | 62.6 | | 43 | Qwen3.6-27B | Alibaba | 62.2 | | 44 | Gemma 4 31B | Google | 61.3 | | 45 | GLM-4.7 | Z.AI | 60.6 | | 46 | Qwen3.6-35B-A3B | Alibaba | 60.5 | | 47 | DeepSeek V4 Flash (Max) | DeepSeek | 60 | | 48 | Claude 3.5 Sonnet | Anthropic | 59.4 | | 49 | Ling 2.6 Flash | InclusionAI | 59 | | 50 | Gemini 3.5 Flash | Google | 58 | | 51 | Mellum2-12B-A2.5B-Thinking | JetBrains | 57.6 | | 52 | GPT-5.4 mini | OpenAI | 57.4 | | 53 | DeepSeek V4 Flash (High) | DeepSeek | 57.2 | | 54 | GPT-5.5 Pro | OpenAI | 57.2 | | 55 | Gemma 4 E2B | Google | 54.1 | | 56 | Grok 4.3 | xAI | 53.9 | | 57 | Kimi K2.6 | Moonshot AI | 53.8 | | 58 | GPT-5.4 nano | OpenAI | 53.2 | | 59 | GLM-5.1 | Z.AI | 52.3 | | 60 | DeepSeek V4 Flash Base | DeepSeek | 52.2 | | 61 | Muse Spark | Meta | 50.4 | | 62 | GPT-4.1 nano | OpenAI | 50.3 | | 63 | DeepSeek V4 Pro | DeepSeek | 49.4 | | 64 | Gemma 4 26B A4B | Google | 49.2 | | 65 | GPT-5.4 Pro | OpenAI | 49 | | 66 | MiMo-V2.5-Pro | Xiaomi | 48 | | 67 | Hy3 Preview | Tencent | 46.7 | | 68 | DeepSeek V4 Flash | DeepSeek | 45.2 | | 69 | Mellum2-12B-A2.5B-Instruct | JetBrains | 40.9 | | 70 | Gemini 2.5 Pro | Google | 40.8 | | 71 | MiniCPM5-1B | OpenBMB | 39.8 | | 72 | LFM2.5-VL-450M | LiquidAI | 21.6 | | 73 | Gemini 3.1 Pro | Google | 0 | | 74 | GPT-5.3 Codex | OpenAI | 0 | | 75 | o1-preview | OpenAI | 0 | | 76 | Gemini 3 Pro | Google | 0 | | 77 | MiniMax M3 | MiniMax | 0 | | 78 | Qwen3.5 397B (Reasoning) | Alibaba | 0 | | 79 | GPT-5.1 | OpenAI | 0 | | 80 | GPT-5 (high) | OpenAI | 0 | | 81 | GPT-5.2-Codex | OpenAI | 0 | | 82 | GPT-5.1-Codex-Max | OpenAI | 0 | | 83 | Grok 4.20 | xAI | 0 | | 84 | GPT-5 (medium) | OpenAI | 0 | | 85 | Grok 4.1 Fast | xAI | 0 | | 86 | Grok 4 | xAI | 0 | | 87 | o3-pro | OpenAI | 0 | | 88 | DeepSeek V3.2 | DeepSeek | 0 | | 89 | o3 | OpenAI | 0 | | 90 | Gemini 3 Flash | Google | 0 | | 91 | MiniMax M2.7 | MiniMax | 0 | | 92 | Claude 4.1 Opus | Anthropic | 0 | | 93 | Claude 4 Sonnet | Anthropic | 0 | | 94 | GPT-4o mini | OpenAI | 0 | | 95 | Mistral Large 3 | Mistral | 0 | | 96 | Gemini 3.1 Flash-Lite | Google | 0 | | 97 | Claude 4.1 Opus Thinking | Anthropic | 0 | | 98 | GPT-4o | OpenAI | 0 | | 99 | Kimi K2 | Moonshot AI | 0 | | 100 | Llama 3.1 405B | Meta | 0 | | 101 | Grok Code Fast 1 | xAI | 0 | | 102 | Sarvam 105B | Sarvam | 0 | | 103 | Mistral Large 2 | Mistral | 0 | | 104 | Gemini 2.5 Flash | Google | 0 | | 105 | Gemini 1.5 Pro | Google | 0 | | 106 | GPT-OSS 120B | OpenAI | 0 | | 107 | Claude 3 Opus | Anthropic | 0 | | 108 | DeepSeek-R1 | DeepSeek | 0 | | 109 | DeepSeek V3.1 (Reasoning) | DeepSeek | 0 | | 110 | Phi-4 | Microsoft | 0 | | 111 | Llama 4 Scout | Meta | 0 | | 112 | Nemotron 3 Nano 30B | NVIDIA | 0 | | 113 | DeepSeek V3.1 | DeepSeek | 0 | | 114 | GPT-4 Turbo | OpenAI | 0 | | 115 | Gemini 1.0 Pro | Google | 0 | | 116 | Claude 3 Haiku | Anthropic | 0 | | 117 | Nemotron Ultra 253B | NVIDIA | 0 | | 118 | GLM-4.5-Air | Z.AI | 0 | | 119 | Llama 4 Maverick | Meta | 0 | | 120 | Gemma 3 27B | Google | 0 | | 121 | GPT-OSS 20B | OpenAI | 0 | | 122 | Nova Pro | Amazon | 0 | | 123 | MiMo-V2-Pro | Xiaomi | 0 | | 124 | MiMo-V2-Omni | Xiaomi | 0 | | 125 | Mistral Medium 3.5 128B | Mistral | 0 | | 126 | Step 3.7 Flash | StepFun | 0 | | 127 | GLM-5V-Turbo | Z.AI | 0 | | 128 | Mistral Small 4 (Reasoning) | Mistral | 0 | | 129 | K-Exaone | LG AI Research | 0 | | 130 | LFM2.5-8B-A1B | LiquidAI | 0 | | 131 | Mistral Medium 3 | Mistral | 0 | | 132 | Mistral Small 4 | Mistral | 0 | | 133 | Sarvam 30B | Sarvam | 0 | | 134 | Command A+ | Cohere | 0 | | 135 | Granite-4.0-1B | IBM | 0 | | 136 | Claude Opus 4.6 (Adaptive) | Anthropic | 0 | | 137 | Claude Opus 4.7 | Anthropic | 0 | | 138 | Claude Opus 4.5 Thinking | Anthropic | 0 | | 139 | GLM-5-Turbo | Z.AI | 0 | | 140 | GPT-5.1-Codex | OpenAI | 0 | | 141 | Grok 4.1 Fast (Reasoning) | xAI | 0 | | 142 | GLM-4.6 | Z.AI | 0 | | 143 | Grok 4 Fast (Reasoning) | xAI | 0 | | 144 | Trinity-Large-Preview | Arcee AI | 0 | | 145 | Trinity-Large-Thinking | Arcee AI | 0 | | 146 | Qwen3 Max | Alibaba | 0 | | 147 | Granite-4.0-H-1B | IBM | 0 | | 148 | Qwen2.5 Coder 32B Instruct | Alibaba | 0 | | 149 | DeepSeek R1 Distill Qwen 32B | DeepSeek | 0 | | 150 | Solar Pro 2 | Upstage | 0 | | 151 | Exaone 4.0 1.2B | LG AI Research | 0 | | 152 | LFM2.5-VL-1.6B-Extract | LiquidAI | 0 | | 153 | Granite-4.0-350M | IBM | 0 | | 154 | Granite-4.0-H-350M | IBM | 0 | ## Instruction Following Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | Qwen3.5-27B | Alibaba | 95 | | 2 | Kimi K2.5 | Moonshot AI | 93.9 | | 3 | o3-mini | OpenAI | 93.9 | | 4 | Qwen3.5-122B-A10B | Alibaba | 93.4 | | 5 | GLM-5 | Z.AI | 92.6 | | 6 | Qwen3.5 397B | Alibaba | 92.6 | | 7 | o1 | OpenAI | 92.2 | | 8 | Qwen3.5-35B-A3B | Alibaba | 91.9 | | 9 | Qwen3.7 Plus | Alibaba | 89.2 | | 10 | Qwen3.7 Max | Alibaba | 89 | | 11 | GPT-4.1 mini | OpenAI | 88.5 | | 12 | Qwen3.6 Plus | Alibaba | 87.8 | | 13 | GPT-4.1 | OpenAI | 87.4 | | 14 | DeepSeek V3 | DeepSeek | 86.1 | | 15 | MAI-Thinking-1 | Microsoft | 85 | | 16 | GPT-4.1 nano | OpenAI | 83.2 | | 17 | Nemotron 3 Ultra | NVIDIA | 81.7 | | 18 | Grok 4.3 | xAI | 81.3 | | 19 | LFM2.5-8B-A1B | LiquidAI | 79.5 | | 20 | Claude Opus 4.5 | Anthropic | 79.4 | | 21 | Mellum2-12B-A2.5B-Thinking | JetBrains | 76.5 | | 22 | Gemini 3.5 Flash | Google | 76.3 | | 23 | Mellum2-12B-A2.5B-Instruct | JetBrains | 75.8 | | 24 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | 74.2 | | 25 | ZAYA1-8B | Zyphra | 74 | | 26 | MiniCPM5-1B | OpenBMB | 68.6 | | 27 | Hy3 Preview | Tencent | 63.1 | | 28 | LFM2.5-VL-450M | LiquidAI | 61.2 | | 29 | Ling 2.6 Flash | InclusionAI | 57 | | 30 | Claude Opus 4.8 | Anthropic | 0 | | 31 | Gemini 3.1 Pro | Google | 0 | | 32 | GPT-5.5 | OpenAI | 0 | | 33 | GPT-5.4 | OpenAI | 0 | | 34 | Claude Opus 4.6 | Anthropic | 0 | | 35 | DeepSeek V4 Pro (Max) | DeepSeek | 0 | | 36 | GPT-5.3 Codex | OpenAI | 0 | | 37 | Claude Opus 4.7 (Adaptive) | Anthropic | 0 | | 38 | GLM-5.1 | Z.AI | 0 | | 39 | Claude Sonnet 4.6 | Anthropic | 0 | | 40 | DeepSeek V4 Pro (High) | DeepSeek | 0 | | 41 | Kimi K2.6 | Moonshot AI | 0 | | 42 | Gemini 3 Pro | Google | 0 | | 43 | MiniMax M3 | MiniMax | 0 | | 44 | GPT-5.2 | OpenAI | 0 | | 45 | Qwen3.5 397B (Reasoning) | Alibaba | 0 | | 46 | GPT-5.1 | OpenAI | 0 | | 47 | GPT-5 (high) | OpenAI | 0 | | 48 | GPT-5.2-Codex | OpenAI | 0 | | 49 | Kimi K2.5 (Reasoning) | Moonshot AI | 0 | | 50 | GPT-5.1-Codex-Max | OpenAI | 0 | | 51 | DeepSeek V4 Flash (Max) | DeepSeek | 0 | | 52 | Qwen3.6-27B | Alibaba | 0 | | 53 | DeepSeek V4 Flash (High) | DeepSeek | 0 | | 54 | GPT-5 (medium) | OpenAI | 0 | | 55 | GLM-4.7 | Z.AI | 0 | | 56 | Grok 4.1 Fast | xAI | 0 | | 57 | Qwen3.6-35B-A3B | Alibaba | 0 | | 58 | Gemini 2.5 Pro | Google | 0 | | 59 | Grok 4 | xAI | 0 | | 60 | MiMo-V2-Flash | Xiaomi | 0 | | 61 | DeepSeek V3.2 | DeepSeek | 0 | | 62 | o3 | OpenAI | 0 | | 63 | Gemini 3 Flash | Google | 0 | | 64 | MiniMax M2.7 | MiniMax | 0 | | 65 | Claude 4 Sonnet | Anthropic | 0 | | 66 | GPT-4o mini | OpenAI | 0 | | 67 | Mistral Large 3 | Mistral | 0 | | 68 | Gemini 3.1 Flash-Lite | Google | 0 | | 69 | Claude 4.1 Opus Thinking | Anthropic | 0 | | 70 | GPT-4o | OpenAI | 0 | | 71 | Kimi K2 | Moonshot AI | 0 | | 72 | Llama 3.1 405B | Meta | 0 | | 73 | Grok Code Fast 1 | xAI | 0 | | 74 | Sarvam 105B | Sarvam | 0 | | 75 | Mistral Large 2 | Mistral | 0 | | 76 | Gemini 2.5 Flash | Google | 0 | | 77 | GPT-OSS 120B | OpenAI | 0 | | 78 | DeepSeek-R1 | DeepSeek | 0 | | 79 | DeepSeek V3.1 (Reasoning) | DeepSeek | 0 | | 80 | Phi-4 | Microsoft | 0 | | 81 | Llama 4 Scout | Meta | 0 | | 82 | Nemotron 3 Nano 30B | NVIDIA | 0 | | 83 | DeepSeek V3.1 | DeepSeek | 0 | | 84 | Claude 3 Haiku | Anthropic | 0 | | 85 | Nemotron Ultra 253B | NVIDIA | 0 | | 86 | GLM-4.5-Air | Z.AI | 0 | | 87 | Llama 4 Maverick | Meta | 0 | | 88 | Gemma 3 27B | Google | 0 | | 89 | GPT-OSS 20B | OpenAI | 0 | | 90 | Nova Pro | Amazon | 0 | | 91 | MiMo-V2.5-Pro | Xiaomi | 0 | | 92 | MiMo-V2-Pro | Xiaomi | 0 | | 93 | MiMo-V2-Omni | Xiaomi | 0 | | 94 | Muse Spark | Meta | 0 | | 95 | Qwen 3.6 Max (preview) | Alibaba | 0 | | 96 | Mistral Medium 3.5 128B | Mistral | 0 | | 97 | Interfaze Beta | Interfaze | 0 | | 98 | Step 3.7 Flash | StepFun | 0 | | 99 | GPT-5.4 mini | OpenAI | 0 | | 100 | Gemma 4 31B | Google | 0 | | 101 | Exaone 4.0 32B | LG AI Research | 0 | | 102 | GLM-5V-Turbo | Z.AI | 0 | | 103 | GPT-5.4 nano | OpenAI | 0 | | 104 | Gemma 4 26B A4B | Google | 0 | | 105 | Mistral Small 4 (Reasoning) | Mistral | 0 | | 106 | K-Exaone | LG AI Research | 0 | | 107 | Gemma 4 12B | Google | 0 | | 108 | Mistral Medium 3 | Mistral | 0 | | 109 | Mistral Small 4 | Mistral | 0 | | 110 | Sarvam 30B | Sarvam | 0 | | 111 | Command A+ | Cohere | 0 | | 112 | Gemma 4 E4B | Google | 0 | | 113 | Granite-4.0-1B | IBM | 0 | | 114 | Claude Opus 4.6 (Adaptive) | Anthropic | 0 | | 115 | Gemma 4 E2B | Google | 0 | | 116 | Claude Opus 4.7 | Anthropic | 0 | | 117 | Claude Opus 4.5 Thinking | Anthropic | 0 | | 118 | GLM-5-Turbo | Z.AI | 0 | | 119 | GPT-5.1-Codex | OpenAI | 0 | | 120 | Grok 4.1 Fast (Reasoning) | xAI | 0 | | 121 | GLM-4.6 | Z.AI | 0 | | 122 | Grok 4 Fast (Reasoning) | xAI | 0 | | 123 | Trinity-Large-Preview | Arcee AI | 0 | | 124 | Trinity-Large-Thinking | Arcee AI | 0 | | 125 | Qwen3 Max | Alibaba | 0 | | 126 | Granite-4.0-H-1B | IBM | 0 | | 127 | DeepSeek R1 Distill Qwen 32B | DeepSeek | 0 | | 128 | Solar Pro 2 | Upstage | 0 | | 129 | Exaone 4.0 1.2B | LG AI Research | 0 | | 130 | LFM2.5-VL-1.6B-Extract | LiquidAI | 0 | | 131 | Granite-4.0-350M | IBM | 0 | | 132 | Granite-4.0-H-350M | IBM | 0 | ## Multilingual Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | Qwen3.7 Max | Alibaba | 87 | | 2 | Claude Opus 4.5 | Anthropic | 85.7 | | 3 | DeepSeek V4 Flash Base | DeepSeek | 85.7 | | 4 | Qwen3.7 Plus | Alibaba | 85.4 | | 5 | Qwen3.6 Plus | Alibaba | 84.7 | | 6 | Qwen3.5 397B | Alibaba | 84.7 | | 7 | DeepSeek V4 Pro Base | DeepSeek | 84.4 | | 8 | GLM-5 | Z.AI | 83.1 | | 9 | Nemotron 3 Ultra | NVIDIA | 83 | | 10 | Kimi K2.5 | Moonshot AI | 82.3 | | 11 | Qwen3.5-122B-A10B | Alibaba | 82.2 | | 12 | Qwen3.5-27B | Alibaba | 82.2 | | 13 | Qwen3.5-35B-A3B | Alibaba | 81 | | 14 | Qwen3 235B 2507 | Alibaba | 79.4 | | 15 | Claude Mythos 5 | Anthropic | 0 | | 16 | Claude Fable 5 | Anthropic | 0 | | 17 | Claude Opus 4.8 | Anthropic | 0 | ## Mathematics Leaderboard | Rank | Model | Creator | Avg Score | |------|-------|---------|----------| | 1 | MAI-Thinking-1 | Microsoft | 97 | | 2 | Kimi K2.5 (Reasoning) | Moonshot AI | 96.1 | | 3 | Kimi K2.5 | Moonshot AI | 96.1 | | 4 | GLM-4.7 | Z.AI | 95.7 | | 5 | MiMo-V2-Flash | Xiaomi | 94.1 | | 6 | Claude Sonnet 4.5 | Anthropic | 87 | | 7 | Exaone 4.0 32B | LG AI Research | 85.3 | | 8 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | 82.1 | | 9 | LFM2.5-8B-A1B | LiquidAI | 59.9 | | 10 | MiniCPM5-1B | OpenBMB | 59.6 | | 11 | GPT-5.5 Pro | OpenAI | 52.4 | | 12 | GPT-5.5 | OpenAI | 51.7 | | 13 | GPT-5.4 Pro | OpenAI | 50 | | 14 | Claude Opus 4.7 (Adaptive) | Anthropic | 43.8 | | 15 | Claude Mythos 5 | Anthropic | 0 | | 16 | Claude Fable 5 | Anthropic | 0 | | 17 | Claude Opus 4.8 | Anthropic | 0 | | 18 | Qwen3.7 Max | Alibaba | 0 | | 19 | Qwen3.7 Plus | Alibaba | 0 | | 20 | Claude Opus 4.6 | Anthropic | 0 | | 21 | DeepSeek V4 Pro (Max) | DeepSeek | 0 | | 22 | GLM-5.1 | Z.AI | 0 | | 23 | DeepSeek V4 Pro (High) | DeepSeek | 0 | | 24 | Kimi K2.6 | Moonshot AI | 0 | | 25 | MiniMax M3 | MiniMax | 0 | | 26 | Claude Opus 4.5 | Anthropic | 0 | | 27 | DeepSeek V4 Flash (Max) | DeepSeek | 0 | | 28 | Qwen3.6-27B | Alibaba | 0 | | 29 | DeepSeek V4 Flash (High) | DeepSeek | 0 | | 30 | DeepSeek V4 Pro | DeepSeek | 0 | | 31 | GLM-5 | Z.AI | 0 | | 32 | Qwen3.6 Plus | Alibaba | 0 | | 33 | Qwen3.6-35B-A3B | Alibaba | 0 | | 34 | Qwen3.5 397B | Alibaba | 0 | | 35 | DeepSeek V4 Flash | DeepSeek | 0 | | 36 | o3-mini | OpenAI | 0 | | 37 | MiniMax M2.7 | MiniMax | 0 | | 38 | ZAYA1-8B | Zyphra | 0 | | 39 | ZAYA1-74B-Preview | Zyphra | 0 | | 40 | Gemma 4 12B | Google | 0 | | 41 | DeepSeek V4 Pro Base | DeepSeek | 0 | | 42 | DeepSeek V4 Flash Base | DeepSeek | 0 | | 43 | Trinity-Large-Preview | Arcee AI | 0 | | 44 | Trinity-Large-Thinking | Arcee AI | 0 | ## LLM Pricing Comparison Last updated: June 12, 2026 | Model | Creator | Input $/1M | Output $/1M | Context | |-------|---------|-----------|------------|--------| | 1-bit Bonsai 1.7B | Prism ML | Free* | Free* | 32K | | 1-bit Bonsai 4B | Prism ML | Free* | Free* | 32K | | 1-bit Bonsai 8B | Prism ML | Free* | Free* | 64K | | Aion-2.0 | Aion Labs | $0.80 | $1.60 | 128K | | Claude 3 Haiku | Anthropic | $0.25 | $1.25 | 200K | | Claude 3 Opus | Anthropic | $15.00 | $75.00 | 200K | | Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 | 200K | | Claude 4 Sonnet | Anthropic | $3.00 | $15.00 | 200K | | Claude 4.1 Opus | Anthropic | $15.00 | $75.00 | 200K | | Claude 4.1 Opus Thinking | Anthropic | Pricing unavailable | Pricing unavailable | 200K | | Claude Fable 5 | Anthropic | $10.00 | $50.00 | 1M+ | | Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 200K | | Claude Mythos 5 | Anthropic | $10.00 | $50.00 | 1M+ | | Claude Opus 4.5 | Anthropic | $5.00 | $25.00 | 200K | | Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | 1M | | Claude Opus 4.7 | Anthropic | $5.00 | $25.00 | 1M | | Claude Opus 4.7 (Adaptive) | Anthropic | $5.00 | $25.00 | 1M | | Claude Opus 4.8 | Anthropic | $5.00 | $25.00 | 1M | | Claude Sonnet 4.5 | Anthropic | $3.00 | $15.00 | 200K | | Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 200K | | Command A+ | Cohere | $2.50 | $10.00 | 128K | | Composer 2 | Cursor | $0.50 | $2.50 | 200K | | Composer 2.5 | Cursor | $0.50 | $2.50 | 200K | | DBRX Instruct | Databricks | Free* | Free* | 32K | | DeepSeek Coder 2.0 | DeepSeek | Pricing unavailable | Pricing unavailable | 128K | | DeepSeek LLM 2.0 | DeepSeek | Free* | Free* | 128K | | DeepSeek R1 | DeepSeek | $0.55 | $2.19 | 128K | | DeepSeek R1 Distill Qwen 32B | DeepSeek | Free* | Free* | 128K | | DeepSeek V3 | DeepSeek | $0.27 | $1.10 | 128K | | DeepSeek V3.1 | DeepSeek | Free* | Free* | 128K | | DeepSeek V3.1 (Reasoning) | DeepSeek | Free* | Free* | 128K | | DeepSeek V3.2 | DeepSeek | $0.28 | $0.42 | 128K | | DeepSeek V3.2 (Thinking) | DeepSeek | $0.55 | $2.19 | 128K | | DeepSeek V4 Flash | DeepSeek | $0.14 | $0.28 | 1M | | DeepSeek V4 Flash (High) | DeepSeek | $0.14 | $0.28 | 1M | | DeepSeek V4 Flash (Max) | DeepSeek | $0.14 | $0.28 | 1M | | DeepSeek V4 Flash Base | DeepSeek | Pricing unavailable | Pricing unavailable | 1M | | DeepSeek V4 Pro | DeepSeek | $1.74 | $3.48 | 1M | | DeepSeek V4 Pro (High) | DeepSeek | $1.74 | $3.48 | 1M | | DeepSeek V4 Pro (Max) | DeepSeek | $1.74 | $3.48 | 1M | | DeepSeek V4 Pro Base | DeepSeek | Pricing unavailable | Pricing unavailable | 1M | | DeepSeekMath V2 | DeepSeek | Free* | Free* | 128K | | Gemini 1.0 Pro | Google | Pricing unavailable | Pricing unavailable | 32K | | Gemini 1.5 Pro | Google | $1.25 | $5.00 | 1M | | Gemini 2.5 Flash | Google | $0.30 | $2.50 | 1M | | Gemini 2.5 Pro | Google | $1.25 | $10.00 | 1M | | Gemini 3 Flash | Google | $0.50 | $3.00 | 1M | | Gemini 3 Pro | Google | $2.00 | $12.00 | 2M | | Gemini 3 Pro Deep Think | Google | Pricing unavailable | Pricing unavailable | 2M | | Gemini 3.1 Flash-Lite | Google | $0.25 | $1.50 | 1M | | Gemini 3.1 Pro | Google | $2.00 | $12.00 | 1M | | Gemini 3.5 Flash | Google | $1.50 | $9.00 | 1M | | Gemma 3 27B | Google | Free* | Free* | 32K | | Gemma 4 26B A4B | Google | Free* | Free* | 256K | | Gemma 4 31B | Google | Free* | Free* | 256K | | Gemma 4 E2B | Google | Free* | Free* | 128K | | Gemma 4 E4B | Google | Free* | Free* | 128K | | GLM-4.5 | Z.AI | $0.60 | $2.20 | 128K | | GLM-4.5-Air | Z.AI | $0.20 | $1.10 | 128K | | GLM-4.7 | Z.AI | Free* | Free* | 200K | | GLM-4.7-Flash | Z.AI | Free* | Free* | 200K | | GLM-5 | Z.AI | $1.00 | $3.20 | 200K | | GLM-5 (Reasoning) | Z.AI | $1.00 | $3.20 | 200K | | GLM-5-Turbo | Z.AI | $1.20 | $4.00 | 200K | | GLM-5.1 | Z.AI | $1.40 | $4.40 | 203K | | GLM-5V-Turbo | Z.AI | $1.20 | $4.00 | 200K | | GPT-4 Turbo | OpenAI | $10.00 | $30.00 | 128K | | GPT-4.1 | OpenAI | $2.00 | $8.00 | 1M | | GPT-4.1 mini | OpenAI | $0.40 | $1.60 | 1M | | GPT-4.1 nano | OpenAI | $0.10 | $0.40 | 1M | | GPT-4o | OpenAI | $2.50 | $10.00 | 128K | | GPT-4o mini | OpenAI | $0.15 | $0.60 | 128K | | GPT-5 (high) | OpenAI | $1.25 | $10.00 | 400K | | GPT-5 (medium) | OpenAI | Pricing unavailable | Pricing unavailable | 128K | | GPT-5 mini | OpenAI | $0.25 | $2.00 | 128K | | GPT-5 nano | OpenAI | $0.05 | $0.40 | 400K | | GPT-5.1 | OpenAI | $1.25 | $10.00 | 400K | | GPT-5.1-Codex-Max | OpenAI | $1.25 | $10.00 | 400K | | GPT-5.2 | OpenAI | $1.75 | $14.00 | 400K | | GPT-5.2 Instant | OpenAI | $1.50 | $6.00 | 128K | | GPT-5.2 Pro | OpenAI | $25.00 | $150.00 | 400K | | GPT-5.2-Codex | OpenAI | $1.75 | $14.00 | 400K | | GPT-5.3 Codex | OpenAI | $1.75 | $14.00 | 400K | | GPT-5.3 Instant | OpenAI | $1.75 | $14.00 | 128K | | GPT-5.3-Codex-Spark | OpenAI | Pricing unavailable | Pricing unavailable | 256K | | GPT-5.4 | OpenAI | $2.50 | $15.00 | 1.05M | | GPT-5.4 mini | OpenAI | $0.75 | $4.50 | 400K | | GPT-5.4 nano | OpenAI | $0.20 | $1.25 | 400K | | GPT-5.4 Pro | OpenAI | $30.00 | $180.00 | 1.05M | | GPT-5.5 | OpenAI | $5.00 | $30.00 | 1M | | GPT-5.5 Pro | OpenAI | $30.00 | $180.00 | 1M | | GPT-OSS 120B | OpenAI | Free* | Free* | 128K | | GPT-OSS 20B | OpenAI | Free* | Free* | 128K | | Granite-4.0-1B | IBM | Free* | Free* | 128K | | Granite-4.0-350M | IBM | Free* | Free* | 32K | | Granite-4.0-H-1B | IBM | Free* | Free* | 128K | | Granite-4.0-H-350M | IBM | Free* | Free* | 32K | | Grok 3 [Beta] | xAI | Pricing unavailable | Pricing unavailable | 128K | | Grok 3 Mini | xAI | $0.30 | $0.50 | 128K | | Grok 4 | xAI | Pricing unavailable | Pricing unavailable | 128K | | Grok 4.1 | xAI | Pricing unavailable | Pricing unavailable | 1M | | Grok 4.1 Fast | xAI | $0.20 | $0.50 | 2M | | Grok 4.20 | xAI | $2.00 | $6.00 | 2M | | Grok 4.20 Multi-agent | xAI | Pricing unavailable | Pricing unavailable | 2M | | Grok 4.3 | xAI | $1.25 | $2.50 | 1M | | Grok Build 0.1 | xAI | $1.00 | $2.00 | 256K | | Grok Code Fast 1 | xAI | $0.20 | $1.50 | 256K | | Holo3-122B-A10B | H Company | $0.40 | $3.00 | 64K | | Holo3-35B-A3B | H Company | Pricing unavailable | Pricing unavailable | 64K | | Holo3.1-0.8B | H Company | Free* | Free* | 262K | | Holo3.1-35B-A3B | H Company | $0.25 | $1.80 | 64K | | Holo3.1-35B-A3B-FP8 | H Company | Free* | Free* | 262K | | Holo3.1-35B-A3B-GGUF | H Company | Free* | Free* | 262K | | Holo3.1-35B-A3B-NVFP4 | H Company | Free* | Free* | 262K | | Holo3.1-4B | H Company | Free* | Free* | 262K | | Holo3.1-9B | H Company | Free* | Free* | 262K | | Hy3 Preview | Tencent | Free* | Free* | 256K | | Interfaze Beta | Interfaze | $1.50 | $3.50 | 1M | | Kimi 2.6 | Moonshot AI | $0.95 | $4.00 | 256K | | Kimi K2 | Moonshot AI | $0.60 | $2.50 | 128K | | Kimi K2.5 | Moonshot AI | $0.60 | $3.00 | 256K | | Kimi K2.5 (Reasoning) | Moonshot AI | $0.60 | $3.00 | 256K | | Kimi K2.7 Code | Moonshot AI | $0.95 | $4.00 | 256K | | Laguna M.1 | Poolside | Free* | Free* | 131K | | Laguna XS.2 | Poolside | Free* | Free* | 131K | | Leanstral | Mistral | Free* | Free* | 256K | | LFM2-24B-A2B | LiquidAI | Free* | Free* | 32K | | LFM2.5-1.2B-Instruct | LiquidAI | Free* | Free* | 32K | | LFM2.5-1.2B-Thinking | LiquidAI | Free* | Free* | 32K | | LFM2.5-350M | LiquidAI | Free* | Free* | 32K | | LFM2.5-8B-A1B | LiquidAI | Free* | Free* | 128K | | LFM2.5-VL-450M | LiquidAI | Free* | Free* | 128K | | Ling 2.6 Flash | InclusionAI | Pricing unavailable | Pricing unavailable | 262K | | Llama 3 70B | Meta | Free* | Free* | 128K | | Llama 3.1 405B | Meta | Free* | Free* | 128K | | Llama 4 Behemoth | Meta | Free* | Free* | 32K | | Llama 4 Maverick | Meta | Free* | Free* | 1M | | Llama 4 Scout | Meta | Free* | Free* | 10M | | Mercury 2 | Inception | $0.25 | $0.75 | 128K | | MiMo-V2-Flash | Xiaomi | Free* | Free* | 256K | | MiMo-V2.5 | Xiaomi | Pricing unavailable | Pricing unavailable | 1M | | MiMo-V2.5-Pro | Xiaomi | Pricing unavailable | Pricing unavailable | 1M | | MiniMax M1 80k | MiniMax | Pricing unavailable | Pricing unavailable | 80K | | MiniMax M2.5 | MiniMax | $0.30 | $1.20 | 128K | | MiniMax M2.7 | MiniMax | $0.30 | $1.20 | 200K | | MiniMax M3 | MiniMax | $0.30 | $1.20 | 1M | | Ministral 3 14B | Mistral | $0.20 | $0.20 | 256K | | Ministral 3 14B (Reasoning) | Mistral | $0.20 | $0.20 | 256K | | Ministral 3 3B | Mistral | $0.10 | $0.10 | 256K | | Ministral 3 3B (Reasoning) | Mistral | $0.10 | $0.10 | 256K | | Ministral 3 8B | Mistral | $0.15 | $0.15 | 256K | | Ministral 3 8B (Reasoning) | Mistral | $0.15 | $0.15 | 256K | | Mistral 7B v0.3 | Mistral | Free* | Free* | 32K | | Mistral 8x7B | Mistral | Free* | Free* | 32K | | Mistral 8x7B v0.2 | Mistral | Free* | Free* | 32K | | Mistral Large 2 | Mistral | Pricing unavailable | Pricing unavailable | 128K | | Mistral Large 3 | Mistral | $0.50 | $1.50 | 256K | | Mistral Medium 3 | Mistral | $0.40 | $2.00 | 128K | | Mistral Medium 3.5 128B | Mistral | $1.50 | $7.50 | 256K | | Mistral Small 4 | Mistral | $0.15 | $0.60 | 256K | | Mistral Small 4 (Reasoning) | Mistral | $0.15 | $0.60 | 256K | | Mixtral 8x22B Instruct v0.1 | Mistral | Free* | Free* | 64K | | Moonshot v1 | Moonshot AI | Pricing unavailable | Pricing unavailable | 128K | | Nemotron 3 Nano 30B | NVIDIA | Free* | Free* | 32K | | Nemotron 3 Nano Omni 30B A3B | NVIDIA | Free* | Free* | 256K | | Nemotron 3 Super 100B | NVIDIA | Free* | Free* | 1M | | Nemotron 3 Super 120B A12B | NVIDIA | Free* | Free* | 256K | | Nemotron 3 Ultra | NVIDIA | Free* | Free* | 1M | | Nemotron Ultra 253B | NVIDIA | Free* | Free* | 32K | | Nemotron-4 15B | NVIDIA | Free* | Free* | 32K | | Nova Pro | Amazon | Pricing unavailable | Pricing unavailable | 128K | | o1 | OpenAI | $15.00 | $60.00 | 200K | | o1-preview | OpenAI | $15.00 | $60.00 | 200K | | o1-pro | OpenAI | $150.00 | $600.00 | 200K | | o3 | OpenAI | $2.00 | $8.00 | 200K | | o3-mini | OpenAI | $1.10 | $4.40 | 200K | | o3-pro | OpenAI | $20.00 | $80.00 | 200K | | o4-mini | OpenAI | $1.10 | $4.40 | 200K | | o4-mini (high) | OpenAI | Pricing unavailable | Pricing unavailable | 200K | | Phi-4 | Microsoft | Free* | Free* | 16K | | Qwen2.5 Coder 32B Instruct | Alibaba | Free* | Free* | 128K | | Qwen2.5-1M | Alibaba | Free* | Free* | 1M | | Qwen2.5-72B | Alibaba | Free* | Free* | 128K | | Qwen2.5-VL-32B | Alibaba | Free* | Free* | 32K | | Qwen3 235B 2507 | Alibaba | Free* | Free* | 128K | | Qwen3 235B 2507 (Reasoning) | Alibaba | Free* | Free* | 128K | | Qwen3.5 397B | Alibaba | $0.60 | $3.60 | 128K | | Qwen3.5 397B (Reasoning) | Alibaba | $0.60 | $3.60 | 128K | | Qwen3.5 Flash | Alibaba | $0.10 | $0.40 | 1M | | Qwen3.5 Plus | Alibaba | $0.40 | $2.40 | 1M | | Qwen3.5-122B-A10B | Alibaba | Free* | Free* | 262K | | Qwen3.5-27B | Alibaba | Free* | Free* | 262K | | Qwen3.5-35B-A3B | Alibaba | Free* | Free* | 262K | | Qwen3.6 Plus | Alibaba | Pricing unavailable | Pricing unavailable | 1M | | Qwen3.6-27B | Alibaba | Free* | Free* | 262K | | Qwen3.7 Max | Alibaba | Pricing unavailable | Pricing unavailable | 1M | | Sarvam 105B | Sarvam | Free* | Free* | 128K | | Sarvam 30B | Sarvam | Free* | Free* | 64K | | Seed 1.6 | ByteDance | Pricing unavailable | Pricing unavailable | 256K | | Seed 1.6 Flash | ByteDance | Pricing unavailable | Pricing unavailable | 256K | | Seed-2.0-Lite | ByteDance | Pricing unavailable | Pricing unavailable | 256K | | Seed-2.0-Mini | ByteDance | Pricing unavailable | Pricing unavailable | 256K | | Step 3.5 Flash | StepFun | $0.10 | $0.30 | 256K | | Step 3.7 Flash | StepFun | $0.20 | $1.15 | 256K | | Ternary Bonsai 1.7B | Prism ML | Free* | Free* | 32K | | Ternary Bonsai 4B | Prism ML | Free* | Free* | 32K | | Ternary Bonsai 8B | Prism ML | Free* | Free* | 64K | | Trinity-Large-Preview | Arcee AI | $0.25 | $1.00 | 512K | | Trinity-Large-Thinking | Arcee AI | $0.25 | $0.90 | 512K | | Z-1 | Z | Pricing unavailable | Pricing unavailable | 128K | | ZAYA1-74B-Preview | Zyphra | Free* | Free* | 256K | | ZAYA1-8B | Zyphra | Free* | Free* | 131K | *Open-weight models are free to download but require self-hosted infrastructure.* ## Tools ### LLM Selector Quiz URL: https://benchlm.ai/tools/llm-selector Answer 5 questions (use case, budget, context needs, open-source preference, speed) and get a personalized model recommendation based on benchmark data. ### Cost Calculator URL: https://benchlm.ai/tools/cost-calculator Estimate AI cost per blog post, web page, documentation article, PRD, or shipped feature. Converts real workloads into token estimates using words, context, and revision assumptions. ### Alternative Finder URL: https://benchlm.ai/tools/alternative-finder Find the best replacement for ChatGPT, Claude, Google Gemini, or the OpenAI API using BenchLM benchmark scores, pricing, context window size, and open-weight filters. ## Alternative Landing Pages ### Best ChatGPT Alternatives in 2026 URL: https://benchlm.ai/alternatives/chatgpt Benchmark-backed ChatGPT alternatives ranked by performance, price, context window, and open-weight availability. Search intents: chatgpt alternatives, best chatgpt alternatives, best alternative to chatgpt ### Best Claude Alternatives in 2026 URL: https://benchlm.ai/alternatives/claude Claude alternatives ranked by benchmark performance, coding strength, token cost, and long-context support. Search intents: claude alternative, best claude alternative, cheaper alternative to claude ### Best Google Gemini Alternatives in 2026 URL: https://benchlm.ai/alternatives/google-gemini Google Gemini alternatives ranked by benchmark quality, long-context support, pricing, and deployment model. Search intents: google gemini alternative, best google gemini alternative, gemini alternative ### Best OpenAI API Alternatives in 2026 URL: https://benchlm.ai/alternatives/openai-api OpenAI API alternatives ranked for teams that want lower cost, different model behavior, or non-OpenAI providers. Search intents: openai api alternative, best openai api alternatives, cheaper openai api alternative ### Best GLM Alternatives in 2026 URL: https://benchlm.ai/alternatives/glm GLM and Z.AI alternatives ranked by benchmark quality, pricing, context window, and deployment model. Search intents: glm alternative, best glm alternative, z.ai alternative ### Best Kimi Alternatives in 2026 URL: https://benchlm.ai/alternatives/kimi Kimi alternatives ranked by benchmark performance, cost, long-context support, and open-weight availability. Search intents: kimi alternative, best kimi alternative, moonshot kimi alternative ### Best Free ChatGPT Alternatives in 2026 URL: https://benchlm.ai/alternatives/chatgpt/free Free and self-hostable ChatGPT alternatives ranked by benchmark quality, open-weight availability, and context window. Search intents: free chatgpt alternative, best free chatgpt alternative, chatgpt alternative free ### Best Open Source ChatGPT Alternatives in 2026 URL: https://benchlm.ai/alternatives/chatgpt/open-source Open-source and open-weight ChatGPT alternatives ranked by benchmark performance, coding strength, and deployment flexibility. Search intents: open source chatgpt alternative, best open source chatgpt alternative, open-weight chatgpt alternative ### Best Claude Alternatives for Coding in 2026 URL: https://benchlm.ai/alternatives/claude/coding Coding-focused Claude alternatives ranked by BenchLM coding, agentic, and reasoning scores. Search intents: claude alternative for coding, best claude alternative for coding, claude code alternative ## Blog Posts ### Claude Fable 5 and Mythos 5: The Future of AI Is Gated Intelligence URL: https://benchlm.ai/blog/posts/claude-fable-5-mythos-5-future-of-ai Anthropic's Claude Fable 5 brings Mythos-class capability to public users, while Claude Mythos 5 remains trusted-access. The benchmark story is strong, but the real shift is capability-gated deployment. ### Perceptron Mk1 and Frontier Video Models: The Complete Guide to Video Understanding AI URL: https://benchlm.ai/blog/posts/perceptron-mk1-frontier-video-models A complete guide to Perceptron Mk1, frontier video understanding models, video AI benchmarks, and where video-language models are headed next. ### Best LLM for Math 2026: AIME, HMMT & MATH-500 Rankings URL: https://benchlm.ai/blog/posts/best-llm-math Best LLM for math 2026: GPT-5.4 leads AIME 2025, MATH-500, and BRUMO. Compare Claude, Gemini, DeepSeek-R1, GPT-5.5, and value picks by use case. ### ProgramBench Benchmark Explained: Can LLMs Rebuild Programs From Binaries? URL: https://benchlm.ai/blog/posts/programbench-cleanroom-coding-benchmark ProgramBench is a new LLM coding benchmark where agents rebuild full programs from a compiled binary and documentation. See scores, how it differs from SWE-bench, and why all public models are 0% resolved. ### ARC-AGI-2 Explained: The Hardest Public Reasoning Benchmark URL: https://benchlm.ai/blog/posts/arc-agi-2-explained ARC-AGI-2 measures fluid intelligence through visual grid puzzles that can't be solved by memorization. Here's how it works, what scores mean, and where current frontier models stand. ### DeepSeek V4 Pro vs Claude Opus 4.7 vs GPT-5.5: The Frontier in April 2026 URL: https://benchlm.ai/blog/posts/deepseek-v4-vs-claude-opus-4-7-vs-gpt-5-5 Three frontier flagships launched in eight days. DeepSeek V4 Pro undercuts GPT-5.5 by ~9x on output price under MIT license. Here's how they compare on benchmarks, cost, and real use. ### LLM Context Window Comparison 2026: Advertised vs Effective, Input vs Output URL: https://benchlm.ai/blog/posts/context-window-comparison Four frontier LLMs now advertise 1M+ tokens. DeepSeek V4 Pro's 384K output changes generation workflows. Gemini leads effective-context evals. Here's the real comparison. ### OpenAI API Pricing: GPT-5.4, GPT-5.2, and GPT-5.1 (April 2026) URL: https://benchlm.ai/blog/posts/openai-api-pricing Current OpenAI API pricing from official docs: GPT-5.4, GPT-5.2, GPT-5.1, cached input rates, Batch API discounts, and the pricing details that actually matter. ### Gemini API Pricing: Current Flash, Flash-Lite, and Pro Rates (April 2026) URL: https://benchlm.ai/blog/posts/gemini-api-pricing Current Gemini API pricing from Google's official docs: 3.1 Pro Preview, 3.1 Flash-Lite Preview, 3 Flash Preview, 2.5 Flash, 2.5 Pro, plus Batch and Flex pricing. ### DeepSeek API Pricing: deepseek-chat vs deepseek-reasoner (April 2026) URL: https://benchlm.ai/blog/posts/deepseek-api-pricing Current DeepSeek API pricing from the official docs: deepseek-chat and deepseek-reasoner, cache-hit vs cache-miss pricing, output pricing, and the current V3.2 endpoint mapping. ### Claude API Pricing: Haiku 4.5, Sonnet 4.6, and Opus 4.7 (April 2026) URL: https://benchlm.ai/blog/posts/claude-api-pricing Current Anthropic Claude API pricing from official model pages and the Claude Opus 4.7 launch announcement, including prompt caching, batch discounts, and current long-context notes. ### GPT-5 vs Gemini in 2026: Full Benchmark Breakdown URL: https://benchlm.ai/blog/posts/gpt5-vs-gemini-2026 GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard. ### Mythos Preview is the first frontier model Anthropic decided not to ship. The benchmarks show why. URL: https://benchlm.ai/blog/posts/mythos-preview-anthropic-not-shipping Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch. ### Best LLM for Writing in 2026: AI Models Ranked for Content Creation URL: https://benchlm.ai/blog/posts/best-llm-writing Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget. ### Best LLM for RAG in 2026: Top Models Ranked for Retrieval-Augmented Generation URL: https://benchlm.ai/blog/posts/best-llm-rag We ranked LLMs for RAG by IFEval, knowledge benchmarks, context window, long-context retrieval accuracy, and pricing. Here are the top models for retrieval-augmented generation in 2026. ### How to Choose an LLM in 2026: Which AI Model Is Best for Your Use Case URL: https://benchlm.ai/blog/posts/which-llm-to-use A step-by-step framework for choosing the right LLM in 2026. We compare Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4, DeepSeek, and open source models by use case, budget, and deployment needs — backed by benchmark data. ### Best Open Source LLM in 2026: Rankings, Benchmarks, and the Models Worth Running URL: https://benchlm.ai/blog/posts/best-open-source-llm Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — DeepSeek V4, Kimi K2.6, GLM-5, Qwen3.5, Gemma 4, Llama — and compare them to proprietary leaders. ### ChatGPT vs Claude vs Gemini in 2026: The Definitive Comparison URL: https://benchlm.ai/blog/posts/chatgpt-vs-claude-vs-gemini-2026 The best AI model depends on your use case. We compare Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.4 across coding, writing, reasoning, multimodal, price, and speed using current benchmark data. ### How LLM Token Pricing Works: A Complete Guide to API Costs in 2026 URL: https://benchlm.ai/blog/posts/llm-token-pricing Learn how LLM API pricing works — from tokens, input/output costs, and reasoning tokens to vision, embedding, and fine-tuning pricing. Includes real cost examples, free tiers, and 6 strategies to cut your AI spend. ### React Native Evals: The Mobile App Coding Benchmark Explained URL: https://benchlm.ai/blog/posts/react-native-evals-mobile-benchmark React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench. ### Best Chinese LLMs in 2026: DeepSeek V4, Kimi K2.6, GLM-5, Qwen, and Every Model Ranked URL: https://benchlm.ai/blog/posts/best-chinese-llm Which Chinese LLM is best in 2026? We rank DeepSeek V4, Kimi K2.6, GLM-5, GLM-5.1, Qwen3.5, MiMo, and more using current BenchLM data across coding, math, reasoning, and agentic work. ### State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed URL: https://benchlm.ai/blog/posts/state-of-llm-benchmarks-2026 State of LLM benchmarks in 2026: current BenchLM rankings, category leaders, benchmark trends, open vs closed performance, and what still matters after the latest scoring changes. ### Best Budget LLMs in 2026: GPT-5.4 Mini, Nano, MiniMax M2.7, and Every Cheap Model Ranked URL: https://benchlm.ai/blog/posts/best-budget-llms-2026 Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price. ### Are AI Benchmarks Reliable? The Data Contamination Problem URL: https://benchlm.ai/blog/posts/benchmark-reliability AI benchmarks are useful but flawed. Data contamination inflates scores when models train on test questions. Here's how it works, which benchmarks resist it, and how BenchLM accounts for reliability. ### What Do LLM Benchmarks Actually Measure? URL: https://benchlm.ai/blog/posts/what-benchmarks-measure LLM benchmarks don't measure intelligence. They measure specific, narrow abilities under controlled conditions. Here's what each benchmark type actually tests — and what it misses. ### Terminal-Bench 2.0 Explained: How We Measure Agentic Coding URL: https://benchlm.ai/blog/posts/terminal-bench-2-agentic-benchmark Terminal-Bench 2.0 measures whether AI models can work through real terminal-based coding and ops workflows instead of just answering in chat. ### OSWorld-Verified Explained: How We Measure Computer-Use Models URL: https://benchlm.ai/blog/posts/osworld-verified-computer-use-benchmark OSWorld-Verified measures whether AI models can operate software interfaces and complete multi-step computer tasks with reliability. ### LLM API Pricing Comparison 2026: Every Major Model, Ranked by Cost URL: https://benchlm.ai/blog/posts/llm-pricing-2026 Full LLM API pricing comparison for 2026 — input/output token costs for GPT-5, Claude, Gemini, DeepSeek, Grok, and more. Find the cheapest model for your use case. ### Claude Opus 4.6 vs GPT-5.4: Full Benchmark Breakdown (2026) URL: https://benchlm.ai/blog/posts/claude-opus-vs-gpt-5 Claude Opus 4.6 vs GPT-5.4 head-to-head: current benchmark scores, pricing, and where each model actually wins. GPT-5.4 now leads overall, while Claude stays extremely close and still has real workflow-specific advantages. ### BrowseComp Explained: How We Measure Web Research Agents URL: https://benchlm.ai/blog/posts/browsecomp-browsing-benchmark BrowseComp evaluates whether AI models can search the web, gather evidence, and answer research questions instead of relying only on latent knowledge. ### Best LLM for Coding in 2026: Ranked by SWE-bench, LCB, and Real-World Performance URL: https://benchlm.ai/blog/posts/best-llm-coding Which AI model is best for coding in 2026? We rank major LLMs by BenchLM's verified coding score — SWE-Rebench, SWE-bench Pro, LiveCodeBench, and SWE-bench Verified — with pricing and task-specific picks. ### What Is HumanEval? The Coding Benchmark Explained URL: https://benchlm.ai/blog/posts/what-is-humaneval-coding-benchmark HumanEval tests whether AI models can generate correct Python functions from docstrings. Here's what it measures, why it's nearly saturated, and which benchmarks matter more in 2026. ### SWE-bench Explained: How We Measure Real-World Coding URL: https://benchlm.ai/blog/posts/swe-bench-explained SWE-bench Verified tests AI models on resolving real GitHub issues from Django, Flask, and scikit-learn. Here's how it works, why it matters, and which models score highest. ### MMLU vs MMLU-Pro: What Changed and Why It Matters URL: https://benchlm.ai/blog/posts/mmlu-vs-mmlu-pro MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026. ### LiveCodeBench: Why Static Coding Benchmarks Aren't Enough URL: https://benchlm.ai/blog/posts/livecodebench-contamination-free LiveCodeBench uses fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to prevent data contamination. Here's why it matters and which models lead. ### HLE (Humanity's Last Exam): The Hardest Benchmark URL: https://benchlm.ai/blog/posts/hle-humanitys-last-exam Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters. ### GPQA Diamond: The PhD-Level Science Benchmark URL: https://benchlm.ai/blog/posts/gpqa-diamond-science-benchmark GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works. ### Claude Opus 4.6 vs GPT-5.4: Where Each Model Actually Wins URL: https://benchlm.ai/blog/posts/claude-opus-4-6-vs-gpt-5-4 A direct benchmark comparison of Claude Opus 4.6 and GPT-5.4 on current BenchLM data. GPT-5.4 now leads overall, while Claude remains highly competitive on coding and still wins on some workflow-specific factors. ### What Is Chatbot Arena Elo? How Human Preference Drives Rankings URL: https://benchlm.ai/blog/posts/chatbot-arena-elo-explained Chatbot Arena ranks AI models using Elo ratings from blind human preference votes. Here's how the system works, what the scores mean, and how Elo compares to standardized benchmarks. ### AIME & HMMT: Can AI Models Do Competition Math? URL: https://benchlm.ai/blog/posts/aime-hmmt-competition-math AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means. ### How to Interpret LLM Benchmark Results: A Practical Guide URL: https://benchlm.ai/blog/posts/interpreting-llm-benchmark-results How to read LLM benchmark scores correctly — what differences are meaningful, what to ignore, common misinterpretations, and how to translate benchmark data into model selection decisions. ### The Complete Guide to LLM Benchmarking: Everything You Need to Know URL: https://benchlm.ai/blog/posts/complete-guide-llm-benchmarking Everything you need to know about LLM benchmarking — what benchmarks measure, how to choose the right ones, common pitfalls, and how to interpret results for real-world model selection. ### Building Your Own LLM Benchmark: A Practical Guide URL: https://benchlm.ai/blog/posts/building-custom-llm-benchmark How to create a custom LLM benchmark for your specific use case — from defining tasks and building datasets to scoring models and avoiding common pitfalls.