# BenchLM AI > BenchLM AI is a comprehensive AI benchmarking platform that evaluates and compares 121 large language models across 32 benchmarks in 8 categories: Agentic, Coding, Multimodal & Grounded, Reasoning, Knowledge, Instruction Following, Multilingual, and Mathematics. The platform provides real-time leaderboard data, detailed model profiles, and educational content about LLM evaluation methodologies. ## Main Pages - [Homepage](https://benchlm.ai/): AI model leaderboard with all benchmark scores, filtering, and sorting - [Knowledge Benchmarks](https://benchlm.ai/knowledge): MMLU, GPQA, SuperGPQA, OpenBookQA evaluations - [Coding Benchmarks](https://benchlm.ai/coding): HumanEval, SWE-bench Pro, SWE-bench Verified, LiveCodeBench evaluations - [Math Benchmarks](https://benchlm.ai/math): AIME 2023-2025, HMMT 2023-2025, BRUMO 2025, MATH-500 evaluations - [Reasoning Benchmarks](https://benchlm.ai/reasoning): SimpleQA, MuSR, BBH, LongBench v2, MRCRv2 evaluations - [Agentic Benchmarks](https://benchlm.ai/agentic): Terminal-Bench 2.0, BrowseComp, OSWorld-Verified evaluations - [Multimodal & Grounded](https://benchlm.ai/multimodal-grounded): MMMU-Pro and OfficeQA Pro evaluations - [Instruction Following](https://benchlm.ai/instruction-following): IFEval benchmark scores - [Multilingual Benchmarks](https://benchlm.ai/multilingual): MGSM, MMLU-ProX evaluations - [Models Directory](https://benchlm.ai/models): Browse all 121 AI models with benchmark scores - [Blog](https://benchlm.ai/blog): Articles on LLM benchmarking methodology and analysis ## Top Models Snapshot Current top models by overall score in BenchLM.ai's March 2026 data. **GPT-5.4 Pro** — OpenAI. Score: 91. Context: 1.05M tokens. Reasoning. Proprietary. Price: $30 / $180 per million input/output tokens. **GPT-5.4** — OpenAI. Score: 90. Context: 1.05M tokens. Reasoning. Proprietary. Price: $2.50 / $15 per million input/output tokens. **GPT-5.2 Pro** — OpenAI. Score: 90. Context: 400K tokens. Reasoning. Proprietary. Price: $25 / $150 per million input/output tokens. **GPT-5.3 Codex** — OpenAI. Score: 89. Context: 400K tokens. Reasoning. Proprietary. Price: $2.50 / $10 per million input/output tokens. **GPT-5.2** — OpenAI. Score: 88. Context: 400K tokens. Reasoning. Proprietary. Price: $2 / $8 per million input/output tokens. **GPT-5.3 Instant** — OpenAI. Score: 87. Context: 128K tokens. Reasoning. Proprietary. Price: $1.75 / $14 per million input/output tokens. **GPT-5.3-Codex-Spark** — OpenAI. Score: 87. Context: 256K tokens. Reasoning. Proprietary. Price: $2 / $8 per million input/output tokens. **Claude Opus 4.6** — Anthropic. Score: 85. Context: 1M tokens. Non-Reasoning. Proprietary. Price: $15 / $75 per million input/output tokens. **GPT-5.2 Instant** — OpenAI. Score: 85. Context: 128K tokens. Reasoning. Proprietary. Price: $1.50 / $6 per million input/output tokens. **GPT-5.2-Codex** — OpenAI. Score: 85. Context: 400K tokens. Reasoning. Proprietary. Price: $2 / $8 per million input/output tokens. See the full current leaderboard at https://benchlm.ai/best/overall ## Benchmark Definitions Detailed definitions for all 32 benchmarks tracked by BenchLM.ai. ### Agentic Benchmarks **Terminal-Bench 2.0** — Tests whether AI models can complete coding and systems tasks in a terminal environment: writing and debugging scripts, managing files, running builds, using command-line tools. Evaluates multi-step agentic execution rather than single-shot code generation. A high score predicts reliable performance in coding agent loops and DevOps automation workflows. URL: https://benchlm.ai/benchmarks/terminalBench2 **BrowseComp** — Measures web research ability: whether a model can navigate multiple web sources, synthesize evidence, and answer questions that require gathering information across pages rather than from a single source. Tests real browsing and research workflows. High scores indicate the model can function as a reliable research agent. URL: https://benchlm.ai/benchmarks/browseComp **OSWorld-Verified** — Evaluates computer-use reliability: whether a model can operate real software interfaces, understand screen state, take sequential actions, maintain state across many steps, and complete workflows without destructive errors. Tests multi-app automation, document tasks, QA testing, and operations workflows. One of the most informative agentic benchmarks in 2026. URL: https://benchlm.ai/benchmarks/osWorldVerified ### Coding Benchmarks **HumanEval** — 164-problem Python function generation benchmark from OpenAI (2021). Models are given a function signature and docstring, must generate the function body. Scored by test execution (pass@1). Saturated in 2026 — six frontier models score 91%+ and it no longer differentiates them. Still shown for reference but excluded from BenchLM.ai's scoring formula. URL: https://benchlm.ai/benchmarks/humaneval **SWE-bench Verified** — Real-world GitHub bug fixing benchmark. Models receive a repository, an issue description, and must produce a patch that makes failing tests pass. "Verified" subset has been manually checked for correctness. Scores range 70-85% among frontier models — still discriminative. URL: https://benchlm.ai/benchmarks/sweVerified **SWE-bench Pro** — Harder version of SWE-bench using more complex, longer-horizon software engineering tasks. The primary coding signal in BenchLM.ai's scoring formula for 2026. Larger spread between models than SWE-bench Verified. URL: https://benchlm.ai/benchmarks/swePro **LiveCodeBench** — Competitive programming benchmark that continuously sources new problems to prevent data contamination. Problems are pulled from Codeforces, LeetCode, and AtCoder after model training cutoffs, making memorization impossible. The spread is 55-85% — wide enough to clearly differentiate models. Considered the most trustworthy coding signal in 2026. URL: https://benchlm.ai/benchmarks/liveCodeBench ### Knowledge Benchmarks **MMLU** — Massive Multitask Language Understanding. 14,042 multiple-choice questions across 57 subjects (STEM, humanities, social science, professional domains) at undergraduate level. Saturated in 2026 — frontier models score 97-99%. Still tracked for mid-tier and open-weight model comparison but excluded from BenchLM.ai's scoring formula. URL: https://benchlm.ai/benchmarks/mmlu **MMLU-Pro** — Enhanced version of MMLU with harder, more reasoning-intensive questions and 10 answer choices instead of 4. Less saturated than MMLU — useful for distinguishing frontier models. Scores range 87-92% among top models. URL: https://benchlm.ai/benchmarks/mmluPro **GPQA Diamond** — Graduate-Level Google-Proof Q&A. 198 PhD-level multiple-choice science questions in biology, chemistry, and physics, written by domain experts to be resistant to Google search. Very hard — frontier models score 87-97%. Scores below 70% indicate a model struggling with expert-level science. URL: https://benchlm.ai/benchmarks/gpqa **SuperGPQA** — Broader version of GPQA covering 285 research-level science domains. Harder and wider than GPQA. Scores range 55-95% — meaningful spread at the frontier. URL: https://benchlm.ai/benchmarks/superGpqa **OpenBookQA** — Science QA benchmark testing elementary science knowledge and commonsense reasoning. Nearly saturated — displayed for reference but excluded from scoring. URL: https://benchlm.ai/benchmarks/openBookQa **HLE (Humanity's Last Exam)** — 2,500 expert-level questions across 100+ academic domains, written by subject matter experts specifically to stump frontier AI models. Scores range 10-50% — the largest spread of any knowledge benchmark in 2026. GPT-5.4 Pro currently leads at 50%, while many mid-tier models remain in the 20-30% range. The most informative frontier knowledge benchmark available. URL: https://benchlm.ai/benchmarks/hle **FrontierScience** — Research-level science benchmark testing whether models can answer questions that require understanding of recent scientific literature and methods beyond what's in textbooks. URL: https://benchlm.ai/benchmarks/frontierScience ### Reasoning Benchmarks **SimpleQA** — Short-form factual accuracy benchmark. Models answer brief factual questions; answers are judged for correctness without partial credit. Tests precision of factual recall rather than fluency. High scores indicate a model that doesn't hallucinate on simple factual queries. URL: https://benchlm.ai/benchmarks/simpleQa **MuSR** — Multistep Soft Reasoning. Tests multi-step reasoning over long paragraphs of context, requiring models to chain multiple inferences across a document before arriving at an answer. Reasoning models outperform standard models significantly here. URL: https://benchlm.ai/benchmarks/musr **BBH (BIG-Bench Hard)** — 204 challenging tasks from the BIG-Bench benchmark. Covers algorithmic reasoning, causal reasoning, and formal logic. Historical baseline — still included for reference but frontier models score 93-96%. URL: https://benchlm.ai/benchmarks/bbh **LongBench v2** — Long-context understanding benchmark testing whether models can accurately answer questions that require reading and reasoning over documents of 10K-100K+ tokens. Tests whether the advertised context window is actually usable. URL: https://benchlm.ai/benchmarks/longBenchV2 **MRCRv2** — Multi-hop Reasoning and Context Retrieval benchmark v2. Tests whether models can retrieve and combine information from multiple locations within a long context to answer questions that require cross-referencing. URL: https://benchlm.ai/benchmarks/mrcrv2 ### Math Benchmarks **AIME 2025** — American Invitational Mathematics Examination 2025. Competition-level math problems requiring creative problem-solving and formal mathematics. Frontier models now score 97-99% — saturated at the top. URL: https://benchlm.ai/benchmarks/aime2025 **HMMT 2025** — Harvard-MIT Mathematics Tournament 2025. Competition math at a similar difficulty to AIME. Frontier models score 95-98%. Saturated among top models. URL: https://benchlm.ai/benchmarks/hmmt2025 **BRUMO 2025** — Bulgarian Mathematical Olympiad 2025. Slightly harder than AIME/HMMT for AI models, providing more spread among frontier models. URL: https://benchlm.ai/benchmarks/brumo2025 **MATH-500** — 500 problems from the MATH dataset covering 7 difficulty levels and multiple math subjects. Broader spread than AIME — useful for comparing mid-tier models. Frontier models score 97-99%. URL: https://benchlm.ai/benchmarks/math500 ### Instruction Following Benchmarks **IFEval** — Instruction Following Evaluation. Tests whether models precisely follow verifiable formatting and content constraints: word count limits, specific keywords required or forbidden, casing rules, output length constraints, JSON formatting requirements. Scored at both prompt level and instruction level. Scores range 70-95% — meaningful spread. URL: https://benchlm.ai/benchmarks/ifeval ### Multilingual Benchmarks **MGSM** — Multilingual Grade School Math. Tests mathematical reasoning across 10 languages including Chinese, German, French, Japanese, Spanish, Russian, and others. Reveals how well models' math capabilities transfer to non-English languages. URL: https://benchlm.ai/benchmarks/mgsm **MMLU-ProX** — Multilingual version of MMLU-Pro. Professional-level knowledge assessment across multiple non-English languages. Tests whether models have internalized expert knowledge across languages, not just English. URL: https://benchlm.ai/benchmarks/mmluProX ### Multimodal & Grounded Benchmarks **MMMU-Pro** — Massive Multidiscipline Multimodal Understanding Pro. Tests whether models can answer questions that require reasoning over images, charts, diagrams, and text together. Expert-level visual reasoning across 30+ disciplines. URL: https://benchlm.ai/benchmarks/mmmuPro **OfficeQA Pro** — Grounded benchmark for enterprise document tasks: reading spreadsheets, interpreting PDFs, extracting data from office documents, and answering questions about visual business artifacts. Tests models for enterprise copilot use cases. URL: https://benchlm.ai/benchmarks/officeQaPro ## Scoring Methodology BenchLM.ai's overall score is a weighted average across 8 benchmark categories: | Category | Weight | Primary Benchmarks | |---|---|---| | Agentic | 22% | Terminal-Bench 2.0, BrowseComp, OSWorld-Verified | | Coding | 20% | SWE-bench Pro, LiveCodeBench, SWE-bench Verified | | Reasoning | 17% | SimpleQA, MuSR, LongBench v2, MRCRv2, BBH | | Knowledge | 12% | GPQA, SuperGPQA, MMLU-Pro, HLE, FrontierScience | | Multimodal & Grounded | 12% | MMMU-Pro, OfficeQA Pro | | Instruction Following | 5% | IFEval | | Multilingual | 7% | MGSM, MMLU-ProX | | Math | 5% | AIME 2025, HMMT 2025, BRUMO 2025, MATH-500 | **Saturation policy:** Benchmarks where frontier models cluster at 95-99% (MMLU, HumanEval, AIME 2023/2024, HMMT 2023/2024) are excluded from the scoring formula because score differences are within noise range. They are still displayed in model profiles for reference. **Within-category weighting:** More discriminative, harder, and less contaminated benchmarks carry more weight within each category. For coding, SWE-bench Pro and LiveCodeBench outweigh HumanEval. For knowledge, HLE and FrontierScience outweigh MMLU. **Normalization:** All scores are on a 0-100 scale. Benchmark scores that are reported differently by original authors are converted to this scale. **Data source:** Benchmark scores are collected from official model announcements, academic papers, and the OpenBench open-source evaluation infrastructure. Methodology details: https://benchlm.ai/#methodology ## Model Profile Pages Individual benchmark analysis pages for each of the 121 tracked AI models: - [GPT-5.4 Pro](https://benchlm.ai/models/gpt-5-4-pro): OpenAI, Score: 91, Proprietary, 1.05M context - [GPT-5.4](https://benchlm.ai/models/gpt-5-4): OpenAI, Score: 90, Proprietary, 1.05M context - [GPT-5.2 Pro](https://benchlm.ai/models/gpt-5-2-pro): OpenAI, Score: 90, Proprietary, 400K context - [GPT-5.3 Codex](https://benchlm.ai/models/gpt-5-3-codex): OpenAI, Score: 89, Proprietary, 400K context - [GPT-5.2](https://benchlm.ai/models/gpt-5-2): OpenAI, Score: 88, Proprietary, 400K context - [GPT-5.3 Instant](https://benchlm.ai/models/gpt-5-3-instant): OpenAI, Score: 87, Proprietary, 128K context - [GPT-5.3-Codex-Spark](https://benchlm.ai/models/gpt-5-3-codex-spark): OpenAI, Score: 87, Proprietary, 256K context - [Claude Opus 4.6](https://benchlm.ai/models/claude-opus-4-6): Anthropic, Score: 85, Proprietary, 1M context - [GPT-5.2 Instant](https://benchlm.ai/models/gpt-5-2-instant): OpenAI, Score: 85, Proprietary, 128K context - [GPT-5.2-Codex](https://benchlm.ai/models/gpt-5-2-codex): OpenAI, Score: 85, Proprietary, 400K context - [Full models directory](https://benchlm.ai/models): All 121 models with scores and rankings ## Comparison Pages - [Model vs Model comparisons](https://benchlm.ai/compare): Side-by-side benchmark comparison for any two models - Example: [GPT-5.4 Pro vs GPT-5.4](https://benchlm.ai/compare/gpt-5-4-vs-gpt-5-4-pro) - 7,260 total comparison pages available ## Benchmark Detail Pages - [MMLU](https://benchlm.ai/benchmarks/mmlu): Massive Multitask Language Understanding (saturated, reference only) - [MMLU-Pro](https://benchlm.ai/benchmarks/mmluPro): Enhanced MMLU with harder questions (10 choices) - [GPQA](https://benchlm.ai/benchmarks/gpqa): Graduate-Level Google-Proof Q&A (198 PhD-level questions) - [SuperGPQA](https://benchlm.ai/benchmarks/superGpqa): Research-level science across 285 domains - [OpenBookQA](https://benchlm.ai/benchmarks/openBookQa): Elementary science knowledge benchmark (reference only) - [HLE](https://benchlm.ai/benchmarks/hle): Humanity's Last Exam — 2,500 expert-level questions - [FrontierScience](https://benchlm.ai/benchmarks/frontierScience): Research-level science benchmark - [HumanEval](https://benchlm.ai/benchmarks/humaneval): Python function generation (saturated, reference only) - [SWE-bench Verified](https://benchlm.ai/benchmarks/sweVerified): Real GitHub bug fixing benchmark - [SWE-bench Pro](https://benchlm.ai/benchmarks/swePro): Harder, longer-horizon software engineering tasks - [LiveCodeBench](https://benchlm.ai/benchmarks/liveCodeBench): Contamination-resistant competitive programming - [AIME 2023-2025](https://benchlm.ai/benchmarks/aime2025): American Invitational Mathematics Examination - [HMMT 2023-2025](https://benchlm.ai/benchmarks/hmmt2025): Harvard-MIT Mathematics Tournament - [BRUMO 2025](https://benchlm.ai/benchmarks/brumo2025): Bulgarian Mathematical Olympiad - [MATH-500](https://benchlm.ai/benchmarks/math500): 500-problem math benchmark across difficulty levels - [SimpleQA](https://benchlm.ai/benchmarks/simpleQa): Short-form factual accuracy - [MuSR](https://benchlm.ai/benchmarks/musr): Multistep soft reasoning over long context - [BBH](https://benchlm.ai/benchmarks/bbh): BIG-Bench Hard — 204 challenging reasoning tasks - [LongBench v2](https://benchlm.ai/benchmarks/longBenchV2): Long-context reasoning benchmark - [MRCRv2](https://benchlm.ai/benchmarks/mrcrv2): Multi-hop long-context retrieval benchmark - [IFEval](https://benchlm.ai/benchmarks/ifeval): Instruction Following Evaluation (verifiable constraints) - [MGSM](https://benchlm.ai/benchmarks/mgsm): Multilingual math reasoning across 10 languages - [MMLU-ProX](https://benchlm.ai/benchmarks/mmluProX): Multilingual professional knowledge benchmark - [Terminal-Bench 2.0](https://benchlm.ai/benchmarks/terminalBench2): Terminal-based agentic evaluation - [BrowseComp](https://benchlm.ai/benchmarks/browseComp): Web research and evidence gathering - [OSWorld-Verified](https://benchlm.ai/benchmarks/osWorldVerified): Computer-use workflow benchmark - [MMMU-Pro](https://benchlm.ai/benchmarks/mmmuPro): Multimodal reasoning across images and charts - [OfficeQA Pro](https://benchlm.ai/benchmarks/officeQaPro): Enterprise document and spreadsheet reasoning ## Best LLM Rankings - [Best AI Models Overall](https://benchlm.ai/best/overall): Weighted score across all 8 categories - [Best LLMs for Coding](https://benchlm.ai/best/coding): SWE-bench Pro, LiveCodeBench leaders - [Best LLMs for Math](https://benchlm.ai/best/math): Competition math benchmark rankings - [Best LLMs for Knowledge](https://benchlm.ai/best/knowledge): HLE, GPQA, MMLU-Pro leaders - [Best LLMs for Reasoning](https://benchlm.ai/best/reasoning): SimpleQA, MuSR, LongBench leaders - [Best Agentic AI Models](https://benchlm.ai/best/agentic): Terminal-Bench, BrowseComp, OSWorld leaders - [Best Multimodal & Grounded AI Models](https://benchlm.ai/best/multimodal-grounded): MMMU-Pro leaders - [Best LLMs for Instruction Following](https://benchlm.ai/best/instruction-following): IFEval leaders - [Best Multilingual LLMs](https://benchlm.ai/best/multilingual): MGSM, MMLU-ProX leaders - [Best Open Source LLMs](https://benchlm.ai/best/open-source): Top open-weight models - [Best Proprietary LLMs](https://benchlm.ai/best/proprietary): Top closed-source models - [Best Reasoning Models](https://benchlm.ai/best/reasoning-models): Chain-of-thought models only - [Best Non-Reasoning LLMs](https://benchlm.ai/best/non-reasoning-models): Standard (no CoT) models - [Best Large Context Window LLMs](https://benchlm.ai/best/large-context-window): 200K+ token models - [Best Chinese AI Models](https://benchlm.ai/best/chinese-models): DeepSeek, Qwen, GLM, Kimi leaders - [Best OpenAI Models](https://benchlm.ai/best/openai-models) - [Best Anthropic Models](https://benchlm.ai/best/anthropic-models) - [Best Google Models](https://benchlm.ai/best/google-models) - [Best Meta Models](https://benchlm.ai/best/meta-models) - [Best DeepSeek Models](https://benchlm.ai/best/deepseek-models) - [Best Mistral Models](https://benchlm.ai/best/mistral-models) - [Best xAI Grok Models](https://benchlm.ai/best/xai-models) - [Best Alibaba Qwen Models](https://benchlm.ai/best/alibaba-models) ## Tools & Resources - [LLM Pricing Comparison](https://benchlm.ai/pricing): Compare API pricing for every major LLM — input/output token costs, price-performance ratios - [AI Cost Calculator](https://benchlm.ai/tools/ai-cost-calculator): Estimate cost per blog post, web page, documentation article, PRD, or shipped feature - [LLM Selector Quiz](https://benchlm.ai/tools/llm-selector): Answer 5 questions, get a personalized model recommendation based on benchmark data - [Cost Calculator](https://benchlm.ai/tools/cost-calculator): Estimate monthly AI API spending based on usage patterns ## Blog Posts - [What Do LLM Benchmarks Actually Measure?](https://benchlm.ai/blog/posts/what-benchmarks-measure): Pillar guide — what benchmarks test, what they miss, and how to use them - [Complete Guide to LLM Benchmarking](https://benchlm.ai/blog/posts/complete-guide-llm-benchmarking): Full methodology overview, benchmark taxonomy, category weights - [How to Interpret LLM Benchmark Results](https://benchlm.ai/blog/posts/interpreting-llm-benchmark-results): Signal vs noise, saturation, statistical significance - [Building Custom LLM Benchmarks](https://benchlm.ai/blog/posts/building-custom-llm-benchmark): How to evaluate LLMs on your specific tasks - [What Is HumanEval?](https://benchlm.ai/blog/posts/what-is-humaneval-coding-benchmark): The coding benchmark explained — and why it's saturated - [SWE-bench Explained](https://benchlm.ai/blog/posts/swe-bench-explained): How we measure real-world coding ability - [MMLU vs MMLU-Pro](https://benchlm.ai/blog/posts/mmlu-vs-mmlu-pro): What changed and why it matters - [GPQA Diamond Explained](https://benchlm.ai/blog/posts/gpqa-diamond-science-benchmark): The PhD-level science benchmark - [LiveCodeBench Explained](https://benchlm.ai/blog/posts/livecodebench-contamination-free): Why static coding benchmarks aren't enough - [AIME & HMMT: Can AI Do Competition Math?](https://benchlm.ai/blog/posts/aime-hmmt-competition-math): Competition math is effectively solved — what now? - [HLE: Humanity's Last Exam](https://benchlm.ai/blog/posts/hle-humanitys-last-exam): The hardest AI benchmark and what it reveals - [Chatbot Arena Elo Explained](https://benchlm.ai/blog/posts/chatbot-arena-elo-explained): How human preference differs from benchmark accuracy - [Best LLM for Coding in 2026](https://benchlm.ai/blog/posts/best-llm-for-coding): What the benchmarks actually show - [Claude Opus 4.6 vs GPT-5.4](https://benchlm.ai/blog/posts/claude-opus-4-6-vs-gpt-5-4): Head-to-head across 22 benchmarks - [BrowseComp Explained](https://benchlm.ai/blog/posts/browsecomp-browsing-benchmark): How we measure web research ability - [OSWorld-Verified Explained](https://benchlm.ai/blog/posts/osworld-verified-computer-use-benchmark): Computer-use benchmark for AI agents - [Terminal-Bench 2.0 Explained](https://benchlm.ai/blog/posts/terminal-bench-2-agentic-benchmark): The agentic terminal task benchmark ## Frequently Asked Questions Q: What is the best AI model overall? A: As of March 2026, GPT-5.4 Pro leads with an overall score of 91, followed by GPT-5.4 (90), GPT-5.2 Pro (90), GPT-5.3 Codex (89), and GPT-5.2 (88). The top of the table is tight, with 3 points separating the top 5. See the full ranking at https://benchlm.ai/best/overall Q: What is the best LLM for coding? A: GPT-5.4 Pro currently leads the weighted coding table at 87.4, with GPT-5.3 Codex essentially tied at 87.3 and still the strongest coding-specific value model. Among general-purpose options, GPT-5.4 is clearly ahead of Claude Opus 4.6 on the current coding mix. See https://benchlm.ai/best/coding Q: What is the best open source LLM? A: The top open-weight models are GLM-5 (Reasoning), Kimi K2.5 (Reasoning), and Qwen3.5 397B (Reasoning). Open-weight models now score within 10-15 points of the top proprietary models on most benchmarks. The gap is largest on agentic benchmarks. See https://benchlm.ai/best/open-source Q: How is the overall score calculated? A: Each model's overall score is a weighted average across 8 categories: Agentic (22%), Coding (20%), Reasoning (17%), Knowledge (12%), Multimodal & Grounded (12%), Multilingual (7%), Instruction Following (5%), and Math (5%). Saturated benchmarks (MMLU, HumanEval, older competition math exams) are excluded from the formula. See https://benchlm.ai/#methodology Q: What benchmarks does BenchLM track? A: 32 benchmarks across 8 categories. Agentic: Terminal-Bench 2.0, BrowseComp, OSWorld-Verified. Coding: HumanEval, SWE-bench Verified, SWE-bench Pro, LiveCodeBench. Multimodal: MMMU-Pro, OfficeQA Pro. Reasoning: SimpleQA, MuSR, BBH, LongBench v2, MRCRv2. Knowledge: MMLU, MMLU-Pro, GPQA, SuperGPQA, OpenBookQA, HLE, FrontierScience. Instruction Following: IFEval. Multilingual: MGSM, MMLU-ProX. Math: AIME 2023-2025, HMMT 2023-2025, BRUMO 2025, MATH-500. Q: Claude vs GPT — which is better? A: GPT-5.4 scores 90 versus Claude Opus 4.6 at 85 overall in the current BenchLM.ai data. GPT-5.4 leads on HLE (48 vs 38), coding, and agentic benchmarks while also costing much less ($2.50/$15 vs $15/$75 per million tokens). Claude Opus 4.6 still offers a faster non-reasoning profile and may be preferable for teams that prioritize writing feel or Anthropic-native workflows. See https://benchlm.ai/compare/claude-opus-4-6-vs-gpt-5-4 Q: What is the best AI model for math? A: Competition math benchmarks (AIME, HMMT) are saturated at the frontier — top models all score 95-99% and differences are noise. GPT-5.4 Pro leads narrowly. For math, the choice between top models rarely matters — focus on coding, reasoning, or agentic performance instead. See https://benchlm.ai/best/math Q: What is the best Chinese AI model? A: DeepSeek, Alibaba Qwen, Zhipu GLM, Moonshot Kimi, and ByteDance Seed all compete at the frontier level. GLM-5 leads open-weight coding. Chinese labs are especially strong in math, reasoning, and open-weight model performance. See https://benchlm.ai/best/chinese-models Q: What is the difference between a reasoning model and a non-reasoning model? A: Reasoning models (like GPT-5.4, GPT-5.3 Codex, and DeepSeek-V4) use chain-of-thought inference — they think through the problem before answering, adding latency but improving accuracy on hard math, logic, and reasoning tasks. Non-reasoning models (like Claude Opus 4.6, Gemini 3.1 Pro, Grok 4.1) respond directly without a visible reasoning step, making them faster and more predictable. For latency-sensitive applications, non-reasoning models are often preferred even at a small benchmark cost. Q: What makes a benchmark saturated? A: A benchmark is saturated when frontier models score 95-99% and differences between top models are 1-2 points — within statistical noise. MMLU, HumanEval, and older AIME exams are saturated. BenchLM.ai excludes saturated benchmarks from its scoring formula and highlights non-saturated alternatives (HLE, SWE-bench Pro, LiveCodeBench) that have meaningful spread between models. Q: How should I pick an LLM for my use case? A: Start with the category most relevant to your task: coding, reasoning, knowledge, agentic, etc. Filter to non-saturated benchmarks in that category — they show real model differences. Focus on 5+ point gaps rather than 1-2 point differences. Test the top 2-3 candidates on a sample of your actual tasks before committing. Use the LLM Selector at https://benchlm.ai/tools/llm-selector for a guided recommendation. ## Data & Methodology - Benchmark data sourced from OpenBench open-source evaluation infrastructure, model official announcements, and academic papers - 121 models from 23 creators (OpenAI, Anthropic, Google, Meta, DeepSeek, xAI, Alibaba, Mistral, ByteDance, StepFun, NVIDIA, Inception, LiquidAI, Zhipu AI, Moonshot AI, and others) - Models categorized by: source type (Proprietary/Open Weight), reasoning capability (Reasoning/Non-Reasoning), context window size, creator - Data last updated: March 12, 2026 - Scores normalized to 0-100 scale across all benchmarks - Overall score: weighted average across 8 categories (Agentic 22%, Coding 20%, Reasoning 17%, Knowledge 12%, Multimodal 12%, Multilingual 7%, Instruction Following 5%, Math 5%) - About page: https://benchlm.ai/about - Methodology: https://benchlm.ai/#methodology ## Markdown Versions (for LLM crawlers) - [Full content file](https://benchlm.ai/llms-full.txt): All site content in a single file - [Homepage (md)](https://benchlm.ai/md/index.md): Leaderboard table in markdown - [Models directory (md)](https://benchlm.ai/md/models.md): All models grouped by creator - [Benchmarks reference (md)](https://benchlm.ai/md/benchmarks.md): All benchmark descriptions - [Knowledge (md)](https://benchlm.ai/md/knowledge.md): Knowledge benchmark rankings - [Coding (md)](https://benchlm.ai/md/coding.md): Coding benchmark rankings - [Math (md)](https://benchlm.ai/md/math.md): Math benchmark rankings - [Reasoning (md)](https://benchlm.ai/md/reasoning.md): Reasoning benchmark rankings - [Pricing (md)](https://benchlm.ai/md/pricing.md): LLM API pricing comparison table - [AI Cost Calculator (md)](https://benchlm.ai/md/tools/ai-cost-calculator.md): Task-based AI budgeting by deliverable - [LLM Selector (md)](https://benchlm.ai/md/tools/llm-selector.md): Model recommendation tool - [Cost Calculator (md)](https://benchlm.ai/md/tools/cost-calculator.md): Monthly API cost estimates - Individual model pages available at: `https://benchlm.ai/md/models/[slug].md` - Blog posts available at: `https://benchlm.ai/md/blog/[slug].md` ## Technical Details - Built with Next.js 14, React 18, statically generated - Site: https://benchlm.ai - Sitemap: https://benchlm.ai/sitemap.xml - RSS Feed: https://benchlm.ai/rss.xml - LLMs Full: https://benchlm.ai/llms-full.txt - Author: [@glevd](https://x.com/glevd) - License: MIT