# BenchLM AI > BenchLM AI is a comprehensive AI benchmarking platform that evaluates and compares 88 large language models across 22 benchmarks in 6 categories: Knowledge, Coding, Mathematics, Reasoning, Instruction Following, and Multilingual. The platform provides real-time leaderboard data, detailed model profiles, and educational content about LLM evaluation methodologies. ## Main Pages - [Homepage](https://benchlm.ai/): AI model leaderboard with all benchmark scores, filtering, and sorting - [Knowledge Benchmarks](https://benchlm.ai/knowledge): MMLU, GPQA, SuperGPQA, OpenBookQA evaluations - [Coding Benchmarks](https://benchlm.ai/coding): HumanEval code generation evaluation - [Math Benchmarks](https://benchlm.ai/math): AIME 2023-2025, HMMT 2023-2025, BRUMO 2025 evaluations - [Reasoning Benchmarks](https://benchlm.ai/reasoning): SimpleQA, MuSR multi-step reasoning evaluations - [Models Directory](https://benchlm.ai/models): Browse all 88 AI models with benchmark scores - [Blog](https://benchlm.ai/blog): Articles on LLM benchmarking methodology and analysis ## Model Profile Pages Individual benchmark analysis pages for each of the 88 tracked AI models: - [GPT-5.4](https://benchlm.ai/models/gpt-5-4): OpenAI, Score: 88, Proprietary, 1M context - [Gemini 3.1 Pro](https://benchlm.ai/models/gemini-3-1-pro): Google, Score: 87, Proprietary, 1M context - [Claude Opus 4.6](https://benchlm.ai/models/claude-opus-4-6): Anthropic, Score: 86, Proprietary, 1M context - [GPT-5.3 Codex](https://benchlm.ai/models/gpt-5-3-codex): OpenAI, Score: 85, Proprietary, 400K context - [Grok 4.1](https://benchlm.ai/models/grok-4-1): xAI, Score: 84, Proprietary, 128K context - [GPT-5.2](https://benchlm.ai/models/gpt-5-2): OpenAI, Score: 83, Proprietary, 400K context - [Gemini 3 Pro Deep Think](https://benchlm.ai/models/gemini-3-pro-deep-think): Google, Score: 81, Proprietary, 2M context - [Claude Sonnet 4.6](https://benchlm.ai/models/claude-sonnet-4-6): Anthropic, Score: 80, Proprietary, 1M context - [Claude Opus 4.5](https://benchlm.ai/models/claude-opus-4-5): Anthropic, Score: 79, Proprietary, 200K context - [Gemini 3 Pro](https://benchlm.ai/models/gemini-3-pro): Google, Score: 78, Proprietary, 2M context - [Full models directory](https://benchlm.ai/models): All 88 models with scores and rankings ## Comparison Pages - [Model vs Model comparisons](https://benchlm.ai/compare): Side-by-side benchmark comparison for any two models - Example: [GPT-5.4 vs Claude Opus 4.6](https://benchlm.ai/compare/claude-opus-4-6-vs-gpt-5-4) - 3,828 total comparison pages available ## Benchmark Detail Pages - [MMLU](https://benchlm.ai/benchmarks/mmlu): Massive Multitask Language Understanding - [GPQA](https://benchlm.ai/benchmarks/gpqa): Graduate-Level Google-Proof Q&A - [SuperGPQA](https://benchlm.ai/benchmarks/superGpqa): Super Graduate-Level Q&A - [OpenBookQA](https://benchlm.ai/benchmarks/openBookQa): Open Book Question Answering - [HumanEval](https://benchlm.ai/benchmarks/humaneval): Code Generation Evaluation - [AIME 2023-2025](https://benchlm.ai/benchmarks/aime2025): American Invitational Mathematics Examination - [HMMT 2023-2025](https://benchlm.ai/benchmarks/hmmt2025): Harvard-MIT Mathematics Tournament - [BRUMO 2025](https://benchlm.ai/benchmarks/brumo2025): Bulgarian Mathematical Olympiad - [SimpleQA](https://benchlm.ai/benchmarks/simpleQa): Short-form Factual Accuracy - [MuSR](https://benchlm.ai/benchmarks/musr): Multistep Soft Reasoning ## Best LLM Rankings - [Best LLMs for Coding](https://benchlm.ai/best/coding) - [Best LLMs for Math](https://benchlm.ai/best/math) - [Best LLMs for Knowledge](https://benchlm.ai/best/knowledge) - [Best LLMs for Reasoning](https://benchlm.ai/best/reasoning) - [Best Open Source LLMs](https://benchlm.ai/best/open-source) - [Best Overall](https://benchlm.ai/best/overall) - [Best Chinese AI Models](https://benchlm.ai/best/chinese-models) - [Best Non-Reasoning LLMs](https://benchlm.ai/best/non-reasoning-models) - [Best Large Context Window LLMs](https://benchlm.ai/best/large-context-window) - [Best Reasoning Models](https://benchlm.ai/best/reasoning-models) - [Best OpenAI Models](https://benchlm.ai/best/openai-models) - [Best Anthropic Models](https://benchlm.ai/best/anthropic-models) - [Best Google Models](https://benchlm.ai/best/google-models) - [Best DeepSeek Models](https://benchlm.ai/best/deepseek-models) - [Best Meta Models](https://benchlm.ai/best/meta-models) - [Best Mistral Models](https://benchlm.ai/best/mistral-models) - [Best xAI Grok Models](https://benchlm.ai/best/xai-models) - [Best Alibaba Qwen Models](https://benchlm.ai/best/alibaba-models) ## Blog Posts - [Complete Guide to LLM Benchmarking](https://benchlm.ai/blog/posts/complete-guide-llm-benchmarking): Comprehensive methodology overview - [Building Custom LLM Benchmarks](https://benchlm.ai/blog/posts/building-custom-llm-benchmark): Step-by-step implementation guide - [Interpreting LLM Benchmark Results](https://benchlm.ai/blog/posts/interpreting-llm-benchmark-results): Result analysis and interpretation ## Frequently Asked Questions Q: What is the best AI model overall? A: As of March 2026, GPT-5.3 Codex leads with a score of 92, followed by GPT-5.4 (91), GPT-5.2 (91), and Claude Opus 4.6 (90). The top models are separated by just 1-2 points. See the full ranking at https://benchlm.ai/best/overall Q: What is the best LLM for coding? A: GPT-5.3 Codex leads coding benchmarks with an 88.3 average (HumanEval 95, SWE-bench Verified 85, LiveCodeBench 85). Among general-purpose models, GPT-5.4 and Claude Opus 4.6 are nearly tied. See https://benchlm.ai/best/coding Q: What is the best open source LLM? A: The top open weight models include DeepSeek, Alibaba Qwen, and Meta Llama models. Open weight models now score within 5-10 points of the best proprietary models. See https://benchlm.ai/best/open-source Q: How is the overall score calculated? A: Each model's overall score is a weighted average of category averages: Coding (25%), Knowledge (20%), Math (20%), Reasoning (20%), Instruction Following (10%), Multilingual (5%). Within each category, all benchmark scores are averaged equally. See https://benchlm.ai/#methodology Q: What benchmarks does BenchLM track? A: 22 benchmarks across 6 categories: Coding (HumanEval, SWE-bench Verified, LiveCodeBench), Knowledge (MMLU, GPQA, SuperGPQA, OpenBookQA, MMLU-Pro, HLE), Math (AIME 2023-2025, HMMT 2023-2025, BRUMO 2025, MATH-500), Reasoning (SimpleQA, MuSR, BBH), Instruction Following (IFEval), Multilingual (MGSM). Q: Claude vs GPT — which is better? A: It depends on the task. GPT-5.4 scores 91 overall vs Claude Opus 4.6 at 90. GPT-5.4 leads on HLE (hard knowledge) by 8 points. Claude Opus 4.6 matches GPT-5.4 on coding without using chain-of-thought reasoning. For a detailed comparison, see https://benchlm.ai/compare/claude-opus-4-6-vs-gpt-5-4 Q: What is the best AI model for math? A: Competition math benchmarks (AIME, HMMT) are saturated at the top — the top 10 models all score above 95%. MATH-500 shows the most differentiation. GPT-5.3 Codex and GPT-5.4 lead slightly. See https://benchlm.ai/best/math Q: What is the best Chinese AI model? A: DeepSeek, Alibaba Qwen, Zhipu GLM, and Moonshot Kimi models all compete at the frontier level. Chinese labs have been especially strong in math and reasoning benchmarks. See https://benchlm.ai/best/chinese-models ## Data & Methodology - Benchmark data sourced from OpenBench open-source evaluation infrastructure - 88 models from 16 creators (OpenAI, Anthropic, Google, Meta, DeepSeek, xAI, Alibaba, Mistral, NVIDIA, and others) - Models categorized by: source type (Proprietary/Open Weight), reasoning capability, context window size - Data last updated: March 7, 2026 - Scores normalized to 0-100 scale across all benchmarks - About page: https://benchlm.ai/about ## Markdown Versions (for LLM crawlers) - [Full content file](https://benchlm.ai/llms-full.txt): All site content in a single file - [Homepage (md)](https://benchlm.ai/md/index.md): Leaderboard table in markdown - [Models directory (md)](https://benchlm.ai/md/models.md): All models grouped by creator - [Benchmarks reference (md)](https://benchlm.ai/md/benchmarks.md): All benchmark descriptions - [Knowledge (md)](https://benchlm.ai/md/knowledge.md): Knowledge benchmark rankings - [Coding (md)](https://benchlm.ai/md/coding.md): Coding benchmark rankings - [Math (md)](https://benchlm.ai/md/math.md): Math benchmark rankings - [Reasoning (md)](https://benchlm.ai/md/reasoning.md): Reasoning benchmark rankings - Individual model pages available at: `https://benchlm.ai/md/models/[slug].md` ## Technical Details - Built with Next.js 14, React 18, statically generated - Site: https://benchlm.ai - Sitemap: https://benchlm.ai/sitemap.xml - RSS Feed: https://benchlm.ai/rss.xml - LLMs Full: https://benchlm.ai/llms-full.txt - Author: [@glevd](https://x.com/glevd) - License: MIT