Best models

Best AI model rankings

Browse BenchLM ranking surfaces by benchmark category, workflow, provider, license, and value.

Core leaderboards

Best LLMs for Coding

Top AI models ranked by coding benchmark performance including HumanEval, SWE-bench Verified, SWE-bench Pro, SWE-Rebench, LiveCodeBench, FLTEval, and ProgramBench.

Best LLMs for Math

Top AI models ranked by mathematics benchmark performance including AIME, HMMT, BRUMO, and MATH-500.

Best LLMs for Knowledge

Top AI models ranked by knowledge benchmarks including MMLU, GPQA, SuperGPQA, MMLU-Pro, HLE, FrontierScience, and SimpleQA.

Best LLMs for Reasoning

Top AI models ranked by reasoning benchmark performance including ARC-AGI-2, MuSR, LongBench v2, MRCRv2, BBH, and LisanBench.

Best Agentic AI Models

Top AI models ranked by agentic benchmark performance including Terminal-Bench 2.0, BrowseComp, and OSWorld-Verified.

Best Multimodal & Grounded AI Models

Top AI models ranked by multimodal and grounded benchmark performance including MMMU-Pro and OfficeQA Pro.

Best LLMs for Instruction Following

Top AI models ranked by instruction following benchmark performance including IFEval.

Best Multilingual LLMs

Top AI models ranked by multilingual benchmark performance including MGSM and MMLU-ProX.

Use cases

Best Long Context AI Models

AI models ranked on sourced long-context and memory benchmarks including LongBench v2, MRCRv2, AI-Needle, Graphwalks, and MMLongBench-Doc.

Best Tool Use & Function Calling Models

AI models ranked on sourced tool-use benchmarks including BFCL v4, MCP Atlas, Toolathlon, Tau Bench, and related tool-calling evaluations.

Best AI Models for Web Research

AI models ranked on sourced browsing and research benchmarks including BrowseComp, WebArena, and WebVoyager.

Best Computer Use AI Models

AI models ranked on sourced computer-use and GUI benchmarks including OSWorld-Verified, ScreenSpot Pro, WebArena, WebVoyager, and Vision2Web.

Best Document AI Models

AI models ranked on sourced document AI and OCR benchmarks including OfficeQA Pro, OmniDocBench 1.5, CC-OCR, and MMLongBench-Doc.

Best Image Understanding Models

AI models ranked on sourced image-understanding benchmarks including MMMU-Pro, RealWorldQA, AI2D, CountBench, RefCOCO, and related grounding evaluations.

Best Frontend & App Dev Models

AI models ranked on sourced frontend and app-development benchmarks including React Native Evals, Design2Code, Vision2Web, and related web-task evaluations.

Best Factuality AI Models

AI models ranked on sourced factuality and hallucination-adjacent benchmarks including SimpleQA, HLE without tools, and Facts-VLM.

Model groups

Best Open Source LLMs

Top open weight AI models you can download and run locally, ranked by benchmark performance.

Best Proprietary LLMs

Top proprietary/closed-source AI models ranked by benchmark performance.

Best Reasoning AI Models

Top AI models with dedicated reasoning capabilities, ranked by benchmark performance.

Best OpenAI Models

All OpenAI models ranked by benchmark performance — GPT-5, GPT-4o, o1, o3, and more.

Best Anthropic Models

All Anthropic Claude models ranked by benchmark performance.

Best Google AI Models

All Google Gemini and Gemma models ranked by benchmark performance.

Best Meta AI Models

All Meta Llama models ranked by benchmark performance.

Best DeepSeek Models

All DeepSeek models ranked by benchmark performance.

Best AI Models Overall

The top AI models ranked by overall benchmark performance across all categories.

Best Large Context Window LLMs

AI models with the largest context windows (200K+ tokens), ranked by benchmark performance.

Best Chinese AI Models

Top AI models from Chinese labs — DeepSeek, Alibaba Qwen, Zhipu GLM, Moonshot Kimi, and more — ranked by benchmark performance.

European AI Models

European AI models from Mistral, H Company, LightOn, and Aleph Alpha — ranked models first, then tracked sparse rows.

Best Non-Reasoning LLMs

Top standard AI models (no chain-of-thought reasoning) ranked by benchmark performance. Faster and cheaper than reasoning models.

Best Mistral Models

All Mistral AI models ranked by benchmark performance — Mistral Large, Mixtral, and more.

Best xAI Grok Models

All xAI Grok models ranked by benchmark performance.

Best Alibaba Qwen Models

All Alibaba Qwen models ranked by benchmark performance.

Value rankings

Best Value LLM for Coding

Top AI models ranked by coding benchmark performance per dollar. Find the most cost-effective LLM for coding tasks including SWE-bench, LiveCodeBench, and more.

Best Value Agentic AI Model

Top AI models ranked by agentic benchmark performance per dollar. Find the most cost-effective model for AI agents, tool use, and multi-step workflows.

Best Value LLM for Reasoning

Top AI models ranked by reasoning benchmark performance per dollar. Cost-adjusted rankings using ARC-AGI-2, LongBench v2, MRCRv2, and MuSR.

Best Value LLM for Knowledge

Top AI models ranked by knowledge benchmark performance per dollar. Cost-adjusted rankings using GPQA, MMLU-Pro, HLE, and more.

Best Value LLM for Math

Top AI models ranked by math benchmark performance per dollar. Cost-adjusted rankings using AIME 2025, BRUMO, and MATH-500.

Best Value Multimodal AI Model

Top AI models ranked by multimodal benchmark performance per dollar. Cost-adjusted rankings using MMMU-Pro and OfficeQA Pro.

Best Value LLM Overall

Top AI models ranked by overall benchmark performance per dollar. Find the most cost-effective all-around LLM across all categories.