GPT-4.1 Benchmark Scores & Performance

Benchmark analysis of GPT-4.1 by OpenAI across 5 tests.

According to BenchLM.ai, GPT-4.1 ranks #74 out of 100 models with an overall score of 43/100. While not a frontier model, it offers specific advantages depending on the use case.

GPT-4.1 is a proprietary model with a 1M token context window. It processes queries without explicit chain-of-thought reasoning, offering faster response times and lower token usage.

GPT-4.1 sits inside the GPT-4.1 family alongside GPT-4.1 mini, GPT-4.1 nano. BenchLM links it directly to GPT-4o as the earlier related model in that lineage. This profile currently has 5 of 22 tracked benchmarks, so the overall score is conservative until the rest of the suite is filled in.

Its strongest category is Knowledge (#22), while its weakest is Mathematics (#94). This performance profile makes it particularly effective for knowledge-intensive tasks like research, analysis, and factual Q&A.

Creator

OpenAI

Source Type

Proprietary

Reasoning

Non-Reasoning

Context Window

1M

Overall Score

43#74 of 100

Family & Lineage

Family

GPT-4.1

Base entry

Related Earlier Model

GPT-4o

Knowledge Benchmarks

MMLU
90.2
GPQA
66.3

Coding Benchmarks

SWE-bench Verified
54.6

Mathematics Benchmarks

AIME 2024
26.4

Instruction Following Benchmarks

IFEval
87.4

Frequently Asked Questions

How does GPT-4.1 perform overall in AI benchmarks?

GPT-4.1 ranks #74 out of 100 models with an overall score of 43. It is created by OpenAI and features a 1M context window.

Is GPT-4.1 good for knowledge and understanding?

GPT-4.1 ranks #22 out of 100 models in knowledge and understanding benchmarks with an average score of 78.3. There are stronger options in this category.

Is GPT-4.1 good for coding and programming?

GPT-4.1 ranks #34 out of 100 models in coding and programming benchmarks with an average score of 54.6. There are stronger options in this category.

Is GPT-4.1 good for mathematics?

GPT-4.1 ranks #94 out of 100 models in mathematics benchmarks with an average score of 26.4. There are stronger options in this category.

Is GPT-4.1 good for instruction following?

GPT-4.1 ranks #26 out of 100 models in instruction following benchmarks with an average score of 87.4. There are stronger options in this category.

Which sibling models are related to GPT-4.1?

GPT-4.1 belongs to the GPT-4.1 family. Related variants on BenchLM include GPT-4.1 mini, GPT-4.1 nano.

Does GPT-4.1 have full benchmark coverage on BenchLM?

Not yet. GPT-4.1 currently has 5 sourced benchmark scores out of the 22 benchmarks BenchLM tracks, so its overall score is intentionally conservative until more results are added.

What is the context window size of GPT-4.1?

GPT-4.1 has a context window of 1M tokens, which determines how much text it can process in a single interaction.

Last updated: March 9, 2026

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Your signup is stored with a derived country code for compliance routing.