Side-by-side benchmark comparison across agentic, coding, multimodal, knowledge, reasoning, and math workflows.
GPT-OSS 120B has the cleaner overall profile here, landing at 25 versus 23. It is a real lead, but still close enough that category-level strengths matter more than the headline number.
GPT-OSS 120B's sharpest advantage is in reasoning, where it averages 47.9 against 21.7. The single biggest benchmark swing on the page is SimpleQA, 49% to 21.7%. Grok 3 Mini does hit back in knowledge, so the answer changes if that is the part of the workload you care about most.
Pick GPT-OSS 120B if you want the stronger benchmark profile. Grok 3 Mini only becomes the better choice if knowledge is the priority.
Benchmark data for this category is coming soon.
GPT-OSS 120B
43
Grok 3 Mini
41.5
Benchmark data for this category is coming soon.
GPT-OSS 120B
47.9
Grok 3 Mini
21.7
GPT-OSS 120B
49
Grok 3 Mini
74.1
Benchmark data for this category is coming soon.
Benchmark data for this category is coming soon.
GPT-OSS 120B
50
Grok 3 Mini
39.7
GPT-OSS 120B is ahead overall, 25 to 23. The biggest single separator in this matchup is SimpleQA, where the scores are 49% and 21.7%.
Grok 3 Mini has the edge for knowledge tasks in this comparison, averaging 74.1 versus 49. Inside this category, GPQA is the benchmark that creates the most daylight between them.
GPT-OSS 120B has the edge for coding in this comparison, averaging 43 versus 41.5. Grok 3 Mini stays close enough that the answer can still flip depending on your workload.
GPT-OSS 120B has the edge for math in this comparison, averaging 50 versus 39.7. Inside this category, AIME 2024 is the benchmark that creates the most daylight between them.
GPT-OSS 120B has the edge for reasoning in this comparison, averaging 47.9 versus 21.7. Inside this category, SimpleQA is the benchmark that creates the most daylight between them.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.