Mercury 2 vs Nemotron 3 Ultra 500B

Side-by-side benchmark comparison across agentic, coding, multimodal, knowledge, reasoning, and math workflows.

Mercury 2 finishes one point ahead overall, 65 to 64. That is enough to call, but not enough to treat as a blowout. This matchup comes down to a few meaningful edges rather than one model dominating the board.

Mercury 2's sharpest advantage is in mathematics, where it averages 80.9 against 78. The single biggest benchmark swing on the page is MuSR, 82 to 69. Nemotron 3 Ultra 500B does hit back in coding, so the answer changes if that is the part of the workload you care about most.

Nemotron 3 Ultra 500B gives you the larger context window at 10M, compared with 128K for Mercury 2.

Quick Verdict

Pick Mercury 2 if you want the stronger benchmark profile. Nemotron 3 Ultra 500B only becomes the better choice if coding is the priority or you need the larger 10M context window.

Agentic

Mercury 2

Mercury 2

63.7

Nemotron 3 Ultra 500B

62.8

63
Terminal-Bench 2.0
63
67
BrowseComp
69
62
OSWorld-Verified
58

Coding

Nemotron 3 Ultra 500B

Mercury 2

41.1

Nemotron 3 Ultra 500B

43.8

75
HumanEval
66
46
SWE-bench Verified
42
38
LiveCodeBench
41
43
SWE-bench Pro
47

Multimodal & Grounded

Mercury 2

Mercury 2

68.3

Nemotron 3 Ultra 500B

66.9

66
MMMU-Pro
61
71
OfficeQA Pro
74

Reasoning

Mercury 2

Mercury 2

80.1

Nemotron 3 Ultra 500B

77.2

82
SimpleQA
71
82
MuSR
69
87
BBH
85
77
LongBench v2
81
76
MRCRv2
85

Knowledge

Mercury 2

Mercury 2

57.2

Nemotron 3 Ultra 500B

57

78
MMLU
74
78
GPQA
73
76
SuperGPQA
71
74
OpenBookQA
69
72
MMLU-Pro
73
9
HLE
15
69
FrontierScience
67

Instruction Following

Tie

Mercury 2

84

Nemotron 3 Ultra 500B

84

84
IFEval
84

Multilingual

Tie

Mercury 2

79.7

Nemotron 3 Ultra 500B

79.7

81
MGSM
81
79
MMLU-ProX
79

Mathematics

Mercury 2

Mercury 2

80.9

Nemotron 3 Ultra 500B

78

81
AIME 2023
74
83
AIME 2024
76
82
AIME 2025
75
77
HMMT Feb 2023
70
79
HMMT Feb 2024
72
78
HMMT Feb 2025
71
80
BRUMO 2025
73
82
MATH-500
84

Frequently Asked Questions

Which is better, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 is ahead overall, 65 to 64. The biggest single separator in this matchup is MuSR, where the scores are 82 and 69.

Which is better for knowledge tasks, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 has the edge for knowledge tasks in this comparison, averaging 57.2 versus 57. Inside this category, HLE is the benchmark that creates the most daylight between them.

Which is better for coding, Mercury 2 or Nemotron 3 Ultra 500B?

Nemotron 3 Ultra 500B has the edge for coding in this comparison, averaging 43.8 versus 41.1. Inside this category, HumanEval is the benchmark that creates the most daylight between them.

Which is better for math, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 has the edge for math in this comparison, averaging 80.9 versus 78. Inside this category, AIME 2023 is the benchmark that creates the most daylight between them.

Which is better for reasoning, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 has the edge for reasoning in this comparison, averaging 80.1 versus 77.2. Inside this category, MuSR is the benchmark that creates the most daylight between them.

Which is better for agentic tasks, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 has the edge for agentic tasks in this comparison, averaging 63.7 versus 62.8. Inside this category, OSWorld-Verified is the benchmark that creates the most daylight between them.

Which is better for multimodal and grounded tasks, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 has the edge for multimodal and grounded tasks in this comparison, averaging 68.3 versus 66.9. Inside this category, MMMU-Pro is the benchmark that creates the most daylight between them.

Which is better for instruction following, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 and Nemotron 3 Ultra 500B are effectively tied for instruction following here, both landing at 84 on average.

Which is better for multilingual tasks, Mercury 2 or Nemotron 3 Ultra 500B?

Mercury 2 and Nemotron 3 Ultra 500B are effectively tied for multilingual tasks here, both landing at 79.7 on average.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.