Side-by-side benchmark comparison across agentic, coding, multimodal, knowledge, reasoning, and math workflows.
Gemma 3 27B and LFM2-24B-A2B finish on the same overall score, so this is less about a single winner and more about where the edge shows up. The headline says tie; the benchmark table is where the real choice happens.
LFM2-24B-A2B's sharpest advantage is in multimodal & grounded, where it averages 41.7 against 41.7. The single biggest benchmark swing on the page is HumanEval, 37 to 42. Gemma 3 27B does hit back in coding, so the answer changes if that is the part of the workload you care about most.
Treat this as a split decision. Gemma 3 27B makes more sense if agentic is the priority; LFM2-24B-A2B is the better fit if coding is the priority.
Gemma 3 27B
34.4
LFM2-24B-A2B
33.4
Gemma 3 27B
16
LFM2-24B-A2B
18
Gemma 3 27B
41.7
LFM2-24B-A2B
41.7
Gemma 3 27B
45.6
LFM2-24B-A2B
46.6
Gemma 3 27B
34.6
LFM2-24B-A2B
35.6
Gemma 3 27B
67
LFM2-24B-A2B
68
Gemma 3 27B
61.4
LFM2-24B-A2B
61.4
Gemma 3 27B
49.4
LFM2-24B-A2B
50.4
Gemma 3 27B and LFM2-24B-A2B are tied on overall score, so the right pick depends on which category matters most for your use case.
LFM2-24B-A2B has the edge for knowledge tasks in this comparison, averaging 35.6 versus 34.6. Inside this category, MMLU is the benchmark that creates the most daylight between them.
LFM2-24B-A2B has the edge for coding in this comparison, averaging 18 versus 16. Inside this category, HumanEval is the benchmark that creates the most daylight between them.
LFM2-24B-A2B has the edge for math in this comparison, averaging 50.4 versus 49.4. Inside this category, AIME 2023 is the benchmark that creates the most daylight between them.
LFM2-24B-A2B has the edge for reasoning in this comparison, averaging 46.6 versus 45.6. Inside this category, SimpleQA is the benchmark that creates the most daylight between them.
Gemma 3 27B has the edge for agentic tasks in this comparison, averaging 34.4 versus 33.4. Inside this category, BrowseComp is the benchmark that creates the most daylight between them.
Gemma 3 27B and LFM2-24B-A2B are effectively tied for multimodal and grounded tasks here, both landing at 41.7 on average.
LFM2-24B-A2B has the edge for instruction following in this comparison, averaging 68 versus 67. Inside this category, IFEval is the benchmark that creates the most daylight between them.
Gemma 3 27B and LFM2-24B-A2B are effectively tied for multilingual tasks here, both landing at 61.4 on average.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.