Llama 4 Behemoth Benchmark Scores & Performance

Benchmark analysis of Llama 4 Behemoth by Meta across 14 tests.

Creator

Meta

Source Type

Open Weight

Reasoning

Non-Reasoning

Context Window

32K

Overall Score

39#69 of 88

Knowledge Benchmarks

MMLU
48
GPQA
47
SuperGPQA
45
OpenBookQA
43

Coding Benchmarks

HumanEval
40

Mathematics Benchmarks

AIME 2023
48
AIME 2024
50
AIME 2025
49
HMMT Feb 2023
44
HMMT Feb 2024
46
HMMT Feb 2025
45
BRUMO 2025
47

Reasoning Benchmarks

SimpleQA
46
MuSR
44

Frequently Asked Questions

How does Llama 4 Behemoth perform overall in AI benchmarks?

Llama 4 Behemoth ranks #69 out of 88 models with an overall score of 39. It is created by Meta and features a 32K context window.

Is Llama 4 Behemoth good for knowledge and understanding?

Llama 4 Behemoth ranks #69 out of 88 models in knowledge and understanding benchmarks with an average score of 45.8. There are stronger options in this category.

Is Llama 4 Behemoth good for coding and programming?

Llama 4 Behemoth ranks #69 out of 88 models in coding and programming benchmarks with an average score of 40. There are stronger options in this category.

Is Llama 4 Behemoth good for mathematics?

Llama 4 Behemoth ranks #69 out of 88 models in mathematics benchmarks with an average score of 47. There are stronger options in this category.

Is Llama 4 Behemoth good for reasoning and logic?

Llama 4 Behemoth ranks #69 out of 88 models in reasoning and logic benchmarks with an average score of 45. There are stronger options in this category.

Is Llama 4 Behemoth open source?

Yes, Llama 4 Behemoth is an open weight model created by Meta, meaning it can be downloaded and run locally or fine-tuned for specific use cases.

What is the context window size of Llama 4 Behemoth?

Llama 4 Behemoth has a context window of 32K tokens, which determines how much text it can process in a single interaction.