Trinity-Large-Thinking Benchmark Scores & Performance

BenchLM is tracking Trinity-Large-Thinking by Arcee AI. Some benchmark data is visible, but not enough non-generated coverage is available for a leaderboard rank yet.

BenchLM is tracking Trinity-Large-Thinking, but this profile is currently excluded from the public leaderboard because it still lacks enough verified benchmark coverage to rank safely. Only verified public benchmark rows appear below.

Trinity-Large-Thinking is a open weight model with a 512K token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

This profile currently has 9 of 83 tracked benchmarks. BenchLM only exposes verified benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Provider

Arcee AI

Source Type

Open Weight

Reasoning

Reasoning

Context Window

512K

Model Status

Tracked

Overall Score

Unranked

Pricing

$0.25 / $0.90

Input / output per 1M

Runtime

N/A

Latency unavailable

Rankings Overview

BenchLM is still missing enough verified benchmark coverage to rank this model across the public leaderboard. Only verified public benchmark rows are shown below.

Knowledge Benchmarks

GPQA-DCurrentDisplay onlyDetails
76.3%

GPQA-D 2026 · Quarterly refresh · updated April 1, 2026

MMLU-Pro (Arcee)CurrentDisplay onlyDetails
83.4%

MMLU-Pro (Arcee) 2026 · Quarterly refresh · updated April 1, 2026

Coding Benchmarks

SWE-bench Verified*CurrentDisplay onlyDetails
63.2%

SWE-bench Verified* 2026 · Quarterly refresh · updated April 1, 2026

Mathematics Benchmarks

AIME25 (Arcee)CurrentDisplay onlyDetails
96.3%

AIME25 (Arcee) 2026 · Quarterly refresh · updated April 1, 2026

Agentic Benchmarks

Tau2-AirlineCurrentDisplay onlyDetails
88.0%

Tau2-Airline 2026 · Quarterly refresh · updated April 1, 2026

Tau2-TelecomCurrentDisplay onlyDetails
94.7%

τ²-Bench 2026 · Quarterly refresh · updated April 1, 2026

PinchBenchCurrentDisplay onlyDetails
91.9%

PinchBench 2026 · Quarterly refresh · updated April 1, 2026

BFCL v4CurrentDisplay onlyDetails
70.1%

BFCL v4 2026 · Quarterly refresh · updated April 1, 2026

Instruction Following Benchmarks

IFBenchCurrentDisplay onlyDetails
52.3%

IFBench 2026 · Quarterly refresh · updated April 1, 2026

Frequently Asked Questions

How does Trinity-Large-Thinking perform overall in AI benchmarks?

Trinity-Large-Thinking has 9 verified benchmark scores on BenchLM, but it does not yet have enough coverage to receive a global overall rank.

Is Trinity-Large-Thinking good for knowledge and understanding?

Trinity-Large-Thinking has visible benchmark coverage in knowledge and understanding, but BenchLM does not currently assign it a global category rank there.

Is Trinity-Large-Thinking good for coding and programming?

Trinity-Large-Thinking has visible benchmark coverage in coding and programming, but BenchLM does not currently assign it a global category rank there.

Is Trinity-Large-Thinking good for mathematics?

Trinity-Large-Thinking has visible benchmark coverage in mathematics, but BenchLM does not currently assign it a global category rank there.

Is Trinity-Large-Thinking good for agentic tool use and computer tasks?

Trinity-Large-Thinking has visible benchmark coverage in agentic tool use and computer tasks, but BenchLM does not currently assign it a global category rank there.

Is Trinity-Large-Thinking good for instruction following?

Trinity-Large-Thinking has visible benchmark coverage in instruction following, but BenchLM does not currently assign it a global category rank there.

Is Trinity-Large-Thinking open source?

Yes, Trinity-Large-Thinking is an open weight model created by Arcee AI, meaning it can be downloaded and run locally or fine-tuned for specific use cases.

Does Trinity-Large-Thinking have full benchmark coverage on BenchLM?

Not yet. Trinity-Large-Thinking currently has 9 verified benchmark scores out of the 83 benchmarks BenchLM tracks. BenchLM only exposes verified public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Trinity-Large-Thinking?

Trinity-Large-Thinking has a context window of 512K, which determines how much text it can process in a single interaction.

Last updated: April 1, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Your signup is stored with a derived country code for compliance routing.