React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.
Share This Report
Copy the link, post it, or save a PDF version.
React Native Evals is one of the clearest examples of where AI coding benchmarks are heading next: less abstract algorithm work, more framework-specific product implementation. It is an open benchmark from Callstack focused on real React Native tasks, not generic Python patches or contest problems.
That makes it useful for a very specific reason. Benchmarks like SWE-bench Verified, SWE-bench Pro, and LiveCodeBench tell you a lot about general coding strength. They do not tell you enough about whether a model understands the quirks of a production mobile stack.
The public React Native Evals dashboard describes itself as an evaluation framework for AI coding agents on React Native code generation tasks. It emphasizes three things:
The current public dashboard groups tasks into areas like navigation, animation, and async state. It also shows repeated runs, token usage, and cost, which makes it more operational than many older benchmark pages.
That is important because React Native work is rarely about one isolated function. It usually involves lifecycle behavior, state hydration, platform-friendly patterns, and library-specific integrations that are easy to get almost right but still ship broken UX.
As of the public March 24, 2026 overview snapshot, the top React Native Evals rows are:
| Model | Overall |
|---|---|
| Composer 2 | 96.2 |
| Claude Opus 4.6 | 84.4 |
| GPT-5.4 | 82.6 |
| GPT-5.3 Codex | 80.9 |
| Gemini 3.1 Pro | 78.9 |
| Claude Sonnet 4.6 | 77.9 |
| Kimi K2.5 | 74.9 |
| GLM-5 | 74.2 |
| Grok 4 | 70.1 |
| DeepSeek V3.2 | 69.0 |
The remaining public rows are GPT-OSS 120B at 66.4, GPT-OSS 20B at 64.3, Qwen2.5 Coder 32B Instruct at 42.7, and DeepSeek R1 Distill Qwen 32B at 31.8.
That snapshot is useful because it makes the benchmark concrete. React Native Evals is not just another abstract coding score: it already separates frontier models quite aggressively on actual mobile implementation tasks, with Composer 2 opening a very large gap over the rest of the visible field.
Generic coding benchmarks still matter:
But none of those are designed around React Native-specific implementation quality. A model can look strong on repository repair or algorithmic reasoning and still make poor choices in app state, navigation, or mobile UI behavior.
React Native Evals is more like a framework benchmark than a general coding benchmark. That makes it narrower, but also more predictive if your actual product work lives inside the React Native ecosystem.
| Benchmark | Best for | What it misses |
|---|---|---|
| SWE-bench | Real repository bug-fixing | Framework-specific product behavior |
| LiveCodeBench | Fresh algorithmic and reasoning signal | Product architecture and mobile integration |
| React Native Evals | React Native app implementation | Broad cross-language software engineering coverage |
That means React Native Evals should not replace the main coding benchmarks on BenchLM. It should sit beside them.
If you are choosing a model for a general coding assistant, the weighted coding leaderboard is still the right first stop. If you are choosing a model for a React Native team, React Native Evals becomes a valuable second filter.
BenchLM currently tracks React Native Evals as a display benchmark, not a weighted coding input. That is the right posture for now.
Why:
In practice, that means you should read React Native Evals as a specialist benchmark. It is not the answer to "what is the best coding model overall?" It is closer to the answer for "which model is strongest for React Native implementation work?"
React Native Evals matters because it measures something mainstream coding benchmarks underweight: framework-specific mobile delivery. If your team ships React Native apps, this is exactly the kind of benchmark you should want next to the usual SWE-bench and LiveCodeBench signals.
Use the coding leaderboard for the broad picture. Use React Native Evals when mobile app implementation quality is part of the decision.
→ See the coding leaderboard · Benchmark page · What benchmarks actually measure
What is React Native Evals? React Native Evals is an open benchmark from Callstack that evaluates AI coding agents on real React Native implementation tasks. It focuses on app behavior, architecture, and constraint adherence.
What does React Native Evals measure? It measures framework-specific mobile development ability across task groups like navigation, animation, and async state, with repeated runs and cost tracking on the public dashboard.
How is it different from SWE-bench and LiveCodeBench? SWE-bench measures repo bug-fixing, LiveCodeBench measures fresh coding problems, and React Native Evals measures framework-specific React Native implementation. They are complementary.
Does React Native Evals change BenchLM's coding rankings? Not yet. BenchLM tracks it as a display benchmark under coding, but it is not currently part of the weighted coding formula.
Why does React Native Evals matter? Because mobile product work depends on framework-specific patterns that general coding benchmarks often miss. It provides a more relevant signal for teams building in React Native.
Source benchmark materials from React Native Evals, Callstack's announcement, and the project repository.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Terminal-Bench 2.0 measures whether AI models can work through real terminal-based coding and ops workflows instead of just answering in chat.
LiveCodeBench uses fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to prevent data contamination. Here's why it matters and which models lead.
SWE-bench Verified tests AI models on resolving real GitHub issues from Django, Flask, and scikit-learn. Here's how it works, why it matters, and which models score highest.