A 30-task subset of MLS-Bench that evaluates whether AI systems can invent generalizable and scalable machine-learning methods.
BenchLM mirrors the published score view for MLS-Bench Lite. Kimi K2.7 Code leads the public snapshot at 35.1%. BenchLM does not use these results to rank models overall.
Year
2026
Tasks
30 machine-learning research tasks
Format
Agentic ML task evaluation
Difficulty
ML research and systems engineering
Moonshot reports MLS-Bench Lite as a coding-agent result for Kimi K2.7 Code. BenchLM stores the provider-reported exact value separately from weighted coding benchmarks because the row is a newly reported benchmark variant with sparse public model coverage.
Version
MLS-Bench Lite 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A 30-task subset of MLS-Bench that evaluates whether AI systems can invent generalizable and scalable machine-learning methods.
Kimi K2.7 Code by Moonshot AI currently leads with a score of 35.1% on MLS-Bench Lite.
1 AI models have been evaluated on MLS-Bench Lite on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.