Large Reasoning Models Leaderboard

Evaluation of Open R1 models across a diverse range of benchmarks from LightEval. All scores are reported as accuracy.

Aggregation

How to aggregate results for each model

Checkbox Group

Select columns to display

1
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B_6393b7559e403fd1d80bfead361586fd6f630a4d
2025-03-13T08-29-31.101
0.6531
0.0008
0.3169
0.142
null
0.6414
0.1954
0.914
0.6095