Challenge the newest AI models with your hardest PhD-level exercises

— and learn how to use AI in your math research —

Project Benchmarks

This benchmark is based on 140 submissions stumping at least 1 active model, visit the Samples Page for several sample submissions. 56 research mathematicians have contributed to this benchmark. It includes the new models Gemini 3 Pro, GPT 5.1, and Claude Opus 4.5, each performing significantly better than its predecessor. Every submission was attempted by every model once, rerunning failing attempts. All models were queried via the api, using the strongest available version. Contact: contact@science-bench.ai
Based on 140 submissions that stump at least 1 active model.
Model Name Active Correct Answer
Gemini 3 Pro active 46%
GPT-5.1 active 41%
GPT-5 35%
Grok-4 active 23%
o3 active 23%
Claude Opus 4.5 active 20%
DeepSeek-V3.1 active 20%
Gemini 2.5 Pro 20%
o3-mini 19%
DeepSeek R1 18%
Claude Opus 4.1 11%
Gemini 2.5 Flash 10%
Claude Sonnet 4 8%
Based on 130 submissions that stump at least 2 active models.
Model Name Active Correct Answer
Gemini 3 Pro active 40%
GPT-5.1 active 33%
GPT-5 29%
o3 active 17%
Grok-4 active 16%
DeepSeek R1 15%
Gemini 2.5 Pro 15%
o3-mini 15%
DeepSeek-V3.1 active 14%
Claude Opus 4.5 active 13%
Claude Opus 4.1 7%
Claude Sonnet 4 7%
Gemini 2.5 Flash 6%
Based on 110 submissions that stump at least 3 active models.
Model Name Active Correct Answer
Gemini 3 Pro active 34%
GPT-5.1 active 28%
GPT-5 22%
Grok-4 active 12%
o3-mini 12%
Gemini 2.5 Pro 11%
o3 active 11%
DeepSeek R1 10%
DeepSeek-V3.1 active 9%
Claude Opus 4.5 active 6%
Claude Sonnet 4 6%
Claude Opus 4.1 5%
Gemini 2.5 Flash 5%
Based on 90 submissions that stump at least 4 active models.
Model Name Active Correct Answer
Gemini 3 Pro active 24%
GPT-5 17%
GPT-5.1 active 15%
o3-mini 10%
Claude Sonnet 4 8%
DeepSeek R1 7%
Gemini 2.5 Pro 7%
Grok-4 active 7%
Claude Opus 4.1 4%
o3 active 4%
Claude Opus 4.5 active 3%
DeepSeek-V3.1 active 3%
Gemini 2.5 Flash 3%