Our Research and Blog
Our Leaderboards
Our Enterprise Solutions
Careers
Get Data
Leaderboards
Rigorous benchmarks, not cherry-picked results.
Design custom evaluations that measure your specified model capabilities.
Collaborate With Us
IDE-Bench: Evaluating AI Agents on Software Engineering
Claude Sonnet 4
Anthropic
82.5%
Claude Sonnet 4 Think
Anthropic
78.75%
o3
OpenAI
76.25%
Gemini 2.5 Pro
Google
72.5%
Claude Code
Anthropic
71.25%
o4-mini
OpenAI
62.5%
GPT-4.1
OpenAI
58.75%
Qwen 3 235B
Qwen
57.5%
Market-Bench: Introductory Quantitative Trading
Grok 4
443
GPT-5.2
969
Gemini 3 Pro Preview
1,744
GPT-5.1 Codex Max
4,243
DeepSeek V3.2
4,576
Claude Sonnet 4.5
5,127
Claude Opus 4.5
6,040
Command A
6,562
App-Bench: AI Web App Generation
Orchids
76.8%
Claude Code
Opus 4.5
67.5%
v0
64.9%
Bolt
53.6%
Google AI Studio
Gemini 3 Pro Preview
50.3%
Codex
gpt-5.1-codex-max
38.4%
Replit
35.1%
Cursor
Composer 1
27.8%
FinanceArena: FinanceQA, Assumption-Based
OpenAI o3
21.7%
Anthropic Claude Opus 4
13%
xAI Grok 4
10.9%
Qwen QwQ-32B
10.9%
OpenAI 4o mini
10.9%
Meta Llama 4 Maverick
8.7%
xAI Grok 3
8.7%
Google DeepMind Gemini 2.5 Pro
6.5%
VADER: Vulnerability Assessment, Detection, Explanation, and Remediation
OpenAI o3
54.6%
Gemini 2.5 Pro
53.6%
Claude 3.7
52.3%
Grok 3 Beta
52.0%
GPT-4.1
50.0%
GPT-4.5
49.2%
LeetBench: A Benchmark for Competitive Programming & Algorithmic Reasoning
OpenAI o3
46%
Anthropic Opus 4
37.6%
DeepMind Gemini 2.5 Pro
23.4%
Amazon Nova Premier
10%
Mistral Magistral Medium
9.4%
xAI Grok 3
8.4%
NVIDIA Llama-3.1-Nemotron-Ultra-253B-v1
8.4%
Meta Llama 4 Maverick
6.4%
Microsoft Phi 4
4.4%
Ready to build better AI?
Contact Us