Leaderboards

Rigorous benchmarks,

not cherry-picked results.

Design custom evaluations that measure your specified model capabilities.

Collaborate with us

IDE Bench

Evaluating AI Agents on Software Engineering

Assessing AI agents across real-world software engineering workflows—measuring how models navigate, reason, and execute complex development tasks.

IDE Bench

Jan 20, 2026

IDE Bench

Evaluating AI Agents on Software Engineering

Assessing AI agents across real-world software engineering workflows—measuring how models navigate, reason, and execute complex development tasks.

IDE Bench

Jan 20, 2026

Market Bench

Introductory Quantitative Trading

Evaluating AI models on real-world market scenarios—measuring how they reason, predict, and make decisions under dynamic conditions.

Market Bench

Dec 13, 2025

Market Bench

Introductory Quantitative Trading

Evaluating AI models on real-world market scenarios—measuring how they reason, predict, and make decisions under dynamic conditions.

Market Bench

Dec 13, 2025

Market Bench

Introductory Quantitative Trading

Evaluating AI models on real-world market scenarios—measuring how they reason, predict, and make decisions under dynamic conditions.

Market Bench

Dec 13, 2025

App Bench

AI Web App Generation

A benchmark for evaluating how well AI coding agents can generate real web apps from a single natural language prompt. One-shot generations. Zero human edits.

App Bench

Oct 25, 2025

App Bench

AI Web App Generation

A benchmark for evaluating how well AI coding agents can generate real web apps from a single natural language prompt. One-shot generations. Zero human edits.

App Bench

Oct 25, 2025

Finance Arena

FinanceQA, Assumption-Based

Analyzing AI models on real-world financial analysis—measuring how they reason, interpret data, and make decisions under uncertainty.

Finance Arena

Jan 30, 2025

Finance Arena

FinanceQA, Assumption-Based

Analyzing AI models on real-world financial analysis—measuring how they reason, interpret data, and make decisions under uncertainty.

Finance Arena

Jan 30, 2025

Vader Bench

Vulnerability Assessment, Detection, Explanation, and Remediation

A comprehensive human-evaluated benchmark for assessing LLM performance in software security.

Vader Bench

May 26, 2025

Vader Bench

Vulnerability Assessment, Detection, Explanation, and Remediation

A comprehensive human-evaluated benchmark for assessing LLM performance in software security.

Vader Bench

May 26, 2025

Leet Bench

A Benchmark for Competitive Programming & Algorithmic Reasoning

This benchmark tests models on novel programming challenges that require deep understanding of algorithms, data structures, and efficient code implementation.

Leet Bench

July 21, 2025

Leet Bench

A Benchmark for Competitive Programming & Algorithmic Reasoning

This benchmark tests models on novel programming challenges that require deep understanding of algorithms, data structures, and efficient code implementation.

Leet Bench

July 21, 2025