A comprehensive benchmark designed to evaluate AI models' capabilities in competitive programming, algorithmic problem-solving, and software engineering tasks. This benchmark tests models on novel programming challenges that require deep understanding of algorithms, data structures, and efficient code implementation.
Our benchmark evaluates AI models across multiple dimensions of programming and algorithmic reasoning
Complex problem-solving requiring deep understanding of algorithms and data structures
All questions were written by our expert annotators, ensuring the dataset consists entirely of novel problems
Programming challenges similar to those found in coding competitions
Addresses need for challenging, uncontaminated data
Comprehensive coverage of algorithmic problem types and difficulty levels
Problems that require breaking down complex problems into simpler subproblems and storing solutions to avoid redundant calculations. Tests understanding of optimal substructure and overlapping subproblems.
Problems that require making locally optimal choices at each step to find a globally optimal solution. Tests ability to identify when greedy strategies are applicable and effective.
Problems involving graph traversal, pathfinding, and graph analysis. Tests understanding of graph representations, search algorithms, and graph theory concepts.
Problems that require systematically exploring all possible solutions by building candidates incrementally and abandoning partial solutions that cannot be completed.
See an example of the types of algorithmic challenges our benchmark evaluates
You are managing a factory that produces products in batches. Each batch contains some number of products that must be assembled in some amount of time. You have a fixed number of workers in the factory, and each one of them is capable of assembling a given number of products per unit of time. But every time one worker has finished the batch that he was assigned to and then a new batch comes, he has to take his break and then prepare the new batch, which implies a delay.
The goal is to determine the minimum number of worker-hours required to assemble all products, considering a break between batches. If it is impossible to assemble all products within the given time, return -1.
def minimum_worker_hours(N: int, max_capacity: int, batches: List[Tuple[int, int, int]]) -> int:
minimum_worker_hours(10, 89, [(1, 2, 82), (2, 2, 31), (3, 4, 63), (6, 7, 18), (9, 9, 44), (10, 11, 95), (13, 13, 52), (13, 15, 39), (15, 16, 70), (17, 18, 54)]) == 571
minimum_worker_hours(10, 79, [(2, 2, 70), (2, 10, 35), (10, 10, 76), (11, 11, 66), (12, 12, 75), (12, 14, 88), (15, 16, 76), (17, 18, 97), (19, 20, 105), (20, 20, 46)]) == 862
minimum_worker_hours(4, 10, [(1, 3, 1), (3, 3, 10), (5, 6, 15), (7, 8, 1)]) == 36
minimum_worker_hours(4, 7, [(2, 4, 9), (7, 8, 13), (8, 8, 7), (9, 9, 5)]) == -1
try:
minimum_worker_hours(5, 10, [(1, 3, 1), (3, 3, 10), (5, 6, 15), (7, 8, 1)])
assert False
except ValueError as e:
assert str(e) == "not valid input"
try:
minimum_worker_hours(4, 10, [(1, 3, 1), (3, 2, 10), (5, 6, 15), (7, 8, 1)])
assert False
except ValueError as e:
assert str(e) == "not valid input"
try:
minimum_worker_hours(-1, -1, [])
assert False
except ValueError as e:
assert str(e) == "not valid input"
This problem requires understanding of:
Comprehensive evaluation framework for assessing AI algorithmic reasoning capabilities
Our benchmark comprises 500 novel, original coding problems specifically designed to test algorithmic reasoning and competitive programming skills. These problems were:
To ensure the integrity of our evaluation:
All models were evaluated using the following standardized prompt:
"You are an expert Python programmer, and here is your task:\n{problem_description}\n\nYour code should pass these tests:\n{test_cases}"
We utilized the Pass@1 (single-shot prompt) accuracy metric, which measures:
For each model:
Our benchmark reveals critical insights:
Interested in evaluating your models on our complete benchmark or need a larger dataset for model training?
We offer access to our complete evaluation set and a by-request collection of over 20,000+ similar problems. Perfect for researchers looking to thoroughly benchmark their models or acquire training datasets.
Our research findings are advancing foundational model capabilities through human-generated, specialized datasets.