Research
Data quality makes all the difference.
We're driven by the conviction that model performance is fundamentally bounded by training data quality. Through expert collaboration, rigorous curation methodologies, and deep domain expertise, we research datasets that power tomorrow's models.

Human expertise, reimagined
Spencer M.
·
Read blog
How AfterQuery Expert Data Drives Model Performance on τ²-bench
Michael E.
Spencer M.
Arya F.
·
Read blog

How We Improved Terminal-Bench 2.0 Scores by Over 5x Using Tinker and Harbor
Spencer M.
Michael E.
Carlos G.
·
Read blog

Solving the Last Mile Problem in Partnership with The Raine Group
Carlos G.
Sam J
·
Read blog

IDE Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks
Spencer M.
Jeff Y.
Tiana C.
·
Read paper
Market-Bench: Evaluating LLMs on Introductory Quantitative Trading
Abhay S.
Sam J.
Spencer M.
·
Read paper

App-Bench: Evaluating Coding Agents on Generating Economically Useful Web-Apps
Andrew Z.
Sam J.
Spencer M.
·
Read paper

The AfterQuery Thesis
Spencer M.
·
Read blog

UI-Bench: A Benchmark for Evaluating User Interface Understanding
Sam J.
Agustin G.
Spencer M.
·
Read paper



















