We release: 67,000+ trajectories from 3,800 resolved issues in 1,800+ Python repos. About 3x more successful trajectories and 1.5x more repos than our previous dataset. Trajectories are long: on average 64 turns, up to 100 turns and 131k context length.
> RFT on this data, SWE-bench Verified: Qwen3-30B-Instruct: 25.7% → 50.3% Pass@1. Qwen3-235B-Instruct: 46.2% → 61.7% Pass@1. Also strong gains on SWE-rebench September.
> We also did massive evals. We run OpenHands with 100 and 500 turns. We compare models under both limits. We run on SWE-bench Verified and several months of SWE-rebench.
!!! We also check tests written by the models. We measure how often tests are correct. We check how often the final patch passes its own tests. This gives a pool of tests for verifiers and auto graders.