Luisa Crawford
Jul 02, 2025 17:58
DeepSWE-Preview, a complicated coding agent, units new benchmarks in open-source AI with a 59% success price on SWE-Bench-Verified, showcasing state-of-the-art efficiency utilizing reinforcement studying.
In a big development for AI-driven software program growth, DeepSWE-Preview has emerged as a groundbreaking open-source coding agent. Developed by way of a collaboration between the Agentica crew and Collectively AI, this agent leverages reinforcement studying (RL) to realize a exceptional 59% go price on the SWE-Bench-Verified benchmark, in response to Collectively AI.
Revolutionizing Software program Engineering
DeepSWE-Preview is constructed upon the Qwen3-32B mannequin, using solely RL to boost its capabilities. This method permits the agent to outperform different open-weight coding brokers, reaching a Move@1 price of 42.2% and a Move@16 price of 71.0%. The mannequin was educated over six days utilizing 64 H100 GPUs, tackling 4,500 real-world software program engineering duties sourced from the R2E-Fitness center coaching environments.
Harnessing the Energy of rLLM
The coaching of DeepSWE-Preview is facilitated by rLLM, Agentica’s framework designed for post-training language brokers. This framework permits for the open-sourcing of datasets, code, and coaching logs, encouraging collaborative efforts to scale and enhance brokers utilizing RL. The total coaching recipe for creating a 32B mannequin into an clever coding agent is now out there to the general public, selling transparency and innovation.
Rising Behaviors and Efficiency
DeepSWE-Preview has demonstrated emergent behaviors throughout its coaching, corresponding to anticipating edge circumstances and conducting thorough regression exams. These capabilities are essential for dealing with advanced software program engineering duties, which require navigating intensive codebases and guaranteeing compatibility with current functionalities.
Take a look at-Time Scaling and Additional Developments
DeepSWE-Preview employs test-time scaling (TTS) to boost its efficiency, combining execution-free and execution-based verification strategies. This hybrid scaling technique considerably boosts its Move@1 efficiency, setting it other than different fashions. Future analysis goals to discover bigger fashions and prolong capabilities to totally different domains, together with net brokers.
DeepSWE-Preview represents a pivotal step in democratizing AI growth, showcasing the potential of reinforcement studying to deal with long-horizon, multi-step challenges in software program engineering. With its open-source nature, it invitations the worldwide analysis neighborhood to contribute to and construct upon its successes.
Picture supply: Shutterstock