Luisa Crawford
Feb 02, 2026 19:30
Collectively AI demonstrates fine-tuned open-source LLMs can outperform GPT-5.2 as analysis judges utilizing simply 5,400 choice pairs, slashing prices dramatically.
Nice-tuned open-source massive language fashions can now outperform OpenAI’s GPT-5.2 at evaluating AI outputs—at a fraction of the fee. Collectively AI launched analysis displaying their GPT-OSS 120B mannequin achieved 62.63% accuracy on human choice alignment after Direct Desire Optimization coaching, surpassing GPT-5.2’s 61.62% baseline whereas working 14x sooner and costing 15x much less per token.
The findings matter for any group working AI analysis pipelines at scale. GPT-5.2 at present fees $1.75 per million enter tokens and $14 per million output tokens. The fine-tuned GPT-OSS 120B? Simply $0.15 and $0.60 respectively.
The Coaching Strategy
Collectively AI used DPO, a way launched in late 2023 that bypasses the complicated reinforcement studying loops of conventional RLHF. As a substitute of coaching a separate reward mannequin, DPO instantly adjusts the language mannequin’s weights utilizing choice pairs—one most popular response, one rejected response for every immediate.
The coaching information got here from RewardBench 2, a benchmark containing examples with human-labeled most popular and rejected responses throughout six classes: security, factuality, math, exact instruction following, focus, and ties. From roughly 1,500 coaching examples, the workforce generated 5,407 choice pairs.
Coaching took simply 1.5 hours for GPT-OSS 120B utilizing LoRA (Low-Rank Adaptation) with a studying charge of 5e-6 over three epochs.
The place Open Fashions Excel
The category-level breakdown reveals the place fine-tuning delivered the most important wins. GPT-OSS 120B after DPO beat GPT-5.2 on math analysis by 10.3 share factors and on focus (response high quality evaluation) by 6.3 factors.
Security analysis proved best throughout all fashions, averaging 91.32% accuracy—unsurprising given these fashions bear intensive security coaching. Factuality detection hit 85.23%. The toughest class? Focus, the place fashions averaged simply 10.13% accuracy, highlighting how subjective high quality judgments stay difficult.
One wrinkle: Qwen3 235B, which already beat GPT-5.2 out of the field at 62.63%, truly regressed barely to 61.28% after fine-tuning. Not each mannequin advantages from further coaching, reinforcing that validation stays important.
The Broader Implications
The “LLM-as-a-judge” paradigm has develop into customary for evaluating AI outputs at scale as a result of judging is basically easier than producing. A mannequin producing a response should juggle context, observe multi-step directions, and synthesize data. Evaluating that response is a centered classification activity.
This analysis suggests organizations can construct analysis pipelines utilizing open-source fashions they management solely—no API dependencies, full visibility into mannequin habits, and the flexibility to fine-tune for particular domains. The associated fee financial savings at manufacturing scale are substantial.
Collectively AI printed the total methodology in a cookbook pocket book for groups wanting to copy the method with their very own choice information.
Picture supply: Shutterstock

