Rongchai Wang
Mar 09, 2026 21:14
Claude Code now deploys AI agent groups to evaluation each pull request, catching bugs human reviewers miss. Out there for Group and Enterprise at $15-25 per evaluation.
Anthropic launched Code Evaluate for Claude Code on March 9, deploying a number of AI brokers to investigate pull requests with a depth the corporate claims catches bugs that fast human scans usually miss. The characteristic enters analysis preview for Group and Enterprise prospects.
The timing addresses an actual bottleneck. Anthropic studies code output per engineer jumped 200% over the previous 12 months, straining evaluation capability. Earlier than Code Evaluate, simply 16% of the corporate’s inside PRs acquired substantive feedback. That determine now sits at 54%.
How the System Operates
When builders open a pull request, Code Evaluate spawns a crew of brokers working in parallel. These brokers hunt for bugs independently, cross-verify findings to filter false positives, then rank points by severity. The output lands as a single overview remark plus inline annotations for particular issues.
Evaluate depth scales mechanically. Massive, complicated adjustments get extra brokers and longer evaluation; trivial updates get a fast go. Common evaluation time runs round 20 minutes, in keeping with Anthropic.
The brokers will not approve PRs—that continues to be a human choice. However the system goals to make sure reviewers aren’t rubber-stamping code they have not really examined.
Inner Outcomes Inform the Story
Anthropic’s inside testing exhibits clear patterns. On PRs exceeding 1,000 modified traces, 84% obtain findings averaging 7.5 points flagged. Smaller PRs below 50 traces see findings on simply 31%, averaging half a problem. Engineers dispute lower than 1% of findings as incorrect.
One case stood out: a single-line change to a manufacturing service—the type of diff that usually will get waved by—would have damaged authentication solely. Code Evaluate flagged it as vital earlier than merge. The engineer admitted they would not have caught it manually.
Early entry prospects report comparable catches. On a ZFS encryption refactor in TrueNAS’s open-source middleware, the system noticed a pre-existing bug in adjoining code: a sort mismatch silently wiping the encryption key cache on each sync. That is the type of latent concern hiding in code a PR occurs to the touch, invisible to reviewers scanning changesets.
Pricing and Controls
This is not low cost. Opinions invoice on token utilization, averaging $15-25 per PR relying on dimension and complexity. That is considerably pricier than Anthropic’s current open-source GitHub Motion, which stays accessible for lighter-weight checks.
Admins get spending controls: month-to-month group caps, repository-level toggles, and an analytics dashboard monitoring evaluation counts, acceptance charges, and prices. As soon as enabled, opinions set off mechanically on new PRs with no developer configuration required.
The discharge follows Claude Code Safety’s restricted preview launch on February 20, which scans codebases for vulnerabilities. Collectively, these options place Claude Code as more and more complete infrastructure for enterprise improvement groups keen to pay for depth over pace.
Picture supply: Shutterstock

