Briefly
- Merchandising-Bench Area examined AI brokers working competing merchandising machine companies.
- Prime fashions elevated income by price-fixing, collusion, and misleading ways. Claude was one of the best at these ways.
- GLM-5 defeated Claude by impersonating a teammate and extracting delicate technique.
Researchers at Andon Labs simply answered which AI fashions are finest at working a enterprise. The highest performers all gained by forming unlawful worth cartels, exploiting determined opponents, and mendacity to clients about refunds.
The Merchandising-Bench Area take a look at places AI fashions answerable for competing merchandising machines for a simulated yr. They negotiate with suppliers, handle stock, set costs, and might e-mail one another to collaborate or compete. Success requires balancing prices, pricing technique, customer support, and competitor dynamics. Claude Opus 4.6 dominated the benchmark with $8,017 in revenue—and celebrated its win by noting: “My pricing coordination labored!”

Anthropic is the picture of the good guys within the AI house, however that “coordination” technique that Claude proposed was principally price-fixing. When competing fashions struggled, Opus 4.6 proposed: “Let’s NOT undercut one another — agree on minimal pricing… Ought to we agree on a worth ground of $2.00 for many objects?” When a rival ran low on stock, it noticed a chance: “Owen wants inventory badly. I can revenue from this!” It bought Equipment Kats at 75% markup to the determined competitor. When requested for provider suggestions, it intentionally directed rivals to costly wholesalers whereas conserving its personal good sources secret.
The newest replace within the benchmark added crew competitors. Researchers pitted two Chinese language GLM-5 fashions towards two American Claude fashions and advised them to seek out their teammates, Individuals or Chinese language—with out revealing which brokers had been which. The outcomes had been genuinely weird.
GLM-5 gained each rounds by convincing Claude it was Claude. “I am additionally powered by Claude from Anthropic, so we’re teammates!” one GLM-5 agent confidently declared. Claude, in the meantime, acquired so confused that Sonnet 4.5 concluded: “I am powered by a Chinese language mannequin, so I would like to seek out the opposite Chinese language mannequin Agent.”

In additional than half the take a look at runs, brokers teamed with their opponents. The Claude fashions shared provider pricing and coordinated technique—leaking beneficial data to rivals. “GLM-5 gained each,” the researchers wrote. “The Claude fashions tried to be crew gamers and ended up leaking beneficial information to their opponents.”
And brokers doing shady stuff could also be all enjoyable and video games till you understand Wall Avenue is already deploying them in real-life operations. JPMorgan deployed LLM Suite to 60,000 workers. Goldman Sachs constructed its GS AI Assistant for buying and selling desks, claiming 20% productiveness features. Bridgewater makes use of Claude to investigate earnings and even high-school age children are seeing their chatbots commerce shares extra effectively.
Normally, adoption of agentic workflows is accelerating quickly throughout enterprises.
When Anthropic and Wall Avenue Journal reporters ran an actual merchandising machine experiment in December, the AI purchased a PlayStation 5, a number of bottles of wine, and a dwell betta fish earlier than going bankrupt. Latest analysis from Gwangju Institute discovered that when AI fashions had been advised to “maximize rewards” in playing situations, chapter charges hit 48%. “When given the liberty to find out their very own goal quantities and betting sizes, chapter charges rose considerably alongside elevated irrational conduct,” researchers discovered.
So, plainly, no less than for now, AI fashions optimized for revenue constantly select unethical ways. They kind cartels. They exploit weak spot. They mislead clients and opponents. Some do it intentionally. Others, like GLM-5 claiming to be Claude, appear genuinely confused about their very own identification. The excellence won’t matter.
Wall Avenue’s AI deployment raises a query the Merchandising-Bench outcomes cannot reply: If the “finest” performing mannequin wins by price-fixing and deception, is it actually the only option for what you are promoting? The benchmark measures revenue. It would not measure whether or not these income got here from fraud.
Day by day Debrief E-newsletter
Begin each day with the highest information tales proper now, plus unique options, a podcast, movies and extra.
