First comparative benchmark of AI agents for API bug detection shows strong performance on simple checks, but major gaps on cross-field and business-logic failures SAN FRANCISCO, June 3, 2026 ...
A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve ...