@levie: We've been running Anthropic's Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts mo…
Summary
Box ran Claude Sonnet 5 through its agentic benchmark, finding it surpasses Sonnet 4.6 in complex enterprise tasks like due diligence and cost analysis. Sonnet 5 will soon be available in Box AI Studio.
View Cached Full Text
Cached at: 07/01/26, 08:14 PM
We’ve been running Anthropic’s Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts models through real enterprise document work end-to-end.
Sonnet 5 holds frontier-class quality on complex multi-step work and pulls ahead of Sonnet 4.6 in several core enterprise domains like Energy (+4.7pp), Retail (+4.4pp), and Professional Services (+2.6pp), and other spaces where unstructured data is heavily complex.
Here are a few examples of wins compared to Sonnet 4.6 to get a sense of some of the more advanced reasoning capabilities in Sonnet 5:
-
Financing due diligence: It computed the company’s liquidity and leverage ratios from the raw balance sheet, and caught that the source report’s own stated debt-to-equity figure understated the leverage, flagging all three loan covenants as violated, not just the ones the document admitted.
-
Overhaul cost analysis: It scoped “total cost” to the company’s own KPI definitions, correctly separating out Lost Production Cost because the guidance said to track it separately rather than naively summing every number on the sheet. It also caught and handled a broken reference cell in the spreadsheet.
-
SKU revenue analysis: On segmented sales data, it computed each product’s contribution against the correct subcategory denominator, sidestepping the easy mistake of dividing by the category total, and flagged why no Pet-category SKU cracked the top 9.
Sonnet 5 will be available in the Box AI Studio shortly for customers to build custom agents with.
Claude (@claudeai): Introducing Claude Sonnet 5, our most agentic Sonnet yet.
It makes plans, uses tools like browsers and terminals, and runs autonomously at a level that just a few months ago required larger and more expensive models.
Similar Articles
Claude Sonnet 5
Anthropic releases Claude Sonnet 5, a highly agentic AI model with improved reasoning, tool use, and coding capabilities, narrowing the gap with Opus-level models at a lower price. It is available across all plans with introductory pricing.
Claude Sonnet 5 Benchmarks
Anthropic's Claude Sonnet 5 model benchmarks are released, showing performance improvements.
Claude Sonnet 5 is out and the gap with Opus 4.8 is smaller than I expected
Anthropic released Claude Sonnet 5, which achieves benchmark scores very close to Opus 4.8 at a significantly lower price, making it a compelling option for agentic tasks despite potential real-world gaps.
@github: @AnthropicAI's Claude Sonnet 5 is now generally available and rolling out in GitHub Copilot. Early testing for Claude S…
Claude Sonnet 5 is now generally available and rolling out in GitHub Copilot. Early testing shows strong coding performance, especially on CLI tasks, with good prompt-cache utilization and competitive latency.
Introducing Claude Sonnet 5
Anthropic announces the release of Claude Sonnet 5, a new AI model.