@levie: We've been running Anthropic's Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts mo…

X AI KOLs Following Models

Summary

Box ran Claude Sonnet 5 through its agentic benchmark, finding it surpasses Sonnet 4.6 in complex enterprise tasks like due diligence and cost analysis. Sonnet 5 will soon be available in Box AI Studio.

We've been running Anthropic's Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts models through real enterprise document work end-to-end. Sonnet 5 holds frontier-class quality on complex multi-step work and pulls ahead of Sonnet 4.6 in several core enterprise domains like Energy (+4.7pp), Retail (+4.4pp), and Professional Services (+2.6pp), and other spaces where unstructured data is heavily complex. Here are a few examples of wins compared to Sonnet 4.6 to get a sense of some of the more advanced reasoning capabilities in Sonnet 5: * Financing due diligence: It computed the company's liquidity and leverage ratios from the raw balance sheet, and caught that the source report's own stated debt-to-equity figure understated the leverage, flagging all three loan covenants as violated, not just the ones the document admitted. * Overhaul cost analysis: It scoped "total cost" to the company's own KPI definitions, correctly separating out Lost Production Cost because the guidance said to track it separately rather than naively summing every number on the sheet. It also caught and handled a broken reference cell in the spreadsheet. * SKU revenue analysis: On segmented sales data, it computed each product's contribution against the correct subcategory denominator, sidestepping the easy mistake of dividing by the category total, and flagged why no Pet-category SKU cracked the top 9. Sonnet 5 will be available in the Box AI Studio shortly for customers to build custom agents with.
Original Article
View Cached Full Text

Cached at: 07/01/26, 08:14 PM

We’ve been running Anthropic’s Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts models through real enterprise document work end-to-end.

Sonnet 5 holds frontier-class quality on complex multi-step work and pulls ahead of Sonnet 4.6 in several core enterprise domains like Energy (+4.7pp), Retail (+4.4pp), and Professional Services (+2.6pp), and other spaces where unstructured data is heavily complex.

Here are a few examples of wins compared to Sonnet 4.6 to get a sense of some of the more advanced reasoning capabilities in Sonnet 5:

  • Financing due diligence: It computed the company’s liquidity and leverage ratios from the raw balance sheet, and caught that the source report’s own stated debt-to-equity figure understated the leverage, flagging all three loan covenants as violated, not just the ones the document admitted.

  • Overhaul cost analysis: It scoped “total cost” to the company’s own KPI definitions, correctly separating out Lost Production Cost because the guidance said to track it separately rather than naively summing every number on the sheet. It also caught and handled a broken reference cell in the spreadsheet.

  • SKU revenue analysis: On segmented sales data, it computed each product’s contribution against the correct subcategory denominator, sidestepping the easy mistake of dividing by the category total, and flagged why no Pet-category SKU cracked the top 9.

Sonnet 5 will be available in the Box AI Studio shortly for customers to build custom agents with.

Claude (@claudeai): Introducing Claude Sonnet 5, our most agentic Sonnet yet.

It makes plans, uses tools like browsers and terminals, and runs autonomously at a level that just a few months ago required larger and more expensive models.

Similar Articles

Claude Sonnet 5

Hacker News Top

Anthropic releases Claude Sonnet 5, a highly agentic AI model with improved reasoning, tool use, and coding capabilities, narrowing the gap with Opus-level models at a lower price. It is available across all plans with introductory pricing.

Claude Sonnet 5 Benchmarks

Reddit r/singularity

Anthropic's Claude Sonnet 5 model benchmarks are released, showing performance improvements.