@levie: We've been running Anthropic's Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts mo…

X AI KOLs Following 06/30/26, 07:55 PM Models

anthropic claude-sonnet-5 enterprise agentic-benchmark box ai-evaluation

Summary

Box ran Claude Sonnet 5 through its agentic benchmark, finding it surpasses Sonnet 4.6 in complex enterprise tasks like due diligence and cost analysis. Sonnet 5 will soon be available in Box AI Studio.

We've been running Anthropic's Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts models through real enterprise document work end-to-end. Sonnet 5 holds frontier-class quality on complex multi-step work and pulls ahead of Sonnet 4.6 in several core enterprise domains like Energy (+4.7pp), Retail (+4.4pp), and Professional Services (+2.6pp), and other spaces where unstructured data is heavily complex. Here are a few examples of wins compared to Sonnet 4.6 to get a sense of some of the more advanced reasoning capabilities in Sonnet 5: * Financing due diligence: It computed the company's liquidity and leverage ratios from the raw balance sheet, and caught that the source report's own stated debt-to-equity figure understated the leverage, flagging all three loan covenants as violated, not just the ones the document admitted. * Overhaul cost analysis: It scoped "total cost" to the company's own KPI definitions, correctly separating out Lost Production Cost because the guidance said to track it separately rather than naively summing every number on the sheet. It also caught and handled a broken reference cell in the spreadsheet. * SKU revenue analysis: On segmented sales data, it computed each product's contribution against the correct subcategory denominator, sidestepping the easy mistake of dividing by the category total, and flagged why no Pet-category SKU cracked the top 9. Sonnet 5 will be available in the Box AI Studio shortly for customers to build custom agents with.

Original Article

View Cached Full Text

Cached at: 07/01/26, 08:14 PM

We’ve been running Anthropic’s Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts models through real enterprise document work end-to-end.

Sonnet 5 holds frontier-class quality on complex multi-step work and pulls ahead of Sonnet 4.6 in several core enterprise domains like Energy (+4.7pp), Retail (+4.4pp), and Professional Services (+2.6pp), and other spaces where unstructured data is heavily complex.

Here are a few examples of wins compared to Sonnet 4.6 to get a sense of some of the more advanced reasoning capabilities in Sonnet 5:

Financing due diligence: It computed the company’s liquidity and leverage ratios from the raw balance sheet, and caught that the source report’s own stated debt-to-equity figure understated the leverage, flagging all three loan covenants as violated, not just the ones the document admitted.
Overhaul cost analysis: It scoped “total cost” to the company’s own KPI definitions, correctly separating out Lost Production Cost because the guidance said to track it separately rather than naively summing every number on the sheet. It also caught and handled a broken reference cell in the spreadsheet.
SKU revenue analysis: On segmented sales data, it computed each product’s contribution against the correct subcategory denominator, sidestepping the easy mistake of dividing by the category total, and flagged why no Pet-category SKU cracked the top 9.

Sonnet 5 will be available in the Box AI Studio shortly for customers to build custom agents with.

Claude (@claudeai): Introducing Claude Sonnet 5, our most agentic Sonnet yet.

It makes plans, uses tools like browsers and terminals, and runs autonomously at a level that just a few months ago required larger and more expensive models.

@levie: We've been running Anthropic's Claude Sonnet 5 through the Box AI Complex Work Eval, our agentic benchmark that puts mo…

Similar Articles

Claude Sonnet 5

Claude Sonnet 5 Benchmarks

Claude Sonnet 5 is out and the gap with Opus 4.8 is smaller than I expected

@github: @AnthropicAI's Claude Sonnet 5 is now generally available and rolling out in GitHub Copilot. Early testing for Claude S…

Introducing Claude Sonnet 5

Submit Feedback