Did you see it when Salesforce's run their own AI Agents benchmark

Reddit r/ArtificialInteligence 06/01/26, 07:22 AM News

salesforce ai-agents benchmark crm multi-turn context-window production-tips

Summary

Discussion of Salesforce's CRMArena-Pro benchmark showing agent success drops from 58% on single-turn to 35% on multi-turn tasks, plus practical advice for splitting agent workflows into narrow stages to reduce error compounding.

So Salesforce built CRMArena-Pro as part of a research study and launched it this time last year (Salesforce Research, arXiv 2505.18878). They tested leading agents on real CRM work across 4,280 queries and 9 top models. The results were \~58% success on single-turn tasks, dropping to \~35% the moment the task went multi-turn. I know this seems irrelevant news considering its age and how fast the AI space is moving but I believe the results will either be the same today for AI agents or improved by marginal points meaning companies running AI agents will still pay for the errors their agents make. The expensive realization is that it's not really the model, it's the length of the autonomous runs. The longer one agent runs, the more early context it loses and the more small errors compound. Leading AI players are combating this problem by expanding context windows to handle upto 1 - 2million tokens, introducing various frameworks for memory saving and also introducing tools like RAG. The problem still persists especially for longer running agents and workflows which require 24/7 activities. This is not something I’m selling but what’s working for me in production is actually pretty boring but reliable. I don’t hand one agent the whole job. Instead I break the overall job into narrow stages, one job per stage, with a plain-text handoff between each stage and a checkpoint where I can verify outputs before the next stage runs. A stage that researches doesn't also write. A stage that writes doesn't also send and the agent never has to run long enough to drift. At every stage, I am able to identify bad outputs before they compound. Has anyone else found that scoping context per stage beats prompt-engineering one mega-agent? Where's the line for you between "split into stages" and "this can run end-to-end"?

Original Article

Did you see it when Salesforce's run their own AI Agents benchmark

Similar Articles

@alexxubyte: Salesforce deployed 20,000 enterprise AI agents. The biggest lesson? The work is inverted! Traditional software → 90% o…

Where AI agents actually break in real workflows (not demos)

@HowToAI_: Microsoft Research + Salesforce has published a paper that should scare every single AI builder right now. It’s called …

everyone's focused on whether their agent works. almost nobody asks if it's actually getting better over time

A CEO built his own AI agent with Claude MCP + NetSuite. It worked. Then it didn't scale.

Submit Feedback

Similar Articles

@alexxubyte: Salesforce deployed 20,000 enterprise AI agents. The biggest lesson? The work is inverted! Traditional software → 90% o…

Where AI agents actually break in real workflows (not demos)

@HowToAI_: Microsoft Research + Salesforce has published a paper that should scare every single AI builder right now. It’s called …

everyone's focused on whether their agent works. almost nobody asks if it's actually getting better over time

A CEO built his own AI agent with Claude MCP + NetSuite. It worked. Then it didn't scale.