Just had to rewrite my entire agent infrastructure for reliability, anyone else doing the same?

Reddit r/AI_Agents 05/31/26, 02:00 AM News

ai-agents agent-infrastructure reliability durable-execution engineering discussion

Summary

The author describes rewriting their AI agent infrastructure for reliability using DBOS durable execution after facing cascading failures, and asks the community about similar experiences, tool choices, and build-vs-buy decisions.

For a bit of context, I’m currently creating a team of AI agents at work to generate reports by fanning out into a large amount of sub-agents to process a large amount of transcript data. When the analysis fails mid-way because of some individual step like an API call returns an error or the machine is out of memory, it would create cascading errors that break the entire generation with almost no visibility. I’ve just spent the past month rewriting the individual jobs as durable execution jobs on DBOS but just wondering if there are better solutions out there and if others encountered similar issues? And then there is the issue to reflect back the progress to the users which I’ve just been coding ad-hoc honestly… When an agent fails at step 9 of 12, how do you handle that? Roughly how many engineer-weeks have you sunk into agent infrastructure (durability, monitoring, human-in-the-loop, live UI) vs. the actual agent logic? Curious if my ratio is normal. For those who built this stuff in-house: was it ever a build-vs-buy conversation? What would a tool have had to do for you to buy instead of build? Do you currently pay for anything in your agent stack (LangSmith, Temporal, Braintrust, etc.)? What made that one worth a line item when others weren't and should I look into it too?

Original Article

Just had to rewrite my entire agent infrastructure for reliability, anyone else doing the same?

Similar Articles

Which platform is your company using for ai agent observability and reliability needs?

"At what point does adding another agent actually hurt your system? Asking because my 6-agent pipeline is slower and less reliable than my old 2-agent one

AI agent development

@djfarrelly: https://x.com/djfarrelly/status/2052779234234380479

What broke first when you went from one AI agent to several?

Submit Feedback

Similar Articles

Which platform is your company using for ai agent observability and reliability needs?

"At what point does adding another agent actually hurt your system? Asking because my 6-agent pipeline is slower and less reliable than my old 2-agent one

@djfarrelly: https://x.com/djfarrelly/status/2052779234234380479

What broke first when you went from one AI agent to several?