We hit the retry problem hard enough that we open-sourced a fix

Reddit r/AI_Agents 06/11/26, 07:58 PM Tools

retry idempotency open-source agents reliability npm circuit-breaker

Summary

Replaysafe is an open-source npm library that ensures idempotent retries by fingerprinting operations, preventing duplicate side effects in AI agent workflows. It integrates with popular frameworks like LangGraph and CrewAI.

If you've been running agents in production, you know the drill: agent crashes mid-task, you retry, and suddenly the customer has two charges, two welcome emails, two CRM entries. The hard part isn't retrying. It's knowing what **already** happened. We are building a small library that wraps any non-idempotent call (charging a card, sending an email, hitting an API) and fingerprints it, `hash(type + target + input)`. Before executing, it checks if that exact operation was already done. If yes, returns the cached result. If not, runs it and remembers. It has circuit breakers for retry storms and rollback hooks for partial failures. Works with LangGraph, CrewAI, Inngest, n8n, Airflow - whichever framework you're using. It's called Replaysafe. Open source (AGPL), just an npm package, no infrastructure needed. Curious what recovery patterns are working for others here, this is still early and we're learning from what people actually need.

Original Article

We hit the retry problem hard enough that we open-sourced a fix

Similar Articles

built a small open source tool to stop AI agents from regressing after changes

I built a replay layer for sandboxed agent runs on GitHub repos

Show r/AI_Agents: Stop your agents from breaking tool calls in production — we built a reliability layer for 2,000+ APIs

I kept rebuilding checkpointing, retries, and run tracking for agents. So I built an open-source runtime around them.

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

Submit Feedback

Similar Articles

built a small open source tool to stop AI agents from regressing after changes

I built a replay layer for sandboxed agent runs on GitHub repos

Show r/AI_Agents: Stop your agents from breaking tool calls in production — we built a reliability layer for 2,000+ APIs

I kept rebuilding checkpointing, retries, and run tracking for agents. So I built an open-source runtime around them.

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios