@sydneyrunkle: https://x.com/sydneyrunkle/status/2062588423295111208
Summary
This article explains how to add fault tolerance to LangGraph agents using RetryPolicy, TimeoutPolicy, and error handlers, covering retries with backoff, timeouts, and compensation logic for production reliability.
View Cached Full Text
Cached at: 06/05/26, 05:09 AM
Fault Tolerance in LangGraph: Retries, Timeouts, and Error Handlers
In the real world, agents hit errors that prototypes never see: network failures, tool call errors, and LLM rate limits.
Imagine you have a task that’s been running for hours or days that hits an unrecoverable error halfway through. What do you do? Abandon the run and completely start over? This isn’t a sustainable way to run production agents.
Writing the happy path is usually the easy part. The error handling boilerplate that makes it survive in production (retries, timeouts, fallbacks) is often longer than the business logic itself.
LangGraph models your agent as a set of discrete steps (nodes), organized as a graph. For a typical agent, that’s a node that calls the model, a node that runs any tool calls it returns, and any deterministic logic you want to wrap around that loop. Because LangGraph controls execution, it’s also where you handle what happens when any of those steps fail.
This post walks through the three primitives LangGraph gives you for fault tolerance, how they compose, and why having them inside the workflow engine matters once you start thinking about compensation logic.
The three primitives are:
-
RetryPolicy: automatic retries with backoff/jitter for transient errors.
-
TimeoutPolicy: a wall-clock or progress-based cap on a node attempt.
-
error_handler: a node that runs after retries are exhausted, with the failure context attached.
In LangGraph, you define your agent by adding nodes and edges to a StateGraph. All three primitives attach directly to a node via add_node, so your fault tolerance config lives right next to the logic it protects. (If you want to configure defaults once, see set_node_defaults.)
Starting from retries
Transient failures are the most common kind of failure in any non-trivial graph: an LLM provider returns a 5xx, a vector store hits a connection reset, a downstream HTTP service is briefly unavailable. Every one of these is fundamentally a “try again in a moment and it’ll probably work” kind of error.
Without first-class support you end up writing the same wrapper inside every node:
LangGraph’s RetryPolicy removes that boilerplate. It applies per node attempt, with exponential backoff, optional jitter, and a configurable predicate for which exceptions count as retryable:
The default retry_on is intentionally conservative: it retries ConnectionError, 5xx responses from httpx/requests, and a few generic transient categories.
By default it does not retry ValueError, TypeError, RuntimeError, etc., which are almost always programming bugs.
The retry_on spec can be a collection of error types or a callable that checks an error at runtime to see if it matches retry criteria.
Timeout: a special case of “transient failure”
A timeout is really just “the attempt is treated as a transient failure because it’s been hanging too long.” Without an explicit timeout, a stuck HTTP call or a frozen subprocess can hang a graph run indefinitely.
LangGraph’s TimeoutPolicy supports two types of timeouts:
-
run_timeout is a hard wall-clock cap on a single attempt. Useful when you simply do not care to ever wait more than N seconds for a node.
-
idle_timeout resets on every “progress” signal: channel writes, streamed chunks (automatically emitted from LangChain LLM models), child task events, LangChain callback events. Long-running but actively-streaming work doesn’t trip it, but a truly hung call does.Internally, it relies on “heartbeat” for every signal. If you control the work and emit your own progress beats, you can switch to refresh_on=“heartbeat” and explicitly call runtime.heartbeat() from inside the node.
When a timeout fires, the node attempt is cancelled and a NodeTimeoutError is raised.
Error handlers: when retries aren’t enough
Retries handle “this will probably work in 5 seconds.” However, they don’t handle the cases where retry exhaustion and you need to run some logic. For example, “we’ve tried six times, the payment provider is still down, and now you need to:
-
mark the order as failed and notify the customer, or
-
roll back the partial side effects we already committed, or
-
publish a payment.failed event for the rest of the system to react to.”
There are a lot of use cases for error handlers after retry exhaustion. This includes cleanup, alerting, dead-letter writes, fallback paths to a cheaper model, or just routing to a “we apologize” message.
In LangGraph, this is now supported naturally (docs: Error handling):
A few things to notice about how this is wired:
It only fires after retries are exhausted. This is the property that makes the feature actually useful. If you want to run on every exception, you’d just need to write a try/except inside the node.
The failure context is injected. The handler can use parameter as NodeError to get the failing node’s name plus the exception (error.node, error.error).
The transition is atomic. When the original node fails, its ERROR write is committed to the checkpoint, and the handler task is scheduled as a new task in the same step. This is crucial in some critical processes where you can’t go back to the regular steps after entering the error-handler steps. If the host process crashes mid-handler, next time it will resume the run re-schedules the handler, not the original failing node
The error handler runs in the same execution cycle.* When a node fails, the error handler is scheduled immediately alongside any other nodes that were already running in that step. It doesn’t wait for them to finish, and they don’t wait for it.
*in LangGraph, we call an “execution cycle” a “superstep”, if you’re familiar with the runtime.
You can set a default handler for every node. set_node_defaults applies to every regular node that doesn’t specify its own, but a per-node error_handler= always wins.
You can’t set another error handler for an error handler. So you don’t get infinite-recursion behavior.
Putting it together: fault tolerant flight booking
The three primitives above compose naturally, but their real power shows up in workflows that involve side effects: operations that change real-world state. Consider a flight booking: it’s not one action, it’s a sequence. Reserve a seat, process payment, issue a ticket. Each step talks to an external system. Any of them can fail.
The naive approach (just retry the whole thing) breaks down fast. If the reserving seat went through but the payment or issuing ticket fails, the reservation is stuck in a bad state . What you actually need is to retry each step individually, and if a step exhausts its retries, undo only the steps that already ran(including failed one because it’s unknown).
This is called the SAGA pattern, and it’s a standard way to handle failures in distributed systems where you can’t wrap everything in a single database transaction.
Here’s what that looks like in LangGraph:
What this gives you:
-
Per-step backoff retries with the configured policy
-
An atomic transition into compensate once any step’s retries are exhausted
-
Persistent state tracking which steps actually completed, so compensate only undoes what needs to revert
Final words
Agents are taking on more autonomy, and with that comes more power to act. They’re booking flights, filing tickets, executing payments, calling internal services. The actions they take are increasingly high-consequence and difficult to reverse.
That raises the bar for reliability. A 1% transient failure rate is a minor inconvenience in a demo. In a production agent with dozens of steps and real-world consequences, it compounds quickly.
RetryPolicy, TimeoutPolicy, and error_handler are built into LangGraph so that it’s easy to build an agent that’s resilient to all sorts of errors. All you have to do is define policies that make sense for your use case, and the LangGraph agent runtime handles the rest.
Get started: configure per-node retries, timeouts, and error handlers with the official Fault tolerance docs.
Acknowledgements
This blog post was written by @quanzhenglong and @sydneyrunkle and initially appeared on the LangChain blog.
Thanks to @huntlovell, @bromann, and @veryboldbagel for thoughtful reviews.
Similar Articles
@sydneyrunkle: https://x.com/sydneyrunkle/status/2066928783534289358
This blog post by Sydney Runkle explains the art of loop engineering for building reliable LLM agents using LangChain primitives, covering four levels of loops: agent loop, verification loop, event-driven loop, and hill climbing loop.
@sydneyrunkle: https://x.com/sydneyrunkle/status/2062217190724579673
A guide on building custom agent harnesses using LangChain's create_agent, focusing on middleware for customization.
Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems
This paper formalizes four concurrency anomalies in multi-agent LLM systems, mechanically verifies a consistency hierarchy, and provides verified Rust runtimes with bounded prevention costs, including a fix for ByteDance's deer-flow and tool-effect reordering in LangGraph.
We hit the retry problem hard enough that we open-sourced a fix
Replaysafe is an open-source npm library that ensures idempotent retries by fingerprinting operations, preventing duplicate side effects in AI agent workflows. It integrates with popular frameworks like LangGraph and CrewAI.
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
Introduces GUI-RobustEval, a benchmark for error recovery in GUI agents, and Robustness-driven Trajectory Synthesis (RoTS) to generate training data, achieving state-of-the-art on OSWorld.