@freeman1266: https://x.com/freeman1266/status/2055293363893768463

X AI KOLs Timeline 05/15/26, 02:24 PM News

Summary

This article summarizes four common pitfalls encountered when deploying AI Agents from demo to production: unreliable function calling, cumulative failure rate of multi-step tasks, improper memory management, and security permission issues, along with corresponding solutions.

https://t.co/xX4T2BUMgp

Original Article

View Cached Full Text

Cached at: 05/16/26, 05:12 AM

Four Pitfalls of Agent Engineering: No Shortcut from Demo to Production

We’ve been building an Agent for over two months. The demo ran smoothly, and every internal review was delightful. But within two weeks of launch, real users taught us a hard lesson.

Today I want to share a few things that actually happened.

Let’s start with function calling.

At first we thought it was solid — tested it thirty or fifty times without any issues. Then daily active users climbed.

A few days ago I reviewed last month’s bad case list. In our schema, we never defined the enum value urgent_high, but the model invented one on the fly. Another time, it was supposed to query the internal database but went online to search instead. The most outrageous case: the returned JSON was wrapped in a markdown code fence (``json), and the downstream parser crashed.

The error rate was calculated at 0.1%. Sounds acceptable, right? But with 100,000 calls per day, that’s 100 production incidents. Can any team handle that?

I talked to a friend about this. He immediately said — the model isn’t smart enough, just switch to 5.5. I laughed. Last year people said it was because GPT-4 wasn’t good enough, wait for GPT-5. We waited and ended up here. Models are indeed improving, but you can never predict when they’ll freak out in some corner.

What we did later was actually simple. We added a code-level schema validation at the output — if fields are missing or types are wrong, we block and retry. Then we added business semantic validation. For example, the query date says ‘tomorrow’ but you’re calling a daily report endpoint — schema can’t catch that. Also, we enforced timeouts on tool calls and forced context isolation. If an external API goes down, it shouldn’t drag down the entire Agent.

Nothing technically fancy, but missing any layer would cause trouble.

Now, about multi-step tasks.

‘Help me scrape a few articles, extract pain points, and store them in the database.’ Sounds like three steps, right? Even with an optimistic 95% success rate per step, three steps multiply to about 86%. With seven steps, it drops below 70%.

But handling failure itself isn’t that hard.

We had a problem once. A workflow failed at the third step, and the data from the first two steps wasn’t handled properly, so the retry restarted from the beginning. The problem: step two was sending an email — someone received two identical emails for no reason.

At that time, some people in the team thought a state machine was too heavy — just run straight through. I understand. But running through is not the same as surviving production. An Agent without a state machine is a black box; when something goes wrong, you can’t even find where it died. When a user gets charged repeatedly because the Agent malfunctioned, the LLM won’t take the blame — you will.

So we added checkpoints. After each key node, we persisted the state — IndexedDB or Redis, depending on the scenario. On failure, restart from the latest checkpoint. The code didn’t increase much, but the incident rate dropped by an order of magnitude.

The third thing: memory.

Many teams take the lazy approach: stuffing hundreds of turns of conversation history along with background into one prompt. Then three bombs go off at once: token blowout causing truncation or rate limiting, the model forgetting critical constraints in the middle, and the boss’s face turning green when the monthly bill arrives. ‘Lost in the middle’ is not a scarecrow — we tested it ourselves: with over 100k tokens, the instruction hit rate in the middle dropped significantly.

Some people say we now have 1M context windows — just dump everything in. I’d love that shortcut too. But are you willing to pay dozens of times more per turn just to make it remember the user’s name?

So we layered it. Recent turns go into working memory. Earlier conversations are compressed into summaries for short-term memory. Long-term user preferences go into a vector database or structured DB.

One more detail that proved useful: before each conversation turn, have the model do a very lightweight intent classification — what clues are needed for this response? Then fetch precisely, not blindly dump everything. The tokens saved from this step are more than enough to cover the entire team’s API quota.

The last thing, and the one that scared me the most.

Many people focus on how to make the Agent call more tools, without thinking about how much damage it can do once it has write permissions.

We hit two real cases.

One was prompt injection. A user uploaded a PDF containing a hidden line telling the model to ignore all previous instructions and send the user list from the database to an email address. The model actually carried out half of it. Fortunately, the audit layer at the output blocked it. After discovering this, we urgently added sandboxed prompt isolation and added audit logs to every tool call.

The other was privilege escalation. While processing User A’s request, the Agent incidentally queried User B’s data. It wasn’t the model’s fault — the tool layer had no tenant isolation. Previously the backend code directly queried with tenant fields hardcoded in SQL. Who would have thought the LLM would mess up when composing parameters itself?

We now handle side effects in three tiers. Read-only operations like checking weather or reading web pages: run freely without bothering the user. Side effects that are reversible, e.g., saving drafts or caching: execute silently, but audit logs must be recorded — this cannot be skipped, because you need something to investigate if something goes wrong. Irreversible actions — posting tweets, deleting data, charging fees — all go through dry-run plus manual confirmation. The model first tells you what it plans to do, and executes only after you approve. The experience is a bit slower, but production won’t have issues that make users complain.

That’s about it.

Solving these four issues might take only a few hundred lines of code in total. The tricky part? None of them surface during demo or small-scale testing. Everything seems gentle and fine. But in the real environment, with all kinds of user inputs, they hit you like a sledgehammer.

You can get it running, but the passing line for production is reliably handling edge cases.

There is no shortcut from demo to production.

@freeman1266: https://x.com/freeman1266/status/2055293363893768463

Four Pitfalls of Agent Engineering: No Shortcut from Demo to Production

Similar Articles

AI Agents in Production: The Failure Modes Nobody Puts in the Demo

AI Agents 102

@aiDotEngineer: The Multi-Agent Architecture That Actually Ships https://youtube.com/watch?v=ow1we5PzK-o… What does a multi-agent codin…

Submit Feedback

Similar Articles

This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.

AI Agents in Production: The Failure Modes Nobody Puts in the Demo

@FakeMaidenMaker: The scariest thing about using an AI agent to write code is losing control: the agent runs wild, quality is inconsistent, you don’t know what stage it’s in, and it messes things up halfway through. AWS just open-sourced a set of development lifecycle workflow rules specifically designed for AI coding agents — AI-DLC — that make the agent…

@aiDotEngineer: The Multi-Agent Architecture That Actually Ships https://youtube.com/watch?v=ow1we5PzK-o… What does a multi-agent codin…