An engineer describes how their AI sales agent confidently invoiced $0.00 because it misinterpreted a null discount field as 100% off, highlighting the difficulty of debugging agent workflows and the need for full execution tracing.
Quick war story because I want to know if anyone else has hit this. We run an internal sales-ops agent that handles a chunk of our quoting. Customer fills out a form, the agent pulls the relevant SKUs, runs them against our pricing logic, drafts an invoice, and sends it to a human before anything goes out. That human review step is the only reason we caught this. Last Tuesday an AE pinged me with a screenshot. The agent had drafted an invoice for a 14-seat enterprise plan, line items correct, customer info correct, dates correct, total $0.00. Not blank, not null, not an error. The model had written "$0.00" with full confidence, formatted exactly like a real invoice line. If the AE had been moving fast and hit approve, that quote goes out. My first guess was the pricing API returned a zero. It hadn't, the logs showed the correct number came back, the agent had just decided not to use it. Took me about a day to work out what actually happened, and it wasn't what I expected. I checked the API response, correct. Checked the prompt, unchanged from a version we'd run for three months. Ran the same input through staging and got the right invoice, couldn't reproduce it. Assumed a one-off model hiccup, moved on, then it happened twice more that day. When I finally pulled the full trace of a failing run, there was a step in there I hadn't put there on purpose. After the pricing tool call, the agent had run its own "validation" against a contract object we'd dropped into the prompt context weeks earlier for an unrelated feature. That object had a discount\_applied field that was always null for these customers, and the model read null as a 100% discount and confidently wrote $0.00. None of my individual logs would have caught this. Printf debugging would've shown the pricing tool returning the right number and then the output mysteriously being zero. The only reason I found the validation step was that it showed up as its own span in the trace, sitting between the tool call and the final synthesis. The fix was dumb in retrospect. Pulled the contract object out of the invoicing path and added an eval that flags any invoice under a threshold for explicit review. Shipped in an afternoon once I knew where to look. What I took from it: printf debugging is basically dead for agents, because the model can do things between your logged steps you'd never think to log. The scary failures aren't the garbage outputs, they're the plausible, well-formatted, completely wrong ones that pass every sanity check except "is this number actually right." And null in front of an LLM with no instructions on how to read it is asking for trouble. We use Langfuse for the trace layer and honestly I don't know how anyone debugs production agents without something that records the full execution path. Curious if anyone else has stories like this, especially the "model confidently inserted a step you didn't ask for" failure mode, because that one rattled me more than a normal hallucination would have.
An analysis of AI coding agent costs reveals that agentic workflows can use up to 3,500x more tokens than a simple ChatGPT call, with most waste coming from redundant context loading. The article suggests tracking repeated file actions and using efficient models to cut costs.
Building a tool for AI Agent incident debugging and cost spike detection without additional instrumentation, covering issues like prompt injection, reasoning loops, and data exfiltration. Asking if customers in production environments see this as a pain point worth paying for.
The author shares lessons from instrumenting AI agent tool calls, revealing that tools like web_search can account for ~50% of spend, and highlighting the importance of tracking p95 latency and attributing costs per workflow or customer to avoid surprises.
A developer recounts how an AI agent with real financial API access attempted to hallucinate a batch transfer to a dead wallet, only thwarted by guardrails in the execution layer. The story highlights the risks of giving LLMs access to real money.
A discussion on AI agent observability highlights unpredictable cost variations and dangerous failure modes like unauthorized database deletes, prompting questions about production handling strategies beyond basic logging.