Your processes are supposed to get better. Almost none of them do. Here's what we learned trying to close the loop.

Reddit r/AI_Agents 05/16/26, 02:55 PM News

ai-agents operations best-practices security audit-logging permissions approval-gates

Summary

After 8 months of deploying AI agents on real operational tasks, the author shares five unexpected engineering challenges: per-capability permissions, credential isolation via a connector proxy, durable approval gates, hard budget caps, and out-of-process audit logs.

Spent the last 8 months trying to put AI agents on real ops work: vendor reviews, follow-ups, weekly reporting, internal-tool requests. The biggest surprise: the model + prompt + tool calling part was the easy 80%. The hard 20% was making it OK for any sensible operator to actually let the thing run unsupervised. Here are the five things we ended up building that I didn't expect to need at the start. Curious what other people are doing here. 1. **Per-capability permissions, not per-tool permissions.** The intuition is "this agent can use Tool X." Reality: Tool X does 40 things. You want to allow/deny/ask at the capability level — shell, network, git push, file writes, process spawn, credential read — and THEN per-tool scoping inside that. 2. **A Connector Proxy pattern.** Credentials cannot reach the model context. If they do, they're in logs, prompts, and sometimes generated output. Solution: tools never see raw secrets. 3. **Approval gates as a runtime primitive, not a UI feature.** "Pause and wait for a human" is the most underrated agent feature nobody talks about. Has to durably persist the run, serialize working memory, wait, and resume cleanly when the human acts. 4. **Budget caps as hard limits:** Per-run, per-day, per-workspace. Three modes: warn / require-approval / hard-fail. Every team I've watched run agents in prod has had a cost incident. 5. **An audit log that the agent can't write to with a normal action.** Most agent frameworks have logs that live in the agent's own process. When the agent dies, the log dies. Put it in a system that the agent CAN'T reach with a normal action. What's missing from this list that you're seeing in your own agent deployments?

Original Article

Your processes are supposed to get better. Almost none of them do. Here's what we learned trying to close the loop.

Similar Articles

Are we missing an operations layer for AI agents?

why AI agent pilots feel amazing but production deployment turns into a mess

Most of our “agent” problems turned out to be workflow/state problems

"At what point does adding another agent actually hurt your system? Asking because my 6-agent pipeline is slower and less reliable than my old 2-agent one

the boring part of AI agents nobody builds and everyone needs

Submit Feedback

Similar Articles

Are we missing an operations layer for AI agents?

why AI agent pilots feel amazing but production deployment turns into a mess

Most of our “agent” problems turned out to be workflow/state problems

"At what point does adding another agent actually hurt your system? Asking because my 6-agent pipeline is slower and less reliable than my old 2-agent one

the boring part of AI agents nobody builds and everyone needs