How are teams handling prompt QA at scale?
Summary
A practitioner at a company handling ~40k conversations/month describes the bottleneck of manual prompt QA and asks how teams are using automated systems to detect regressions and user frustration in production.
Similar Articles
One line system prompt change dropped model quality from 84% to 52%. How are people monitoring semantic quality in production?
A developer shares their experience of a single system prompt change degrading LLM response quality without triggering traditional monitoring alerts, and describes internal tooling they built to monitor semantic quality in production LLM applications.
How are you letting non-engineer teammates edit prompts in production?
The author discusses challenges in allowing subject matter experts to edit prompts in production for AI agents in regulated domains, and shares a solution using a prompt editor with GitHub backend for version control.
I keep abandoning multi-agent setups because I can't verify the code they ship. How are you handling this?
A developer shares their frustration with multi-agent coding setups where verifying the output of parallel PRs is impractical, and describes building an AI QA agent that uses a real browser (via Browserbase) to automatically click through preview deploys and fail PRs that don't work as expected.
@russelljkaplan: Having to manually prompt your coding agent to do work will soon feel like a UX bug. The future is setting up automatio…
Cognition has released Devin Auto-Triage, a feature that automates the monitoring and triaging of bugs, alerts, and incidents, allowing the AI coding agent to proactively work before the user logs on.
Anyone actually running AI agents in production with real users - not demos, not 10 beta testers. What's your stack? And has anyone moved back to traditional code after trying agents in prod - why?
A discussion prompt asking about real-world AI agent deployments with 100+ users, covering tech stacks and scaling issues, plus experiences of moving back to traditional code.