production-agent

#production-agent

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

arXiv cs.CL ↗ · 6d ago Cached

This paper introduces layer-isolated evaluation for LLM agents, decomposing a production agent into architectural layers each tested with a deterministic, no-LLM harness. It demonstrates that per-slice baseline testing localizes regressions that aggregate metrics mask, validated by controlled regression injections across multiple tenants.

0 favorites 0 likes

#production-agent

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper studies a deployed LLM-as-judge system for evaluating multi-turn conversational agents and finds it catches far fewer defects than human review, revealing a structured blind-spot taxonomy and routing failures.

0 favorites 0 likes

#production-agent

Prompt injection took down a production agent last week — here's what our post-mortem found

Reddit r/AI_Agents ↗ · 2026-06-05

A production AI support agent was compromised via prompt injection, exposing other customers' data. The post-mortem revealed lack of enforcement layers, useless audit trails, and no kill switch, highlighting systemic security gaps in deploying AI agents.

0 favorites 0 likes

production-agent

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

Prompt injection took down a production agent last week — here's what our post-mortem found

Submit Feedback