Tag
The author explains why they often reject AI-generated code even when it works, citing reasons like inability to explain the approach, overly large diffs, premature abstractions, and reduced system reasoning, and argues for mandatory human review.
This paper studies a deployed LLM-as-judge system for evaluating multi-turn conversational agents and finds it catches far fewer defects than human review, revealing a structured blind-spot taxonomy and routing failures.
Discusses the inadequacy of traditional metrics like accuracy and click-through rates for evaluating AI agent recommendations, proposing a more holistic long-term evaluation that includes user understanding, trade-offs, and real-world problem-solving.
MindForge Guard is a CLI-first evidence layer that generates deterministic reports for single-agent AI workflows, enabling human review before trusting agent actions.