built an agent where the LLM is structurally forbidden from writing the final output. looking for feedback + people willing to break it

Reddit r/AI_Agents Tools

Summary

The author describes an AI agent designed to reproduce production Python crashes using LangGraph, featuring a unique architecture where the LLM plans actions but deterministic Python functions generate the final test code to ensure reliability.

Posting here because the constraint i landed on feels weird and i want to know if anyone else has done something similar or thinks im wrong about it **Context:** I built an agent that reproduces production Python crashes. You give it a Sentry URL, the agent reads the stacktrace + frame locals, decides which tools to call (repo introspection, dep preparation, sandbox execution, etc.), and runs everything in a Docker sandbox. It either ends with a deterministic failing pytest you can paste into your repo, or a structured investigation report if it can’t fully reproduce. **The weird part:** The LLM is structurally not allowed to write the final test code or the audit artifact. Those bytes come from a pure deterministic Python function that only takes the captured frame locals as input. The agent can plan, call tools, recover from dead ends, and reason about races but when it’s time to emit the actual test/artifact, a non-LLM codepath runs. The artifact always has llm\_in\_evidence\_path: false. Architecture is LangGraph supervisor + 11 tools. The agent gets graded on the deterministic output, not just the reasoning. Is this split worth the extra complexity or am I over-engineering it? I’ve got around 800 unit tests but no real external eval harness yet, which I know is the actual gap. If you build agents and have thoughts on this architecture, I’d genuinely appreciate any feedback. Also: if you have a Python Sentry issue sitting unresolved (especially Django/FastAPI/Celery/SQLAlchemy), I’d love to run it through and see what breaks. Frame locals are the gold, so anything with the default Python SDK settings should work. DM or comment, whatever is easiest.
Original Article

Similar Articles

PlayCoder: Making LLM-Generated GUI Code Playable

Hugging Face Daily Papers

PlayCoder introduces PlayEval benchmark and a multi-agent framework that iteratively repairs LLM-generated GUI applications, achieving up to 20.3% end-to-end playable code.