★ Follow-up to "Blaming the model won't fix your workflow": the paper is now a preprint. The real learnings: composable domains, a verification ratchet, and tool naming.

Reddit r/artificial 07/05/26, 07:07 AM Papers

agentic-systems verification composable-domains tool-naming common-lisp open-source preprint

Summary

The author announces a preprint on building an agentic system with composable domains, a verification ratchet, and careful tool naming, sharing practical lessons and an open-source Common Lisp implementation.

A month ago I posted the very rough beginnings of a paper. That rough version did not survive: it got pulled apart and rebuilt by the very process it describes, and what came out the other side is now a proper preprint with a DOI: https://doi.org/10.5281/zenodo.21139628. Short version: the core claim held. The artifacts (specs, plans, executable graphs) and the verification gates wrapped around them have proven out on real work. Agents produce the work, the gates catch the defects, and a milestone only closes when the evidence is real, not when the model announces it is done. Honestly, though, the headline result was not the most valuable thing I got out of building it. What I actually want to pass on is three things I learned making it work. The first was composable domains. A "domain" in my setup is a bundle of instructions, skills, and tool access you hand an agent for a class of task. I built the first few as one-offs. Once I redesigned them to compose (stack cleanly, assume nothing about each other) they started turning up useful in places I had not planned for. A domain written for one workflow dropped straight into two others unchanged, and the same pattern is now carrying an entirely separate application build. Designing for composition instead of single use is the thing I would do first next time. The second was the ratchet, and it needs a concrete example. An agent once delivered a load test asserting the record count was greater than or equal to zero. Green forever, catches nothing, and it looks completely normal in review. So the loop now runs like this: acceptance criteria are written before the code exists, the coding agent never writes tests at all, a fresh session verifies the code against those criteria, only then does another session derive regression tests from them, and a final step breaks the code on purpose to confirm each test can actually fail. A test that survives that is frozen, and later work runs against it and cannot silently undo it. Standards move one way only. That killed a whole class of "looks done, isn't." The third was dumber and more surprising: tool naming matters far more than it should. An agent routes off a tool's name, and the name drags the model's training priors with it. What fixed things was never cleverness: borrow names from tools the model already knows, mirror the built-in parameter vocabulary exactly (renaming one parameter from `code` to `content` ended a whole class of thrashing), and never let a familiar name lie about what the tool does. The kicker: a strong model absorbs a bad interface and hides the problem from you, so test your tool surface with the weakest model that can still do the work. Everything above runs as an open reference implementation: the orchestrator, the verification cycle, the composable domains. To set expectations, this is not another 180-line agent loop. It is the third generation of a design that got ground out until it was useful rather than until it was postable, and it has only recently earned daily-driver status. It also passes the dogfood test, since the system's own development runs through its own gates, and the deepest bugs it ever caught were in itself. Fair warning before you click: it is Common Lisp. https://gitlab.com/naive-x/experimental/cl-naive-full-stack-agentic-system Preprint is here if you want the formal version: https://doi.org/10.5281/zenodo.21139628. Happy to take questions. And one worth asking of any agent-written suite: when did a test last fail because it caught wrong code? I could not answer that for mine, and that is where all of this started.

Original Article

★ Follow-up to "Blaming the model won't fix your workflow": the paper is now a preprint. The real learnings: composable domains, a verification ratchet, and tool naming.

Similar Articles

Blaming the model won't fix your workflow — a white paper on structural enforcement for AI agents

Learning to Construct Practical Agentic Systems

@tli104: New paper: "Self-Compacting Language Model Agents" LM agents build up long traces of reasoning and tool calls. As the t…

@yoheinakajima: in arxiv paper #2, i tackle the last topic from paper #1: @activegraphai as an architectural affordance for self-improv…

@Zephyr_hg: https://x.com/Zephyr_hg/status/2062176187384807488

Submit Feedback

Similar Articles

Blaming the model won't fix your workflow — a white paper on structural enforcement for AI agents

Learning to Construct Practical Agentic Systems

@tli104: New paper: "Self-Compacting Language Model Agents" LM agents build up long traces of reasoning and tool calls. As the t…

@yoheinakajima: in arxiv paper #2, i tackle the last topic from paper #1: @activegraphai as an architectural affordance for self-improv…

@Zephyr_hg: https://x.com/Zephyr_hg/status/2062176187384807488