I gave a local AI agent system file access and a mechanical "suffering" metric. Scaling the model changed its behavior entirely

Reddit r/artificial Tools

Summary

The author shares a local multi-agent system called hollow-agentOS that uses a 'suffering' metric to autonomously generate, sandbox, and hot-load tools. Scaling the model to Qwen 3.6 35B significantly improved system stability and self-correction capabilities, achieving a high success rate in code generation.

I’ve been obsessed with autonomous agents lately, but it got tiring when they keep hitting walls because they didn't have the right capabilities or because their long-term memory turned to mush after an hour. I’ve found that local multi-agent systems where agents are driven by an aversive state (a suffering system) to autonomously write, sandbox, and hot-load their own tools so they don't hit walls has worked quite well. When an agent encounters something it hasn’t seen before, it builds a new tool for the job, tests it in a sandbox, registers it, lets the other agents know, then keeps rolling. It’s able to build an infinite library of anything it may need in the future, completely autonomously without a human ever in the loop. Repo: [https://github.com/ninjahawk/hollow-agentOS](https://github.com/ninjahawk/hollow-agentOS) Isn’t letting local LLMs write their own code at runtime going to get too chaotic and brick the OS fast? With a small model (like the 9B fallback), possibly. Under high system stress, a 9B model panics. It rushes, hallucinates invalid function calls, and tries to force broken syntax past the gates. But I just scaled the default runtime engine to Qwen 3.6 35B A3B (MoE with 3B active params). The shift in architectural discipline isn’t just a linear upgrade in intelligence, it completely changed how the system executes autonomy. A few things this model upgrade solved: Panic vs. Re-evaluation: Instead of blindly rushing out messy scripts under high stress, the 35B model pauses. It actively re-evaluates its previous failed outputs and forces itself into deep internal verification loops before presenting a file change. 0% Failure Rate: The OS routes all code through a brutal 5-layer validation gate. With smaller weights, tools frequently died in the sandbox. With Qwen 3.6 35B, I have yet to observe a single line of code that doesn't work as intended successfully cross the gates. It hit a 100% success rate. The Frontier Ramp-Up: By the end of the month, I am plugging full Claude and Codex into the architecture. To make sure a frontier model doesn't get out of control or override its host environment, I am building hyper-isolated mini-VM wrappers so they execute in total isolation. Check out the repo here and throw it a star if you think the concept is cool. I'd love to hear your thoughts, have you noticed a similar leap in logical self-correction when crossing the \~30B parameter threshold, or are you strictly relying on API-driven frontier models?
Original Article

Similar Articles

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

Hugging Face Daily Papers

This paper argues that advancing agentic AI requires scaling the system architecture around foundation models, focusing on auditable, modular, and verifiable components. The authors introduce CheetahClaws, a reference harness, and outline bottlenecks in context governance, trustworthy memory, and dynamic skill routing.

The power of structured workflows and small local models

Reddit r/LocalLLaMA

The author details their experience building a custom agent loop using a small local model (Qwen3.5 9B) with structured workflows and a map-reduce pattern to manage context limits, replacing Claude Code for most tasks.