Five labs, five minds: building a multi-model finance drama on small models (6 minute read)
Summary
A field report on building a multi-model finance drama game where each agent runs on a different lab's small model, demonstrating the engineering challenges and benefits of model heterogeneity.
View Cached Full Text
Cached at: 06/08/26, 03:13 PM
Five labs, five minds: building a multi-model finance drama on small models
Source: https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2 Back to Articles
- Heterogeneity is the product, not a constraint
- Information asymmetry needs a firewall
- Memory is cheap drama if you bound it
- What actually happened
- Takeaways for building with small models
A second Build Small Hackathon field report: what happens when each agent in an emergent economy runs on a different lab’s small model, and the player becomes the financier pulling the strings.
The first version of Thousand Token Wood was a weather-god sandbox: five woodland creatures on one fine-tuned 0.5B model traded goods, and you poked the world with shocks and watched bubbles and crashes emerge. It was a nice toy. It was also something you watched rather than played.
v2 rebuilt it into a game you operate. You are the Patron of the Wood, a shadow financier: you lend at interest, whisper tips that may be true or planted, short the market, bribe, and broker alliances, while a magistrate hunts you for trading on what you should not know. The creatures remember how you treated them and scheme back. And the biggest change is under the hood: every creature now thinks with a different lab’s small model. This is the engineering report.
https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2#heterogeneity-is-the-product-not-a-constraintHeterogeneity is the product, not a constraint
The obvious way to run a council of agents is one model, many prompts. v2 runs four: gpt-oss-20b (OpenAI), MiniCPM3-4B (OpenBMB), Nemotron-Mini-4B (NVIDIA), and a fine-tuned Qwen 0.5B of my own. The point is not novelty for its own sake. A market is interesting when the participants genuinely differ, and four labs’ models trained on different data with different post-training are about as different as small models get. The owl hoards differently than the fox speculates. The council is a live argument, not a script.
Standing four distinct models up on one platform surfaced the real lesson: the friction is almost entirely at the serving layer, not the modeling layer.
- Current vLLM (0.22.1) JIT-compiles kernels at load and needs the CUDA toolkit (
nvcc) present. A lean base image does not ship it, so all four models failed identically with “could not find nvcc” until I based them on a CUDA devel image. This was not a gpt-oss quirk; it was universal to the vLLM version. One image fix unblocked everything. - gpt-oss-20b runs in its native MXFP4 quantization and fits a 24GB L4 with room to spare; no high-end GPU needed. It also speaks a channel format that wraps the answer in an analysis preamble, so the consumer has to extract the final channel.
- MiniCPM3 needed
trust\_remote\_code; Nemotron loaded clean. Per-model footguns, each a one-line config.
The thing that made four heterogeneous models tractable was the same primitive that made one model tractable in v1: a tolerant JSON parse-and-repair layer that every model’s output flows through. Different tokenizers and formatting habits produce different malformations; the parser drops what it cannot salvage and the simulation never crashes. Build that layer once and adding a model is a config entry, not a refactor.
https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2#information-asymmetry-needs-a-firewallInformation asymmetry needs a firewall
The dramatic core of v2 is the insider tip. You can whisper a tip to a creature that istrue(a real forecast of the next market mania the deck will draw, your genuine edge) orfalse(bait). Acting on a true tip and profiting raises your heat; cross a threshold and the magistrate opens an investigation that ends in a fine, frozen assets, or exile.
For that to be a real game, the truth of a tip must be hidden from the creatures. They see the rumor text; they must never see the flag. This is a security property, not a UI nicety, and small-model agents make it sharp: everything the model could repeat back is whatever you put in its prompt. So the hidden flag lives off-prompt entirely (on the player’s ledger), it is stripped from the public event record at construction, and the only thing the narrator ever summarizes is public events. A single test scans every creature’s full prompt, every turn, for the banned tokens. That test is the most important one in the suite. When you give an agent secret information, assume it will leak unless a test proves it cannot.
https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2#memory-is-cheap-drama-if-you-bound-itMemory is cheap drama if you bound it
Creatures carry persistent relationships: a signed sentiment toward the Patron and toward each other, nudged by events (you shorted my crop, you repaid your loan, you allied me with a rival). A creature that turns hostile refuses your loans and quotes you worse; allied creatures stop undercutting each other and behave like a cartel.
The trap is prompt inflation. Raw history grows without bound and a small model drowns in it. The fix is to never put history in the prompt: the model sees a one-line bucketed summary (“you feel warmly toward Oona, wary of the Patron”), capped to the few strongest feelings, derived from integer sentiment. Notes are kept for traces but bounded and never shown. The behavioral bias is part emergent (the summary nudges the model) and part mechanical (a strongly hostile creature deterministically refuses), so it is observable and testable rather than a hope.
https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2#what-actually-happenedWhat actually happened
A representative council run, with the full v2 mechanics live:
LeverResultModels in the council4 labs, all under the 32B cap, served on ModalFine-tuned 0.5B reliability0% self-buys, 100% valid offers (beats its 3B teacher)Truth firewall0 leaks of a tip’s hidden flag across every prompt scannedInsider tip edgea true-tip pre-position settles a positive P&L; a false tip does notHeat to investigationtwo clean suspicious wins cross the magistrate’s lineRuina margin call and a loan default banish a creature, who returns a chapter later A single seeded run exercising the Patron, the information war, relationships, and leverage end to end.
https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2#takeaways-for-building-with-small-modelsTakeaways for building with small models
A small model is a reliable format generator and an unreliable reasoner; you close the gap with structure, prompting, and a small fine-tune, not with scale. A heterogeneous council is more interesting than a homogeneous one and costs you only config once the serving layer is solid. Secret information given to an agent is a firewall problem, and the firewall belongs in the data flow, proven by a test, not in a prompt instruction. And persistent memory is the cheapest way to make agents feel alive, as long as the prompt only ever sees a bounded summary.
Small models, big adventures. The whole council is open, and so are the traces.
Similar Articles
The crash that vanished: control and emergence in a five-model economy
A technical blog post describing a hackathon project where five different small AI models run a simulated economy, revealing that emergent market behavior differs when using heterogeneous agents compared to a single model, and that the price is a residue of agent decisions rather than a controllable dial.
Running a 24/7 AI agent dev team: I route each role to a different LLM (Claude/Kimi/MiniMax/GPT) to dodge a ~$2k/mo API bill. Setup + what actually breaks.
The author describes a setup where different AI models are assigned to specific roles (planning, coding, review) to reduce API costs for a 24/7 autonomous engineering team, and shares common failure points like model wandering and hallucinated ownership.
Five different frontier LLMs in one shared environment, with separate thought and emotion output channels — sharing setup, results, and open methodology questions
A personal research project places five frontier LLMs in a shared survival island environment without assigned identities, using separate channels for communication, thought, and emotion. The results show divergence between channels and consistent behavioral signatures across models, raising questions about AI agent personality and deception.
A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis
This paper presents a unified multi-modal framework integrating reinforcement learning, high-frequency trading, game-theoretic approaches, and cross-modal sentiment analysis for intelligent financial systems, claiming significant improvements over single-domain systems.
@aiDotEngineer: Your Agent Can Now Train Models The argument from @mervenoyann: open source models have caught up. GLM 5.1 is leading t…
The talk by @mervenoyann demonstrates that open source models like GLM 5.1 have caught up to closed models, and shows how Hugging Face's ecosystem enables agents to train models, run inference, and build workflows.