Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Hacker News Top 05/19/26, 12:23 PM Tools

guardrails agentic-tasks llm-tool-calling self-hosted open-source python reliability

Summary

Forge is a reliability layer for self-hosted LLM tool-calling that uses guardrails and context management to dramatically improve performance on multi-step agentic tasks, lifting an 8B local model from 53% to 99% accuracy.

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.What it does:- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it- Ships with an eval harness and interactive dashboard so you can reproduce every numberI wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.Demo video: <a href="https://youtu.be/MzRgJoJAXGc" rel="nofollow">https://youtu.be/MzRgJoJAXGc</a> (side-by-side: same model, same task, with and without Forge guardrails)The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.How to try it:- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.Repo: <a href="https://github.com/antoinezambelli/forge" rel="nofollow">https://github.com/antoinezambelli/forge</a>Paper: <a href="https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/" rel="nofollow">https://www.caisconf.org/program/2026/demos/forge-agentic-re...</a> <a href="https://github.com/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf" rel="nofollow">https://github.com/antoinezambelli/forge/blob/main/docs/forg...</a>Dashboard: <a href="https://github.com/antoinezambelli/forge/docs/results/dashboard.html" rel="nofollow">https://github.com/antoinezambelli/forge/docs/results/dashbo...</a>

Original Article

View Cached Full Text

Cached at: 05/19/26, 10:05 PM

antoinezambelli/forge

Source: https://github.com/antoinezambelli/forge

forge

A reliability layer for self-hosted LLM tool-calling. Forge lifts an 8B local model to the top of its class on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction). The current top self-hosted config (Ministral-3 8B Instruct Q8 on llama-server) scores 86.5% across forge’s 26-scenario eval suite — and 76% on the hardest tier.

Three ways to use it:

WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you’re building on forge directly.
Guardrails middleware — Use forge’s reliability stack (composable middleware) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.
Proxy server — Drop-in OpenAI-compatible proxy (python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server. Applies guardrails transparently — the client thinks it’s talking to a smarter model.

Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends.

Requirements

Python 3.12+
A running LLM backend (see below)

Install

pip install forge-guardrails                # core only
pip install "forge-guardrails[anthropic]"   # + Anthropic client

For development:

git clone https://github.com/antoinezambelli/forge.git
cd forge
pip install -e ".[dev]"

Backend setup (pick one)

llama-server (recommended — top 10 eval configs all run on llama-server):

# Install from https://github.com/ggml-org/llama.cpp/releases
llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080

Ollama (alternative — easier setup, slightly weaker on harder workloads):

# Install from https://ollama.com/download
ollama pull ministral-3:8b-instruct-2512-q4_K_M

Anthropic (API, no local GPU needed):

pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY=sk-...

See Backend Setup for full instructions and Model Guide for which model fits your hardware.

Quick Start

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide. If you’re building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages.

Proxy Server

Drop-in replacement for a local model server. Point any OpenAI-compatible client at the proxy and get forge’s guardrails for free.

# External mode — you manage llama-server, forge proxies it
python -m forge.proxy --backend-url http://localhost:8080 --port 8081

# Managed mode — forge starts llama-server and the proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081

Then configure your client to use http://localhost:8081/v1 as the API base URL.

Note: The proxy automatically injects a synthetic respond tool when tools are present in the request. The model calls respond(message="...") instead of producing bare text, keeping it in tool-calling mode where forge’s full guardrail stack applies. The respond call is stripped from the outbound response — the client sees a normal text response (finish_reason: "stop") and never knows the tool exists. This is essential for small local models (~8B), which cannot be trusted to choose correctly between text and tool calls — guiding them to a tool is a must. See ADR-013 for the full analysis.

Backends

Backend	Best for	Native FC?
Ollama	Easiest setup, model management built-in	Yes
llama-server	Best performance, full control	Yes (with `--jinja`)
Llamafile	Single binary, zero dependencies	No (prompt-injected)
Anthropic	Frontier baseline, hybrid workflows	Yes

See Backend Setup for installation and Model Guide for which model to pick.

Running Tests

python -m pytest tests/ -v --tb=short

python -m pytest tests/ --cov=forge --cov-report=term-missing

Eval Harness

26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation. See Eval Guide for full CLI reference.

# llama-server (start in another terminal first; see Eval Guide)
python -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf "path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf" --runs 10 --stream --verbose

# Batch eval (JSONL output, automatic resume)
python -m tests.eval.batch_eval --config all --runs 50

# Reports (ASCII table, HTML dashboard, markdown views)
python -m tests.eval.report eval_results.jsonl

Project Structure

src/forge/
  __init__.py          # Public API exports
  errors.py            # ForgeError hierarchy
  server.py            # setup_backend(), ServerManager, BudgetMode
  core/
    messages.py        # Message, MessageRole, MessageType, MessageMeta
    workflow.py        # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow
    inference.py       # run_inference() — shared front half (compact, fold, validate, retry)
    runner.py          # WorkflowRunner — the agentic loop
    slot_worker.py     # SlotWorker — priority-queued slot access
    steps.py           # StepTracker
  guardrails/
    nudge.py           # Nudge dataclass
    response_validator.py  # ResponseValidator, ValidationResult
    step_enforcer.py   # StepEnforcer, StepCheck
    error_tracker.py   # ErrorTracker
  clients/
    base.py            # ChunkType, StreamChunk, LLMClient protocol
    ollama.py          # OllamaClient (native FC)
    llamafile.py       # LlamafileClient (native FC or prompt-injected)
    anthropic.py       # AnthropicClient (frontier baseline)
  context/
    manager.py         # ContextManager, CompactEvent
    strategies.py      # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact
    hardware.py        # HardwareProfile, detect_hardware()
  prompts/
    templates.py       # Tool prompt builders (prompt-injected path)
    nudges.py          # Retry and step-enforcement nudge templates
  tools/
    respond.py         # Synthetic respond tool (respond_tool(), respond_spec())
  proxy/
    proxy.py           # ProxyServer — programmatic start/stop API
    server.py          # Raw asyncio HTTP server, SSE streaming
    handler.py         # Request handler — bridge between HTTP and run_inference
    convert.py         # OpenAI messages ↔ forge Messages conversion
tests/
  unit/                # 865 deterministic tests — no LLM backend required
  eval/                # Eval harness — model qualification against real backends

Documentation

User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory
Model Guide — Which model and backend for your hardware
Backend Setup — Backend installation and server setup
Eval Guide — Eval harness CLI reference, batch eval
Architecture — Full design document
Workflow Internals — Workflow design and runner internals
Contributing — How to set up, test, and add new backends or scenarios

Paper

The forge guardrail framework and ablation study are published as:

Zambelli, A. Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling. https://doi.org/10.1145/3786335.3813193

A pre-publication preprint is also available at docs/forge_ieee_preprint.pdf — kept as a historical artifact. Cite the published version above; the DOI link may not resolve immediately depending on the publisher’s release timing.

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

antoinezambelli/forge

forge

Requirements

Install

Backend setup (pick one)

Quick Start

Proxy Server

Backends

Running Tests

Eval Harness

Project Structure

Documentation

Paper

License

Similar Articles

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

How should teams review AI-assisted work before trusting it?

Robust and Efficient Guardrails with Latent Reasoning

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Submit Feedback

Similar Articles

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

How should teams review AI-assisted work before trusting it?

Robust and Efficient Guardrails with Latent Reasoning

@FeitengLi: #面壁智能 Opensources #ForgeTrain, a pretraining framework autonomously written by an AI Agent, even the CUDA kernels were written by itself. On H100, MiniCPM4-0.5B reaches 44% MFU, higher than Megatron (NVidia's main push for GPT implementation) baseline by about 10%. Starting AI self-evolution iteration

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast