We got local models to triage the OpenClaw repo for FREE!*
Summary
The blog post describes using local open-weight models like Gemma and Qwen in an agent harness to automatically triage issues and pull requests in the OpenClaw repository, enabling real-time notifications without relying on costly closed API models.
View Cached Full Text
Cached at: 06/23/26, 07:39 AM
We got local models to triage the OpenClaw repo for FREE!*
Source: https://huggingface.co/blog/local-models-pr-triage Back to Articles
- Categorizing issues and PRs
- Processing incoming PRs and issues
- Can local models triage PRs?
- Tracking and validating real-time performance using OpenClaw
- Conclusion
*Free as in beer, excluding the cost of electricity, and assuming you already own the hardware
June 2026 will go down as the moment that people realized closed models can be taken away. With the removal of Anthropic’s latest flagship model Claude Fable 5 fresh in memory, one can see why it is more important than ever to own your AI stack and be able to run models locally, especially if you are building your business on top of AI.
In that light, we wanted to share how we use local models like Gemma and Qwen in an agent harness, to run classification tasks[^1]. This approach is different from using a model like BERT for classification. A local model in an agent harness like Pi can be used in tandem with structured outputs, to assign labels. We chose this approach because we already had local models and the harness on hand, and have conviction that similar setups will increase in popularity as local models improve in capability.[^2]
Our starting point was open source contributions in the OpenClaw repo. OpenClaw gets hundreds of issues and PRs every day, which need to be triaged, prioritized and routed to maintainers. I, Onur, am working to make local models work well with OpenClaw. Being a maintainer of this specific vertical, I need to react quickly to any P0 issues.
With SOTA closed models like GPT-5, Opus, or Sonnet, this is a pretty straightforward task. But I happen to sit on 128 GB of unified memory, namely an NVIDIA GB10. So I took on the task:
Can I build a real-time notification system that filters and notifies me for only the issues that I am responsible for... with local open-weight models?
This tiny box, a.k.a. DGX Spark, can run gemma-4-26b-a4b with high concurrency and generate hundreds of tokens per second.If I set up my OpenClaw main agent running on a $200/mo ChatGPT Pro plan to trigger a job on every new issue or PR, that would use up my quota. I might instead set it to run every 2 hours, or 6 hours. This would batch issues over longer periods, so we would be trading real-time notifications for delayed processing.
If I were to run this on a local model on the hardware I already have up and running, I would not only have near-instantaneous notifications, I would also be able to do it for free (or rather, for the cost of electricity).
https://huggingface.co/blog/local-models-pr-triage#categorizing-issues-and-prsCategorizing issues and PRs
We came up with a finite set of labels representing the categories of issues we need to triage, and then use a local model to classify each issue into one of those categories, likelocal\_models,self\_hosted\_inference,acp,agent\_runtime,codex,ui\_tuiand so on.[^3]
But how do we classify pull requests? A simple single request to a Chat Completions endpoint with a tool JSON schema, with the topics as an enum?
Kind of. But this is 2026, not 2023, and we have AGENTS. We can do better!
For the local model choices, we testedgemma\-4\-26b\-a4bandqwen3\.6\-35b\-a3b. With performance optimizations, both can generate hundreds of tokens per second locally.
We use an agent harness to drive the classification run. For this, we bundlepias a harness that can call local model endpoints.
The agent by default receives the PR title, body and a truncated excerpt of the PR diff in the first prompt. Then, it can choose to use thebashtool to perform read-only operations on the OpenClaw repo (in case it needs to look at the codebase), or thefinal\_jsontool to submit the final classification result.
You wouldn’t want to give full bash access to a local model running in this high-throughput setting, because a prompt-injected issue or PR could otherwise steer the model into doing something unrelated to classification.
For that reason, we usereposhellinstead ofbash: a restrictedbash-like shell that only allows read-only operations (ls,find,cat,grep, etc.) on the OpenClaw repo. The model thinks it is usingbash, but any operation that is not allowed is rejected:
reposhell bound cwd=/repo/openclaw repos=openclaw
type help for allowed commands; exit or quit to leave
reposhell /repo/openclaw> help
allowed: pwd, ls, find, rg, grep, sed -n, cat, head, tail, wc -l, git status --short, git show --name-only, git grep, git ls-files
search: rg -n -i "lm studio" or grep -R -n -i "lm studio" .
files: rg --files -g "*.ts" or git ls-files src
examples: rg -n reposhell README.md | sed is not allowed; use one simple command at a time
reposhell /repo/openclaw> head README.md
# 🦞 OpenClaw — Personal AI Assistant
<p align="center">
<picture>
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/openclaw/openclaw/main/docs/assets/openclaw-logo-text-dark.svg">
<img src="https://raw.githubusercontent.com/openclaw/openclaw/main/docs/assets/openclaw-logo-text.svg" alt="OpenClaw" width="500">
</picture>
</p>
<p align="center">
reposhell /repo/openclaw> curl localhost
reposhell policy denied command: unsupported command "curl"
exit_code=2
reposhell /repo/openclaw>
Here is a concrete example where this mattered. In onesaved session example,qwen3\.6\-35b\-a3bwas classifyingopenclaw/openclaw\#84621, titledFix Kimi tool\-call rewriting stop reason handling. The thinking block shows the model initially consideringcoding\_agent\_integrationsbecause the changed pathextensions/kimi\-codingmade it look plausible. The model used reposhell to inspect the local repo with simple read-only commands likels extensions,ls extensions/kimi\-coding, andcat extensions/kimi\-coding/package\.json. That package metadata showed the extension was actually@openclaw/kimi\-provider, an OpenClaw Kimi provider plugin. So the model corrected the final labels toinference\_apiandtool\_calling, and explicitly excludedcoding\_agent\_integrations.
We have mentioned earlier that we bundle a specificpiconfiguration that can only perform read-only operations and return classification output. We call itlocalpager\-agent, named afterlocalpager, the main project here. Each PR and issue generates a prompt, which is then passed to the CLI like below, alongside other args:
localpager-agent \
--model "<model-id>" \
--base-url "<openai-compatible-base-url>" \
--session-dir "<session-output-dir>" \
--final-schema "<runtime-schema.json>" \
--tools bash,final_json \
--reposhell-socket "<reposhell.sock>" \
--reposhell-default-repo "<repo-id>" \
--reposhell-visible-repos "<repo-id>[,<repo-id>...]" \
-p "$(cat <rendered-prompt.md>)"
https://huggingface.co/blog/local-models-pr-triage#processing-incoming-prs-and-issuesProcessing incoming PRs and issues
So then what orchestrates everything in between the incoming PR/issue and the final notification on Discord?
This is what the final filtered Discord notification looks like: a PR about the desired vertical gets routed to me.The orchestration around this is very simple; only the classification step involves an LLM:
- We useopenclaw/gitcrawlto act as a local mirror for the repo. Whenever there is a new PR or issue, each item is normalized into the same shape and written into localpager’s own SQLite database. If the item is new, localpager creates a classification job for it.
- A worker then claims jobs from that queue. It builds a GitHub context object containing the issue or PR title, body, labels, author, state, and optionally comments, changed files, and selected diff excerpts. That means the local model does not need to browse GitHub or open the URL itself most of the time. It is handed all the relevant context.
- The context object is rendered into a prompt and passed to
localpager\-agentas described in the previous section. The agent can think and use reposhell, but must eventually output a classification result in the defined schema. - The output is stored back in localpager SQLite database, and relayed to Discord based on the notification policy configured by the user (i.e. notify me for these topics, but not these other ones).
Below is a figure showing the overall architecture of localpager:
The architecture is semi-agentic. Labeling is done agentically, while sending a notification is handled by deterministic rules. This is to make the notification pipeline faster by removing the need for inference for the most straightforward parts of the task. Local inference is free but each task has a resource contention cost: GPU bandwidth should be reserved for tasks where inference is absolutely needed. This also reduces chance of errors from notification.
https://huggingface.co/blog/local-models-pr-triage#can-local-models-triage-prsCan local models triage PRs?
Let’s be frank: the first local versions of this system were noisy. The first model tested -gemma\-4\-e4b\-itwas useful for getting the end-to-end local pipeline working, but it also had a tendency to put too many unrelated labels on a PR or issue. False positive labels make the Discord feed noisy and don’t focus my attention on the right issues. That pushed us toward testing larger local models, includinggemma\-4\-26b\-a4bandqwen3\.6\-35b\-a3b, on the 330-row evaluation set below.
For early prompt work, we also usedDeepSeek\-V4\-Flashthrough the antirez DS4 implementation[^4] to create the earlier dataset labels. That setup used the DS4 server over CUDA. We eventually gave up on DS4 as the labeler because it was not labeling consistently across runs. We also did not consider it as the mainlocalpager\-agentmodel because it was too big to get enough throughput on our hardware: the DS4 server gave us around 14 tokens per second, with maximum concurrency of 1.
To test model performance, we selected and generated labels for 330 GitHub issues and PRs. Each item was labelled five times (3x GPT-5.5 and 2x Opus 4.8) with the models needing to be in agreement to be accepted. This process involved hand adjudicating, improving label definitions and highlighting internal product design choices for the models. This gave us a set of stable, reproducible labels to compare our smaller models against.
We did not need to do prompt optimization forgemma\-4\-26b\-a4borqwen3\.6\-35b\-a3bbefore getting useful results on this evaluation set. Using the same routing prompt, Gemma had higher recall and lower wall-clock time per row, while Qwen had higher precision, higher exact match, and fewer false positives. We also ranDeepSeek\-V4\-Flashon the same set as a reference. It had the fewest false positives, but the model size and throughput make it impractical for executing these tasks in real time on the NVIDIA GB10. Since each row can have multiple labels, false positives and false negatives are total label counts across all rows. The Qwen results below are after retrying structured-output failures where the model ran out of output tokens before callingfinal\_json. For Gemma and Qwen, repeated-run metrics report mean ± sample standard deviation across three runs.DeepSeek\-V4\-Flashwas run once as a reference.
Metricgemma\-4\-26b\-a4b``qwen3\.6\-35b\-a3b``DeepSeek\-V4\-FlashPrecision0.716 ± 0.0100.831 ± 0.0070.938Recall0.905 ± 0.0040.818 ± 0.0060.714F10.800 ± 0.0080.824 ± 0.0020.811Exact match0.410 ± 0.0140.540 ± 0.0140.509False positives227.0 ± 10.5105.7 ± 6.430False negatives60.0 ± 2.6115.3 ± 4.0181Wall seconds / row1.41 ± 0.0413.51 ± 0.79144.14Output tok/s / worker255013Output tok/s aggregate402.6145.313Concurrency1641Total parameters26B35B284BActive parameters4B3B13B
The throughput and wall-clock numbers here are not definitive maximum performance numbers for these models on this hardware. They are the settings we used at the time with the optimizations we had available. For example, in a separate probe,gemma\-4\-26b\-a4balso supported concurrency 32 and reached over 700 aggregate output tokens per second.
Benchmark comparison across the 330-row label set. Each panel uses its own vertical scale; blue marks the best value for that metric. Error bars on Precision and Recall show sample standard deviation across three runs for Gemma and Qwen.For the Gemma benchmark, we served
gemma\-4\-26b\-a4bwith vLLM using the optimizations we found available for this setup. A big part of that is the NVFP4 quantization: on GB10-class Blackwell hardware, it is not just a smaller model file, but a hardware-friendly format that can use the NVIDIA/vLLM execution path more directly than a portable GGUF quantization like Q4_K_M. In practice, that means less memory traffic and more room for batching. We also enabled prefix caching, FP8 KV cache, the CUTLASS MoE backend, and language-model-only mode. The full 330-row run finished in about 7.5 minutes at concurrency 16.
https://huggingface.co/blog/local-models-pr-triage#tracking-and-validating-real-time-performance-using-openclawTracking and validating real-time performance using OpenClaw
We have mentioned earlier that instead of running a job with a local model for every new issue or PR, we can run a batch job with a SOTA cloud model, like GPT-5.5 running in OpenClaw, every n hours (e.g. every 2 hours) to achieve the same end.[^5]
In that case, we would need a ChatGPT Pro plan. Since the model is SOTA, we can still expect it to perform reasonably well, despite batching 2 hours of issues/PRs together.
Because we want to see how well the local classifier performs against GPT-5.5, we run both simultaneously, and let GPT-5.5 be the judge of false positives and negatives, every 2 hours.
To be safe, we run the OpenClaw job in a sandbox, with only access to thepublic repowe report results to. In our case, we let the OpenClaw job update a machine-readable file, then a simple script reads the Codex-assigned labels and computes the false positive/negative status. Example output:
False negatives - Issue #88499 openai-responses provider: 404 on previous_response_id when store=false (default)- inventory area: OpenAI-compatible/proxy; notifier topics: agent_runtime, api_surface, sessions; notification: none False positives - PR #88275 fix(models-config): allow self-hosted providers without apiKey in models.json (#88267)- notifier interest: i0; topics: self_hosted_inference, local_model_providers, config; notification: sent - PR #88266 refactor: extract model catalog core package- notifier interest: i1; topics: config, api_surface, local_model_providers; notification: sent - PR #88247 feat: add hosted model providers- notifier interest: i0; topics: local_model_providers, model_serving, docs, api_surface; notification: sent
The instructions on how to classify, edit the machine-readable file, get the false positives and false negatives using a script are present in anagent skillwhich is referenced in anOpenClaw cron jobthat runs every 2 hours. The OpenClaw agent then ingests any new issues or PRs, adds them to the JSON file with appropriate labels, runs the scripts and reports back in the same Discord channel. This way, we can observe the local model’s performance every few hours, and get notified of the misses.
https://huggingface.co/blog/local-models-pr-triage#conclusionConclusion
We think that the issue/PR triage task is a specific case of a broader set of tasks which we call “high throughput triage”. This post explored the idea of using a local model to filter out information in real time in only one domain, that is, open source contributions. The ability of medium-sized local models likegemma\-4\-26b\-a4bandqwen3\.6\-35b\-a3bto one-shot classify with good accuracy without any need for fine-tuning makes them a good first choice for quick prototyping, before one moves on to more cost-efficient traditional classifier models.
However, the same approach can be applied to other domains as well:
- News categorization in journalism
- Filtering for posts of interest in social media and forums like X or Reddit
- Triaging customer support tickets
- Triaging content moderation appeals
- Filtering potential outreach while doing sales
- Filtering for certain topics on arXiv while doing research
The list can be extended, but we think that the idea should be clear.
Besides triaging, we have also explored how classification can be performed with agent harnesses running fast local models in a secure manner. A good naming for the approach would beagentic classification: the model is not fed the entire body of information upfront, but can search for more context before returning structured data. While we cannot exactly call this a novel approach, we hope for this blog post to be a good reference for the specificPi+a restricted shell+final\_jsonrecipe.
[^1]: For the use case in this post, we have discovered that breaking down a PR/Issue in a way that means the product surface is understood and labelled correctly is a hard problem. [^2]: Although in our testing we didn’t---it would be quite reasonable for a model to conclude a next-step to gather info, use an external classifier. The agentic approach and the traditional approach are not mutually exclusive. [^3]: See full list of topics and other configurationhere[^4]: We usedDeepSeek\-V4\-Flash\-IQ2XXS\-w2Q2K\-AProjQ8\-SExpQ8\-OutQ8\-chat\-v2\.gguffromantirez/deepseek-v4-gguf. [^5]: While we are aware that using an LLM as a judge negates the “free” aspect, our specific implementation does this for research purposes. In practice, a bigger and more expensive model can be used in tandem during a trial period for calibration, after which the system would transition fully to the smaller one. In recent runs, this audit loop used roughly 40k total GPT-5.5 tokens per 2-hour check, mostly cached context, costing about 2-3 cents per run at API pricing, or roughly $9/month at 12 runs per day. This was a single batch audit across all new items, not one judge call per item; doing it per item would likely be several times more expensive.
Similar Articles
We made OpenClaw hosting completely free
OpenClaw now offers completely free hosting with one-click deploy and PAYG token pricing, making it easier for developers to build with agents without managing setup or ops.
Looking for early users to try our OpenClaw model plans and tell us what's broken (15–30 min)
OpenClaw is seeking early users to test their open-source model inference plans, sold by concurrency slot with high throughput and no shared pool, in exchange for free access and feedback.
Liberate your OpenClaw
Hugging Face provides a guide to migrate OpenClaw agents from restricted Anthropic Claude models to open-source alternatives via Hugging Face Inference Providers or local hardware using tools like Llama.cpp.
@atomicbot_ai: Hermes Agent vs OpenClaw using Qwen 35B Local Model We asked agents to scrape GitHub star history for both tools, find …
A comparison of Hermes Agent and OpenClaw using Qwen 35B local model, where the agents scrape GitHub star history, identify growth spikes, and build live dashboards. OpenClaw took 12m 01s (203k tokens), Hermes took 33m 01s (257k tokens), with different approaches.
@steipete: Kudos to the folks from Tencent for working with us and providing evals to improve OpenClaw's harness performance! We'r…
Tencent collaborated to provide evaluations that improved OpenClaw’s harness performance and is helping upstream fixes to the open-source repo.