Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.

Reddit r/LocalLLaMA 05/26/26, 10:20 PM Tools

hybrid-router edge-computing routing efficiency open-source gemini gemma

Summary

Cactus Hybrid Router is a 65k parameter model that dynamically routes tasks between local edge models (like Gemma4-2B) and frontier cloud models (like Gemini-3.1-Flash-Lite) to optimize cost and performance, with adjustable edge-cloud ratios and support for text, vision, and audio prompts.

Last week, we announced the “Simple Attention Network” and trained Needle, a 26m function call model that beats models 10-25x its size. Some LocalLlama Redditors asked if we could use make a router model. We now built “Cactus Hybrid Router”, a 65k parameter model that decodes on the fly when to complete a task with the edge model or route to frontier cloud. https://preview.redd.it/jm23ff7r1k3h1.png?width=1453&format=png&auto=webp&s=2091ec952216beb2d987d536b08df3aec58fec94 1. Robust router performance, even when you quantize the edge model. This is Cactus Quants though, our 4bit uniform nears fp16 naturally. https://preview.redd.it/4ri8bkuw1k3h1.png?width=2048&format=png&auto=webp&s=415e8165d5421d509634c165a3fb9feb2f83c209 2. Adjustable edge-cloud ratio for optimized resource allocation, cause why run "what is the capital of France?" through a trillion-parameter frontier model on expensive infra? https://preview.redd.it/dwtg7noc2k3h1.png?width=904&format=png&auto=webp&s=0ecde47c439e7a29af3dca441a9098c98ca38e29 3. Same 64k router handles text-only, vision and audio prompts. We'd love to hear your thoughts on this, what are we not thinking about? Live AI and coding require a lot of inference, hence much pressure on the cloud infra. Why not run rudimentary tasks locally and only escalate to cloud as a step towards edge? [https://github.com/cactus-compute/cactus](https://github.com/cactus-compute/cactus)

Original Article

Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.

Similar Articles

Gemini 3.5: frontier intelligence with action

@swyx: any time a model router company drops data, its worth browsing. here we learn that gemini leads in education and person…

Gemini 3.5 Flash ranks #1 on Automation Bench (from Zapier), beating every other frontier model at a much lower cost

Gemini 3 Flash: frontier intelligence built for speed

Gemini 3.1 Flash-Lite

Submit Feedback

Similar Articles

Gemini 3.5: frontier intelligence with action

@swyx: any time a model router company drops data, its worth browsing. here we learn that gemini leads in education and person…

Gemini 3.5 Flash ranks #1 on Automation Bench (from Zapier), beating every other frontier model at a much lower cost

Gemini 3 Flash: frontier intelligence built for speed