Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.

Reddit r/LocalLLaMA Tools

Summary

Cactus Hybrid Router is a 65k parameter model that dynamically routes tasks between local edge models (like Gemma4-2B) and frontier cloud models (like Gemini-3.1-Flash-Lite) to optimize cost and performance, with adjustable edge-cloud ratios and support for text, vision, and audio prompts.

Last week, we announced the “Simple Attention Network” and trained Needle, a 26m function call model that beats models 10-25x its size. Some LocalLlama Redditors asked if we could use make a router model. We now built “Cactus Hybrid Router”, a 65k parameter model that decodes on the fly when to complete a task with the edge model or route to frontier cloud. https://preview.redd.it/jm23ff7r1k3h1.png?width=1453&format=png&auto=webp&s=2091ec952216beb2d987d536b08df3aec58fec94 1. Robust router performance, even when you quantize the edge model. This is Cactus Quants though, our 4bit uniform nears fp16 naturally. https://preview.redd.it/4ri8bkuw1k3h1.png?width=2048&format=png&auto=webp&s=415e8165d5421d509634c165a3fb9feb2f83c209 2. Adjustable edge-cloud ratio for optimized resource allocation, cause why run "what is the capital of France?" through a trillion-parameter frontier model on expensive infra? https://preview.redd.it/dwtg7noc2k3h1.png?width=904&format=png&auto=webp&s=0ecde47c439e7a29af3dca441a9098c98ca38e29 3. Same 64k router handles text-only, vision and audio prompts. We'd love to hear your thoughts on this, what are we not thinking about? Live AI and coding require a lot of inference, hence much pressure on the cloud infra. Why not run rudimentary tasks locally and only escalate to cloud as a step towards edge? [https://github.com/cactus-compute/cactus](https://github.com/cactus-compute/cactus)
Original Article

Similar Articles

Gemini 3.5: frontier intelligence with action

Google DeepMind Blog

Google announces Gemini 3.5, a new family of AI models focused on agentic workflows and coding, starting with 3.5 Flash which delivers frontier performance at high speed.

Gemini 3 Flash: frontier intelligence built for speed

Google DeepMind Blog

Google has released Gemini 3 Flash, a fast, cost-effective AI model that combines Pro-grade reasoning with Flash-level speed for tasks like coding, complex analysis, and agentic workflows.

Gemini 3.1 Flash-Lite

Product Hunt

Google releases Gemini 3.1 Flash-Lite, a lightweight version of the Gemini model designed for high-volume AI pipelines.