Needle: We Distilled Gemini Tool Calling Into a 26M Model

Reddit r/LocalLLaMA 05/12/26, 05:56 PM Models

Summary

Cactus-Compute released Needle, a 26M parameter open-source model distilled from Gemini for efficient on-device function calling using a novel Simple Attention Network architecture without MLPs.

We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale. Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...). Training: \- Pretrained on 200B tokens across 16 TPU v6e (27 hours) \- Post-trained on 2B tokens of synthesized function-calling data (45 minutes) \- Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.) You can test it right now and finetune on your Mac/PC: [https://github.com/cactus-compute/needle](https://github.com/cactus-compute/needle) The full writeup on the architecture is here: [https://github.com/cactus-compute/needle/blob/main/docs/simple\_attention\_networks.md](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md) We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to be published. While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly. Needle is part of a broader effort to make on-device AI practical. We also build Cactus (https://github.com/cactus-compute/cactus), an open-source inference engine for mobile and wearables. Everything is MIT licensed. Weights: [https://huggingface.co/Cactus-Compute/needle](https://huggingface.co/Cactus-Compute/needle) GitHub: [https://github.com/cactus-compute/needle](https://github.com/cactus-compute/needle)

Original Article

Needle: We Distilled Gemini Tool Calling Into a 26M Model

Similar Articles

Cactus-Compute/needle

A 26M tool-router suggests tool calling should be split from reasoning

Introducing the Gemini 2.5 Computer Use model

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

Submit Feedback

Similar Articles

@sitinme: A 26M parameter model can do Function Call, and is even stronger than Qwen-0.6B? This team's out-of-the-box approach is too wild! Nowadays, large models have ever-growing parameter counts, but one question has never been seriously considered: does calling a tool really need hundreds of billions of parameters? Think about it, when you say 'Check today's...'

A 26M tool-router suggests tool calling should be split from reasoning

Introducing the Gemini 2.5 Computer Use model

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.