mobile-inference

#mobile-inference

Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup

Reddit r/LocalLLaMA ↗ · 2026-05-15

A user shares a hands-on comparison of running Gemma 4 with LiteRT-LM on mobile devices versus their previous llama.cpp setup, noting significantly better memory usage (1.5-2 GB vs 4-5 GB) and faster inference (2-4 seconds vs 7-10 seconds) on smartphones like Samsung S25 Ultra and iPhone 13 Pro Max.

0 favorites 0 likes

#mobile-inference

Needle: We Distilled Gemini Tool Calling Into a 26M Model

Reddit r/LocalLLaMA ↗ · 2026-05-12

Cactus-Compute released Needle, a 26M parameter open-source model distilled from Gemini for efficient on-device function calling using a novel Simple Attention Network architecture without MLPs.

0 favorites 0 likes

mobile-inference

Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup

Needle: We Distilled Gemini Tool Calling Into a 26M Model

Submit Feedback