Tag
Google Chrome is automatically downloading a 4GB Gemini Nano model weights file to users' devices to power on-device AI features like scam detection and writing assistance, often without clear notification about storage requirements. Users can disable the On-Device AI toggle in Chrome settings to remove the file and prevent re-downloads.
Garry Tan highlights a model with a 1M token context window and coding agent capabilities running locally on a 128GB MacBook Pro, expressing excitement about the milestone.
atomic.chat has optimized Gemma 4 26B inference in LLaMA.cpp, achieving ~40% faster token generation on MacBook Pro M5 Max using Multi-Token Prediction (MTP) speculative decoding. This is a notable win for local AI users running desktop apps, coding agents, and private on-device assistants.
MIT researchers developed a new framework called FTTE that accelerates privacy-preserving federated learning by 81%, enabling efficient AI training on resource-constrained edge devices like smartwatches and sensors.
Tencent's AngelSlim team released Hy-MT1.5-1.8B-1.25bit, a highly compressed 1.25-bit machine translation model supporting 33 languages that fits in 440MB for on-device use. It utilizes the Sherry quantization algorithm to achieve world-class translation quality comparable to much larger models.
Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.
A user reflects on why more apps don’t run local LLMs directly on phones, noting Gemma 2-4B models already work offline and could eliminate server costs while maintaining near-GPT-4o quality.
Apple is betting that AI’s future hinges on custom hardware and on-device inference via the iPhone’s advanced processors rather than cloud-based LLMs.
K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.
Researchers introduce 8M-30M parameter micro language models that instantly generate the first few words on-device before cloud models complete responses, enabling responsive AI on ultra-constrained devices like smartwatches.
A 31B parameter model runs locally on a laptop via Hermes agent at 15 tok/s, using 22.8 GB VRAM and 94 W power, highlighting fully autonomous, private AI inference without cloud dependencies.
A user shares their experience running Qwen3-35B-A3B quantized model on an M2 MacBook Pro with 32GB RAM for coding tasks via opencode and llama.cpp, finding that the 32K context window limit causes critical memory loss during compaction, making complex coding tasks impractical. They conclude that meaningful agentic coding with this model likely requires at least 128K context, exceeding what their hardware can support.
Overworld releases Waypoint-1.5, a real-time video world model designed for everyday GPUs, featuring improved visual fidelity and new 360p and 720p tiers for broader hardware accessibility.
Google DeepMind introduces Gemma 4, its most capable family of open models to date, designed for advanced reasoning and agentic workflows with high intelligence-per-parameter efficiency across multiple sizes.
Google DeepMind releases Gemma 4, a frontier multimodal model family available on Hugging Face with Apache 2 licensing, optimized for on-device deployment and supported by various inference libraries.
Unsloth releases GGUF-quantized versions of Google DeepMind's Gemma 4 26B A4B instruction-tuned model, enabling efficient local inference with support for tool-calling and fine-tuning via Unsloth Studio. Gemma 4 is a multimodal MoE model with a 256K context window, supporting text, image, video, and audio inputs.
Google DeepMind announces the full release of Gemma 3n, a mobile-first multimodal AI model optimized for on-device efficiency with MatFormer architecture. The release includes E2B and E4B variants designed for low memory usage while delivering strong performance in reasoning, coding, and multilingual tasks.
Google DeepMind introduces Gemini Robotics On-Device, an efficient VLA model optimized to run locally on robotic devices, enabling low-latency operation and offline capability while maintaining strong dexterous manipulation and task generalization. The model can be fine-tuned with as few as 50-100 demonstrations and comes with an SDK for developers.