Tag
This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.
Running Gemma 12B model on a Google Pixel 10 Pro using llama.cpp achieves 6.5 tokens per second prompt processing and 1.3 tokens per second generation with under 10 watts power consumption, demonstrating efficient on-device AI inference.
Trace is a Mac app that transcribes meetings locally without uploading audio, allowing users to flag moments mid-call and get clean markdown transcripts.
A collection of 7 hands-on, open-source webinars from the past 7 months focused on building with local AI and small language models, all running on-device.
Revi is a voice dictation app that runs on-device without needing cloud services or an account.
This article argues that local-first software, like the Harper grammar checker, avoids scaling issues by running code on-device, making it easier to handle traffic spikes without additional server costs.
Atomic Chat is now available as a Local App on Hugging Face, allowing users to run 200,000+ open-weight models privately and locally on their devices.
Synopsule is a product that provides on-device, private AI meeting transcripts, ensuring data stays local.
VTT for Mac is a voice-to-text tool for macOS that offers a fully on-device option for privacy.
A user attempted to benchmark Google's new on-device dictation app Eloquent, which uses proprietary models, and found it frequently drops words or returns incomplete transcripts, with accuracy competitive only when complete. The author theorizes the underlying chat-style model sometimes refuses transcribing.
Apple released Core AI, a new framework that runs AI models entirely on Apple silicon devices (iPhone, iPad, Mac, Vision Pro) with zero server calls. It includes a memory-safe Swift API, model export recipes for PyTorch, an optimizer, and debugging tools, supporting models like Qwen, Mistral, and SAM3.
ColibotAI is an on-device AI tool that translates, summarizes, and explains any text without needing internet connection.
Apple announced CoreAI, a new on-device inference engine for Apple Silicon at WWDC, replacing CoreML and supporting larger models up to 20B parameters via optimized inference, with a focus on phones and tablets.
Apple shipped a 20B parameter on-device model using a MoE variant that selects experts once per query to fit in NAND, enabling inference despite RAM constraints.
Apple introduces Core AI Framework, a new tool for on-device machine learning.
Apple announces the next generation of Apple Intelligence and Siri, featuring on-device AI processing, privacy-focused enhancements like Private Cloud Compute, and new capabilities such as Genmoji and smarter home integration.
A developer tested how small edge models (LFM2.5, Gemma variants) retain a single fact across conversation turns, finding that models often confidently deny knowing information that remains in context, posing a trust issue for agent architectures and suggesting a trade-off between memory and format discipline.
OpenMed 1.5.5 ships with batch PII redaction on-device in 9 languages, open-source under Apache 2.0, having surpassed 6 million PyPI downloads in under a year.
Signal Recorder SR-7 is an on-device voice recorder that transcribes audio and exports Markdown files.
Google releases Gemma 4 12B, a compact AI model optimized for local laptop use with only 16GB of RAM, featuring multi-token prediction and streamlined multimodal capabilities for text, audio, and images.