Tag
A year after its inception, OpenMed has achieved 340 million model downloads, offering over 1,500 open medical models under Apache 2.0, with 650+ capable of running on-device on iPhones.
Describes a 'distill-on-idle' pipeline that enables a 4B parameter local model to run effectively as an on-device memory assistant, demonstrating practical use of small models.
Introduces AnySimLite, a lightweight similarity encoder for on-device speech-adjacent classification tasks, achieving state-of-the-art or competitive performance while using less than 1/250th the model size of the qLLaMA-LoRA-7B baseline.
Liquid AI releases LFM2.5-230M, a lightweight foundation model that runs on devices from cloud GPUs to CPUs and Raspberry Pi, with strong performance on tool use and data extraction tasks.
A tweet argues that continual learning will run locally, explaining why major AI labs avoid discussing it: because data stays on device.
Eyal Toledano built an LLM inference engine using pure WebGPU/WGSL, running on-device in browser and Node without API keys, and is seeking peer review.
Liquid AI's LFM2.5-230M model demonstrates multi-step tool-calling capabilities on a Unitree G1 robot, running entirely on-device on an NVIDIA Jetson Orin, acting as a skill-selection layer.
Liquid AI releases LFM2.5-230M, a small 230M parameter model optimized for fast inference on CPUs, NPUs, and GPUs, targeting agentic tasks on devices like phones and robots.
Proposes a lightweight neural architecture search performed directly on the deployment device for near-sensor computing, validated on sEMG sign language and fault diagnosis datasets, achieving improved accuracy and reduced RAM occupancy.
Liquid AI released LFM2.5-230M, a compact 230M-parameter hybrid model optimized for on-device deployment with fast edge inference speeds (213 tok/s on Galaxy S25 Ultra) and built for agentic tasks via reinforcement learning.
A benchmark study comparing traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer variants (DistilBERT, TinyBERT, MobileBERT) for on-device fault detection across three public datasets. Traditional ML offers competitive accuracy at far smaller resource footprints, while TinyBERT-4L is the most deployment-friendly transformer.
FUTO Swipe is a product offering open models for on-device swipe typing, enabling privacy-focused keyboard input.
Update on running a non-quantized DeepSeek-v4-Flash model at 11 tok/s on a single DGX Spark using sglang inference and a custom mega-kernel, progressing towards GLM-5.2.
A collection of 650+ Apache-2.0 licensed biomedical NER and de-identification models that run on-device via MLX, achieving 30-40x faster inference than PyTorch-CPU on an M3 Max with identical outputs.
This paper introduces a conversational voice agent system that uses a lightweight on-device 'Talker' model to start responding immediately, then incorporates knowledge from a frontier LLM 'Reasoner' as it becomes available, achieving 7-19x faster time-to-first-response while approaching frontier-level performance on a laptop.
PhoneClaw is an open-source project that runs the AI Agent entirely locally on the iPhone, based on models like Gemma 4 and MiniCPM-V, no internet or data upload needed. It supports on-device operations such as voice, calendar, health data, ensuring privacy and fast response.
OpenJarvis is an open-source framework for building personal AI agents that run locally on devices, with support for local LLMs and a focus on energy efficiency and privacy.
Explains reasons to run local AI agents instead of cloud-based alternatives, highlighting privacy and control benefits.
This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.
Running Gemma 12B model on a Google Pixel 10 Pro using llama.cpp achieves 6.5 tokens per second prompt processing and 1.3 tokens per second generation with under 10 watts power consumption, demonstrating efficient on-device AI inference.