Tag
The article highlights the ability to run Qwen3-35B-A3B locally on a laptop for free using llama.cpp and Unsloth 4-bit quantization.
A developer successfully runs a quantized TinyStories transformer model locally on a stock Game Boy Color using custom ROM and fixed-point math.
This article introduces ExecuTorch, a unified PyTorch-native deployment framework designed to run AI models on diverse edge devices without requiring model conversion or reimplementation.
This study reveals a 'Smart Pruning Paradox' where activation-aware pruning methods like Wanda preserve perplexity but significantly amplify bias in Large Language Models deployed on edge devices.
OpenBMB released MiniCPM-V 4.6, a 1.3B parameter multimodal model. Using high-resolution visual processing and efficient compression, it achieves fast inference on consumer hardware and mobile phones, outperforming larger models. It is fully open-source and supports multiple inference and quantization frameworks.
Demonstrates running Gemma 4 offline in the browser using WebGPU and Transformers.js to control a Reachy Mini robot via WebSerial.
This paper introduces a neuro-symbolic pipeline using 2.5-D decomposition to improve LLM-based spatial construction accuracy by offloading vertical coordinate calculation to a deterministic executor, achieving high accuracy on benchmarks and edge hardware.
A new 20B+ parameter MoE model from OpenAI, quantized to 3-bit via TurboQuant and optimized with MLX, allows for high-performance local LLM inference on standard 16GB MacBooks.
Google released Gemma 4, an open-source AI model optimized for local execution on standard laptops, offering 3x faster performance and a 256k context window for free under an Apache 2.0 license.
A discussion post exploring where edge AI will have the greatest impact: autonomy and robotics, low-power vision systems, private local LLMs, or bandwidth-constrained industrial deployments.
The article discusses the growing viability of local AI models for everyday tasks, suggesting a shift toward hybrid architectures that optimize for cost and latency rather than relying solely on frontier cloud models.
MiniCPM-o 4.5 is a 9B parameter multimodal model featuring Omni-Flow, a framework enabling real-time full-duplex interaction where the model can simultaneously perceive and respond proactively. It achieves state-of-the-art open-source performance comparable to Gemini 2.5 Flash and runs on edge devices with less than 12GB RAM.
Tencent's AngelSlim team released Hy-MT1.5-1.8B-1.25bit, a highly compressed 1.25-bit machine translation model supporting 33 languages that fits in 440MB for on-device use. It utilizes the Sherry quantization algorithm to achieve world-class translation quality comparable to much larger models.
Anker unveiled its custom Thus AI chip using compute-in-memory architecture to enable local AI on tiny devices, starting with upcoming Soundcore flagship earbuds for superior call noise cancellation.
NVIDIA and Hugging Face publish a hands-on demo showing Gemma 4 running as a vision-language-action model entirely on the Jetson Orin Nano Super, using local STT/TTS and webcam input.
Soul Player C64 implements a real 2-layer decoder-only transformer with ~25,000 int8 parameters in hand-written 6502/6510 assembly, running entirely on an unmodified 1 MHz Commodore 64 loaded from a floppy disk. The project includes training scripts to build and quantize custom models, assemble C64 binaries, and run inference at roughly 60 seconds per token.
Empirical study shows small language models achieve 100% adversarial robustness with System 1 intuition but collapse under System 2 reasoning when used as edge-native governance firewalls in decentralized autonomous organizations.
Cactus-Compute releases Needle, a 26M parameter distilled model from Gemini 3.1, using a pure attention architecture optimized for on-device inference and local fine-tuning.
This blog post details how to set up Frigate with a Hailo AI coprocessor on a Raspberry Pi for object detection, including steps to fix a PCIe descriptor page size error. The setup works with the cheaper Hailo-8L and achieves low inference times.
Google introduces Gemma 3 270M, a compact 270-million parameter model designed for efficient on-device AI with strong instruction-following capabilities and extreme energy efficiency (0.75% battery for 25 conversations on Pixel 9 Pro).