Tag
YOLO26 is a multi-task computer vision model family released in January 2026, featuring end-to-end detection without Non-Maximum Suppression for lower latency and optimized for edge deployment with improved CPU inference and compact design.
Microsoft open-sourced bitnet.cpp, a 1-bit LLM inference framework that enables running 100B parameter models on local CPUs without GPUs, achieving 6.17x faster inference and 82.2% less energy consumption.
Reverse engineering the Qualcomm NPU compiler reveals undocumented VTCM memory management, MILP-based placement, automatic precision alteration, and a hidden analytical simulator (Hextimate) for edge deployment optimization.
Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 with 128GB RAM achieves ~15 TPS for a 284B MoE model (13B active) locally, costing $3,000 versus $25,000+ for a datacenter setup, highlighting the feasibility of running large models on consumer hardware.
A demonstration of running 'Le Gros Chaton' (likely a lightweight AI model) on a 1984 Toyota Corolla radio, showcasing edge AI on vintage hardware.
Harvard University open-sourced the textbook "Machine Learning Systems," which systematically covers practical topics such as ML system design, data engineering, model deployment, MLOps, and edge AI, aiming to help bring AI from research into production. It is freely available on GitHub.
APEX proposes a three-layer self-evolution framework for production AI agents that simultaneously optimizes the harness, behavioural principles, and workflow topology. Experiments on a production agent show significant improvements in health score and workflow quality with minimal LLM calls.
A satellite called Yam-9 used Google DeepMind's Gemma 3 vision-language model in orbit to autonomously identify areas of interest based on natural language queries, marking the first reported use of a VLM in space and signaling a shift toward more autonomous satellite operations.
PrintGuard 2.0 is a major rewrite of a few-shot FDM fault detector using a ShuffleNetV2 backbone and prototypical network, now with a single Python engine that runs unmodified on both CPython and Pyodide in the browser via a platform abstraction layer, enabling per-printer sensitivity tuning and fair inference scheduling.
D2H-AD is a novel anomaly detection framework using Hyperdimensional Computing (HDC) that combines distance-based and density-aware encoding. It outperforms five baselines across multiple benchmarks, offering lightweight, interpretable, and efficient performance for edge AI and IoT.
PaddleOCR releases PP-OCRv6, a new OCR model series with sizes from 1.5M to 34.5M parameters, offering improved accuracy and faster inference, supporting 50 languages and new scenarios like PCB and CAD drawings, under Apache 2.0 open source license.
Sigma-Branch restructures pretrained dense networks into a hierarchical binary tree with a shared backbone, routers, and specialized leaves, reducing per-inference active parameters by 58–60% while staying within 1.72 pp of baseline accuracy on CIFAR-100, ImageNet-1K, and ModelNet40.
A developer ran DeepSeek-V4-Flash on a Raspberry Pi 5 by streaming model weights from an NVMe SSD, achieving 1.3 tokens/second at 8 watts, demonstrating the feasibility of frontier-adjacent open-weight models on low-cost, offline hardware.
A detailed build and benchmarking of a Jetson Orin NX system for running Hermes Agent, achieving 14.65 tok/s at 8k context and 10.21 tok/s at 60k context with Gemma 4 26B quantized model.
This paper presents a two-stage methodology for end-to-end LLM deployment on spatial NPUs, progressing from human-guided development to an autonomous agent skill system. The system achieves speedups of 2.2x on prefill and 4.0x on decode for a reference model, and autonomously deploys eight additional LLMs on AMD XDNA 2 NPU with minimal human guidance.
A quiet revolution is making powerful AI models runnable on consumer hardware without expensive GPUs, thanks to breakthroughs in quantization and optimized implementations like llama.cpp's Gemma4 MTP support, democratizing access for hobbyists, small businesses, and edge computing.
The author announces a new blog post on clustering three Jetson Nano Orin Supers for distributed training and inference, continuing a series to help people build small compute clusters with accessible hardware.
A developer argues that the edge AI community overlooks small, specialized models that can run locally on devices like smartphones, using a self-built offline Morse code recognition feature as an example. The project uses a sub-5 MB AI model with TensorFlow/Keras and LiteRT, and the entire pipeline from data generation to mobile integration was custom-built.
Google released Gemma 4 12B, an open-source multimodal AI model under Apache 2.0 that runs locally on laptops with 16GB RAM, targeting enterprise edge deployment.
The author uses the Qwen3.6-35B-A3B model and oMLX tool on the new local machine for daily tasks, finding that both speed and quality far exceed expectations, even outperforming remote LLMs in PA and coding scenarios, demonstrating a significant improvement in on-device AI capabilities.