Tag
The author uses the Qwen3.6-35B-A3B model and oMLX tool on the new local machine for daily tasks, finding that both speed and quality far exceed expectations, even outperforming remote LLMs in PA and coding scenarios, demonstrating a significant improvement in on-device AI capabilities.
A deep benchmark of 8 tiny LLMs (135M to 1B parameters) on a $250 Jetson Orin Nano Super across four power modes finds 25W to be Pareto-optimal, with SmolLM2-135M achieving 165.1 tok/s and best efficiency.
NVIDIA announces JetPack 7.2 and NemoClaw support on Jetson, bringing agentic AI capabilities to edge devices like robotics and industrial automation, with performance boosts and new developer tools.
NVIDIA announced the Factory Operations Blueprint (FOX), a reference design for building autonomous factory manager AI agents that integrate real-time data, automate model training, and orchestrate specialized agents, with early adoption by major manufacturers.
Liquid AI releases LFM2.5-8B-A1B, an 8B MoE model with 1.5B active parameters and 128K context, optimized for edge devices.
A 7MB open-source L4 self-driving AI is trained to learn navigation, lane following, and drift recovery from visual and sensor input, designed to run on phones and embedded devices without heavy infrastructure.
NVIDIA has launched the $249 Jetson Orin Nano Super developer kit, an AI computer that runs large models like Llama 3 and Mistral locally, cutting monthly OpenAI costs from $200 to just $22 in electricity.
Demonstrates running a DCGAN with 12.6M int8 quantized parameters on a low-cost RISC-V microcontroller (CH32H417), generating 64x64 cat faces in 26 seconds using pure C inference and quantum entropy sampling.
BitCPM is a new open-source model from ModelBest, Tsinghua, and OpenBMB that uses ternary weights (-1,0,1) to run full-sized AI models on phones.
Developed a custom C++ inference engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B NPU), achieving 2x speedup over stock framework by writing optimized AscendC kernels for matmul and causal-conv1d, reaching 5.90 tokens/s.
BitCPM-CANN is the first open-source 1.58-bit ternary LLM trained entirely on Chinese-developed AI infrastructure (Huawei Ascend 910B), offering extreme memory reduction for edge deployment.
General Instinct launches a deployment layer that enables frontier AI models to run on constrained edge hardware like Jetsons and mobile NPUs, helping robotics and physical AI teams achieve low-latency offline inference.
A tweet argues the next AI boom will be compact intelligence on edge devices rather than larger data centers, with Liquid AI supporting the vision of running AI on phones, cars, and everyday devices.
A detailed examination of the real-world challenges faced when updating AI models on edge devices deployed in remote or disconnected environments, covering strategies like connectivity windows, technician visits, mesh propagation, and accepting staleness.
This paper proposes an Edge-AI-driven decentralized task allocation framework for circular smart manufacturing that uses learning-to-rank to align with the ordering-based nature of winner selection. Simulation results show improved delay, deadline adherence, and energy efficiency under high-load and tight-deadline scenarios.
This paper proposes a knowledge-adaptive edge expert agent architecture for ecological monitoring, separating visual perception from reasoning to reduce reliance on cloud resources and enable sustainable on-device AI in remote deployments.
ExecuTorch, PyTorch's on-device AI deployment framework, won the Best Industry Paper Award at MLSysConf 2026. The paper introduces a unified solution for running models on diverse hardware, from microcontrollers to SoCs.
A user shares a hands-on comparison of running Gemma 4 with LiteRT-LM on mobile devices versus their previous llama.cpp setup, noting significantly better memory usage (1.5-2 GB vs 4-5 GB) and faster inference (2-4 seconds vs 7-10 seconds) on smartphones like Samsung S25 Ultra and iPhone 13 Pro Max.
Sipeed's new K3 RISC-V single-board computers feature 32GB LPDDR5 and a 60 TOPS NPU, enabling local inference of large language models at up to 15 tokens per second.
The article highlights the ability to run Qwen3-35B-A3B locally on a laptop for free using llama.cpp and Unsloth 4-bit quantization.