Tag
This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.
This paper introduces QuIDE, a framework featuring an Intelligence Index to evaluate the trade-offs between compression, accuracy, and latency in quantized neural networks. It demonstrates that optimal bit-widths vary by task, with 4-bit being ideal for LLMs and simple tasks, while 8-bit is better for complex CNNs.
This paper introduces EdgeFlowerTune, a benchmark for evaluating federated LLM fine-tuning under realistic edge system constraints, demonstrating that accuracy-only metrics can be misleading regarding deployability.
MiniCPM-V 4.6 is an ultra-efficient 1.3B vision-language model optimized for mobile devices.
The article analyzes the viability of running AI inference locally on a MacBook Pro, comparing a local Qwen 35B model against the cloud-based Claude Opus 4.5. It concludes that local models are 2x faster for routine tasks, making them a practical choice for half of daily workloads despite a slight capability gap.
OpenBMB has released MiniCPM V4.6, a 1B-parameter multimodal large language model optimized for mobile devices under the Apache 2.0 license. It features mixed visual token compression and claims approximately 1.5x faster throughput than Qwen3.5 0.8B while running natively on iOS, Android, and HarmonyOS.
Nvidia is backing Span's initiative to deploy residential mini-data centers that leverage underutilized home electricity to run distributed AI workloads. The concept aims to bypass grid constraints by placing GPU nodes beside houses, though it remains largely unproven in real-world deployments.
The article argues against relying on cloud-hosted AI APIs due to privacy and reliability concerns, advocating for on-device AI processing as demonstrated by a native iOS app using Apple's local model APIs.
A tutorial on building an autonomous AI agent on a $15 RISC-V device (LicheeRV Nano) that can manage its own Lightning Network wallet and make autonomous Bitcoin payments via Nostr.
This paper presents an automated diagnostic system for grading knee osteoarthritis severity using an optimized ResNet-18 model deployed on edge devices via TensorFlow Lite. It integrates an LLM interface using Gemini 2.0 Flash to provide structured interpretive findings while maintaining offline capability for resource-constrained environments.
MIT researchers developed a new framework called FTTE that accelerates privacy-preserving federated learning by 81%, enabling efficient AI training on resource-constrained edge devices like smartwatches and sensors.
Researchers introduce 8M-30M parameter micro language models that instantly generate the first few words on-device before cloud models complete responses, enabling responsive AI on ultra-constrained devices like smartwatches.
EdgeDetect is a federated intrusion detection system for 6G-IoT environments that combines importance-aware gradient binarization (32× compression) with Paillier homomorphic encryption to achieve 98% accuracy on CIC-IDS2017 while reducing communication overhead by 96.9% and enabling deployment on resource-constrained devices like Raspberry Pi 4.
Cloudflare and OpenAI have partnered to make OpenAI's frontier models, including GPT-5.4, directly accessible within Cloudflare Agent Cloud, enabling enterprises to deploy AI agents for real-world tasks at scale. The integration also includes Codex tools now generally available in Cloudflare Sandboxes and upcoming availability in Workers AI.
NVIDIA and Google collaborate to optimize Gemma 4 models for local deployment across RTX GPUs, DGX Spark, and Jetson devices, enabling efficient on-device agentic AI with support for reasoning, coding, multimodal capabilities, and 35+ languages.
Google DeepMind announces the full release of Gemma 3n, a mobile-first multimodal AI model optimized for on-device efficiency with MatFormer architecture. The release includes E2B and E4B variants designed for low memory usage while delivering strong performance in reasoning, coding, and multilingual tasks.
Supertonic is an open-source, on-device text-to-speech system designed for local inference with minimal overhead, now releasing version 3 with support for 31 languages and improved accuracy.
AT&T outlined its AI strategy at MWC, focusing on 5G edge monetization and AI-driven 6G R&D to create new revenue streams and optimize network energy use.
RuView is an open-source WiFi sensing platform that uses Channel State Information (CSI) from low-cost ESP32 sensors to detect people, track movement, measure vital signs, and estimate pose through walls without cameras or wearables. The system runs entirely on edge hardware with cryptographic attestation and uses spiking neural networks for local adaptation.