Tag
A user reports running Google's Gemma 4 12B model locally on a single RTX 3090 via GGUF quantization, finding strong performance including real 256k context, multimodal capabilities, and function calling that outperforms larger 70B models for coding tasks.
This paper systematically studies hybrid multi-agent systems combining cloud-based LLMs and on-device SLMs, revealing task-dependent optimal architectures and challenging the assumption that more frontier compute always improves performance.
MobileExplorer is a new framework that accelerates on-device inference for mobile GUI agents by performing lightweight parallel exploration of UI elements during model inference, reducing reasoning steps and latency by 23% while maintaining or improving task success rates.
This article introduces ExecuTorch, a unified PyTorch-native deployment framework designed to run AI models on diverse edge devices without requiring model conversion or reimplementation.
The article argues against relying on cloud-hosted AI APIs due to privacy and reliability concerns, advocating for on-device AI processing as demonstrated by a native iOS app using Apple's local model APIs.
A discussion post exploring where edge AI will have the greatest impact: autonomy and robotics, low-power vision systems, private local LLMs, or bandwidth-constrained industrial deployments.