Tag
Introduces Stratum, a system-hardware co-design approach utilizing 3D-stackable DRAM to efficiently accelerate Mixture of Experts (MoE) models.
Reason-ModernColBERT achieves near-perfect results on BrowseComp-Plus, surpassing SOTA and models 54× larger, then Agent-ModernColBERT further improves with minimal training.
MiniCPM-V 4.6 is an ultra-efficient 1.3B vision-language model optimized for mobile devices.
Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.
The authors present TOPAS, a recursive AI architecture achieving 11.67% on ARC-AGI-2 using a single RTX 4090, aiming to demonstrate that architectural efficiency can outweigh raw compute power.
A highly efficient AI model architecture using ternary weights (-1, 0, 1) that achieves competitive performance while requiring only 1.58 bits per parameter, enabling deployment on extremely constrained devices.
MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.