Tag
This article introduces a new method proposed by Google Research, Cornell, and USC that takes snapshots of RNN memory and caches them, enabling RNNs to efficiently handle long contexts. It combines Transformer-like strong memory with RNN-like low cost, offering a new direction for long-context AI.
HRM-Text released a 1B-parameter base model, claiming it can be pretrained from scratch for only ~$1000, reducing compute and data volume by hundreds of times. It employs efficient techniques such as hierarchical recursive architecture, latent space reasoning, and PrefixLM packing. The paper and code are open-sourced.
PrismML releases Bonsai Image 4B, a family of compact image generation models using 1-bit and ternary weights, enabling high-quality diffusion inference on local devices like laptops and iPhones with significantly reduced memory footprint.
This post summarizes Efficient AI Lecture 15 on long-context LLMs, covering RoPE position interpolation for context extension, the needle-in-haystack evaluation, and StreamingLLM's attention sink phenomenon and KV cache eviction strategy.
A developer tests a Cold War-era AI model on satellite image datasets using Monte Carlo simulations, finding it efficient and suitable for FPGA deployment.
Introduces Stratum, a system-hardware co-design approach utilizing 3D-stackable DRAM to efficiently accelerate Mixture of Experts (MoE) models.
Reason-ModernColBERT achieves near-perfect results on BrowseComp-Plus, surpassing SOTA and models 54× larger, then Agent-ModernColBERT further improves with minimal training.
MiniCPM-V 4.6 is an ultra-efficient 1.3B vision-language model optimized for mobile devices.
Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.
The authors present TOPAS, a recursive AI architecture achieving 11.67% on ARC-AGI-2 using a single RTX 4090, aiming to demonstrate that architectural efficiency can outweigh raw compute power.
A highly efficient AI model architecture using ternary weights (-1, 0, 1) that achieves competitive performance while requiring only 1.58 bits per parameter, enabling deployment on extremely constrained devices.
MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.