Tag
This paper explores how an exponentially decaying memory module from RAT+ can improve query-aware sparse inference methods for long-context language models, demonstrating consistent accuracy gains across various sparse budgets on needle-in-a-haystack tasks.
This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention without approximation error by leveraging kernel decomposition, and addresses gradient explosion and token dilution through constrained kernel functions. It also presents engineering innovations including Hyper Link, Memory Lobe, and a routing bias for Mixture of Experts.
This paper presents NGM, a plug-and-play training-free memory module for LLMs that uses a Causal N-Gram Encoder and Cosine-Gated Memory Injector to improve performance on code generation and knowledge-intensive tasks without additional training.
NGM is a training-free, plug-and-play memory module for LLMs that enhances performance by using pretrained token embeddings for N-gram knowledge retrieval without additional training or retrieval pipelines, achieving gains of up to 3 points on code generation and knowledge tasks.
NanoResearch is a multi-agent framework designed to personalize research automation by co-evolving skills, memory, and policy to adapt to individual user preferences and research styles.