Tag
This paper introduces generic triple-latent recurrent models that compress token pair interactions into a latent state, and a gated associative retrieval variant that improves exact recall. The hybrid model outperforms Transformers on byte-level WikiText-2 and a tokenized language benchmark, achieving up to 41.9% associative recall versus 25%.
This paper introduces Toeplitz MLP Mixers (TMM), a novel architecture that replaces attention with Toeplitz matrix multiplication to achieve lower computational complexity while maintaining high information retention and training efficiency.
This paper presents Olmo Hybrid, a 7B-parameter language model that combines attention and Gated DeltaNet recurrent layers, demonstrating both theoretical and empirical advantages over pure transformers. The work shows that hybrid models have greater expressivity, scale more efficiently during pretraining, and outperform comparable transformer baselines.