transformer-alternatives

#transformer-alternatives

Generic Triple-Latent Compression with Gated Associative Retrieval

arXiv cs.CL ↗ · 2026-06-05 Cached

This paper introduces generic triple-latent recurrent models that compress token pair interactions into a latent state, and a gated associative retrieval variant that improves exact recall. The hybrid model outperforms Transformers on byte-level WikiText-2 and a tokenized language benchmark, achieving up to 41.9% associative recall versus 25%.

0 favorites 0 likes

#transformer-alternatives

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

arXiv cs.LG ↗ · 2026-05-11 Cached

This paper introduces Toeplitz MLP Mixers (TMM), a novel architecture that replaces attention with Toeplitz matrix multiplication to achieve lower computational complexity while maintaining high information retention and training efficiency.

0 favorites 0 likes

#transformer-alternatives

Olmo Hybrid: From Theory to Practice and Back

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper presents Olmo Hybrid, a 7B-parameter language model that combines attention and Gated DeltaNet recurrent layers, demonstrating both theoretical and empirical advantages over pure transformers. The work shows that hybrid models have greater expressivity, scale more efficiently during pretraining, and outperform comparable transformer baselines.

0 favorites 0 likes

transformer-alternatives

Generic Triple-Latent Compression with Gated Associative Retrieval

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

Olmo Hybrid: From Theory to Practice and Back

Submit Feedback