Tag
An update on Matrix Recurrent Units (MRU), a linear-time attention alternative. The author explores methods to stabilize training, finding that orthogonal matrices underperform while LDU factorization works best, and shows MRU underperforms transformers on larger datasets like TinyStories.