Tag
Released RRT-355M, a softmax-free attention model at GPT-2 Medium scale with 354M parameters trained from scratch on 11.5B tokens, using structural sparsity and tile-skipping kernels for long-context efficiency, achieving comparable performance to GPT-2 Medium on a 22-task benchmark.