structural-sparsity

#structural-sparsity

I released a softmax-free attention model at GPT-2 Medium scale (~354M params, 11.5B tokens): structural sparsity + tile-skipping kernels for long-context VRAM savings. Open weights + custom Triton kernels [R]

Reddit r/MachineLearning ↗ · 5d ago Cached

Released RRT-355M, a softmax-free attention model at GPT-2 Medium scale with 354M parameters trained from scratch on 11.5B tokens, using structural sparsity and tile-skipping kernels for long-context efficiency, achieving comparable performance to GPT-2 Medium on a 22-task benchmark.

0 favorites 0 likes

structural-sparsity

I released a softmax-free attention model at GPT-2 Medium scale (~354M params, 11.5B tokens): structural sparsity + tile-skipping kernels for long-context VRAM savings. Open weights + custom Triton kernels [R]

Submit Feedback