Tag
A user benchmarks a V100-compatible port of Flash Attention 2, reporting 3x-17x speedups and up to 94% memory reduction over default PyTorch attention.