Tag
An in-depth blog post exploring the inner workings of modern dense transformers, covering topics such as YaRN for positional information, hybrid attention for long context lengths, soft capping, QK normalization, and transformer math including FLOPs/token formulas and cluster sizing.