attention

#attention

High Dimensional, Dynamic Rotary Positional Embedding [P]

Reddit r/MachineLearning ↗ · 2026-06-24

Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.

0 favorites 0 likes

#attention

The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention

Reddit r/singularity ↗ · 2026-06-24

The article discusses how rising DDR5 memory prices signal a broader memory bottleneck in AI, particularly the KV cache in softmax attention for LLMs, and highlights post-transformer architectures like linear attention and state space models that aim to reduce memory usage.

0 favorites 0 likes

#attention

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

arXiv cs.LG ↗ · 2026-06-24 Cached

Introduces Nexus Sampling, a training-free KV-cache eviction method using weighted reservoir sampling instead of deterministic top-k, improving long-context LLM inference under fixed memory budgets, matching dense attention performance at 80% eviction.

0 favorites 0 likes

#attention

I Figured Out What Causes 'Super Weights'

Reddit r/ArtificialInteligence ↗ · 2026-06-23

Explains that super weights in large language models arise from the SoftMax-Attention interaction creating a 'Nothing Dump' token that serves as a stable reference point; removing these weights cripples performance.

0 favorites 0 likes

#attention

Attention Is All You Need

Reddit r/ArtificialInteligence ↗ · 2026-06-22

A reflection on the landmark 'Attention Is All You Need' paper, highlighting how removing recurrence and relying solely on attention mechanisms revolutionized AI and led to modern LLMs like GPT and Claude.

0 favorites 0 likes

#attention

@TheAhmadOsman: INCREDIBLE RESOURCE The MOST COMPLETE GUIDE for understanding LLMs from first principles is now available online to rea…

X AI KOLs Timeline ↗ · 2026-06-21 Cached

A comprehensive free guide explaining LLMs from first principles, covering tokens, transformers, attention, fine-tuning, and local deployment.

0 favorites 0 likes

#attention

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

arXiv cs.AI ↗ · 2026-06-20 Cached

Introduces ITNet, a neural architecture based on a learnable integral transform that unifies convolution, attention, and recurrence, achieving strong results across multiple modalities.

0 favorites 0 likes

#attention

Dual Dimensionality for Local and Global Attention

arXiv cs.CL ↗ · 2026-06-18 Cached

Proposes Distance-Adaptive Representation (DAR) which reduces key-value dimensionality for distant tokens while preserving full dimensionality for nearby tokens, improving KV cache efficiency without performance loss.

0 favorites 0 likes

#attention

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

This paper introduces QG-MIL, a gated transformer aggregator that mitigates attention concentration in multiple instance learning for medical imaging, achieving domain-agnostic performance without auxiliary losses.

0 favorites 0 likes

#attention

Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.

0 favorites 0 likes

#attention

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

HydraHead is a novel attention hybridization architecture that combines Full and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead via interpretability-driven selection and scale-normalized fusion.

0 favorites 0 likes

#attention

@sairahul1: Nobody tells you what's actually inside GPT or Claude. They say "transformer" and move on. This repo builds one from sc…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

A repository that builds a transformer from scratch without high-level libraries, explaining attention mechanisms and the full training pipeline, trainable in a day on free Colab.

0 favorites 0 likes

#attention

@pradheepraop: implemented the top-k kernel from the kernel design section in the msa paper. https://github.com/Mantissagithub/learn_c…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

Implemented a top-k kernel from the kernel design section of the MSA paper, using exp-free comparison and warp-level tree merging with CUDA shuffles. The code is available on GitHub.

0 favorites 0 likes

#attention

@freeman1266: You don't need math to understand most AI papers—just understand this chain: token → embedding → position encoding → attention → FFN → residual stream → next-token prediction. LLMs essentially stack Transf…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

A Chinese science tweet that intuitively explains the core chain of LLMs (Large Language Models): from token, embedding, position encoding, attention, FFN to residual stream and next-token prediction, helping readers without a math background understand AI papers.

0 favorites 0 likes

#attention

@Fluyeporlaweb: This genius published a step-by-step guide on GitHub for building and training your own model from scratch. No magic. N…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

A GitHub guide published by Fluyeporlaweb shows how to build and train a Transformer model from scratch, implementing attention, multi-head, embeddings, and post-training algorithms (SFT, PPO, DPO, GRPO) without high-level libraries, trained on The Pile dataset.

0 favorites 0 likes

#attention

MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)

TLDR AI ↗ · 2026-06-15 Cached

MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.

0 favorites 0 likes

#attention

@CamilleRoux: Une explication bien faite du fonctionnement interne des LLMs : tokens, embeddings, positional encoding, attention, fee…

X AI KOLs Timeline ↗ · 2026-06-14 Cached

This tweet shares a well-made explanation of the internal workings of LLMs, covering tokens, embeddings, positional encoding, attention, and feed-forward networks, via a blog post by 0xkato.

1 favorites 1 likes

#attention

Tiny Scale Is All I Can Spare To Play With Transformer

Reddit r/LocalLLaMA ↗ · 2026-06-11

A student introduces Silia, a novel transformer architecture that combines attention and FFN into a unified operation to save parameters at scales ≤10M, achieving comparable performance to GPT-2 with fewer parameters despite limited compute resources.

0 favorites 0 likes

#attention

Deficient executive control in transformer attention

Hacker News Top ↗ · 2026-06-10

The article discusses a deficiency in executive control within transformer attention mechanisms, highlighting limitations in how transformers manage sequential dependencies.

0 favorites 0 likes

#attention

Blurry Window Attention

arXiv cs.LG ↗ · 2026-06-10 Cached

Introduces Blurry Window Attention (BLA), a novel attention method with bounded-memory control that reconstructs a blurry KV history via Dirichlet kernel interpolation, achieving 8x state efficiency over Sliding Window Attention on the Multi-Query Associate Recall task.

0 favorites 0 likes

attention

Submit Feedback