transformer-architecture

#transformer-architecture

Open weights aren't catching up to closed models by copying them, but they're winning because of how the whole AI stack is quietly modularising

Reddit r/singularity ↗ · 4h ago

The article argues that open-weight AI models are catching up to closed ones not via distillation but due to the modularisation of the AI stack—stable interfaces (Transformer architecture, OpenAI-compatible APIs, agentic harnesses) allow innovations to diffuse rapidly across the ecosystem, shrinking the capability gap while keeping a massive price advantage, potentially leading to a commoditisation of frontier AI.

0 favorites 0 likes

#transformer-architecture

I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config.

Reddit r/LocalLLaMA ↗ · 2026-06-20

A comprehensive 15-part series covering LLM internals from tokenization to serving, grounded in Gemma 4 12B's actual config.

0 favorites 0 likes

#transformer-architecture

OneRank: Unified Transformer-Native Ranking Architecture for Multi-Task Recommendation

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

OneRank proposes a Transformer-native multi-task ranking framework that integrates feature encoding and prediction to reduce inter-task interference and improve ranking performance in recommender systems.

0 favorites 0 likes

#transformer-architecture

@_rohit_tiwari_: https://x.com/_rohit_tiwari_/status/2063982924714901858

X AI KOLs Timeline ↗ · 2026-06-08 Cached

This article provides a visual guide to the Transformer architecture in Large Language Models, covering self-attention, causal self-attention, masked multi-head attention, and the output layer with step-by-step explanations and examples.

0 favorites 0 likes

#transformer-architecture

Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper introduces a Jacobian-PCA-Grassmann framework to analyze the geometric structure of expert specialization in Mixture-of-Experts (MoE) Transformers. It finds that experts exhibit strong functional decorrelation while their representations overlap, and that routing sparsity significantly influences this geometry.

0 favorites 0 likes

#transformer-architecture

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

arXiv cs.LG ↗ · 2026-05-15 Cached

This paper demonstrates that cosine similarity is a poor proxy for assessing layer importance in LLMs, and proposes using the actual accuracy drop from layer removal as a more robust metric.

0 favorites 0 likes

#transformer-architecture

RT-Transformer: The Transformer Block as a Spherical State Estimator

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper presents a theoretical framework interpreting Transformer components (attention, residual connections, normalization) as arising from a spherical state estimation problem using Radial-Tangential SDEs.

0 favorites 0 likes

#transformer-architecture

@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…

X AI KOLs Timeline ↗ · 2026-05-09 Cached

This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.

0 favorites 0 likes

#transformer-architecture

@simplifyinAI: DeepSeek has dropped a fundamental rewrite of the Transformer architecture. And it solves the "identity crisis" that br…

X AI KOLs Timeline ↗ · 2026-05-09

DeepSeek has published a paper introducing mHC (Manifold-Constrained Hyper-Connections), a fundamental rewrite of the Transformer architecture that stabilizes large models by replacing standard residual connections with mathematically constrained multi-stream pathways.

0 favorites 0 likes

#transformer-architecture

@tetsuoai: Forty minutes of whiteboard. The full transformer architecture. Then open vim and write it in C.

X AI KOLs Timeline ↗ · 2026-05-08 Cached

A 40-minute walkthrough explains the complete Transformer architecture via whiteboard diagrams and demonstrates a practical implementation in C using Vim.

0 favorites 0 likes

#transformer-architecture

@AYi_AInotes: Mind blown—22-year-old founder open-sources what Anthropic kept locked away: the Claude Mythos black box

X AI KOLs Timeline ↗ · 2026-04-19 Cached

A 22-year-old founder claims to have cracked open the architecture black box of Anthropic’s Claude Mythos model via an open-source project, speculating it uses a recurrent-depth Transformer design instead of simply scaling parameters.

0 favorites 0 likes

#transformer-architecture

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Reddit r/MachineLearning ↗ · 2026-04-16

ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.

0 favorites 0 likes

#transformer-architecture

Three-Phase Transformer

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

A research paper introducing Three-Phase Transformer (3PT), which applies Tesla's polyphase geometry to transformer architectures by organizing the residual stream into three 120° offset phases. The approach achieves 7.2% perplexity improvement on WikiText-103 with minimal parameters (0.00124% overhead) and 1.93× convergence speedup.

0 favorites 0 likes

#transformer-architecture

Motif-Video 2B: Technical Report

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

Motif-Video 2B is a 2B parameter text-to-video generation model that achieves 83.76% on VBench, surpassing Wan2.1 14B while using 7x fewer parameters and trained on fewer than 10M clips with less than 100,000 H200 GPU hours. The model uses a specialized architecture with shared cross-attention and a three-part backbone to separate prompt alignment, temporal consistency, and detail refinement.

0 favorites 0 likes

transformer-architecture

Submit Feedback