@ZhihuFrontier: Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse ro…

X AI KOLs Timeline 06/26/26, 09:42 AM News

transformer architecture hybrid-models state-space-model sparse-routing latent-reasoning future-predictions

Summary

A Zhihu contributor's half-year-old prediction that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning is gaining relevance as Loop Engineering advances. The article explores how future Transformer architectures may evolve into hybrid models blending linear-complexity layers for background context with attention for precise reasoning, plus finer-grained sparsity and native System 2 reasoning.

Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning. Now, as Loop Engineering gains momentum, that prediction feels strikingly relevant. Let's dive in What Will the Next Generation of Transformer Architecture Look Like? Insights from Zhihu contributor CodeCrafter Transformer is far from over. Even with many challengers, it still dominates production. But the Transformer we use five years from now may look nothing like the original “Attention Is All You Need” design. It may evolve from a pure attention stack into a huge, sparse, hybrid architecture mixed with state space model (SSM) features. Models like GPT-5 and Claude 4.5 already show early signs of this direction. In 2023 and 2024, academia kept announcing “Transformer killers”: Mamba, RWKV, RetNet, and many others. But in real production, Transformer is still the main architecture. Why? Ecosystem. But that does not mean Transformer is unchanged. In long-context scenarios, pure attention has become too expensive to tolerate. Pain Point 1: KV Cache Explosion If you run a 1M-token context with full attention, the KV Cache alone can drain an H100 cluster. So the first big trend is clear: linearized attention and hybrid architectures will become normal. Architectures like Jamba already mix Mamba/SSM layers with Transformer layers. The logic is simple: SSM has O(1) memory usage during inference and does not need a massive KV Cache. But SSM also forgets. Ask it to recall a specific name from 20,000 words ago, and it may hallucinate. Attention is expensive, but it works like a precise lookup table. So the future may look like this: 80% of the lower layers use linear-complexity models such as improved Mamba or RWKV to handle massive background context, while 20% of the upper or key layers still use full attention for precise recall and strong reasoning. Think of it like the brain: most of the time, the subconscious runs in the background. When a hard problem appears, focused reasoning takes over. Pain Point 2: Dense Compute Is Unsustainable MoE became mainstream in 2024. By 2025, any lab without MoE already looks behind. But today’s MoE is still crude. Most current MoE systems route at the token level: one token comes in, and two experts are selected. The next generation will go finer. Sparsity may move down to the neuron level. Future networks may no longer have clear boundaries between FFN layers and attention layers. The whole model could become a massive dynamic routing graph. The goal is to decouple compute from both parameter count and simple token count. Simple tokens may pass through with almost no computation. Hard reasoning tasks may trigger multiple internal loops before output. In short: the model will learn to be lazy. Prediction 3: System 2 Becomes Native OpenAI’s o1 series showed the industry something important: inference-time RL can create slow thinking. Today’s Transformer is still mostly a System 1 model. It predicts the next token through statistical reflex. Current System 2 behavior mostly comes from data construction and inference workflows, such as CoT, not from architecture itself. In five years, Transformer-like models may support System 2 reasoning at the architecture level. Current CoT is wasteful. The model prints its thinking into context, generating thousands of intermediate tokens that consume memory and decoding time. A future architecture may reason inside a latent state space. It could run multiple internal steps, test ideas, backtrack, and correct itself in high-dimensional space, then map the final result back into text. That means Transformer may grow something like working memory. It will no longer be a pure feed-forward network. It may include recurrent structures, not for sequence processing, but for thinking time. This is why Yann LeCun’s JEPA ideas are worth watching. His focus on world models and latent prediction may be absorbed into next-generation Transformer systems. Prediction 4: Native Multimodality and the End of Tokenizer-Centric Design By 2025, if a multimodal model still trains a separate vision encoder and text decoder, then connects them with a projector, it already feels old-school. Most multimodal models today are patched together. Images are split into patches and turned into tokens. Audio is cut into frames and turned into tokens. This creates information loss, because continuous signals are forced into discrete tokens. The next generation may no longer rely on discrete tokenizers. Future models may process raw signals directly. This requires architectures that can handle continuous-value inputs. It may involve diffusion-style understanding, new neural architectures, or other signal-native designs. At that point, text will no longer be the center. The core representation may become a physical world model, and text will be only one output interface. This is a huge engineering shift. Frameworks like Megatron-LM and DeepSpeed were optimized for discrete tokens. Moving beyond that means rebuilding the foundation. But it is necessary. Text alone cannot teach a model real physics. A real lesson: a team once tried to fine-tune a coding model with text-only data to control a robotic arm. The code looked excellent, but the arm crashed in the real world. The model did not understand gravity or friction. After adding direct sensor embeddings, performance improved. That is why native multimodality is unavoidable. Prediction 5: Hardware Will Force Architecture Change Talking about architecture without hardware is incomplete. Transformer won partly because it fits GPUs perfectly. GPUs love matrix multiplication (MatMul), and Transformer is full of MatMul. But MatMul is getting too expensive, especially in energy cost. Future architectures will try to reduce their absolute dependence on MatMul. Today’s BitNet and 1-bit LLMs still look early, maybe even toy-like. But they reveal the right direction: quantization should not be just a deployment trick. It should be part of architecture design. Future Transformer-like models may train directly in INT4 or even INT1. That means activations, normalization layers, and RMSNorm/LayerNorm may all need redesign. Current LayerNorm is unstable under very low precision and can easily cause gradient explosions. With processing-in-memory (PIM), architecture may also become more localized. Today’s Transformer moves the full hidden state across layers, creating a brutal bandwidth bottleneck. Future architectures may look more like cortex: mostly local compute, with only a few long-range connections. This connects back to fine-grained sparsity and MoE. So the core point is simple: Transformer is not the endpoint. It is a transition state. If you work on algorithms, do not put your entire skill tree into tuning Transformer hyperparameters. RoPE variants and attention mask tricks may lose relevance in a few years. Focus on deeper fundamentals: · Information theory and compression: models are compression systems, and perplexity still matters. · Optimization theory: SGD and AdamW have dominated for too long. Sparse architectures may need better optimizers. · Data engineering: architecture gets open-sourced fast, but data recipes remain the real moat. Maybe we will stop calling it “Transformer.” It may become a general state machine or a neural reasoning engine. But its soul will remain: end-to-end learning through gradient descent. Even with all the AGI hype, today’s Transformer is still probability fitting. Its “creativity” is mostly interpolation across a massive sample space. To truly break through, the next architecture may need discrete symbolic reasoning modules. This could bring back neuro-symbolic AI. If you have GPUs today, do not only run SFT. Try reproducing non-Transformer architectures. Try inserting SSM into Transformer. Try replacing embeddings with continuous signal inputs. Many ideas that look strange or weak today may become textbook answers five years later. This field moves fast. By the time this is written, DeepSeek or OpenAI may publish a new paper that proves half of it wrong. But that is exactly why this is such an exciting era for builders. Stay curious. Do not worship authority. Run the code. Push VRAM to its limits. Watch the loss curve. That is where the future becomes real. Original article (CN): https://zhihu.com/question/1904728228213548260/answer/1975169767355736614… #Transformer #LoopEngineering #LLM #AIArchitecture #Mamba #MoE #SSM #AIAgents #MultimodalAI #AIResearch #MachineLearning #DeepLearning

Original Article

View Cached Full Text

Cached at: 06/26/26, 04:14 PM

What Will the Next Generation of Transformer Architecture Look Like? Insights from Zhihu contributor CodeCrafter

Transformer is far from over. Even with many challengers, it still dominates production. But the Transformer we use five years from now may look nothing like the original “Attention Is All You Need” design. It may evolve from a pure attention stack into a huge, sparse, hybrid architecture mixed with state space model (SSM) features. Models like GPT-5 and Claude 4.5 already show early signs of this direction. In 2023 and 2024, academia kept announcing “Transformer killers”: Mamba, RWKV, RetNet, and many others. But in real production, Transformer is still the main architecture. Why? Ecosystem. But that does not mean Transformer is unchanged. In long-context scenarios, pure attention has become too expensive to tolerate.

Pain Point 1: KV Cache Explosion If you run a 1M-token context with full attention, the KV Cache alone can drain an H100 cluster. So the first big trend is clear: linearized attention and hybrid architectures will become normal. Architectures like Jamba already mix Mamba/SSM layers with Transformer layers. The logic is simple: SSM has O(1) memory usage during inference and does not need a massive KV Cache. But SSM also forgets. Ask it to recall a specific name from 20,000 words ago, and it may hallucinate. Attention is expensive, but it works like a precise lookup table. So the future may look like this: 80% of the lower layers use linear-complexity models such as improved Mamba or RWKV to handle massive background context, while 20% of the upper or key layers still use full attention for precise recall and strong reasoning. Think of it like the brain: most of the time, the subconscious runs in the background. When a hard problem appears, focused reasoning takes over.

Pain Point 2: Dense Compute Is Unsustainable MoE became mainstream in 2024. By 2025, any lab without MoE already looks behind. But today’s MoE is still crude. Most current MoE systems route at the token level: one token comes in, and two experts are selected. The next generation will go finer. Sparsity may move down to the neuron level. Future networks may no longer have clear boundaries between FFN layers and attention layers. The whole model could become a massive dynamic routing graph. The goal is to decouple compute from both parameter count and simple token count. Simple tokens may pass through with almost no computation. Hard reasoning tasks may trigger multiple internal loops before output. In short: the model will learn to be lazy.

Prediction 3: System 2 Becomes Native OpenAI’s o1 series showed the industry something important: inference-time RL can create slow thinking. Today’s Transformer is still mostly a System 1 model. It predicts the next token through statistical reflex. Current System 2 behavior mostly comes from data construction and inference workflows, such as CoT, not from architecture itself. In five years, Transformer-like models may support System 2 reasoning at the architecture level. Current CoT is wasteful. The model prints its thinking into context, generating thousands of intermediate tokens that consume memory and decoding time. A future architecture may reason inside a latent state space. It could run multiple internal steps, test ideas, backtrack, and correct itself in high-dimensional space, then map the final result back into text. That means Transformer may grow something like working memory. It will no longer be a pure feed-forward network. It may include recurrent structures, not for sequence processing, but for thinking time. This is why Yann LeCun’s JEPA ideas are worth watching. His focus on world models and latent prediction may be absorbed into next-generation Transformer systems.

Prediction 4: Native Multimodality and the End of Tokenizer-Centric Design By 2025, if a multimodal model still trains a separate vision encoder and text decoder, then connects them with a projector, it already feels old-school. Most multimodal models today are patched together. Images are split into patches and turned into tokens. Audio is cut into frames and turned into tokens. This creates information loss, because continuous signals are forced into discrete tokens. The next generation may no longer rely on discrete tokenizers.

Future models may process raw signals directly. This requires architectures that can handle continuous-value inputs. It may involve diffusion-style understanding, new neural architectures, or other signal-native designs. At that point, text will no longer be the center. The core representation may become a physical world model, and text will be only one output interface. This is a huge engineering shift. Frameworks like Megatron-LM and DeepSpeed were optimized for discrete tokens. Moving beyond that means rebuilding the foundation. But it is necessary. Text alone cannot teach a model real physics.

A real lesson: a team once tried to fine-tune a coding model with text-only data to control a robotic arm. The code looked excellent, but the arm crashed in the real world. The model did not understand gravity or friction. After adding direct sensor embeddings, performance improved. That is why native multimodality is unavoidable.

Prediction 5: Hardware Will Force Architecture Change Talking about architecture without hardware is incomplete. Transformer won partly because it fits GPUs perfectly. GPUs love matrix multiplication (MatMul), and Transformer is full of MatMul. But MatMul is getting too expensive, especially in energy cost. Future architectures will try to reduce their absolute dependence on MatMul. Today’s BitNet and 1-bit LLMs still look early, maybe even toy-like. But they reveal the right direction: quantization should not be just a deployment trick. It should be part of architecture design.

Future Transformer-like models may train directly in INT4 or even INT1. That means activations, normalization layers, and RMSNorm/LayerNorm may all need redesign. Current LayerNorm is unstable under very low precision and can easily cause gradient explosions.

With processing-in-memory (PIM), architecture may also become more localized. Today’s Transformer moves the full hidden state across layers, creating a brutal bandwidth bottleneck. Future architectures may look more like cortex: mostly local compute, with only a few long-range connections. This connects back to fine-grained sparsity and MoE. So the core point is simple: Transformer is not the endpoint. It is a transition state. If you work on algorithms, do not put your entire skill tree into tuning Transformer hyperparameters. RoPE variants and attention mask tricks may lose relevance in a few years.

Focus on deeper fundamentals: · Information theory and compression: models are compression systems, and perplexity still matters. · Optimization theory: SGD and AdamW have dominated for too long. Sparse architectures may need better optimizers. · Data engineering: architecture gets open-sourced fast, but data recipes remain the real moat.

Maybe we will stop calling it “Transformer.” It may become a general state machine or a neural reasoning engine. But its soul will remain: end-to-end learning through gradient descent. Even with all the AGI hype, today’s Transformer is still probability fitting. Its “creativity” is mostly interpolation across a massive sample space. To truly break through, the next architecture may need discrete symbolic reasoning modules. This could bring back neuro-symbolic AI.

If you have GPUs today, do not only run SFT. Try reproducing non-Transformer architectures. Try inserting SSM into Transformer. Try replacing embeddings with continuous signal inputs.

Many ideas that look strange or weak today may become textbook answers five years later. This field moves fast. By the time this is written, DeepSeek or OpenAI may publish a new paper that proves half of it wrong. But that is exactly why this is such an exciting era for builders. Stay curious. Do not worship authority. Run the code. Push VRAM to its limits. Watch the loss curve. That is where the future becomes real.

Original article (CN): https://zhihu.com/question/1904728228213548260/answer/1975169767355736614…

#Transformer #LoopEngineering #LLM #AIArchitecture #Mamba #MoE #SSM #AIAgents #MultimodalAI #AIResearch #MachineLearning #DeepLearning

@ZhihuFrontier: Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse ro…

Similar Articles

@DorothyDDU: LoopCoder-v2 is out Loop Transformers reuse the same block for recurrent hidden-state refinement — letting models “thin…

@retr0sushi_: looped transformer -> hyper-looped transformer -> looped world model ??

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

@FinanceYF5: Next token prediction is short-sighted. What if the Transformer learns to predict its own next hidden state? Jayden Teoh proposes Next-Latent Prediction (NextLat): a self-supervised learning method that teaches the Transformer to form...

@gordic_aleksa: new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i…

Submit Feedback

Similar Articles

@DorothyDDU: LoopCoder-v2 is out Loop Transformers reuse the same block for recurrent hidden-state refinement — letting models “thin…

@retr0sushi_: looped transformer -> hyper-looped transformer -> looped world model ??

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

@FinanceYF5: Next token prediction is short-sighted. What if the Transformer learns to predict its own next hidden state? Jayden Teoh proposes Next-Latent Prediction (NextLat): a self-supervised learning method that teaches the Transformer to form...

@gordic_aleksa: new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i…