@seclink: https://x.com/seclink/status/2067968283492712846
Summary
This article, based on the sharing of researcher Victoria Lin, systematically reviews the mainstream technical approaches of native multimodal large models (Chameleon, Transfusion, MOT) and their pros and cons. It points out that multimodal AI is still in the early exploration stage, with open problems such as gaps in scaling laws, inconsistency between image understanding and generation encoding, and connection with the physical world.
View Cached Full Text
Cached at: 06/20/26, 02:37 PM
Why Multimodal AI Is Just Getting Started
📌 Source: https://www.youtube.com/watch?v=NDdc39KYqDU
Why Multimodal AI Is Just Getting Started
⚡ Core thesis upfront: The real world is inherently multimodal — pure language AI is fundamentally limited.
- A clear breakdown of current mainstream native multimodal large model categories, core technical approaches, and their pros/cons
- Why the next-generation MOT architecture can address existing solution pain points, and its core logic
- Key open problems in multimodal AI and future research directions
The GPT-4o and Kimi we use daily are already multimodal AIs that can see, speak, and generate images. Many people think multimodal technology is mature, and only real-world deployment is left. But at the academic frontier, the foundational architecture and training principles of native multimodal large models still have a long list of unsolved problems. The entire field is still in its early exploration phase.
This article summarizes the public technical sharing by Victoria Lin (formerly at Meta AI and Salesforce AI Research, currently a researcher at Thinking Machines Lab), a researcher in native multimodal AI. All technical terms are explained in plain language, breaking down complex technical paths into understandable pieces. After reading, you’ll grasp the present and future of multimodal AI.
🧩 What Is a Native Multimodal Large Model?
Thesis first: The core logic of native multimodal large models is to unify all types of information and directly replicate the successful recipe of pure text large models.
Let’s clarify some fundamental terms in plain language:
- Token (processing unit of large models): The basic unit for processing information in large models. Breaking raw information into tokens is called tokenization.
- Transformer (basic architecture): The universal neural network architecture underlying all current large language models and multimodal large models.
- Autoregressive generation: A common way large models generate content — generating one content unit at a time, where each new unit depends on all previously generated units.
The familiar pure text large model follows a general paradigm: split all text into tokens, train the model to predict the next token. As data volume and model size scale up, advanced capabilities like knowledge retention, reasoning, and planning emerge naturally. This logic has been repeatedly proven successful.
Native multimodal large models directly reuse this logic: regardless of whether the input is text, image, video, or audio, everything is uniformly tokenized, converted into tokens that a Transformer can process, and then trained in exactly the same way as a pure text large model.
Currently, native multimodal models mainly fall into two categories:
- Multimodal input, text-only output — This is the pattern of most mainstream products, e.g., Gemini, Quon, Kimi.
- Full-modal models — Support multimodal input and multimodal output, directly generating text, images, audio, etc. GPT-4o is a representative of this category and the main direction being explored in the industry today.
⚖️ Strengths and Weaknesses of Mainstream Full-Modal Technical Approaches
Thesis first: The two mainstream paths for full-modal models both solve old problems but leave new gaps.
The first path is Meta’s Chameleon series. Its core assumption: all modalities can be converted into discrete tokens. Specifically, an image is cut into fixed-size patches (commonly 16×16 pixels), and VQ-VAE (a technique that converts continuous images into discrete codes) transforms the image into discrete tokens that are interleaved with text as a sequence for joint training.
Chameleon was the first model to validate that training on interleaved image-text sequences from scratch can simultaneously acquire multimodal capabilities while preserving strong pure text performance. However, its drawbacks are clear: discretization loses a lot of image information, image understanding performance lags behind current mainstream continuous encoding schemes, and token efficiency is low — requiring large amounts of training data to generate acceptable images.
The second path, proposed around the same time, is Transfusion, designed to address Chameleon’s flaws. It uses continuous representations of images, seamlessly unifying autoregressive language modeling and diffusion image generation (the current mainstream AI image generation technique) within the same Transformer. Inputs are still interleaved image-text sequences. Text is modeled with standard autoregression, while images use diffusion operations. The quality and token efficiency of generated images far surpass Chameleon’s discrete token scheme.
But Transfusion also has unresolved issues: the community generally finds that image generation and image understanding require different encoding schemes. Transfusion’s image representation is friendly to generation tasks but inefficient for image understanding tasks. Today’s best full-modal models typically use a dual image encoding scheme to bypass this problem, but the fundamental issue remains unsolved and is still an open research topic.
🧠 MOT Architecture: Dedicated Parameters for Each Modality
Thesis first: The core idea of MOT (Mixture of Transformers) is very simple — different modalities are fundamentally different; sharing parameter sets is less effective than having dedicated ones.
Previous multimodal models used the same Transformer parameters for all modalities. But the MOT architecture discovered that information density and data characteristics vary greatly across modalities, so sharing parameters can actually harm each other.
MOT’s approach: assign independent Transformer parameters to each modality, including attention projections and feed-forward layers. When processing mixed-modal input, tokens from a given modality activate that modality’s parameters, and cross-modal information exchange happens through self-attention after processing.
This architecture doesn’t require abandoning previous paths — it can be directly combined with existing schemes like Chameleon and Transfusion, serving as a general optimization idea.
Researchers conducted comparative experiments at scales from 163M to 7B parameters. The results are clear: compared to baseline models with equivalent total parameters, MOT significantly improves generation quality for non-text modalities like images without sacrificing pure text performance.
It also offers a practical advantage: supports asynchronous training. If you already have a well-trained text-only model, you don’t need full fine-tuning. Simply freeze the original text model parameters and only train new dedicated parameters for image and speech modalities. This allows low-cost expansion of multimodal generation capabilities for older models.
This architecture idea has been adopted by many follow-up studies: the multimodal model Bagel, released last year, used a similar approach — dedicated parameters for image generation while sharing base parameters for image understanding, enabling a “generate thinking text, then generate image” pipeline that significantly improved the detail quality of generated images. In embodied AI and robotics, many teams have used this idea to allocate dedicated parameters for action prediction, leveraging large language model knowledge to improve action prediction performance.
❓ Key Unsolved Problems in Multimodal AI
Thesis first: Compared to the already mature pure text large models, multimodal AI has far more open problems than solved ones, with plenty of untapped opportunities.
The biggest gap is the scaling law. The relationship between pure text model performance and parameters/training data is very well understood, but the precise scaling law for multimodal models hasn’t been thoroughly studied. This is a highly valuable research blue ocean — the growth dividend of multimodality is far from exhausted.
The second core contradiction is the mismatch between image understanding and generation requirements. No single image encoding scheme can simultaneously satisfy both tasks — understanding needs one, generation needs another. No one has yet given a perfect answer to unified encoding.
The third fundamental difference is that the training laws for language and sensory modalities are completely different: language is a highly compressed abstraction of human cognition; training next-token prediction essentially learns human reasoning and intention. But images/videos are passive sensory observations of the world, not abstracted cognitive symbols. Combined with high inter-frame information redundancy and more complex loss functions, the pre-training payoff pattern is entirely different from language. Previously, a Berkeley scholar raised a puzzling phenomenon: large language models can emerge strong capabilities just from next-token prediction, but video models doing next-frame prediction do not achieve equivalent capability gains. To this day, there is no clear answer.
The fourth major practical issue: current multimodal AI performs well only in digital information processing. Bridging to the physical world — tasks like spatial reasoning, real-time perception, and robot control — still has a vast number of unsolved problems.
Finally, regarding the common question “Can pure vision do reasoning without language?” The current academic conclusion: at this stage, using language as a backbone helps visual reasoning achieve better results because language has a higher level of abstraction — this path has been validated. However, in the future, when computational power is abundant enough, whether pure vision/video models can achieve reasoning remains an open question — no one can definitively say it’s impossible.
💡 Key Takeaways
-
The real world is inherently multimodal; pure language AI is fundamentally limited.
-
Breaking all modalities into unified tokens lets you directly replicate the success of large language models in multimodal settings.
-
Dedicated parameters for each modality work better than sharing a single parameter set across all modalities.
-
Multimodal AI’s growth dividend is far from exhausted; scaling laws are still an underexplored blue ocean.
-
Understanding helps generation, but generation doesn’t necessarily help understanding.
-
To unify modalities, align other modalities to text — don’t align text to other modalities.
-
Multimodal models are far from fully understood; open problems far outnumber solved ones.
-
Today’s multimodal AI excels at processing digital information but is still far from achieving multimodal intelligence in the physical world.
Similar Articles
@seclink: Fun fact: Currently, the specific implementation directions for multimodal large model startups typically include the following. If none of these interest you, don't follow the trend and go back to learning AI coding: 1. Game AI NPC / Agent middleware (e.g., end-cloud collaborative OmniNPC, empowering 3D character interaction and emotional storytelling...)
Summarizes several main implementation directions for current multimodal large model startups, including game AI NPC, enterprise-level multimodal Agent, content generation, embodied intelligence, and visual code assistants.
@tanzhengmc97: https://x.com/tanzhengmc97/status/2066531753762656730
Explained the operating principles of large models in easy-to-understand language, including word vectors, Transformer attention mechanism, next-word prediction training, and emergent abilities, suitable for beginners to understand basic AI concepts.
@seclink: https://x.com/seclink/status/2067970118873993482
Current mainstream pure data-driven robot solutions suffer from low data efficiency and poor generalization. The newly proposed neuro-symbolic physical intelligence paradigm breaks down tasks into two steps: world modeling and planning. It requires only 1-10 demonstrations to learn new tasks, and its generalization ability far exceeds traditional end-to-end solutions, providing a more reliable path for general-purpose robots.
@snowboat84: https://x.com/snowboat84/status/2064135804092645410
This article systematically reviews the evolution of the world model concept from Craik's psychological metaphor in 1943 to the industry explosion in 2024-2026. It details the core ideas and representative works of symbolic AI and deep learning schools (Schmidhuber-Ha, Dreamer series, JEPA, video generation direction), and points out the current state of definition confusion and competition among various schools.
@0xcherry: https://x.com/0xcherry/status/2067610347633025281
This article analyzes the reasons behind the performance leap of Zhipu GLM-5.2, suggesting that its 40B activation parameters provide greater effective capacity after accounting for fixed overhead, making RL post-training more effective. It also reviews the history of Chinese AI model development and notes that the large model approach ultimately prevailed.