Which tokens does a hybrid model predict better?
Summary
A study comparing Olmo Hybrid and Olmo 3 transformers at the token level shows hybrid models better predict meaningful tokens like nouns/verbs, while transformers excel at copying tokens from input.
View Cached Full Text
Cached at: 06/25/26, 05:10 PM
Which tokens does a hybrid model predict better?
Source: https://huggingface.co/blog/allenai/hybrid-token-prediction Back to Articles
📄Tech report:https://arxiv.org/abs/2606.20936
Which kinds of tokens does a model predict well, and which does it not? That question is especially intriguing in the case of hybrids, a language model architecture that’s begun to challenge the standard transformer and that we’ve been investigating withOlmo Hybrid.
Hybrids can match or beat transformers on standard benchmarks, but the headline numbers don’t reveal much about what specific advantages hybrid models have over transformers.
In an attempt to shed light on these token-level behaviors, we recently conducted experiments comparing our own strongest 7B transformer,Olmo 3, and hybrid model, Olmo Hybrid, head-to-head. Specifically, we compare the differences in model predictions in a fine-grained way across different types of tokens, or units of information that appear as input to an LLM.
Because Olmo 3 and Olmo Hybrid were built to be as alike as possible outside their architectures — closely matched in data, tokenizer, and training recipe — any difference in their predictions mostly reflects the architecture itself. Viewing these differences at the token level allows us to glean insights about the specific strengths of hybrid models over transformers.
Our resultsshow that the hybrid’s advantage is real across many tokens, but not all. Olmo Hybrid is strongest on tokens that carry meaning, such as nouns, verbs, and adjectives, and on tokens that can only be predicted by following what’s going on, like which person a pronoun refers to. But the hybrid’s advantage almost disappears on tokens that simply repeat something already in the input — a word or phrase reproduced verbatim from earlier — where the answer is sitting right there to be looked up. That’s where the transformer’s strength lies.
https://huggingface.co/blog/allenai/hybrid-token-prediction#attention-versus-recurrence-and-measuring-the-differenceAttention versus recurrence, and measuring the difference
A language model is built from a stack of repeated layers, each one refining its representation of every token using the tokens around it.
A transformer uses attention in every layer. The model can draw directly on every earlier token at once, weighing how relevant each is to the current prediction. That makes attention good at recalling a specific earlier token exactly, even when that token appeared far back in the input. The catch is that every token is compared against all the earlier ones, so attention’s cost climbs steeply as the input grows. Additionally, while attention is strong at recalling and aggregating information, it also struggles to represent information that evolves sequentially over time.
A hybrid model keeps a few attention layers but swaps the rest for recurrent layers. Unlike an attention layer, a recurrent layer reads tokens left to right and carries a fixed-size memory, folding each new token into memory as it goes so the cost of processing each token stays flat however long the input gets. That memory is compressed and lossy, so a recurrent layer can’t reach back for an exact earlier token the way attention can. But it is well suited to keeping a running account of anything that changes as the model reads tokens, providing a complementary strength to attention.
To isolate the areas of strength and weakness for attention and recurrent layers, we fed Olmo 3 and Olmo Hybrid passages of text: articles, Wikipedia entries, books, and scientific papers, as well as structured text like Python, HTML, and LaTeX. We scored each model on how well it predicted each token from the tokens before it in a given sample.
Both models saw the same earlier tokens and assigned a probability to every possible next token. We recorded the probability each gave to the token that actually followed. We then summarize the difference between the two models token by token by computing the loss gap, or the difference in loss between the two models. A positive gap means the hybrid predicted the real next token better. A negative gap means the transformer did.
To find where the loss gaps might concentrate, we ran several analyses. First, we sorted each token into a category and averaged the loss gap within these categories. Because a raw average can be skewed by other factors, such as a category’s rarity or how often tokens repeat in a sample of text, we re-checked each pattern with a regression that estimates the category’s own effect while holding other factors constant.
https://huggingface.co/blog/allenai/hybrid-token-prediction#what-real-text-showsWhat real text shows
We find that Olmo Hybrid has lower loss than Olmo 3 on most kinds of tokens, though not by the same amount on each.
In prose, the clearest divide is between content words — meaning-bearing nouns, verbs, and adjectives — and function words like “the,” “of,” and “is.” The hybrid predicts content words better than the transformer, with a loss gap around0.040.04, whereas the gap is closer to0.020.02on function words.
In particular, on content-word categories like adverbs and adjectives, the advantage of hybrid models is especially pronounced, though some function-word categories like existentials, such as “there,” also show a large advantage for hybrid models. In short, the hybrid’s edge is biggest on the words that say what a sentence is about and smallest on the grammatical words any model can nearly guess from syntax.
In contrast, we find some specific contexts where the advantage of hybrid models over transformers disappears. The first is closing, but not opening, braces, a pattern that is robust across brackets in language, code, and markup. Why? It’s known that attention suffices for representing bracket matching, which suggests attention alone suffices for closing brace prediction.
The second place where the hybrid’s advantage all but disappears is when the next token simply repeats something already in the passage. We spot these cases by looking for repeated n-grams: runs of text where the token that completes a sequence has appeared, verbatim, earlier in the same passage. The longer the repeated run, the smaller the hybrid’s lead, until it approaches zero.
Finally, inspired by these findings, we explore using filtered losses on specific types of tokens as an evaluation to better compare different architectures in pretraining experiments. We use three 1B-parameter models from our earlierOlmo Hybrid work: a transformer, a hybrid, and a pure recurrent model with no attention at all.
On meaning-bearing tokens that aren’t repeats, the hybrid and pure recurrent model overtake the transformer, with the hybrid performing the best. On repeated tokens, the pure recurrent model — with no attention to reach back for the copy — falls behind both the hybrid and the transformer.
Thus, these filtered token losses reveal different fine-grained differences between architectures, including copying abilities and differences on content words, early in training in a way that would not otherwise be visible.
https://huggingface.co/blog/allenai/hybrid-token-prediction#where-this-leaves-usWhere this leaves us
Filtered token losses surface architecture differences during 1B pretraining. Token-loss curves at WSD-annealed checkpoints for a transformer, a hybrid, and a pure recurrent neural network, or RNN.
Two lessons follow from this work.
First, a single overall loss — the model’s average error across all tokens — is too blunt to compare transformer and hybrid architectures. Scoring the loss on just the tokens that test a specific model ability surfaces key differences.
Second, specifically for hybrid models, we found evidence of particular advantages on open-class tokens, which perhaps is related to the state-tracking capabilities of RNN layers.
As a next step, we’re taking these findings into our ongoing hybrid modeling work. We believe the best hybrid architectures will come from understanding, token by token, what each component of a model does well. We hope studies like this help that understanding grow across the whole AI community.
We encourage you to read ourfull report, exploreOlmo 3, tryOlmo Hybrid, and dig into their associated open artifacts.
Similar Articles
Comparing Transformers and Hybrid Models at the Token Level
This paper analyzes token-level prediction differences between transformers and hybrid attention-recurrent models using Olmo 3 and Olmo Hybrid, finding that hybrids improve on semantic state tracking while transformers excel at n-gram copying and syntactic bracket matching.
Olmo Hybrid: From Theory to Practice and Back
This paper presents Olmo Hybrid, a 7B-parameter language model that combines attention and Gated DeltaNet recurrent layers, demonstrating both theoretical and empirical advantages over pure transformers. The work shows that hybrid models have greater expressivity, scale more efficiently during pretraining, and outperform comparable transformer baselines.
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
Nemotron 3 Ultra is a 550B parameter hybrid Mamba-Attention mixture-of-experts language model, pre-trained on 20T tokens, extended to 1M context, and post-trained with SFT, RL, and MOPD. It achieves up to 6x higher inference throughput than state-of-the-art LLMs with comparable accuracy, and is open-sourced.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, achieving strong performance across understanding and generation tasks.
Identifiable Token Correspondence for World Models
This paper introduces Identifiable Token Correspondence, a method that models token correspondence across time frames to improve temporal consistency in transformer-based world models for visual reinforcement learning, achieving state-of-the-art results on multiple benchmarks.



