Tag
A study comparing Olmo Hybrid and Olmo 3 transformers at the token level shows hybrid models better predict meaningful tokens like nouns/verbs, while transformers excel at copying tokens from input.
This paper proves that learning by predicting latent representations (as in world models like JEPA and data2vec) requires exponentially less data than predicting tokens (as in LLMs) for hierarchical data with hidden structure.
This paper proposes STOP (SuperTOken for Pruning), a systematic framework for pruning inefficient reasoning paths early in parallel reasoning with Large Reasoning Models. The method achieves superior efficiency and effectiveness across models from 1.5B to 20B parameters, boosting GPT-OSS-20B accuracy on AIME25 from 84% to 90% under fixed compute budgets.