OlmoEarth v1.1: A more efficient family of models
Summary
OlmoEarth v1.1 is a new family of satellite imagery analysis models from Allen AI that reduces compute costs by up to 3x while maintaining performance, achieved by decreasing token sequence lengths in transformer-based models.
View Cached Full Text
Cached at: 05/20/26, 02:23 AM
OlmoEarth v1.1: A more efficient family of models
Source: https://huggingface.co/blog/allenai/olmoearth-v1-1 Back to Articles
- Increasing efficiency by decreasing sequence lengths
- Designing the token
- For developers
- For researchers
- Get started
🧠 Models:https://huggingface.co/collections/allenai/olmoearth| 📄 Tech Report:https://allenai.org/papers/olmoearth_v1_1| 💻 Code:https://github.com/allenai/olmoearth_pretrain
We released OlmoEarth (v1) in November 2025. Since then, partners have applied it across a wide range of tasks, from tracking mangrove change to classifying drivers of forest loss to producing country-scale crop-type maps in days, scaling deployments to national, continental, and global areas. Every release moves us closer to our mission: bringing state-of-the-art AI to organizations and communities working to protect people and our planet.
WhenOlmoEarthprocesses satellite imagery to make predictions across tens to hundreds of thousands of square kilometers, efficiency shapes what’s possible. Over the full lifecycle of running OlmoEarth – data export, preprocessing, inference, and post-processing – compute is by far the highest cost. A more efficient model means we can support more partners on the OlmoEarth Platform, and that anyone running OlmoEarth on their own can leverage this technology faster and at lower expense.
That’s why we built**OlmoEarth v1.1: a new family of models that cuts compute costs by up to3x**while maintaining OlmoEarth v1’s performance on a mix of research benchmarks and tasks we’ve constructed with partners.
https://huggingface.co/blog/allenai/olmoearth-v1-1#increasing-efficiency-by-decreasing-sequence-lengthsIncreasing efficiency by decreasing sequence lengths
The OlmoEarth models are transformer-based models, one of the dominant architectures in machine learning today. To process remote sensing data, we first convert it into a sequence oftokensthe model can ingest.
Two important levers control efficiency in transformer-based models:model size(this is why we release a family of models, so users can pick the size that fits their compute budget) andtoken sequence length. Compute costs scale quadratically with the token sequence length, so even small reductions can meaningfully cut the cost of running the model.
MACs, or multiply-accumulate operations, estimate the computation needed for one model forward pass; lower MACs generally mean cheaper, faster inference. The y-axis is inverted because lower average rank is better. Labels show model family and size. All plotted points use the pasted MAC/rank values.
https://huggingface.co/blog/allenai/olmoearth-v1-1#designing-the-tokenDesigning the token
This raises an important question for transformer-based remote sensing models:what should a token represent?
Take Sentinel-2 imagery, a common modality we process. A Sentinel-2 input will be some tensor with a height and width (H, W representing the latitudinal and longitudinal pixels), a temporal dimension T, and 12 Sentinel-2 channels ([H, W, T, D=12]).
Currently, we split the data into*resolution-based patches.*Concretely, this means that we will pick some spatial patch size p, and split our overall Sentinel-2 image into patches of size p x p:
For each patch, we create a token per timestep per resolution. So a Sentinel-2 input with 2 timesteps yields 6 tokens per patch (2 timesteps x 3 resolutions, 10m, 20m, and 60m).
In total, a[H, W, T, D=12] Sentinel-2 input will yield H/p x W/p x T x 3 tokens.
Using a unique token per resolution is a common technique when processing Sentinel-2 data—GalileoandSatMAEboth take this approach, and SatMAE shows significantly better results when doing it. However, it is not universal:CROMAis a model that only uses a single token for all bands, regardless of resolution. Because token counts compound multiplicatively, collapsing resolutions into a single token producesthree times fewer tokensand material savings across pretraining, fine-tuning, and inference.
Naively combining the tokens in this way leads to significant performance drops, including a 10 ppt drop on m-eurosat kNN (a common benchmark task for remote sensing models). We hypothesize that separating Sentinel-2 bands into different tokens makes it easier for OlmoEarth to model important cross-band relationships.
Merging tokenswithoutimpacting performance required us to modify our pre-training regimen. We describe those changes in detail in our paper.
https://huggingface.co/blog/allenai/olmoearth-v1-1#for-developersFor developers
The result is a model family that does more with less. At every size, OlmoEarth v1.1 runs up to three times cheaper than OlmoEarth v1, making frequent, planet-scale map refreshes more affordable for every team running OlmoEarth. If you’re using a model from the original OlmoEarth family, try OlmoEarth v1.1. It provides similar performance to OlmoEarth v1 while requiring one third of the compute, though we have seen some regressions (see our technical report for more details). If it works for your task, you should see a significant speedup during fine-tuning and inference.
https://huggingface.co/blog/allenai/olmoearth-v1-1#for-researchersFor researchers
Pretrained remote sensing models have many degrees of freedom, which makes them hard to study. When performance shifts, is it the architecture, the dataset, or the pre-training algorithm?
We train OlmoEarth v1.1 on the same dataset as OlmoEarth v1, so any differences between the two isolate the effect of methodological changes. We hope this advances understanding of scientific principles when pretraining models for remote sensing.
https://huggingface.co/blog/allenai/olmoearth-v1-1#get-startedGet started
Check out the OlmoEarth v1.1weightsandtraining code, including the weights for our Base, Tiny, and Nano models.
Similar Articles
Olmo Hybrid: From Theory to Practice and Back
This paper presents Olmo Hybrid, a 7B-parameter language model that combines attention and Gated DeltaNet recurrent layers, demonstrating both theoretical and empirical advantages over pure transformers. The work shows that hybrid models have greater expressivity, scale more efficiently during pretraining, and outperform comparable transformer baselines.
AllenAI has been iterating on their MolmoAct2 models for robotics
AllenAI has released open-source MolmoAct2 models for robot control, with multiple fine-tuned versions for different tasks, including full datasets and training code.
@oliviscusAI: You can now parse any document with one 1.7B parameter model It’s called dots-ocr. One system that handles text, tables…
The article introduces dots-ocr, a 1.7B parameter model capable of parsing text, tables, formulas, and images from documents in over 100 languages without needing separate OCR pipelines.
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
@vllm_project: Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving. Day-0 @NVID…
vLLM-Omni v0.22.0 is a major upgrade adding robust support for NVIDIA Cosmos world models, production TTS (Qwen3-TTS, Qwen3-Omni, VoxCPM2), faster diffusion model serving (Wan 2.2, HunyuanVideo 1.5, LTX-2.3), and broader quantization and hardware coverage with 339 commits from 124 contributors.



