microsoft/Lens-Turbo
Summary
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model with efficient training and fast high-resolution generation, featuring dense-caption pre-training and mixed-resolution learning.
View Cached Full Text
Cached at: 05/25/26, 07:54 AM
microsoft/Lens-Turbo · Hugging Face
Source: https://huggingface.co/microsoft/Lens-Turbo
https://huggingface.co/microsoft/Lens-Turbo#lens-rethinking-training-efficiency-for-foundational-text-to-image-modelsLens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Contributors (Alphabetical Order): Baining Guo,Chong Luo,Dong Chen†,Dongdong Chen,Fangyun Wei†,Ji Li,Jianmin Bao,Jiawei Zhang*,Jinjing Zhao*,Lei Shi,Qinhong Yang,Sirui Zhang*,Xiuyu Wu,Xuelu Feng,Yan Lu,Yanchen Dong,Yang Yue*,Yitong Wang,Yunuo Chen,Zhiyang Liang*,Ziyu Wan† Microsoft | *Core Contributors | †Project Lead
Lensis a3.8B-parameterfoundational text-to-image model designed forefficient trainingandfast high-resolution generation. It combines dense-caption pre-training, mixed-resolution learning, GPT-OSS multi-layer text features, and the FLUX.2 semantic VAE to reach competitive quality with substantially less training compute than larger T2I models.
This repository provides the minimal inference code for generating images from Lens DiT checkpoints.
https://huggingface.co/microsoft/Lens-Turbo#highlightsHighlights
- Efficient Foundation— Trained onLens-800M, an 800M image-text corpus with long GPT-4.1 captions, maximizing information density per training batch.
- Compact & Expressive— A 48-block MMDiT denoiser leverages FLUX.2 latents and concatenated multi-layer GPT-OSS features for stronger prompt following and multilingual generalization.
- Flexible Resolution— Mixed-resolution training enables inference across aspect ratios from
1:2to2:1and resolutions up to1440×1440. - Post-trained Variants— RL tuning improves visual quality and artifact suppression; the distilledLens-Turbosupports fast4-stepgeneration.
https://huggingface.co/microsoft/Lens-Turbo#galleryGallery
Page 1 / 6samples 000-005
Sample 000· 1440x1440
A generous portion of classic British fish and chips served on a sheet of white paper, golden crispy beer-battered cod fillet alongside thick-cut chips, a wedge of lemon, mushy peas in a small dish, malt vinegar bottle nearby, wooden pub table, overhead shot
Sample 001· 1440x1440
The iconic Big Ben clock tower and the Houses of Parliament in London at golden hour, the River Thames reflecting warm amber light, Westminster Bridge in the foreground, a classic red double-decker bus crossing, dramatic clouds lit by sunset
Sample 002· 1440x1440
La Tour Eiffel au crépuscule vue depuis le Trocadéro, la structure en fer illuminée de milliers de lumières dorées scintillantes, le ciel passant du bleu profond au violet, les fontaines du Trocadéro au premier plan avec des reflets dorés, silhouettes de promeneurs
Sample 003· 1248x1664
A crystal dragon soaring through an aurora borealis sky, its entire body made of transparent faceted crystal refracting the green and purple aurora light into rainbow spectra, ice particles trailing from its wings, high fantasy digital art
Sample 004· 1664x1248
Aerial view of Yuanyang rice terraces in Yunnan province at sunrise, thousands of cascading water-filled paddies reflecting golden and pink sky colors, morning mist weaving between terrace layers, lush green hillside with scattered palm trees, drone photography
Sample 005· 1664x1248
A green iguana basking on a moss-covered fallen log in a tropical rainforest, every scale and spine rendered in sharp detail, dewdrops clinging to its skin, a blurred waterfall and lush tropical foliage in the background, National Geographic wildlife photography stylePage 2 / 6samples 006-011
Sample 006· 1248x1664
Oil painting portrait of a Renaissance noblewoman in a deep blue velvet dress with pearl drop earrings, soft chiaroscuro lighting revealing delicate skin, craquelure texture on the painted surface, in the style of Vermeer
Sample 007· 1440x1440
An artisan honey jar with a hand-illustrated vintage botanical label reading “Mountain Wildflower Honey” in brown serif letterpress-style typography with decorative flourishes, detailed ink drawings of wildflowers, clover and honeybees surrounding the text, kraft paper label on clear glass jar
Sample 008· 1440x1440
Watercolor portrait of a thoughtful young man reading a worn leather book in a Parisian cafe, loose wet-on-wet brushstrokes bleeding into warm amber and burnt sienna washes, visible paper grain texture
Sample 009· 1664x1248
An explorer’s oak desk with an aged world map spread open, a brass sextant, leather-bound navigation journal with handwritten entries, melting candle in a copper holder, scattered compass and quill pen, warm window light, still life photography
Sample 010· 1664x1248
New York Grand Central Terminal subway station with the classic station name “GRAND CENTRAL” spelled out in elegant white ceramic mosaic tile letters embedded in a dark green tile wall, each letter approximately eight inches tall, ornate tile border frames, the S-curve of train tracks visible
Sample 011· 1664x1248
A ruby-throated hummingbird hovering in front of a bright red heliconia flower, wings frozen in a figure-eight pattern showing iridescent feather detail, individual water droplets suspended around the bird, high-speed macro photography with dark backgroundPage 3 / 6samples 012-017
Sample 012· 1664x1248
An old Remington typewriter with a sheet of cream-colored paper rolled into the carriage, the typed words “Chapter One: The Beginning” visible in slightly uneven Courier typeface with characteristic ink density variations, some letters slightly misaligned, warm desk lamp lighting
Sample 013· 1664x1248
The Great Wildebeest Migration crossing the Mara River at golden hour, hundreds of animals plunging into churning water sending spray everywhere, dust clouds rising from the riverbank, dramatic backlit scene, National Geographic documentary style
Sample 014· 1248x1664
A charming flower shop storefront window with hand-painted white script lettering on the glass reading “Fresh Flowers Daily” in flowing connected cursive with decorative swashes, roses and peonies arranged in buckets visible through the lettering, morning sunlight catching the painted letters
Sample 015· 1248x1664
A steampunk floating sky-city built on massive gear-driven platforms, brass and copper towers connected by chain bridges, steam-powered airships and hot air balloons docking at various levels, sunset clouds below the city, detailed concept art
Sample 016· 1664x1248
Milford Sound in New Zealand at dawn, a perfect mirror reflection of steep fjord walls on glass-still water, waterfalls streaming down thousand-foot cliffs, morning mist hovering above the water surface, panoramic landscape photography
Sample 017· 1248x1664
An Indian Bharatanatyam classical dancer in the aramandi pose, bronze ankle bells and elaborate hand mudra gestures, rich silk costume with gold temple jewelry, captured mid-performance with dramatic stage lightingPage 4 / 6samples 018-023
Sample 018· 1248x1664
A narrow alleyway in Marrakech’s old medina with walls painted in vivid cobalt blue, colorful handwoven rugs and ceramic plates displayed along the walls, ornate wooden doors, warm sunlight from above creating dramatic shadows, Moroccan architecture
Sample 019· 1664x1248
A rustic wooden sign at a fishing village dock reading “Fresh Catch of the Day” in hand-carved letters painted nautical blue, thick hemp rope threaded through the sign as a border, fishing nets and lobster traps stacked in the background, seaside atmosphere
Sample 020· 1664x1248
A sunken shipwreck on the ocean floor completely overgrown with colorful coral formations, schools of tropical fish swimming through the broken hull and portholes, shafts of sunlight streaming down from the surface above, underwater archaeology photography
Sample 021· 1664x1248
Zhangjiajie pillar mountains rising above a sea of clouds at sunrise, golden light painting the sandstone peaks, the surreal Avatar-like floating mountain landscape stretching to the horizon, aerial drone photography capturing immense vertical scale
Sample 022· 1440x1440
A red-eyed tree frog perched on a bright red bromeliad flower in the Costa Rican cloud forest, its neon green body contrasting with blue-striped flanks and orange feet, water droplets on its smooth skin, extreme macro with ring flash lighting
Sample 023· 1248x1664
Inside a massive limestone cave, ancient stalactites and stalagmites meeting to form columns, an underground river reflecting the formations like a mirror, subtle warm lighting revealing millions of years of mineral deposits, spelunking exploration photographyPage 5 / 6samples 024-029
Sample 024· 1664x1248
A weathered 1960s gas station with a large roadside sign reading “ROUTE 66 GAS & GO” in retro rounded sans-serif letters with a red and white color scheme, vintage gas pumps with analog dials in the foreground, a classic Chevrolet parked to the side, Americana nostalgia
Sample 025· 1664x1248
Construction site hoarding covered in unauthorized street art with “ART IS EVERYWHERE” spray-painted in large freehand capital letters using multiple overlapping colors of red, yellow and blue, paint drips running down from each letter, chaotic beautiful urban canvas
Sample 026· 1664x1248
Top-down view of a koi pond, dozens of ornamental koi fish in vivid red white orange and gold patterns swimming through crystal-clear emerald water, fallen cherry blossom petals floating on the surface, Japanese garden aerial photography
Sample 027· 1664x1248
The Potala Palace in Lhasa under a canopy of stars with the Milky Way arching overhead, Tibetan prayer wheels and butter lamps in the foreground casting warm golden light, the massive white and red palace walls glowing in moonlight, night photography
Sample 028· 1248x1664
Yellowstone’s Grand Prismatic Spring shot from directly above by drone, concentric rings of vivid blue turquoise green yellow and orange created by thermophilic bacteria, steam rising from the surface, abstract natural color palette
Sample 029· 1664x1248
A herd of African elephants walking in a line across the savanna with Mount Kilimanjaro’s snow-capped peak behind them, golden sunset dust kicked up by their feet creating a hazy atmosphere, telephoto wildlife photography showing massive scalePage 6 / 6samples 030-031## https://huggingface.co/microsoft/Lens-Turbo#installationInstallation
**Tested environment:**Python 3.12 · CUDA 12.6 · PyTorch 2.11.0+cu126 · TorchVision 0.26.0+cu126
conda create -n lens python=3.12 -y
conda activate lens
uv pip install torch==2.11.0+cu126 torchvision==0.26.0+cu126 \
--index-url https://download.pytorch.org/whl/cu126
uv pip install -r requirements.txt
The default GPT-OSS encoder and FLUX.2 VAE are loaded from Hugging Face. Make sure your environment has access to any gated model repositories you use.
https://huggingface.co/microsoft/Lens-Turbo#checkpointsCheckpoints
Pick a variant by passing its repo id to\-\-repo\_id(CLI) orLensPipeline\.from\_pretrained\(\.\.\.\)(Python).
https://huggingface.co/microsoft/Lens-Turbo#inferenceInference
**Important:**run from the cloned repo root so
from lens import LensPipelineresolves to this package — importinglensis what registersLensGptOssEncoder/LensTransformer2DModelwith thetransformersanddiffusersnamespaces thatmodel\_index\.jsonreferences.
Python API:
import torch
from lens import LensPipeline
pipe = LensPipeline.from_pretrained(
"microsoft/Lens", torch_dtype=torch.bfloat16
).to("cuda")
image = pipe(
prompt="A cat holding a sign that says \"hello world\"",
base_resolution=1440, aspect_ratio="1:1",
num_inference_steps=20, guidance_scale=5.0,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("lens.png")
To trade speed for VRAM, replace\.to\("cuda"\)withpipe\.enable\_model\_cpu\_offload\(\).
CLI — basic usage:
python inference.py \
--repo_id "microsoft/Lens" \
--prompt "A cinematic mountain lake at sunrise, soft mist, detailed reflections" \
--base_resolution 1440 --aspect_ratio 1:1 \
--steps 20 --cfg 5.0 --n 1 --seed 42 \
--out ./outputs
Batch generation— join multiple prompts with\|:
python inference.py \
--repo_id "microsoft/Lens" \
--steps 20 --cfg 5.0 \
--prompt "a red fox in snow|a glass greenhouse at night"
A100 / V100 (no MXFP4 kernels)— dequantize the GPT-OSS encoder to bf16:
python inference.py \
--repo_id "microsoft/Lens" \
--steps 20 --cfg 5.0 \
--prompt "a cat" \
--disable_mxfp4 --offload
https://huggingface.co/microsoft/Lens-Turbo#optionsOptions
FlagDescriptionDefault\-\-repo\_idHF repo id (or local path) of the assembled Lens pipelinemicrosoft/Lens``\-\-base\_resolution``1024or1440``1440``\-\-aspect\_ratio``1:2,9:16,2:3,3:4,1:1,4:3,3:2,16:9,2:1``1:1``\-\-stepsNumber of denoising steps20``\-\-cfgClassifier-free guidance scale5\.0``\-\-nNumber of images per prompt1``\-\-seedRandom seed (omit for non-deterministic)—\-\-outOutput directory\./outputs``\-\-dtypeCompute dtype:bfloat16,float16,float32``bfloat16``\-\-disable\_mxfp4Dequantize the GPT-OSS text encoder to\-\-dtype(required on A100 / V100; Hopper+ keeps MXFP4 by default for less VRAM)—\-\-offloadEnable diffusers CPU offload (text\_encoder\-\>transformer\-\>vae) to reduce peak VRAM—\-\-reasonerRefine prompts with the loaded GPT-OSS encoder before generation—\-\-api\_url/\-\-api\_key/\-\-api\_modelUse an OpenAI-compatible API for prompt refinement (takes precedence over\-\-reasoner)—
https://huggingface.co/microsoft/Lens-Turbo#citationCitation
@article{zhao2026lens,
title = {Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models},
author = {Guo, Baining and Luo, Chong and Chen, Dong and Chen, Dongdong and Wei, Fangyun and Li, Ji and Bao, Jianmin and Zhang, Jiawei and Zhao, Jinjing and Shi, Lei and Yang, Qinhong and Zhang, Sirui and Wu, Xiuyu and Feng, Xuelu and Lu, Yan and Dong, Yanchen and Yue, Yang and Wang, Yitong and Chen, Yunuo and Liang, Zhiyang and Wan, Ziyu},
journal = {arXiv preprint arXiv:2605.21573},
year = {2026}
}
https://huggingface.co/microsoft/Lens-Turbo#responsible-aiResponsible AI
The model is released for research purposes only and is not intended for product or service deployment. Responsible AI considerations were incorporated throughout the development process, including data selection, model training, and evaluation. The training data includes a combination of public, licensed, and internal datasets that were processed to remove clearly identifiable personal information and reduce harmful content where possible. However, as the data is largely sourced from web-scale collections, it may contain biases or uneven representation. As a result, the model may generate outputs that are inaccurate, biased, or inappropriate under certain prompts, including content that could be misleading or raise copyright or IP-related concerns. Given these limitations, the model should be used in controlled research settings, with appropriate human oversight. Downstream users are responsible for applying additional safeguards, such as content moderation, validation, and compliance checks, before using the model in broader applications.
https://huggingface.co/microsoft/Lens-Turbo#privacyPrivacy
This project does not collect any usage data. For more information, see theMicrosoft Privacy Statement.
https://huggingface.co/microsoft/Lens-Turbo#licenseLicense
This project is released under theMIT License.
Model tree formicrosoft/Lens-Turbohttps://huggingface.co/docs/hub/model-cards#specifying-a-base-model
Space usingmicrosoft/Lens-Turbo1
Paper formicrosoft/Lens-Turbo
Similar Articles
microsoft/Lens
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model designed for efficient training and fast high-resolution generation, achieving competitive quality with reduced compute.
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
Lens is a compact 3.8B-parameter text-to-image model from Microsoft that achieves competitive performance with larger models while requiring significantly less training compute, using dense captions, multi-resolution batching, and efficient architecture.
@HuggingPapers: Microsoft just released Lens on Hugging Face A 3.8B parameter text-to-image model delivering efficient training and hig…
Microsoft released Lens, a 3.8B parameter text-to-image model on Hugging Face, capable of efficient training and high-resolution generation up to 1440×1440.
@Azure: Three open-source image models, one platform. Microsoft Foundry and Hugging Face bring developers the largest catalog f…
Microsoft Foundry integrates three open-source image models (SDXL, FLUX.1-schnell, and Z-Image-Turbo) via Hugging Face, offering developers a unified platform for AI image generation.
prunaai/z-image-turbo
Alibaba’s 6B-parameter Z-Image-Turbo text-to-image model, further compressed by PrunaAI, generates 1024×1024 photorealistic images with bilingual text in <1s on 8 diffusion steps.