@Azure: Three open-source image models, one platform. Microsoft Foundry and Hugging Face bring developers the largest catalog f…
Summary
Microsoft Foundry integrates three open-source image models (SDXL, FLUX.1-schnell, and Z-Image-Turbo) via Hugging Face, offering developers a unified platform for AI image generation.
View Cached Full Text
Cached at: 05/23/26, 10:12 PM
Three open-source image models, one platform. Microsoft Foundry and Hugging Face bring developers the largest catalog for AI innovation.
Build with Stability AI’s SDXL, Black Forest Labs’ FLUX.1-schnell, and Tongyi-MAI’s Z-Image-Turbo in Foundry today: https://t.co/ceTA6AF0we https://t.co/p7ioLwIch8
Now in Foundry: Tongyi-MAI Z-Image-Turbo, with FLUX.1-schnell and SDXL base 1.0
Source: https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/now-in-foundry-tongyi-mai-z-image-turbo-with-flux-1-schnell-and-sdxl-base-1-0/4520199 This week’s Model Mondays edition pairs three models available through theHugging Face collectioninMicrosoft Foundry:**Tongyi-MAI’s Z-Image-Turbo,**a new designed for lower latency on a single GPU and native bilingual text rendering;**Black Forest Labs’ FLUX.1-schnell,a 12B rectified flow transformer distilled to 1–4 step inference and one of the most adopted open-weight image models since its 2024 release; andStability AI’s stable-diffusion-xl-base-1.0 (SDXL),**a latent diffusion research model that can be used to generate and modify images based on text prompts.
Model Specs
- Parameters / size: 6B (BF16)
- Resolution: Up to 1024×1024 native
- Primary task: Text-to-image generation (English and Chinese)
Why it’s interesting (Spotlight)
- **Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture:**Z-Image concatenates text tokens, visual semantic tokens, and image VAE tokens into a single unified input stream rather than running text and image through separate branches. This single-stream design can improve parameter efficiency relative to dual-stream DiT architectures at the same capacity. See theZ-Image technical reportfor details.
- **8-step inference at sub-second latency, fits in 16GB VRAM:**Z-Image-Turbo is distilled with Decoupled Distribution Matching Distillation (Decoupled-DMD) and further refined with DMDR, a method that fuses DMD with reinforcement learning during post-training. The result is a model that runs 8 Number-of-Function-Evaluations (NFE) per image with no Classifier-Free Guidance (CFG)—which roughly halves the per-step compute compared to CFG-based inference. See theDecoupled-DMDandDMDRpapers.
- **Native bilingual text rendering and strong instruction adherence:**Unlike most open-weight image models, which struggle with legible in-image text, Z-Image-Turbo renders complex English and Chinese text accurately which is useful for posters, signage, packaging mockups, and marketing creative.
Try it
Figure 1. Cherry cake generated by Z-Image-TurboFigure 2. Using the original image to create a poster for marketing material Imagine you’re a community programs coordinator at your city’s parks department, planning a new summer event series — a “Cake Picnic in the Park” — designed to bring neighbors together over food in shared green space. The event is a few weeks out. You haven’t booked bakery partners yet, so no actual cake exists, and you need marketing assets this week to start driving sign-ups: a hero image for the registration page, a flyer for community centers and libraries, social tiles for the city’s channels. Use the prompt below and a photorealistic image, that can now be scaled to become additional assets like printed flyers or social images in minutes using image editing tools (or another model).
Prompt: A round layered cake displayed on a white ceramic cake stand, topped with glossy fresh red cherries and smooth pastel pink buttercream frosting piped in delicate rosettes around the edge. One generous slice has been cleanly cut and removed from the front, revealing a perfect cross-section: four distinct horizontal layers alternating between soft pink sponge cake and fluffy white vanilla cream frosting. Professional bakery photography, soft natural window light from the left, shallow depth of field, marble countertop, warm and inviting atmosphere, photorealistic detail on the cake texture, cherry highlights, and frosting swirls.
Model Specs
- Parameters / size: 12B (rectified flow transformer)
- Resolution: Flexible up to 2 megapixels
- Primary task: Text-to-image generation
Why it’s interesting (Spotlight)
- **Rectified flow transformer with adversarial distillation for 1–4 step inference:**FLUX.1-schnell is the distilled, Apache 2.0 sibling of the FLUX.1 family. It uses a rectified flow formulation (a diffusion variant that learns straight-line probability paths between noise and data, reducing the number of solver steps needed) and is further compressed with latent adversarial diffusion distillation. The model generates high quality images in for latency-sensitive workloads.
- **Permissive licensing for commercial use:**Released under Apache 2.0, FLUX.1-schnell can be used for personal, scientific, and commercial purposes. This has driven broad adoption across product features that need an open, redistributable image backbone.
- Strong prompt adherence at its parameter range: At 12B parameters, FLUX.1-schnell sits between the SDXL family and frontier proprietary image models, and it remains a common reference point for evaluating open image generation prompt following—particularly for complex compositional prompts and longer captions—roughly two years after its initial release.
Try it
Hugging Face Spaces give developers the ability to experiment and try new models before deploying them. Test out a few prompts here:
https://black-forest-labs-flux-1-schnell.hf.spacethen when you are ready, deploy the model in Microsoft Foundry.
Figure 2. Architectural diagram available here:stabilityai/stable-diffusion-xl-base-1.0 · Hugging Face
Model Specs
- Parameters / size: 2.6B UNet (≈3.5B total with text encoders)
- Resolution: 1024×1024 native
- Primary task: Text-to-image generation
Why it’s interesting (Spotlight)
- **Dual text encoder design and an ensemble-of-experts pipeline:**SDXL uses two pretrained text encoders—OpenCLIP-ViT/G and CLIP-ViT/L—concatenated to capture both broad semantic alignment and finer-grained token-level cues. It can be run standalone or paired with theSDXL refinerin an ensemble-of-experts pipeline where the base model handles early denoising and the refiner specializes in the final steps. See theSDXL reportfor the original training and architecture details.
- **CreativeML Open RAIL++-M licensing for managed deployments:**SDXL is distributed under the CreativeML Open RAIL++-M license, which permits commercial use and downstream fine-tuning with documented use restrictions.
Try it
To go deeper on SDXL, take a look at Stability AI’sgenerative-models GitHub repository, which implements the most popular diffusion frameworks for both training and inference and continues to expand with new capabilities like distillation.
You can deploy open-source Hugging Face models directly in Microsoft Foundry in two ways. The first by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. The second way is direct through the Hugging Face Hub, select any supported model and then choose “Deploy on Microsoft Foundry”, which brings you straight into Azure. Learn how to discover models and deploy them using Microsoft Foundry documentation:
Similar Articles
@HuggingPapers: Microsoft just released Lens on Hugging Face A 3.8B parameter text-to-image model delivering efficient training and hig…
Microsoft released Lens, a 3.8B parameter text-to-image model on Hugging Face, capable of efficient training and high-resolution generation up to 1440×1440.
@HowToAI_: Microsoft has released a 4B parameter model that turns any image into a 3D asset in 3 seconds. It uses a new geometry f…
Microsoft released a 4B parameter model that converts any image into a 3D asset in 3 seconds, using the O-Voxel geometry format and outputting GLB files with full PBR textures, compatible with Blender, Unity, and Unreal.
microsoft/Lens
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model designed for efficient training and fast high-resolution generation, achieving competitive quality with reduced compute.
microsoft/Lens-Turbo
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model with efficient training and fast high-resolution generation, featuring dense-caption pre-training and mixed-resolution learning.
@SigGravitas: https://x.com/SigGravitas/status/2061554698285404289
Microsoft Build 2025 will host an Open Source Zone featuring four open source AI projects: OpenClaw, AutoGPT, Open WebUI, and prompts.chat, highlighting their roles in personal agents, autonomous agents, local AI platforms, and prompt engineering.