@ZhengyangGeng: You can always trust Kaiming's quality bar. Writing, code, data, recipe, ckpt... https://github.com/PeppaKing8/minit2i-…

X AI KOLs Timeline 06/17/26, 10:16 PM Models

text-to-image diffusion open-source jax pytorch flan-t5 minit2i

Summary

MiniT2I is a minimalist direct-RGB text-to-image generator using a pixel-space MM-JiT denoiser with flow matching and frozen FLAN-T5-Large text tokens, with open-source JAX/Flax and PyTorch implementations released along with checkpoints.

You can always trust Kaiming's quality bar. Writing, code, data, recipe, ckpt... https://github.com/PeppaKing8/minit2i-jax… https://github.com/Hope7Happiness/minit2i-torch… It's a great pleasure working with all undergrads! They're literally "postdocs" in the group. @kevinxbwang2007 @Hope7Happiness @Lyy_iiis 带带弟弟

Original Article

View Cached Full Text

Cached at: 06/18/26, 08:12 AM

You can always trust Kaiming’s quality bar. Writing, code, data, recipe, ckpt… https://github.com/PeppaKing8/minit2i-jax… https://github.com/Hope7Happiness/minit2i-torch… It’s a great pleasure working with all undergrads! They’re literally “postdocs” in the group. @kevinxbwang2007 @Hope7Happiness @Lyy_iiis 带带弟弟

PeppaKing8/minit2i-jax

Source: https://github.com/PeppaKing8/minit2i-jax

MiniT2I logo

MiniT2I: A Minimalist Baseline for Text-to-Image Generation

Official JAX/Flax training implementation of MiniT2I.

MiniT2I is a simple direct-RGB text-to-image generator that trains a pixel-space MM-JiT denoiser with flow matching, conditioned on frozen FLAN-T5-Large text tokens. The recipe is intentionally plain: avoiding image tokenizers, cascaded generation, RL stages, and any auxiliary losses. Data used in training MiniT2I is fully public and easy to implement. For more details, please refer to our blog post.

This repository contains the original JAX/Flax diffusion training and evaluation code used for MiniT2I.

For the JAX Mean Flow distillation code used by the four-step MiniT2I-B/16-MF checkpoint, use the mean_flow_distill branch.
For a PyTorch/Diffusers implementation with inference and LoRA adaptation, see Hope7Happiness/minit2i-torch.

Model Zoo

Model	Params	Patch	GenEval	DPG-Bench	Hugging Face
MiniT2I-B/16	258M + 341M text encoder	16	0.873	84.2	MiniT2I-B-16
MiniT2I-L/16	912M + 341M text encoder	16	0.883	85.9	MiniT2I-L-16

The repository also includes our default baseline B/32 for ablation and reproduction studies.

Repository Layout

.
|-- main.py                     # JAX distributed train/eval entry point
|-- train.py                    # training loop, checkpointing, sampling, online eval
|-- diffusion.py                # flow-matching objective and samplers
|-- configs/                    # defaults plus b32/b16/l16 YAML recipes
|-- settings.py                 # local paths, checkpoints, eval assets, logging
|-- models/                     # MM-JiT, T5 encoder, Flax/Torch-compatible layers
|-- utils/                      # input pipeline, pjit sharding, checkpoints, logging
|-- evaluators/                 # FID, GenEval, and DPG-Bench dispatch
|-- external/                   # JAX evaluator/model ports used by benchmarks
`-- scripts/                    # install/train/eval launch helpers

Installation

This codebase is TPU-oriented that uses JAX distributed initialization.

Create a Python environment, install a TPU-compatible JAX build, then install the remaining dependencies:

python -m pip install "jax[tpu]" \
  -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
python -m pip install -r requirements.txt

The same commands are wrapped by:

bash scripts/install.sh

Authenticate only with services needed by your run:

hf auth login
wandb login

Project-level paths, checkpoint roots, benchmark asset roots, and W&B defaults live in settings.py. Experiment-dependent fields such as load_from belong in run YAML configs or command-line overrides.

Dataset Preparation

Training uses WebDataset tar shards. Each sample must contain an image under jpg or png and a caption under txt. The input pipeline resizes and center-crops images, normalizes them to [-1, 1], and tokenizes captions with the configured frozen text encoder.

See scripts/datasets/README.md for more details.

Training

We provide training config files for B/32, B/16, and L/16, respectively:

For B/32, start from configs/b32/pretrain.yml;
For B/16, start from configs/b16/pretrain.yml;
For L/16, start from configs/l16/pretrain.yml.

Set CC12M_ROOT in settings.py, then keep the YAML focused on the training recipe:

eval_only: False

dataset:
  use_cc12m: True
  use_blip3_ft60k: False
  use_dalle3: False
  use_sharegpt4o: False

eval:
  on_training: False

Launch:

bash scripts/train.sh pretrain \
  --config configs/load_config.py:pretrain_b16 \
  --workdir /path/to/runs/minit2i-pretrain

For the B/32 ablation recipe:

bash scripts/train.sh pretrain_b32 \
  --config configs/load_config.py:pretrain_b32 \
  --workdir /path/to/runs/minit2i-b32-pretrain

For fine-tuning, pass the pretrained checkpoint with --load_from or set load_from in the run YAML, then enable the 120K mix sources:

eval_only: False
load_from: /path/to/pretrained/checkpoint_or_run_dir

dataset:
  use_cc12m: False
  use_blip3_ft60k: True
  use_dalle3: True
  use_sharegpt4o: True

Launch:

bash scripts/train.sh finetune \
  --config configs/load_config.py:finetune_b16 \
  --workdir /path/to/runs/minit2i-finetune \
  --load_from /path/to/pretrained/checkpoint_or_run_dir

Checkpoints are saved through Flax checkpointing directly under workdir. Local absolute paths and gs:// paths (GCS bucket) are both supported.

Evaluation

Evaluation is driven by the same entry point with eval_only: True. The checkpoint is passed with --load_from when using scripts/eval.sh, or with the experiment-level load_from field in a YAML config. It is restored in train.py.

load_from may point either to a Flax checkpoint directory (starting with checkpoint_), or to a parent directory containing Flax checkpoints. If a parent directory is given, the latest checkpoint_* under that directory is restored.

Download Checkpoints

Install the Hugging Face CLI if needed:

python -m pip install -U "huggingface_hub[cli]"

Download a JAX checkpoint:

hf download MiniT2I/MiniT2I-B-16-jax \
  --local-dir /path/to/checkpoints/MiniT2I-B-16-jax

hf download MiniT2I/MiniT2I-L-16-jax \
  --local-dir /path/to/checkpoints/MiniT2I-L-16-jax

The model architecture in the YAML config must match the checkpoint for successful parameter loading:

Checkpoint	Config
MiniT2I-B/16 (`MiniT2I/MiniT2I-B-16-jax`)	`configs/load_config.py:eval_b16`
MiniT2I-L/16 (`MiniT2I/MiniT2I-L-16-jax`)	`configs/load_config.py:eval_l16`

Run Evaluation

For B/16:

bash scripts/eval.sh eval \
  --config configs/load_config.py:eval_b16 \
  --workdir /path/to/runs/minit2i-eval \
  --load_from /path/to/checkpoints/MiniT2I-B-16-jax

For L/16:

bash scripts/eval.sh eval_l \
  --config configs/load_config.py:eval_l16 \
  --workdir /path/to/runs/minit2i-l-eval \
  --load_from /path/to/checkpoints/MiniT2I-L-16-jax

The combined evaluator currently wires:

FID on MSCOCO-30K when enabled.
GenEval with the JAX Mask2Former detector and optional CLIP color classifier (This JAX version is our reproduction).
DPG-Bench with the JAX mPLUG VQA evaluator (This JAX version is our reproduction).

Evaluation Assets

We evaluate MSCOCO-FID-30K, GenEval, and DPGBench. For each benchmark, set the asset paths in settings.py, then enable the benchmark via setting enable: True inside the config yaml file. You can specify the CFG scale for each benchmark. For instance, if using GenEval with guidance scale 6.0:

eval:
  geneval:
    enable: True
    cfg_scale: 6.0

If a benchmark-specific value is missing, the evaluator falls back to the top-level eval.cfg_scale.

Sampling

During training or eval-only runs, setting eval_show_sample: True writes samples from the built-in visualization prompts. If W&B or TensorBoard logging is disabled, images are saved under:

<workdir>/writed_images/

Acknowledgments

This codebase builds on a number of open-source efforts:

Our GenEval and DPGBench evaluations on JAX are the re-implementation of the original evaluations from djghosh13/geneval and Jialuo21/DPG-Bench.
Public training data from CaptionEmporium/conceptual-captions-cc12m-llavanext, BLIP3o/BLIP3o-60k, CaptionEmporium/dalle3-llama3.2-11b, FreedomIntelligence/ShareGPT-4o-Image, and lambdalabs/naruto-blip-captions.

We also thank GCP for providing the computational resources.

Citation

@misc{minit2i2026,
  title  = {MiniT2I: A Minimalist Baseline for Text-to-Image Generation},
  author = {Wang, Xianbang and Zhao, Hanhong and Lu, Yiyang and Zhou, Kangyang and Ma, Linrui and He, Kaiming},
  year   = {2026},
  url    = {https://peppaking8.github.io/#/post/minit2i}
}

Zhengyang Geng (@ZhengyangGeng): MiniT2I from Kaiming’s group! text-2-img made simple 🤏🏼

@ZhengyangGeng: You can always trust Kaiming's quality bar. Writing, code, data, recipe, ckpt... https://github.com/PeppaKing8/minit2i-…

PeppaKing8/minit2i-jax

MiniT2I: A Minimalist Baseline for Text-to-Image Generation

Model Zoo

Repository Layout

Installation

Dataset Preparation

Training

Evaluation

Download Checkpoints

Run Evaluation

Evaluation Assets

Sampling

Acknowledgments

Citation

Similar Articles

@jun_song: Working on fitting Kimi-K2.6 (1T) on 128GB Mac. Trying to get 40tok/s, and minimize the quality loss.

@HuggingPapers: Alibaba released Qwen-Image-Flash Few-step distillation goes beyond objectives. Data composition, teacher guidance, and…

Nano Banana Finally Dethroned. GPT-Image 2.0 FULLY tested

@QingQ77: Training a 0.1B end-to-end omnimodal model from scratch. A single set of weights handles text, speech, and image inputs, while outputting text and streaming speech. https://github.com/jingyaogong/minimind-o… MiniMind-O is an omnimodal model with only 0.1B parameters…

This is ChatGPT Images 2.0

Submit Feedback

Similar Articles

@jun_song: Working on fitting Kimi-K2.6 (1T) on 128GB Mac. Trying to get 40tok/s, and minimize the quality loss.
A developer is optimizing the Kimi-K2.6 (1T) model to run efficiently on a 128GB Mac, targeting 40 tokens per second while minimizing quality loss.

@HuggingPapers: Alibaba released Qwen-Image-Flash Few-step distillation goes beyond objectives. Data composition, teacher guidance, and…

Nano Banana Finally Dethroned. GPT-Image 2.0 FULLY tested

@QingQ77: Training a 0.1B end-to-end omnimodal model from scratch. A single set of weights handles text, speech, and image inputs, while outputting text and streaming speech. https://github.com/jingyaogong/minimind-o… MiniMind-O is an omnimodal model with only 0.1B parameters…