Tag
This paper argues that vanilla conditional diffusion models fundamentally fail at compositional generation when the target distribution is out-of-distribution, due to score estimation error, and that inference-time corrections cannot fully compensate.
This paper introduces BiDPO, a framework that enhances text-to-image models for complex compositional prompts through preference-based fine-tuning and region-level guidance, achieving state-of-the-art results on compositional fidelity benchmarks.