LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Hugging Face Daily Papers 06/25/26, 12:00 AM Papers

Summary

This paper introduces LISA, a regularization method that aligns the intermediate features of a side network with an approximated likelihood score to improve training efficiency and the quality of visual-condition controllable generation in score-based generative models.

The prevalent dual-branch paradigm, i.e., training a side network to encode visual conditions and fusing its intermediate-layer features to a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the side branch and its training efficiency remain underexplored. In this paper, we first revisit this mainstream paradigm through the lens of score-based generative modeling: 1) The main network preserves visual perceptual quality by providing a prior unconditional score. 2) The side network steers conditional control by implicitly contributing a likelihood score. Guided by this perspective, we propose LIkelihood Score Alignment (LISA), an effective regularization method that explicitly aligns the intermediate feature of the side network with an approximated likelihood score. Specifically, we first hook features from a designated layer of the side network and project them into the score latent space by a lightweight decoder. Then, we construct an approximated likelihood score target and calculate the distance between the decoder's output and this target as an additional regularization loss. Finally, we jointly optimize the side network and decoder with both standard diffusion loss and our regularization loss. Experiments across various image/video tasks, architectures, and diffusion/flow models demonstrated that LISA can not only consistently accelerate the training convergence and improve final synthetic results, but also encourage the side network's features to be more disentangled for conditional modeling with negligible additional training cost and zero extra inference cost.

Original Article

View Cached Full Text

Cached at: 06/26/26, 06:08 PM

Paper page - LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Source: https://huggingface.co/papers/2606.27192

Abstract

Score-based generative modeling reveals that side networks contribute likelihood scores to conditional control, leading to improved training efficiency through likelihood score alignment regularization.

The prevalent dual-branch paradigm, i.e., training aside networkto encode visual conditions and fusing itsintermediate-layer featuresto a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the side branch and its training efficiency remain underexplored. In this paper, we first revisit this mainstream paradigm through the lens ofscore-based generative modeling: 1) The main network preserves visual perceptual quality by providing a priorunconditional score. 2) Theside networksteersconditional controlby implicitly contributing alikelihood score. Guided by this perspective, we proposeLIkelihood ScoreAlignment (LISA), an effectiveregularization methodthat explicitly aligns the intermediate feature of theside networkwith an approximatedlikelihood score. Specifically, we first hook features from a designated layer of theside networkand project them into thescore latent spaceby a lightweightdecoder. Then, we construct an approximatedlikelihood scoretarget and calculate the distance between thedecoder’s output and this target as an additional regularization loss. Finally, we jointly optimize theside networkanddecoderwith both standard diffusion loss and our regularization loss. Experiments across various image/video tasks, architectures, and diffusion/flow modelsdemonstrated thatLISAcan not only consistently accelerate thetraining convergenceand improve final synthetic results, but also encourage theside network’s features to be more disentangled for conditional modeling with negligible additional training cost and zero extra inference cost.

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2606\.27192

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.27192 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.27192 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.27192 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Paper page - LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Cross-scale Aligned Supervision for Training GANs

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

Optimizing Visual Generative Models via Distribution-wise Rewards

Submit Feedback

Similar Articles

Cross-scale Aligned Supervision for Training GANs

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

Optimizing Visual Generative Models via Distribution-wise Rewards