LISA: Likelihood Score Alignment for Visual-condition Controllable Generation
Summary
This paper introduces LISA, a regularization method that aligns the intermediate features of a side network with an approximated likelihood score to improve training efficiency and the quality of visual-condition controllable generation in score-based generative models.
View Cached Full Text
Cached at: 06/26/26, 06:08 PM
Paper page - LISA: Likelihood Score Alignment for Visual-condition Controllable Generation
Source: https://huggingface.co/papers/2606.27192
Abstract
Score-based generative modeling reveals that side networks contribute likelihood scores to conditional control, leading to improved training efficiency through likelihood score alignment regularization.
The prevalent dual-branch paradigm, i.e., training aside networkto encode visual conditions and fusing itsintermediate-layer featuresto a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the side branch and its training efficiency remain underexplored. In this paper, we first revisit this mainstream paradigm through the lens ofscore-based generative modeling: 1) The main network preserves visual perceptual quality by providing a priorunconditional score. 2) Theside networksteersconditional controlby implicitly contributing alikelihood score. Guided by this perspective, we proposeLIkelihood ScoreAlignment (LISA), an effectiveregularization methodthat explicitly aligns the intermediate feature of theside networkwith an approximatedlikelihood score. Specifically, we first hook features from a designated layer of theside networkand project them into thescore latent spaceby a lightweightdecoder. Then, we construct an approximatedlikelihood scoretarget and calculate the distance between thedecoder’s output and this target as an additional regularization loss. Finally, we jointly optimize theside networkanddecoderwith both standard diffusion loss and our regularization loss. Experiments across various image/video tasks, architectures, and diffusion/flow modelsdemonstrated thatLISAcan not only consistently accelerate thetraining convergenceand improve final synthetic results, but also encourage theside network’s features to be more disentangled for conditional modeling with negligible additional training cost and zero extra inference cost.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2606\.27192
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.27192 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.27192 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.27192 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Cross-scale Aligned Supervision for Training GANs
This paper proposes CAT, a cross-scale aligned transformer that enforces consistency between intermediate and final GAN outputs to resolve trajectory misalignment, achieving state-of-the-art FID of 1.56 on ImageNet-256.
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.
Goal-Conditioned Supervised Learning for LLM Fine-Tuning
This paper proposes goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs, which treats feedback as an explicit goal and trains models via supervised learning with a novel goal formulation and natural-language goal representations. Evaluated on non-toxic generation, code generation, and recommendation, it outperforms standard offline baselines.
Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
This paper addresses the degradation of likelihood-based machine-generated text detectors by identifying a Simpson's paradox in token-score aggregation. It proposes a learned local calibration step that significantly improves detection performance across various models and datasets.
Optimizing Visual Generative Models via Distribution-wise Rewards
This paper presents a reinforcement learning framework for visual generative models that uses distribution-wise rewards, with a subset-replace strategy for efficiency, improving image diversity and quality while addressing mode collapse and reward hacking.