DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

arXiv cs.LG 06/09/26, 04:00 AM Papers
Summary
DiffoR proposes a novel continuous generative framework for ordinal regression using diffusion models, overcoming limitations of discrete methods. Extensive experiments on 12 benchmarks demonstrate state-of-the-art performance across four domains.
arXiv:2606.07599v1 Announce Type: new Abstract: Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:48 AM
# DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression
Source: [https://arxiv.org/html/2606.07599](https://arxiv.org/html/2606.07599)
\(2026\)

###### Abstract\.

Ordinal Regression \(OR\) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision\. Though having evolved from naive regression to discretization\-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception\. These methods typically enforce rigid boundary delineations, failing to capture the non\-stationary semantic transitions inherent to ordinal data\.

In this paper, we propose a novel paradigm where OR is formulated as aContinuous Generative Ordinal Regressiontask\. Under the novel paradigm, we introduceDiffoR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions\. To explicitly preserve ordinal topology, we devise a Dual\-Decoupling Strategy: Spatially,Multi\-scale Increment Aggregationdecomposes targets into hierarchical continuous increments; Temporally,Dynamic Denoising Perceptionsynchronizes denoising steps with feature frequencies, ensuring robust coarse\-to\-fine refinement\. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability\. Extensive experiments on 12 benchmarks across four domains validate DiffoR’s consistent superiority over state\-of\-the\-art methods, establishing a new standard that demonstrates strong potential as a general\-purpose solution for universal ordinal regression\.

Ordinal Regression, Diffusion Models, Continuous Generative Learning

††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3818149††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Information systems Learning to rank## 1\.Introduction

![Refer to caption](https://arxiv.org/html/2606.07599v1/x1.png)Figure 1\.Conceptual comparison of \(a\) Naive Classification, \(b\) Ordinal Regression, and \(c\) Naive Regression\. Ordinal Regression uniquely addresses data that is ordered but non\-stationary, distinct from the nominal nature of classification and the equidistant metric of regression\.Table 1\.Evolution of Ordinal Regression Paradigms\. ‘L’ and ‘A’ denote ‘Limitation’ and ‘Advantage’, respectively\.ParadigmCore MechanismTarget RepresentationIntrinsic Limitation / AdvantageNaive RegressionPoint\-wise MappingDeterministic PointL:Fails to capture complex ordinal distributions\.Space DiscretizationSpace QuantizationHard CategoriesL:Disregards inherent ordinal relationships\.Rank\-basedBinary DecompositionIndependent BitsL:Struggles with interval dependencies\.Discrete GenerationToken AutoregressionQuantized TokensL:Heuristic vocabulary and quantization error\.Continuous GenerationManifold DiffusionContinuous DistributionA:Captures soft transitions & global topology\.

Predicting target values with inherent ordering, known as Ordinal Regression \(OR\), is instrumental across domains such as computer vision \(e\.g\., facial age estimation\(Niuet al\.,[2016b](https://arxiv.org/html/2606.07599#bib.bib42); Liet al\.,[2022b](https://arxiv.org/html/2606.07599#bib.bib80)\)\), medical diagnosis \(e\.g\., disease staging\(Wanget al\.,[2023a](https://arxiv.org/html/2606.07599#bib.bib111)\)\) and recommendation systems \(e\.g\. preference ranking\(Sunet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib1); Maet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib2); Jinet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib7); Drachenet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib86)\)\)\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.07599#S1.F1), unlike classification \(which ignores order\) or metric regression \(which assumes equidistant intervals\), OR must model non\-stationary semantic boundaries \(e\.g\., the perceptual gap between ages 20→21 is smaller than 60→61 in facial aging\)\. This inherent non\-uniformity defies rigid discretization\(Wanget al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib139)\), a limitation we theoretically verify in Appendix[A\.1](https://arxiv.org/html/2606.07599#A1.SS1)and[A\.2](https://arxiv.org/html/2606.07599#A1.SS2)\.

From the perspective of paradigm evolution \(summarized in Tab\.[1](https://arxiv.org/html/2606.07599#S1.T1)\), early approaches largely treat OR as aNaive Regressiontask\(Rotheet al\.,[2015](https://arxiv.org/html/2606.07599#bib.bib45)\), attempting to fit targets via point\-wise mapping but often failing to capture complex ordinal distributions\. Subsequently,Continuous Space Discretization\(CSD\)\(Wanget al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib139); Diaz and Marathe,[2019a](https://arxiv.org/html/2606.07599#bib.bib44)\)becomes the dominant paradigm, simplifying regression into classification via space quantization\. Its derivative,Rank\-basedmethods\(Niuet al\.,[2016b](https://arxiv.org/html/2606.07599#bib.bib42); Sunet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib1)\), encode ordinality via binary subtasks yet struggle with inherent interval independence and rigid predictions\. Recently, analogous to Large Language Models \(LLMs\),Discrete Generativemethods\(Wanget al\.,[2023a](https://arxiv.org/html/2606.07599#bib.bib111); Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)reformulate OR as sequence generation, but remain hampered by heuristic vocabulary design and inevitable quantization error\. Crucially, as we theoretically show in Appendix\.[A\.2](https://arxiv.org/html/2606.07599#A1.SS2)and[A\.3](https://arxiv.org/html/2606.07599#A1.SS3), these discretization\-based formulations introduce deterministic boundaries into an inherently continuous ordinal space\. As a result, predictions are often fragmented locally, insensitive to ordinal distances, and blind to the global topology of the ordinal space, especially near semantic transition regions\. While a number of works attempt to enhance encoders\(Wanget al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib43); Yanget al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib119); Zhanget al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib3); Maet al\.,[2025b](https://arxiv.org/html/2606.07599#bib.bib4),[a](https://arxiv.org/html/2606.07599#bib.bib5); Fenget al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib6)\)or tackle boundary ambiguity\(Liet al\.,[2021](https://arxiv.org/html/2606.07599#bib.bib83); Shinet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib65)\), these optimizations mostly address domain\-specific symptoms rather than the intrinsic limitations of the underlying discrete paradigms, remaining inherently palliative and lacking cross\-task generalization\.

This situation poses a fundamental question:Is there a paradigm that can circumvent the intrinsic limitations of discretization while natively accommodating the strict ordinal dependencies and non\-stationary boundaries of OR?

A natural alternative is to directly regress continuous targets\. However, naive regression collapses the ordinal structure into a single scalar estimate, failing to explicitly represent hierarchical ordinal semantics or capture uncertainty across different scales\. In practice, coarse ordinal levels \(e\.g\., “which stage”\) and fine\-grained variations \(e\.g\., “where within a stage”\) are governed by distinct semantic cues, which are difficult to disentangle within a single scalar prediction\.

In response to this question, this paper turns to Continuous Generative Models\(Kingma and Welling,[2013](https://arxiv.org/html/2606.07599#bib.bib27); Hoet al\.,[2020a](https://arxiv.org/html/2606.07599#bib.bib129); Lipmanet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib29)\)and posits that their exceptional capability in characterizing complex probability density manifolds enables the adaptive learning of latent target distributions across diverse domains, offering a principled solution to model the continuous ordinal space\. Unlike traditional discretization that enforces deterministic boundary delineation, this continuous generative capability enables the dynamic learning of soft semantic transitions, thereby perfectly capturing the non\-stationary gradations between adjacent ordinal values\. To this end, we formally propose a novel paradigm —Continuous Generative Ordinal Regression, and introduceDiffoR, a unified diffusion\-based framework that models ordinal targets as a structured multivariate continuous distribution via joint generation rather than discrete classification\.

Different from prior paradigms, we reformulate OR as a conditional continuous value recovery process, progressively estimating the target via iterative denoising\. To inject ordinal priors and explicitly preserve the ordinal topology within the generation process, we deviseMulti\-scale Increment Aggregation, which models the prediction as a*multivariate continuous distribution*\. This mechanism decouples the target into hierarchical continuous increments, spanning from global coarse\-grained estimations to local fine\-grained refinements\. Instead of independent predictions, our model jointly generates all components from a shared latent representation, ensuring cross\-scale ordinal dependency and global consistency\. By dedicating distinct attention heads in the Transformer to encode features at specific scales, we leverage parallel attention subspaces to capture multi\-granular ordinal dependencies, enabling efficient joint decoding\.

Furthermore, mirroring the coarse\-to\-fine cognitive perception where macroscopic judgments \(e\.g\., “elderly”\) rely on low\-frequency structures and microscopic estimations \(e\.g\., “65 years”\) demand high\-frequency details, we introduce aDynamic Denoising Perceptionstrategy\. Recognizing that distinct ordinal increments align with varying feature frequency spectra, we discard the unified timestep convention of standard diffusion models\. By injecting scale\-adaptive stochastic perturbations to decouple the denoising process, we enable the model to learn robust representations for coarse scales under high noise levels, while reserving low\-noise steps for fine\-grained calibration\. This dynamic synchronization effectively aligns the denoising trajectory with the hierarchical semantics of the target\.

Overall, the contributions of this paper are as follows:

- •We provide a rigorous theoretical analysis that exposes the limitations inherent to existing paradigms and propose a novel paradigm that formalizes ordinal regression as a Continuous Generative task to bypass quantization limits\.
- •We developDiffoR, a unified continuous regression framework tailored for OR that adapts diffusion generation to capture diverse latent ordinal spaces\. By leveraging soft semantic transitions, it accurately captures the non\-stationary gradations between adjacent values,
- •We design a Dual\-Decoupling Strategy, comprisingMulti\-scale Increment Aggregationto construct a structured ordinal space andDynamic Denoising Perceptionto align semantic granularity with feature frequency\. Theoretical analysis confirms that this design significantly enhances the model’s representation capability and interpretability\.
- •Extensive experiments across 12 ordinal regression benchmarks spanning four domains demonstrate DiffoR’s strong generalization and consistent superiority over the SOTAs\.

## 2\.Related Work

### 2\.1\.Ordinal Regression \(OR\)

OR addresses prediction tasks with ordered targets, widely applied in diverse domains like facial age estimation\(Niuet al\.,[2016a](https://arxiv.org/html/2606.07599#bib.bib128); Chenet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib115)\), image aesthetic/quality assessment\(Heet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib135),[2023](https://arxiv.org/html/2606.07599#bib.bib81)\), watch\-time prediction\(Sunet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib1); Linet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib8); Maet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib2); Jinet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib7)\), life\-time value prediction\(Wanget al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib96); Liet al\.,[2022a](https://arxiv.org/html/2606.07599#bib.bib112); Wenget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib124)\)\. The landscape of OR has witnessed a shift from continuous mapping to discrete modeling\. InitialNaive Regressionmethods\(Rotheet al\.,[2015](https://arxiv.org/html/2606.07599#bib.bib45)\)utilize standard regression losses but struggle with non\-uniform ordinal intervals\. Continuous Space Discretization \(CSD\)\(Wanget al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib139); Diaz and Marathe,[2019a](https://arxiv.org/html/2606.07599#bib.bib44)\)and Rank\-based Methods\(Niuet al\.,[2016b](https://arxiv.org/html/2606.07599#bib.bib42); Sunet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib1); Linet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib8)\)transform regression into classification or binary subtasks\. Despite their popularity, they still suffer from inherent quantization artifacts and rigid boundary delineations that fragment the global ordinal space\. Discrete Generative Models\(Wanget al\.,[2023a](https://arxiv.org/html/2606.07599#bib.bib111); Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)emulate LLMs to generate ordinal sequences, yet remain bound by discrete vocabularies and heuristic tokenization\. While CLIP\-based methods\(Liet al\.,[2022c](https://arxiv.org/html/2606.07599#bib.bib161); Yuet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib164); Wanget al\.,[2023b](https://arxiv.org/html/2606.07599#bib.bib162); Duet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib163)\)enhance semantic understanding via vision\-language alignment, but they do not address the fundamental limitations of discretization\. In this paper, our DiffoR, the first Continuous Generative framework for OR, models the target distribution as a continuous manifold via diffusion, can effectively capture the soft and non\-stationary transitions between ordinal values\.

### 2\.2\.Generative Regression Modeling

Generative modeling has traditionally been bifurcated into discrete and continuous paradigms\. Discrete Generative Models, exemplified by Large Language Models \(LLMs\)\(Brownet al\.,[2020](https://arxiv.org/html/2606.07599#bib.bib104); Touvronet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib106); Wanget al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib107); Xuet al\.,[2025a](https://arxiv.org/html/2606.07599#bib.bib108),[b](https://arxiv.org/html/2606.07599#bib.bib109)\), excel in capturing sequential dependencies via autoregressive token prediction\. While recently adapted for Ordinal Regression\(Wanget al\.,[2023a](https://arxiv.org/html/2606.07599#bib.bib111); Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\), these methods are inherently constrained by heuristic vocabularies and quantization errors, limiting their precision estimation\. Conversely, Continuous Generative Models, such as Diffusion\(Hoet al\.,[2020a](https://arxiv.org/html/2606.07599#bib.bib129)\)and Flow Matching\(Lipmanet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib29)\), have surged in prominence due to their exceptional capability in modeling complex, high\-dimensional probability density manifolds\. Leveraging this capability, recent works have successfully extended diffusion models to standard classification\(Hanet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib110); Yanget al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib119); Uliana and Krohling,[2025](https://arxiv.org/html/2606.07599#bib.bib122); Chenet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib125)\)and regression\(Zhaoet al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib114)\), demonstrating superior robustness and uncertainty quantification\. However, the application of continuous generation to Ordinal Regression remains unexplored, primarily due to the difficulty of capturing strict ordinal relationships and non\-stationary semantic boundaries—issues, which is addressed in this work\.

## 3\.Preliminaries

### 3\.1\.Conditional Diffusion Models

Diffusion Models\(Hoet al\.,[2020b](https://arxiv.org/html/2606.07599#bib.bib151)\)learn complex data distributions by simulating a gradual noise injection and removal process\. In this work, we focus on*conditional latent diffusion models*, where the generation process is guided by auxiliary conditions𝐱\\mathbf\{x\}\(i\.e\., the input features\)\.

#### 3\.1\.1\.Forward Diffusion Process\.

Given an initial latent representation𝐳0\\mathbf\{z\}\_\{0\}, Gaussian noise is progressively injected through a Markov chain controlled by a predefined variance scheduleβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\):

\(1\)q\(𝐳t∣𝐳t−1\)=𝒩\(𝐳t;1−βt𝐳t−1,βt𝐈\)\.q\(\\mathbf\{z\}\_\{t\}\\mid\\mathbf\{z\}\_\{t\-1\}\)=\\mathcal\{N\}\\\!\\left\(\\mathbf\{z\}\_\{t\};\\sqrt\{1\-\\beta\_\{t\}\}\\mathbf\{z\}\_\{t\-1\},\\beta\_\{t\}\\mathbf\{I\}\\right\)\.Letαt=1−βt\\alpha\_\{t\}=1\-\\beta\_\{t\}andα¯t=∏i=1tαi\\bar\{\\alpha\}\_\{t\}=\\prod\_\{i=1\}^\{t\}\\alpha\_\{i\}\. Using the reparameterization trick, the latent variable at time stepttcan be expressed as:

\(2\)𝐳t=α¯t𝐳0\+1−α¯tϵ,ϵ∼𝒩\(0,𝐈\)\.\\mathbf\{z\}\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\mathbf\{z\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\bm\{\\epsilon\},\\quad\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\)\.When the total number of diffusion stepsTTbecomes sufficiently large,𝐳T\\mathbf\{z\}\_\{T\}approaches an isotropic Gaussian distribution\.

#### 3\.1\.2\.Reverse Denoising Process\.

The reverse process aims to learn a parameterized conditional distribution

\(3\)pθ\(𝐳t−1∣𝐳t,𝐱\),p\_\{\\theta\}\(\\mathbf\{z\}\_\{t\-1\}\\mid\\mathbf\{z\}\_\{t\},\\mathbf\{x\}\),which iteratively removes noise from𝐳t\\mathbf\{z\}\_\{t\}conditioned on𝐱\\mathbf\{x\}\. At each time step, the model predicts either the injected noise or the original latent representation\. When predicting noise, the reverse update takes the form:

\(4\)𝐳t−1=1αt\(𝐳t−1−αt1−α¯tϵθ\(𝐳t,𝐱,t\)\)\+σt𝐯,\\mathbf\{z\}\_\{t\-1\}=\\frac\{1\}\{\\sqrt\{\\alpha\_\{t\}\}\}\\left\(\\mathbf\{z\}\_\{t\}\-\\frac\{1\-\\alpha\_\{t\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{x\},t\)\\right\)\+\\sigma\_\{t\}\\mathbf\{v\},where𝐯∼𝒩\(0,𝐈\)\\mathbf\{v\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\)andσt\\sigma\_\{t\}controls sampling process stochasticity\.

## 4\.Methodology

This section instantiates the paradigm described in the Introduction: we treat ordinal regression ascontinuous generative ordinal regressionover an ordered latent space\. DiffoR maps𝐱\\mathbf\{x\}to conditional features𝐜=Eψ\(𝐱\)\\mathbf\{c\}=E\_\{\\psi\}\(\\mathbf\{x\}\), runs a conditional diffusion denoiserϵθ\(𝐳t,𝐜,t\)\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{c\},t\)to obtain denoised latents\{𝐳^0\(t\)\}\\\{\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\)\\\}, and decodes an*order\-aware*increment vector𝐁^\\hat\{\\mathbf\{B\}\}whose sum yields the final predictiony^\\hat\{y\}\. Our key structural novelty isspatio\-temporal decoupling, implemented by two modules:MIA\(Multi\-scale Increment Aggregation\)*spatially*decomposes the target intoSSordered continuous increments and assigns each increment to a dedicated head;DDP\(Dynamic Denoising Perception\)*temporally*aligns each head to a dedicated denoising timesteptkt\_\{k\}, so coarse\-to\-fine semantics are decoded from noise levels matched to feature frequency\. The rest of this section introduces the increment target space, the conditional diffusion backbone, the dual\-decoupled decoding rule, and the training/inference procedures\.

##### Overall architecture\.

DiffoR consists of: \(i\) an encoderEψE\_\{\\psi\}producing conditional features𝐜=Eψ\(𝐱\)\\mathbf\{c\}=E\_\{\\psi\}\(\\mathbf\{x\}\); \(ii\) a diffusion denoiser \(backbone\)ϵθ\(𝐳t,𝐜,t\)\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{c\},t\); \(iii\)SSscale\-specific decoding heads\{𝒟k\}k=1S\\\{\\mathcal\{D\}\_\{k\}\\\}\_\{k=1\}^\{S\}; and \(iv\) an order\-preserving aggregation rule\.

As illustrated in Fig\.[2](https://arxiv.org/html/2606.07599#S4.F2), the overall pipeline is: \(i\) compute𝐜=Eψ\(𝐱\)\\mathbf\{c\}=E\_\{\\psi\}\(\\mathbf\{x\}\); \(ii\) perform conditional diffusion to obtain denoised latent estimates\{𝐳^0\(t\)\}\\\{\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\)\\\}; \(iii\) decode multi\-scale increments𝐁^\\hat\{\\mathbf\{B\}\}from aligned timesteps; \(iv\) aggregate𝐁^\\hat\{\\mathbf\{B\}\}to produce the final ordinal predictiony^\\hat\{y\}\. Crucially, DiffoR follows aDual\-Decouplingdesign \(matching the Contributions\):spatially,*Multi\-scale Increment Aggregation \(MIA\)*decomposes targets into hierarchical continuous increments and assigns each increment to a dedicated head;temporally,*Dynamic Denoising Perception \(DDP\)**synchronizes denoising steps with semantic granularity / feature frequency*, so coarse semantics are perceived under higher noise while fine semantics are calibrated under lower noise\. This design ensures robust coarse\-to\-fine refinement and yields provable guarantees \(Sec\.[4\.6](https://arxiv.org/html/2606.07599#S4.SS6)\)\.

### 4\.1\.Problem Formulation for Multi\-scale Increment Representation

Given an normalized ordinal targety∈\[0,1\]y\\in\[0,1\], we divide the target range intoSSordered intervals with uniform widthΔ=1/S\\Delta=1/S\. With this partition, we represent each target using an orderedSS\-dimensional continuous vector

\(5\)𝐁=\[b1,b2,…,bS\]⊤,\\mathbf\{B\}=\[b\_\{1\},b\_\{2\},\\dots,b\_\{S\}\]^\{\\top\},where each component corresponds to the*usable completion*ofyywithin thekk\-th ordinal interval\. We use the explicit construction

\(6\)bk=min⁡\{max⁡\(y−\(k−1\)Δ,0\),Δ\},k=1,…,S,b\_\{k\}\\;=\\;\\min\\Big\\\{\\max\\big\(y\-\(k\-1\)\\Delta,\\,0\\big\),\\,\\Delta\\Big\\\},\\qquad k=1,\\dots,S,which guaranteesbk∈\[0,Δ\]b\_\{k\}\\in\[0,\\Delta\]and yields an exact additive reconstruction:

\(7\)y=∑k=1Sbk\.y=\\sum\_\{k=1\}^\{S\}b\_\{k\}\.This representation induces a hierarchical structure: lower\-index components encode coarse ordinal completion, while higher\-index components remain sensitive near semantic transition regions\.

### 4\.2\.Conditional Diffusion Backbone with Encoder and Denoiser

To model the conditional distribution over𝐁\\mathbf\{B\}, we adopt a conditional diffusion framework\. Given input𝐱\\mathbf\{x\}, we first encode it as𝐜=Eψ\(𝐱\)\\mathbf\{c\}=E\_\{\\psi\}\(\\mathbf\{x\}\)\. We assume the structured ordinal information is embedded in a latent variable𝐳0\\mathbf\{z\}\_\{0\}on a continuous ordinal manifold, and we learn a conditional denoiser to recover𝐳0\\mathbf\{z\}\_\{0\}from its noised versions\. Since𝐜\\mathbf\{c\}is a deterministic function of𝐱\\mathbf\{x\}, conditioningϵθ\(⋅\)\\epsilon\_\{\\theta\}\(\\cdot\)on𝐜\\mathbf\{c\}is equivalent to conditioning on𝐱\\mathbf\{x\}as in standard conditional diffusion notation\. Using the standard reparameterization, the noised latent at timettcan be written as

\(8\)𝐳t=α¯t𝐳0\+1−α¯tϵ,ϵ∼𝒩\(0,𝐈\)\.\\mathbf\{z\}\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\mathbf\{z\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\bm\{\\epsilon\},\\qquad\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\)\.We parameterize the reverse model by*noise prediction*ϵθ\(𝐳t,𝐜,t\)\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{c\},t\)and form the standard estimator

\(9\)𝐳^0\(t\)=1α¯t\(𝐳t−1−α¯tϵθ\(𝐳t,𝐜,t\)\)\.\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\)=\\frac\{1\}\{\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\}\\left\(\\mathbf\{z\}\_\{t\}\-\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{c\},t\)\\right\)\.
![Refer to caption](https://arxiv.org/html/2606.07599v1/x2.png)Figure 2\.Overview of the proposed Generative Ordinal Regression framework DiffoR\. \(a\) The overall pipeline operates as a conditional generative process\. It consists of a forward diffusion process that perturbs the labelyyinto a latent space𝐡T\\mathbf\{h\}\_\{T\}, and a reverse denoising process that recoversy^\\hat\{y\}using features from a task\-specific encoder\. \(b\) The detailed architecture of the Denoising Module with Multi\-scale Increment Aggregation and Dynamic Denoising Perception\.
### 4\.3\.Dual\-Decoupling Strategy

#### 4\.3\.1\.Spatial decoupling: Multi\-scale Increment Aggregation \(MIA\)

MIA implements thespatial decouplingdescribed in the Introduction: instead of predictingyywith a single head, we decode the increment vector𝐁^\\hat\{\\mathbf\{B\}\}usingSSparallel*scale\-specific heads*\{𝒟1,…,𝒟S\}\\\{\\mathcal\{D\}\_\{1\},\\dots,\\mathcal\{D\}\_\{S\}\\\}, each specializing in one ordinal incrementbkb\_\{k\}\. Each head predicts a bounded scalar increment in\[0,Δ\]\[0,\\Delta\]by

\(10\)b^k=Δ⋅σ\(MLPk\(𝐡k\)\),k=1,…,S,\\hat\{b\}\_\{k\}\\;=\\;\\Delta\\cdot\\sigma\\\!\\left\(\\mathrm\{MLP\}\_\{k\}\(\\mathbf\{h\}\_\{k\}\)\\right\),\\qquad k=1,\\dots,S,whereσ\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function\. Importantly, the heads are not forced to share the same denoising state; instead, DDP below specifies the scale\-specific input𝐡k\\mathbf\{h\}\_\{k\}as an aligned denoising state\. This design is motivated by the non\-stationary nature of ordinal semantics: coarse\-level decisions \(e\.g\., “which interval”\) and fine\-level calibration \(e\.g\., “where within the interval”\) are governed by different cues and should not be entangled into a single scalar readout\. By representingyyas an*ordered sum*of bounded increments \(Eq\.[6](https://arxiv.org/html/2606.07599#S4.E6)\), each head becomes interpretable and ablatable at the interval level, the bounded rangeb^k∈\[0,Δ\]\\hat\{b\}\_\{k\}\\in\[0,\\Delta\]stabilizes decoding and facilitates the order\-preserving aggregation \(Eq\.[17](https://arxiv.org/html/2606.07599#S4.E17)\), and the multi\-head structure encourages specialization across semantic scales under a fixed overall model size\. All heads share the same conditional diffusion backbone and only differ in lightweight head parameters \(theMLPk\\mathrm\{MLP\}\_\{k\}’s\), so the additional cost isO\(S\)O\(S\)small MLPs on top of one shared denoiser\.

#### 4\.3\.2\.Temporal decoupling: Dynamic Denoising Perception \(DDP\)

DDP implements thetemporal decouplingdescribed in the Introduction: it*synchronizes denoising steps with semantic granularity / feature frequency*\. Coarse semantics are dominated by low\-frequency structures and are robust under higher noise, while fine semantics require high\-frequency details and are better calibrated under lower noise\. Accordingly, DDP aligns semantic granularity to denoising noise levels by assigning each scale headkka dedicated timesteptkt\_\{k\}:

\(11\)t1\>t2\>⋯\>tS,tk∈\{1,…,T\}\.t\_\{1\}\>t\_\{2\}\>\\cdots\>t\_\{S\},\\qquad t\_\{k\}\\in\\\{1,\\dots,T\\\}\.Smallerkkcorresponds to coarser increments and uses larger noise \(largertkt\_\{k\}\), while largerkkcorresponds to finer increments and uses smaller noise \(smallertkt\_\{k\}\)\. The schedule\{tk\}\\\{t\_\{k\}\\\}is a simple hyper\-parameter \(e\.g\., linear or log\-spaced over\[1,T\]\[1,T\]\)\. In our implementation we use a monotone spacing over\[1,T\]\[1,T\], e\.g\., a simple linear map,

\(12\)tk=1\+⌊S−kS−1\(T−1\)⌋,k=1,…,S,t\_\{k\}\\;=\\;1\+\\left\\lfloor\\frac\{S\-k\}\{S\-1\}\(T\-1\)\\right\\rfloor,\\qquad k=1,\\dots,S,and we also consider log\-spaced schedules in ablations\. This explicit mapping makes DDP fully reproducible\.

Each head consumes its aligned denoised estimate:

\(13\)𝐡k=𝐳^0\(tk\),b^k=Δ⋅σ\(MLPk\(𝐡k\)\),k=1,…,S\.\\mathbf\{h\}\_\{k\}\\;=\\;\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\),\\qquad\\hat\{b\}\_\{k\}\\;=\\;\\Delta\\cdot\\sigma\\\!\\left\(\\mathrm\{MLP\}\_\{k\}\(\\mathbf\{h\}\_\{k\}\)\\right\),\\qquad k=1,\\dots,S\.This realizes temporal decoupling: different semantic scales are decoded from different denoising states rather than from a single shared timestep representation\. During inference, we cache𝐳^0\(tk\)\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\)when the denoising chain reachestkt\_\{k\}and decodeb^k\\hat\{b\}\_\{k\}by Eq\.[13](https://arxiv.org/html/2606.07599#S4.E13)\(one pass, negligible overhead\)\. Besides the standard diffusion loss sampled at a random timesteptt, we explicitly supervise the scale heads at their aligned timesteps\{tk\}\\\{t\_\{k\}\\\}viaℒmia\\mathcal\{L\}\_\{\\mathrm\{mia\}\}\(Eq\.[13](https://arxiv.org/html/2606.07599#S4.E13)\)\. To emphasize DDP during diffusion training, we use a mixture timestep sampler

\(14\)p\(t\)=\(1−β\)⋅Unif\{1,…,T\}\+β⋅1S∑k=1Sδtk\(t\),p\(t\)\\;=\\;\(1\-\\beta\)\\cdot\\mathrm\{Unif\}\\\{1,\\dots,T\\\}\\;\+\\;\\beta\\cdot\\frac\{1\}\{S\}\\sum\_\{k=1\}^\{S\}\\delta\_\{t\_\{k\}\}\(t\),whereUnif\{1,…,T\}\\mathrm\{Unif\}\\\{1,\\dots,T\\\}denotes the uniform distribution over all timesteps andδtk\(t\)\\delta\_\{t\_\{k\}\}\(t\)denotes a point mass att=tkt=t\_\{k\}\(i\.e\., samplingtkt\_\{k\}with probability11\)\. Equivalently, Eq\.[14](https://arxiv.org/html/2606.07599#S4.E14)can be implemented by: with probability1−β1\-\\beta, samplettuniformly from\{1,…,T\}\\\{1,\\dots,T\\\}; with probabilityβ\\beta, samplettuniformly from the aligned set\{t1,…,tS\}\\\{t\_\{1\},\\dots,t\_\{S\}\\\}\. The purpose of this mixture is twofold: \(i\) the uniform term maintains global coverage of the denoising trajectory, preserving the standard diffusion training behavior; \(ii\) the aligned\-set term increases the training frequency of the DDP\-critical timesteps, making the denoised states\{𝐳^0\(tk\)\}\\\{\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\)\\\}more reliable for coarse\-to\-fine decoding\. The hyper\-parameterβ∈\[0,1\]\\beta\\in\[0,1\]controls the trade\-off between global trajectory learning \(β↓\\beta\\\!\\downarrow\) and aligned\-step emphasis for DDP \(β↑\\beta\\\!\\uparrow\)\. The complete training and inference procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.07599#alg1)\.

A standard conditional diffusion regressor typically decodes the prediction from a*single*denoising state \(often the final sample\), i\.e\.,

\(15\)\(single\-time, single\-head\)y^std=g\(𝐳^0\(t⋆\),𝐜\),\\text\{\(single\-time, single\-head\)\}\\qquad\\hat\{y\}\_\{\\mathrm\{std\}\}\\;=\\;g\\\!\\left\(\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{\\star\}\),\\,\\mathbf\{c\}\\right\),for some fixedt⋆t\_\{\\star\}\(or using the final denoised output\)\. In contrast, DiffoR jointly uses*multiple*aligned timesteps\{tk\}\\\{t\_\{k\}\\\}and*multiple*scale heads to generate a structured increment vector and then aggregates it as follows:

\(16\)\(DiffoR\)b^k=𝒟k\(𝐳^0\(tk\)\),𝐁^=\[b^1,…,b^S\]⊤,y^=Agg\(𝐁^\),\\text\{\(DiffoR\)\}\\quad\\hat\{b\}\_\{k\}=\\mathcal\{D\}\_\{k\}\\\!\\left\(\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\)\\right\),\\ \\ \\hat\{\\mathbf\{B\}\}=\[\\hat\{b\}\_\{1\},\\dots,\\hat\{b\}\_\{S\}\]^\{\\top\},\\ \\ \\hat\{y\}=\\mathrm\{Agg\}\(\\hat\{\\mathbf\{B\}\}\),matching the spatio\-temporal decoupling advocated in Sec\.[1](https://arxiv.org/html/2606.07599#S1)\.

1:Inputs:sample

\(𝐱,y\)\(\\mathbf\{x\},y\), intervals

SS\(width

Δ=1/S\\Delta=1/S\), aligned timesteps

\{tk\}k=1S\\\{t\_\{k\}\\\}\_\{k=1\}^\{S\}, weights

λ\\lambda
2:Compute conditional features

𝐜←Eψ\(𝐱\)\\mathbf\{c\}\\leftarrow E\_\{\\psi\}\(\\mathbf\{x\}\)
3:Construct targets

bk←min⁡\{max⁡\(y−\(k−1\)Δ,0\),Δ\}b\_\{k\}\\leftarrow\\min\\\{\\max\(y\-\(k\-1\)\\Delta,0\),\\Delta\\\}for

k=1,…,Sk=1,\\dots,S\(Eq\.[6](https://arxiv.org/html/2606.07599#S4.E6)\)

4:Sample

ϵ∼𝒩\(0,𝐈\)\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\)and sample a timestep

ttfor diffusion training

5:Form

𝐳t←α¯t𝐳0\+1−α¯tϵ\\mathbf\{z\}\_\{t\}\\leftarrow\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\mathbf\{z\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\bm\{\\epsilon\}\(forward diffusion\)

6:Predict noise

ϵ^←ϵθ\(𝐳t,𝐜,t\)\\hat\{\\bm\{\\epsilon\}\}\\leftarrow\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{c\},t\)and compute

ℒdiff←‖ϵ−ϵ^‖22\\mathcal\{L\}\_\{\\mathrm\{diff\}\}\\leftarrow\\\|\\bm\{\\epsilon\}\-\\hat\{\\bm\{\\epsilon\}\}\\\|\_\{2\}^\{2\}
7:for

k=1k=1to

SSdo

8:Form

𝐳tk←α¯tk𝐳0\+1−α¯tkϵ\\mathbf\{z\}\_\{t\_\{k\}\}\\leftarrow\\sqrt\{\\bar\{\\alpha\}\_\{t\_\{k\}\}\}\\mathbf\{z\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\_\{k\}\}\}\\bm\{\\epsilon\}\(reuseϵ\\bm\{\\epsilon\}\)

9:Compute

𝐳^0\(tk\)←1α¯tk\(𝐳tk−1−α¯tkϵθ\(𝐳tk,𝐜,tk\)\)\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\)\\leftarrow\\frac\{1\}\{\\sqrt\{\\bar\{\\alpha\}\_\{t\_\{k\}\}\}\}\\big\(\\mathbf\{z\}\_\{t\_\{k\}\}\-\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\_\{k\}\}\}\\,\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\_\{k\}\},\\mathbf\{c\},t\_\{k\}\)\\big\)
10:Decode increment

b^k←Δ⋅σ\(MLPk\(𝐳^0\(tk\)\)\)\\hat\{b\}\_\{k\}\\leftarrow\\Delta\\cdot\\sigma\(\\mathrm\{MLP\}\_\{k\}\(\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\)\)\)\(Eq\.[13](https://arxiv.org/html/2606.07599#S4.E13)\)

11:endfor

12:Compute

ℒmia←∑k=1S‖bk−b^k‖22\\mathcal\{L\}\_\{\\mathrm\{mia\}\}\\leftarrow\\sum\_\{k=1\}^\{S\}\\\|b\_\{k\}\-\\hat\{b\}\_\{k\}\\\|\_\{2\}^\{2\}and total loss

ℒ←ℒdiff\+λℒmia\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{diff\}\}\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{mia\}\}
13:Inference:run the reverse denoising chain; cache

𝐳^0\(tk\)\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\)at aligned timesteps; decode

b^k\\hat\{b\}\_\{k\}; apply truncation \(Eq\.[17](https://arxiv.org/html/2606.07599#S4.E17)\) to obtain

y^\\hat\{y\}

Algorithm 1DiffoR: training and inference with scale–time aligned decoding

### 4\.4\.Order\-preserving Aggregation and Inference

In inference, DiffoR runs the reverse process and obtains the aligned denoised estimates\{𝐳^0\(tk\)\}k=1S\\\{\\hat\{\\mathbf\{z\}\}\_\{0\}\(t\_\{k\}\)\\\}\_\{k=1\}^\{S\}along the denoising trajectory, then decodes𝐁^\\hat\{\\mathbf\{B\}\}using \([13](https://arxiv.org/html/2606.07599#S4.E13)\)\. The final ordinal prediction is obtained via an additive aggregation with deterministic truncation:

b~1\\displaystyle\\tilde\{b\}\_\{1\}:=min⁡\{b^1,Δ\},\\displaystyle:=\\min\\\{\\hat\{b\}\_\{1\},\\Delta\\\},\(17\)b~k\\displaystyle\\tilde\{b\}\_\{k\}:=min⁡\{b^k,Δ,\[1−∑j=1k−1b~j\]\+\},k≥2,\\displaystyle:=\\min\\Big\\\{\\hat\{b\}\_\{k\},\\Delta,\\big\[1\-\\sum\_\{j=1\}^\{k\-1\}\\tilde\{b\}\_\{j\}\\big\]\_\{\+\}\\Big\\\},\\qquad k\\geq 2,y^\\displaystyle\\hat\{y\}:=∑k=1Sb~k,\\displaystyle:=\\sum\_\{k=1\}^\{S\}\\tilde\{b\}\_\{k\},where\[⋅\]\+=max⁡\{⋅,0\}\[\\cdot\]\_\{\+\}=\\max\\\{\\cdot,0\\\}\. Intuitively, the nestedmin⁡\{⋅\}\\min\\\{\\cdot\\\}enforces feasibility of the increment representation by clipping each predicted increment to both its interval range \(\[0,Δ\]\[0,\\Delta\]\) and the remaining ordinal capacity, ensuringb~k∈\[0,Δ\]\\tilde\{b\}\_\{k\}\\in\[0,\\Delta\]andy^∈\[0,1\]\\hat\{y\}\\in\[0,1\]\. This mapping prevents later increments from exceeding the remaining ordinal capacity\. Note that this dual\-decoupled formulation is not merely heuristic\. Sec\.[4\.6](https://arxiv.org/html/2606.07599#S4.SS6)shows that spatial decoupling can mitigate representation collapse and tighten ordinal error bounds, while temporal decoupling via multi\-time inference yields non\-worse \(and potentially tighter\) ordinal guarantees\.

Table 2\.Results of Image Aesthetics Assessment on four datasets\.DatasetMetricRAPID\(Luet al\.,[2014](https://arxiv.org/html/2606.07599#bib.bib98)\)AADB\(Konget al\.,[2016](https://arxiv.org/html/2606.07599#bib.bib101)\)PAM\(Renet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib89)\)NIMA\(Talebi and Milanfar,[2018](https://arxiv.org/html/2606.07599#bib.bib143)\)TANet\(Heet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib135)\)Mamba\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib131)\)GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)DiffoRICAA17KMAE↓\\downarrow0\.7420\.7140\.7070\.6960\.6790\.6130\.5840\.552XAUC↑\\uparrow0\.6420\.6660\.6730\.6840\.7010\.7660\.7910\.841LCC↑\\uparrow0\.5160\.5310\.5390\.5460\.5590\.6140\.6820\.759SRCC↑\\uparrow0\.5080\.5200\.5250\.5330\.5470\.6290\.6790\.765TAD66KMAE↓\\downarrow1\.7661\.4631\.3141\.4221\.0811\.0350\.9960\.914XAUC↑\\uparrow0\.5100\.5230\.5340\.5110\.6490\.6660\.6770\.694LCC↑\\uparrow0\.3320\.4000\.4400\.4050\.4520\.4820\.5410\.553SRCC↑\\uparrow0\.3140\.3790\.4220\.3900\.4280\.4680\.5130\.526AVAMAE↓\\downarrow0\.9780\.7840\.6140\.7150\.5770\.5220\.3950\.388XAUC↑\\uparrow0\.5130\.5340\.6190\.5320\.6590\.6970\.7510\.766LCC↑\\uparrow0\.3360\.4310\.5310\.4720\.5680\.6630\.7260\.734SRCC↑\\uparrow0\.3270\.4080\.5210\.4470\.5540\.6560\.7010\.725SPAQMAE↓\\downarrow1\.0891\.0831\.0731\.0761\.0470\.9880\.8720\.768XAUC↑\\uparrow0\.7000\.7040\.7100\.7080\.7280\.7520\.7650\.805LCC↑\\uparrow0\.6570\.6650\.6690\.6710\.6840\.7260\.7430\.799SRCC↑\\uparrow0\.6130\.6160\.6220\.6200\.6380\.6900\.7230\.785

- \*The best and second best results are marked inboldandunderline, respectively\.↑\\uparrowindicates that higher values are better, while↓\\downarrowindicates the opposite\.

### 4\.5\.Optimization Objective

DiffoR is trained by jointly optimizing the diffusion denoising objective and the multi\-scale increment supervision\.

#### 4\.5\.1\.Latent manifold \(diffusion\) loss\.

We adopt the standard noise\-prediction objective:

\(18\)ℒdiff:=𝔼t,ϵ\[‖ϵ−ϵθ\(𝐳t,𝐜,t\)‖22\],\\mathcal\{L\}\_\{\\mathrm\{diff\}\}\\;:=\\mathbb\{E\}\_\{t,\\bm\{\\epsilon\}\}\\Big\[\\big\\\|\\bm\{\\epsilon\}\-\\epsilon\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\mathbf\{c\},t\)\\big\\\|\_\{2\}^\{2\}\\Big\],wherettis sampled from a pre\-defined distribution over\{1,…,T\}\\\{1,\\dots,T\\\}\. To better match DDP, we use a mixture that emphasizes the aligned timesteps\{tk\}\\\{t\_\{k\}\\\}while still covering the full trajectory\.

#### 4\.5\.2\.Multi\-scale aggregation loss\.

To supervise the decoded multivariate representation, we define a scale\-wise regression objective:

\(19\)ℒmia:=∑k=1S‖bk−b^k‖22,\\mathcal\{L\}\_\{\\mathrm\{mia\}\}\\;:=\\sum\_\{k=1\}^\{S\}\\big\\\|b\_\{k\}\-\\hat\{b\}\_\{k\}\\big\\\|\_\{2\}^\{2\},encouraging different heads to specialize in distinct ordinal scales \(coarse\-to\-fine\)\.

#### 4\.5\.3\.Overall objective\.

The whole training objective is

\(20\)ℒ:=ℒdiff\+λℒmia,\\mathcal\{L\}\\;:=\\mathcal\{L\}\_\{\\mathrm\{diff\}\}\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{mia\}\},whereλ\\lambdacontrols the trade\-off between denoising quality and multi\-scale ordinal supervision\.

The above design directly supports the statements in the Introduction section and our contributions: spatial decoupling \(MIA\) improves representational capacity for hierarchical ordinal semantics, while temporal decoupling \(DDP\) provides multi\-time, frequency\-aligned denoising states for coarse\-to\-fine calibration\. Sec\.[4\.6](https://arxiv.org/html/2606.07599#S4.SS6)formalizes how these mechanisms translate into tighter ordinal error bounds and improved optimization behavior\.

### 4\.6\.Theoretical Results

We provide theoretical results connecting spatio\-temporal decoupling to*ordinal prediction quality*\. For simplicity we present the scalar case \(extension to vectors follows by replacing squares with squared norms\)\. We establish: \(i\) spatial decoupling mitigates representation collapse and tightens the ordinal bound; \(ii\) temporal decoupling yields a non\-worse ordinal bound via multi\-time selection/fusion; \(iii\) MH\-MT admits linear convergence under standard smoothness and PL conditions\. Full proofs are in Appendix[A](https://arxiv.org/html/2606.07599#A1)\.

###### Theorem 1 \(Spatial decoupling mitigates collapse and tightens the ordinal bound\)\.

Consider two representations over the same samples: a single\-embedding representation matrixUSEU^\{\\mathrm\{SE\}\}and a multi\-embedding \(concatenated\) representation matrixUMEU^\{\\mathrm\{ME\}\}with the*same total dimension*\(e\.g\.,4×324\\times 32vs\.1×1281\\times 128\)\. If the multi\-embedding representation spans a larger subspace,

col\(USE\)⊆col\(UME\),\\mathrm\{col\}\(U^\{\\mathrm\{SE\}\}\)\\subseteq\\mathrm\{col\}\(U^\{\\mathrm\{ME\}\}\),then the best linear readout risk \(least\-squares\) for predicting the continuous targetz0z\_\{0\}is non\-increasing:

ℛ\(UME\)≤ℛ\(USE\),ℛ\(U\)≜minw⁡1n‖Uw−z0‖22\.\\mathcal\{R\}\(U^\{\\mathrm\{ME\}\}\)\\leq\\mathcal\{R\}\(U^\{\\mathrm\{SE\}\}\),\\qquad\\mathcal\{R\}\(U\)\\triangleq\\min\_\{w\}\\frac\{1\}\{n\}\\\|Uw\-z\_\{0\}\\\|\_\{2\}^\{2\}\.If the inclusion is strict and the target contains a component explainable bycol\(UME\)\\mathrm\{col\}\(U^\{\\mathrm\{ME\}\}\)but not bycol\(USE\)\\mathrm\{col\}\(U^\{\\mathrm\{SE\}\}\), thenℛ\(UME\)<ℛ\(USE\)\\mathcal\{R\}\(U^\{\\mathrm\{ME\}\}\)<\\mathcal\{R\}\(U^\{\\mathrm\{SE\}\}\)\. Consequently, via the ordinal MSE\-to\-error inequality, improving representation fit tightens the ordinal error upper bound\.

Proof sketch\.Least squares is orthogonal projection; projecting onto a larger subspace cannot increase residual norm \(strictly decreases when the residual becomes representable\)\.

###### Theorem 2 \(Temporal decoupling improves the best achievable ordinal\-error bound\)\.

Letz^0\(t\)\\hat\{z\}\_\{0\}\(t\)be the diffusion\-derived estimator constructed from a noise predictor at timett\(Appendix, Eq\. \(9\)\)\. Assume threshold decoding with minimum gapγ\>0\\gamma\>0\(Appendix, Assumption[4](https://arxiv.org/html/2606.07599#Thmassumption4)\) and the standard diffusion forward process\. Then for any fixed timett,

ℙ\(y^\(t\)≠y\)≤4γ2𝔼\[\(z^0\(t\)−z0\)2\],\\mathbb\{P\}\(\\hat\{y\}\(t\)\\neq y\)\\ \\leq\\ \\frac\{4\}\{\\gamma^\{2\}\}\\,\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}\(t\)\-z\_\{0\}\)^\{2\}\],and the regression MSE satisfies the exact identity

𝔼\[\(z^0\(t\)−z0\)2\]:=1−α¯tα¯t𝔼\[\(ϵ^\(t\)−ϵ\)2\]\.\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}\(t\)\-z\_\{0\}\)^\{2\}\]:=\\frac\{1\-\\bar\{\\alpha\}\_\{t\}\}\{\\bar\{\\alpha\}\_\{t\}\}\\,\\mathbb\{E\}\[\(\\hat\{\\epsilon\}\(t\)\-\\epsilon\)^\{2\}\]\.Moreover, given a set of time points𝒯=\{t1,…,tM\}\\mathcal\{T\}=\\\{t\_\{1\},\\dots,t\_\{M\}\\\}, define the oracle\-selected predictort†∈arg⁡mint∈𝒯⁡𝔼\[\(z^0\(t\)−z0\)2\]t^\{\\dagger\}\\in\\arg\\min\_\{t\\in\\mathcal\{T\}\}\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}\(t\)\-z\_\{0\}\)^\{2\}\]\. Then its ordinal error bound is no worse than any fixed single\-time baselinet⋆∈𝒯t\_\{\\star\}\\in\\mathcal\{T\}:

ℙ\(y^\(t†\)≠y\)≤4γ2𝔼\[\(z^0\(t†\)−z0\)2\]≤4γ2𝔼\[\(z^0\(t⋆\)−z0\)2\]\.\\mathbb\{P\}\(\\hat\{y\}\(t^\{\\dagger\}\)\\neq y\)\\ \\leq\\ \\frac\{4\}\{\\gamma^\{2\}\}\\,\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}\(t^\{\\dagger\}\)\-z\_\{0\}\)^\{2\}\]\\ \\leq\\ \\frac\{4\}\{\\gamma^\{2\}\}\\,\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}\(t\_\{\\star\}\)\-z\_\{0\}\)^\{2\}\]\.

Proof sketch\.The first inequality is the ordinal MSE\-to\-error bound; the MSE identity is a direct algebraic consequence of the diffusion reparameterization; the oracle bound follows from the minimizer definition\.

###### Theorem 3 \(Linear convergence rate of DiffoR on the multi\-time objective\)\.

Consider the multi\-time training objective

FMH\(ϑ\)=∑m=1Mλmfm\(θm\),ϑ=\(θ1,…,θM\),λm\>0,F\_\{\\mathrm\{MH\}\}\(\\vartheta\)=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}f\_\{m\}\(\\theta\_\{m\}\),\\qquad\\vartheta=\(\\theta\_\{1\},\\dots,\\theta\_\{M\}\),\\ \\lambda\_\{m\}\>0,where eachfmf\_\{m\}is differentiable\. Assume eachfmf\_\{m\}isLmL\_\{m\}\-smooth and satisfies the Polyak–Łojasiewicz \(PL\) inequality with constantμm\>0\\mu\_\{m\}\>0:

12‖∇fm\(θ\)‖2≥μm\(fm\(θ\)−fm⋆\),fm⋆=infθfm\(θ\)\.\\frac\{1\}\{2\}\\\|\\nabla f\_\{m\}\(\\theta\)\\\|^\{2\}\\geq\\mu\_\{m\}\\bigl\(f\_\{m\}\(\\theta\)\-f\_\{m\}^\{\\star\}\\bigr\),\\qquad f\_\{m\}^\{\\star\}=\\inf\_\{\\theta\}f\_\{m\}\(\\theta\)\.LetL≜∑m=1MλmLmL\\triangleq\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}L\_\{m\},μmin≜minm⁡μm\\mu\_\{\\min\}\\triangleq\\min\_\{m\}\\mu\_\{m\}, andλmin≜minm⁡λm\\lambda\_\{\\min\}\\triangleq\\min\_\{m\}\\lambda\_\{m\}\. Then gradient descent with step sizeη≤1/L\\eta\\leq 1/Lyields the linear rate

FMH\(ϑk\)−FMH⋆≤\(1−ημminλmin\)k\(FMH\(ϑ0\)−FMH⋆\),F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\\leq\\bigl\(1\-\\eta\\,\\mu\_\{\\min\}\\lambda\_\{\\min\}\\bigr\)^\{k\}\\bigl\(F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{0\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\\bigr\),

whereFMH⋆=∑m=1Mλmfm⋆\.F\_\{\\mathrm\{MH\}\}^\{\\star\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}f\_\{m\}^\{\\star\}\.

Proof sketch\.Using smoothness, one obtains a standard descent inequalityF\(ϑk\+1\)≤F\(ϑk\)−η2‖∇F\(ϑk\)‖2F\(\\vartheta^\{k\+1\}\)\\leq F\(\\vartheta^\{k\}\)\-\\frac\{\\eta\}\{2\}\\\|\\nabla F\(\\vartheta^\{k\}\)\\\|^\{2\}forη≤1/L\\eta\\leq 1/L\. For MH\-MT, the gradient is block\-separated across heads, so‖∇FMH‖2=∑mλm2‖∇fm\(θm\)‖2\\\|\\nabla F\_\{\\mathrm\{MH\}\}\\\|^\{2\}=\\sum\_\{m\}\\lambda\_\{m\}^\{2\}\\\|\\nabla f\_\{m\}\(\\theta\_\{m\}\)\\\|^\{2\}with no cross terms; applying per\-task PL andλm2≥λminλm\\lambda\_\{m\}^\{2\}\\geq\\lambda\_\{\\min\}\\lambda\_\{m\}yields12‖∇FMH‖2≥μminλmin\(FMH−FMH⋆\)\\frac\{1\}\{2\}\\\|\\nabla F\_\{\\mathrm\{MH\}\}\\\|^\{2\}\\geq\\mu\_\{\\min\}\\lambda\_\{\\min\}\(F\_\{\\mathrm\{MH\}\}\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\), giving the stated recursion\.

Table 3\.Facial age estimation results on four benchmarksMethodUTKFaceFG\-NETMORPHCACDMAE↓\\downarrowCS\(%\)↑\\uparrowMAE↓\\downarrowCS\(%\)↑\\uparrowMAE↓\\downarrowCS\(%\)↑\\uparrowMAE↓\\downarrowCS\(%\)↑\\uparrowOR\-CNN\(Niuet al\.,[2016b](https://arxiv.org/html/2606.07599#bib.bib42)\)4\.4063\.675\.0983\.802\.8361\.974\.0173\.41DLDL\(Gaoet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib51)\)4\.3963\.655\.2683\.832\.8162\.433\.9673\.37SORD\(Diaz and Marathe,[2019b](https://arxiv.org/html/2606.07599#bib.bib67)\)4\.3664\.255\.5982\.832\.8161\.313\.9673\.48Mean\-Var\.\(Panet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib68)\)4\.4263\.365\.4583\.432\.8362\.874\.0772\.98Unimodal\(Liet al\.,[2022b](https://arxiv.org/html/2606.07599#bib.bib80)\)4\.4762\.675\.1383\.972\.7863\.154\.1073\.55FaRL\(Paplhámet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib94)\)3\.8765\.384\.9584\.523\.0463\.493\.9674\.18GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)3\.4366\.584\.6885\.662\.6964\.953\.7375\.29DiffoR3\.3967\.234\.2987\.352\.5866\.793\.6476\.88

## 5\.Experiments

In this section, we first evaluate DiffoR’s overall performance across diverse domains, followed by an in\-depth analysis of architectural choices, distributional visualizations, and component ablations to elucidate its underlying mechanisms\. We employ a comprehensive set of metrics, including Mean Absolute Error \(MAE\), Cumulative Score \(CS\), XAUC\(Zhanet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib11)\), Linear Correlation Coefficient \(LCC\), and Spearman’s Rank Correlation Coefficient \(SRCC\)\. Due to space limit, detailed metric definitions, implementation specifics, and supplementary results are provided in Appendix[C\.1](https://arxiv.org/html/2606.07599#A3.SS1)\.

### 5\.1\.Overall Performance across Domains

#### 5\.1\.1\.Image Aesthetics Assessment \(IAA\)

##### Setting\.

Following the protocol in\(Heet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib81)\), we benchmark DiffoR against 14 representative baselines on four standard datasets: AVA\(Murrayet al\.,[2012](https://arxiv.org/html/2606.07599#bib.bib82)\), TAD66K\(Heet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib135)\), ICAA17K\(Heet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib81)\), and SPAQ\(Fanget al\.,[2020](https://arxiv.org/html/2606.07599#bib.bib70)\)\. Performance is assessed using MAE, XAUC, LCC, and SRCC\. Due to space limit, here we present only the results of six top\-performing methods in Tab\.[2](https://arxiv.org/html/2606.07599#S4.T2)\. The complete results are in Appendix[C\.2](https://arxiv.org/html/2606.07599#A3.SS2)\. For all experiments, we employ a ResNet50\(Heet al\.,[2016](https://arxiv.org/html/2606.07599#bib.bib60)\)backbone as the encoder\.

#### 5\.1\.2\.Watch Time Prediction \(WTP\)

Table 4\.Performance comparison among different approaches on KuaiRec and KuaiRand\.MethodKuaiRecKuaiRandMAE↓\\downarrowXAUC↑\\uparrowMAE↓\\downarrowXAUC↑\\uparrowD2Co\(Zhaoet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib14)\)3\.26330\.589520\.78540\.6547CWM\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib10)\)3\.35320\.589919\.63510\.6668D2Q\(Zhanet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib11)\)3\.26960\.604319\.42580\.6715TPM\(Linet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib8)\)3\.45840\.581922\.59500\.6303CREAD\(Sunet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib1)\)3\.22900\.611919\.80870\.6678PTPM\(Chenet al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib13)\)3\.28650\.603320\.65840\.6679SWaT\(Yanget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib9)\)3\.34960\.588822\.33530\.6515GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)3\.19850\.611719\.27420\.6682EGMN\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib12)\)3\.18900\.609819\.32460\.6682DiffoR3\.14270\.612619\.10660\.6763

##### Performance\.

Tab\.[2](https://arxiv.org/html/2606.07599#S4.T2)highlights DiffoR’s dominance across all metrics\. Even with a generic ResNet50, DiffoR surpasses all SOTA methods equipped with specialized architectures by a significant margin in both ordinal\-sensitive and accuracy metrics\. Considering the pivotal role of visual features in IAA\(Heet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib135)\), our continuous generative paradigm not only excels in ordinal modeling but also demonstrates robust encoder\-agnostic generalization\.

#### 5\.1\.3\.Facial Age Estimation \(FAE\)

##### Setting\.

FAE aims to predict chronological age from facial imagery by analyzing visual cues\. Adhering to the evaluation protocol established in\(Paplhámet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib94)\), we benchmark DiffoR on four standard datasets: UTKFace\(Zhanget al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib62)\), FG\-NET\(Lanitiset al\.,[2002](https://arxiv.org/html/2606.07599#bib.bib73)\), MORPH\(Ricanek and Tesafaye,[2006](https://arxiv.org/html/2606.07599#bib.bib72)\), and CACD\(Chenet al\.,[2014](https://arxiv.org/html/2606.07599#bib.bib64)\), using Mean Absolute Error \(MAE\) and Cumulative Score \(CS\) with a tolerance level ofL=5L=5as metrics, in comparison with 7 SOTAs including OR\-CNN\(Niuet al\.,[2016b](https://arxiv.org/html/2606.07599#bib.bib42)\), DLDL\(Gaoet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib51)\), SORD\(Diaz and Marathe,[2019b](https://arxiv.org/html/2606.07599#bib.bib67)\), Mean\-Var\.\(Panet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib68)\), Unimodal\(Liet al\.,[2022b](https://arxiv.org/html/2606.07599#bib.bib80)\), FaRL\(Paplhámet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib94)\), and GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)\. For all experiments, we employ a standard ResNet50 backbone as the encoder\.

##### Performance\.

As presented in Tab\.[3](https://arxiv.org/html/2606.07599#S4.T3), DiffoR achieves SOTA performance, significantly outperforming all baselines across all datasets and metrics\. Specifically, DiffoR delivers substantial gains, with MAE reductions ranging from 4\.08% \(MORPH\) to 8\.33% \(FG\-NET\) and CS improvements between 1\.59% \(CACD\) and 0\.65% \(UTKFace\)\. These consistent improvements underscore DiffoR’s robust generalization capability across diverse ordinal distributions\.

##### Setting\.

We conduct evaluations on two publicly available datasets KuaiRec\(Gaoet al\.,[2022a](https://arxiv.org/html/2606.07599#bib.bib20)\)and KuaiRand\(Gaoet al\.,[2022b](https://arxiv.org/html/2606.07599#bib.bib19)\), which are both collected from real\-world recommender platforms\. Using the Feed\-Forward Network \(FFN\) as an encoder\(Maet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib2)\), we report MAE and XAUC\.

##### Performance\.

As shown in Tab\.[4](https://arxiv.org/html/2606.07599#S5.T4), DiffoR surpasses all nine baselines\. Against the runner\-up, it reduces MAE by up to 1\.45% on KuaiRec\. This dominance extends to the KuaiRand, where DiffoR achieves a 0\.168 MAE drop and a 0\.71% XAUC gain\. These results not only validate the method’s efficacy, but also highlight its practical value for optimizing real\-world recommendation systems\.

#### 5\.1\.4\.Life Time Value Prediction \(LTV\)

Table 5\.Performance comparison on LTV datasets\.MethodCriteo\-SSCKaggleMAE↓\\downarrowSRCC↑\\uparrowMAE↓\\downarrowSRCC↑\\uparrowTwo\-stage\(Drachenet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib86)\)21\.7190\.238674\.7820\.431MTL\-MSE\(Maet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib88)\)21\.1900\.247874\.0650\.433ZILN\(Wanget al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib96)\)20\.8800\.243472\.5280\.524MDME\(Liet al\.,[2022a](https://arxiv.org/html/2606.07599#bib.bib112)\)16\.5980\.226972\.9000\.516MDAN\(Liuet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib123)\)20\.0300\.247073\.9400\.437OptDist\(Wenget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib124)\)15\.7840\.250570\.9290\.525HiLTV\(Xuet al\.,[2025c](https://arxiv.org/html/2606.07599#bib.bib159)\)14\.7640\.264569\.3310\.512GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)12\.9960\.302667\.0350\.533DiffoR12\.5330\.306666\.5220\.557

##### Setting\.

We benchmark DiffoR on the Criteo\-SSC and Kaggle datasets following\(Wenget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib124)\), reporting MAE and SRCC\. An FFN\-based encoder \(identical to the WTP task\) is employed\. Detailed specifications of the encoder, baselines, and dataset statistics are provided in Appendix[C\.4](https://arxiv.org/html/2606.07599#A3.SS4)\.

##### Performance\.

As presented in Tab\.[5](https://arxiv.org/html/2606.07599#S5.T5), DiffoR consistently outperforms eight existing methods across both ordinal \(SRCC\) and numeric \(MAE\) metrics\. On Criteo\-SSC, DiffoR surpasses GoR, reducing MAE by 3\.56% and improving SRCC by 1\.32%\. on Kaggle, DiffoR achieves a 0\.531 reduction in MAE and a 4\.5% boost in SRCC compared to GoR, substantiating the superiority of DiffoR\.

### 5\.2\.Further Analysis

#### 5\.2\.1\.Ablation Study

##### Module Contribution

We investigate the effects of our core components: Multi\-scale Increment Aggregation \(MIA\) and Dynamic Denoising Perception \(DDP\)\. As shown in Tab\.[6](https://arxiv.org/html/2606.07599#S5.T6), removing MIA \(Row \(d\)\) leads to a sharp performance drop \(e\.g\., SRCC↓\\downarrow0\.1\), highlighting its necessity in capturing hierarchical ordinal dependencies\. Similarly, discarding DDP \(Row \(e\)\) degrades performance, confirming the importance of synchronizing denoising steps with feature frequencies\. The removal of both modules results in the worst performance \(Row \(f\)\), verifying that MIA and DDP function synergistically to enable robust learning\.

##### Architecture Analysis

To validate the universality of our paradigm, we replace the diffusion backbone with Flow Matching \(Row \(b\) in Tab\.[6](https://arxiv.org/html/2606.07599#S5.T6)\)\. The comparable performance confirms that the efficacy stems from the Continuous Generative Ordinal Regression formulation itself, rather than a specific generative model\. Furthermore, replacing the Transformer architecture with a simple MLP\-based structure \(Row \(c\) in Tab\.[6](https://arxiv.org/html/2606.07599#S5.T6)\) causes a performance dip due to the loss of attention\-based feature interaction\. However, this variant still outperforms the SOTA baselines, demonstrating that our generative framework provides a fundamental improvement robust to architectural simplifications\.

Table 6\.Ablation study of key modules and generative architectures on the ICAA17K dataset under the IAA task\.VariantMAE↓\\downarrowXAUC↑\\uparrowLCC↑\\uparrowSRCC↑\\uparrow\(a\)DiffoR0\.5520\.8410\.7590\.765Generative Architecture Ablation\(b\)w/ Flow Matching0\.5510\.8420\.7580\.765\(c\)Rep\. Trans\. w/ MLP0\.5680\.8050\.7100\.728Key Modules Ablation\(d\)w/oMIA0\.5970\.7860\.6750\.666\(e\)w/oDDP0\.5660\.8010\.6960\.701\(f\)w/o Both line \(d\) & \(e\)0\.6060\.7630\.6530\.642

#### 5\.2\.2\.Latent Space Visualization

![Refer to caption](https://arxiv.org/html/2606.07599v1/x3.png)Figure 3\.t\-SNE visualization of embeddings from different attention heads\. The distinct clusters demonstrate that each head captures diverse and non\-redundant ordinal patterns\.To see whether DiffoR effectively decomposes the ordinal regression task, we visualize the latent representations learned by different attention heads using t\-SNE\. As illustrated in Fig\.[3](https://arxiv.org/html/2606.07599#S5.F3), the embeddings from the 8 heads exhibit clear spatial disentanglement, forming distinct and cohesive clusters without significant overlap\. This implies that each head specializes in capturing a specific ordinal pattern or semantic subspace\. Such diversity confirms that our Multi\-scale Increment Aggregation strategy successfully encourages the model to learn complementary features, rather than collapsing into redundant representations\.

#### 5\.2\.3\.Hyperparameter Effect

![Refer to caption](https://arxiv.org/html/2606.07599v1/x4.png)Figure 4\.Impact of the number of attention heads on ICAA17K\. “0 head” corresponds to the variant without MIA\.We further investigate the impact of the number of attention heads on the ICAA17K dataset, which governs the granularity of ordinal subspace decoupling\. As shown in Fig\.[4](https://arxiv.org/html/2606.07599#S5.F4), the configuration with 0 head \(equivalent to removing MIA, cf\. Tab\.[6](https://arxiv.org/html/2606.07599#S5.T6)Row \(d\)\)\. As the number of heads increases, we observe a steady improvement across all metrics, peaking at 8 heads\. This suggests that a sufficient number of parallel subspaces is crucial for capturing diverse ordinal increments\. However, further increasing the heads \(to 10 or 12\) leads to a slight performance degradation, likely due to over\-parameterization or the fragmentation of semantic features into overly fine\-grained, noisy subspaces\.

## 6\.Conclusion

This work addresses the fundamental bottlenecks of discretization in Ordinal Regression \(OR\) by proposing DiffoR, a novel continuous generative framework\. By reformulating OR as a conditional value recovery process via diffusion, DiffoR bypasses quantization artifacts and natively captures the non\-stationary semantic transitions of ordinal data\. Our proposed Dual\-Decoupling Strategy — synergizing spatial increment aggregation with temporal denoising perception — ensures the preservation of hierarchical ordinal topology\. Extensive experiments across 12 benchmarks in four domains demonstrate DiffoR’s consistent superiority and robust generalization\. By establishing a high\-precision, universal solution for OR, this work offers a robust foundation for future research in modeling data with inherent ordering\.

## Acknowledgments

This work was partially supported by Kuaishou Technology\. Shuigeng Zhou was supported by National Social Science Fund of China \(NSFC\) under grant No\. 24&ZD185\. The computations in this research were performed using the CFFF platform of Fudan University\.

## References

- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- B\. Chen, C\. Chen, and W\. H\. Hsu \(2014\)Cross\-age reference coding for age\-invariant face recognition and retrieval\.InComputer Vision–ECCV 2014,Cited by:[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- H\. Chen, Y\. Dong, Z\. Wang, X\. Yang, C\. Duan, H\. Su, and J\. Zhu \(2023\)Robust classification via a single diffusion model\.arXiv preprint arXiv:2305\.15241\.Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- S\. Chen, C\. Zhang, M\. Dong,et al\.\(2017\)Using ranking\-cnn for age estimation\.InProceedings of the IEEE conference on computer vision and pattern recognition,Cited by:[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- X\. Chen, X\. Lin, C\. Li, and P\. Jiang \(2025\)Personalized tree\-based progressive regression model for watch\-time prediction in short video recommendation\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,pp\. 5609–5616\.Cited by:[item 6](https://arxiv.org/html/2606.07599#A3.I2.i6.p1.1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.11.1)\.
- R\. Diaz and A\. Marathe \(2019a\)Soft labels for ordinal regression\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 4738–4747\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- R\. Diaz and A\. Marathe \(2019b\)Soft labels for ordinal regression\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,Cited by:[Table 3](https://arxiv.org/html/2606.07599#S4.T3.8.8.12.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- A\. Drachen, M\. Pastor, A\. Liu, D\. J\. Fontaine, Y\. Chang, J\. Runge, R\. Sifa, and D\. Klabjan \(2018\)To be or not to be… social: incorporating simple social features in mobile game customer lifetime value predictions\.InProceedings of the Australasian Computer Science Week Multiconference,Cited by:[item 1](https://arxiv.org/html/2606.07599#A3.I3.i1.p1.1),[§C\.4\.2](https://arxiv.org/html/2606.07599#A3.SS4.SSS2.p1.1),[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.6.1)\.
- Y\. Du, Q\. Zhai, W\. Dai, and X\. Li \(2024\)Teach clip to develop a number sense for ordinal regression\.InEuropean Conference on Computer Vision,pp\. 1–17\.Cited by:[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- Y\. Fang, H\. Zhu, Y\. Zeng, K\. Ma, and Z\. Wang \(2020\)Perceptual quality assessment of smartphone photography\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.07599#A3.SS2.SSS1.Px1.p1.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.17.1),[§5\.1\.1](https://arxiv.org/html/2606.07599#S5.SS1.SSS1.Px1.p1.1)\.
- Y\. Feng, J\. Hu, Q\. Lu, J\. Niu, L\. Tan, S\. Yuan, Z\. Yan, Y\. Jia, Q\. He, S\. Ge,et al\.\(2026\)MUVR: a multi\-modal untrimmed video retrieval benchmark with multi\-level visual correspondence\.Advances in Neural Information Processing Systems38\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1)\.
- B\. Gao, C\. Xing, C\. Xie, J\. Wu, and X\. Geng \(2017\)Deep label distribution learning with label ambiguity\.IEEE Transactions on Image Processing\.Cited by:[Table 3](https://arxiv.org/html/2606.07599#S4.T3.8.8.11.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- C\. Gao, S\. Li, W\. Lei, J\. Chen, B\. Li, P\. Jiang, X\. He, J\. Mao, and T\. Chua \(2022a\)KuaiRec: a fully\-observed dataset and insights for evaluating recommender systems\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,pp\. 540–550\.Cited by:[§C\.3\.1](https://arxiv.org/html/2606.07599#A3.SS3.SSS1.Px1.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px3.p1.1)\.
- C\. Gao, S\. Li, Y\. Zhang, J\. Chen, B\. Li, W\. Lei, P\. Jiang, and X\. He \(2022b\)Kuairand: an unbiased sequential recommendation dataset with randomly exposed videos\.InProceedings of the 31st ACM international conference on information & knowledge management,pp\. 3953–3957\.Cited by:[§C\.3\.1](https://arxiv.org/html/2606.07599#A3.SS3.SSS1.Px1.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px3.p1.1)\.
- F\. Gao, Y\. Lin, J\. Shi, M\. Qiao, and N\. Wang \(2024\)AesMamba: universal image aesthetic assessment with state space models\.InProceedings of the 32nd ACM International Conference on Multimedia,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.24.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.26.1),[Table 2](https://arxiv.org/html/2606.07599#S4.T2.16.16.17.8)\.
- X\. Guo, J\. Pan, X\. Wang, B\. Chen, J\. Jiang, and M\. Long \(2023\)On the embedding collapse when scaling up recommendation models\.arXiv preprint arXiv:2310\.04400\.Cited by:[§B\.5\.3](https://arxiv.org/html/2606.07599#A2.SS5.SSS3.p1.2),[Corollary 13](https://arxiv.org/html/2606.07599#A2.Thmtheorem13.p1.1.1)\.
- X\. Han, H\. Zheng, and M\. Zhou \(2022\)Card: classification and regression diffusion models\.Advances in Neural Information Processing Systems35,pp\. 18100–18115\.Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,Cited by:[§5\.1\.1](https://arxiv.org/html/2606.07599#S5.SS1.SSS1.Px1.p1.1)\.
- S\. He, A\. Ming, Y\. Li, J\. Sun, S\. Zheng, and H\. Ma \(2023\)Thinking image color aesthetics assessment: models, datasets and benchmarks\.InProceedings of the IEEE/CVF International Conference on Computer Vision,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.07599#A3.SS2.SSS1.Px1.p1.1),[§C\.2\.1](https://arxiv.org/html/2606.07599#A3.SS2.SSS1.Px3.p1.1),[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.23.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.25.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[§5\.1\.1](https://arxiv.org/html/2606.07599#S5.SS1.SSS1.Px1.p1.1)\.
- S\. He, Y\. Zhang, R\. Xie, D\. Jiang, and A\. Ming \(2022\)Rethinking image aesthetics assessment: models, datasets and benchmarks\.\.InIJCAI,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.07599#A3.SS2.SSS1.Px1.p1.1),[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.21.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.23.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2606.07599#S4.T2.16.16.17.7),[§5\.1\.1](https://arxiv.org/html/2606.07599#S5.SS1.SSS1.Px1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.07599#S5.SS1.SSS2.Px1.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020a\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020b\)Denoising diffusion probabilistic models\.Advances in neural information processing systems\.Cited by:[§3\.1](https://arxiv.org/html/2606.07599#S3.SS1.p1.1)\.
- V\. Hosu, B\. Goldlucke, and D\. Saupe \(2019\)Effective aesthetics prediction with multi\-level spatially pooled features\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.16.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.16.1)\.
- C\. Jin, Y\. Ren, H\. Ma, Y\. Xia, Y\. Guan, H\. Zhang, J\. Ding, J\. Guan, and S\. Zhou \(2026\)Invariant feature learning for counterfactual watch\-time prediction in video recommendation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 14964–14972\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- J\. Ke, Q\. Wang, Y\. Wang, P\. Milanfar, and F\. Yang \(2021\)Musiq: multi\-scale image quality transformer\.InProceedings of the IEEE/CVF international conference on computer vision,Cited by:[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.21.1)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§C\.1\.2](https://arxiv.org/html/2606.07599#A3.SS1.SSS2.Px2.p1.2)\.
- D\. P\. Kingma and M\. Welling \(2013\)Auto\-encoding variational bayes\.arXiv preprint arXiv:1312\.6114\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p5.1)\.
- S\. Kong, X\. Shen, Z\. Lin, R\. Mech, and C\. Fowlkes \(2016\)Photo aesthetics ranking network with attributes and content adaptation\.InComputer Vision–ECCV 2016,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.12.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.12.1),[Table 2](https://arxiv.org/html/2606.07599#S4.T2.16.16.17.4)\.
- A\. Lanitis, C\.J\. Taylor, and T\.F\. Cootes \(2002\)Toward automatic simulation of aging effects on face images\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- K\. Li, G\. Shao, N\. Yang, X\. Fang, and Y\. Song \(2022a\)Billion\-user customer lifetime value prediction: an industrial\-scale solution from kuaishou\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,Cited by:[item 4](https://arxiv.org/html/2606.07599#A3.I3.i4.p1.1),[§C\.4\.2](https://arxiv.org/html/2606.07599#A3.SS4.SSS2.p1.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.9.1)\.
- Q\. Li, J\. Wang, Z\. Yao, Y\. Li, P\. Yang, J\. Yan, C\. Wang, and S\. Pu \(2022b\)Unimodal\-concentrated loss: fully adaptive label distribution learning for ordinal regression\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[Table 3](https://arxiv.org/html/2606.07599#S4.T3.8.8.14.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- W\. Li, X\. Huang, J\. Lu, J\. Feng, and J\. Zhou \(2021\)Learning probabilistic ordinal embeddings for uncertainty\-aware regression\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.19.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.20.1),[§1](https://arxiv.org/html/2606.07599#S1.p2.1)\.
- W\. Li, X\. Huang, Z\. Zhu, Y\. Tang, X\. Li, J\. Zhou, and J\. Lu \(2022c\)Ordinalclip: learning rank prompts for language\-guided ordinal regression\.Advances in Neural Information Processing Systems35,pp\. 35313–35325\.Cited by:[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- X\. Lin, X\. Chen, L\. Song, J\. Liu, B\. Li, and P\. Jiang \(2023\)Tree based progressive regression model for watch\-time prediction in short\-video recommendation\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 4497–4506\.Cited by:[item 5](https://arxiv.org/html/2606.07599#A3.I2.i5.p1.1.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.9.1)\.
- Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2022\)Flow matching for generative modeling\.arXiv preprint arXiv:2210\.02747\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- W\. Liu, G\. Xu, B\. Ye, X\. Luo, Y\. He, and C\. Yin \(2024\)MDAN: multi\-distribution adaptive networks for ltv prediction\.InPacific\-Asia Conference on Knowledge Discovery and Data Mining,Cited by:[item 5](https://arxiv.org/html/2606.07599#A3.I3.i5.p1.1),[§C\.4\.2](https://arxiv.org/html/2606.07599#A3.SS4.SSS2.p1.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.10.1)\.
- X\. Lu, Z\. Lin, H\. Jin,et al\.\(2014\)Rapid: rating pictorial aesthetics using deep learning\.InProceedings of the 22nd ACM international conference on Multimedia,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.11.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.11.1),[Table 2](https://arxiv.org/html/2606.07599#S4.T2.16.16.17.3)\.
- H\. Ma, K\. Tian, T\. Zhang, X\. Zhang, H\. Zhou, C\. Chen, H\. Li, J\. Guan, and S\. Zhou \(2024\)Generative regression based watch time prediction for short\-video recommendation\.arXiv preprint arXiv:2412\.20211\.Cited by:[§A\.2](https://arxiv.org/html/2606.07599#A1.SS2.p1.4),[§A\.3](https://arxiv.org/html/2606.07599#A1.SS3.p1.1),[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px3.p1.1)\.
- H\. Ma, G\. Wang, F\. Yu, Q\. Jia, and S\. Ding \(2025a\)MS\-DETR: towards effective video moment retrieval and highlight detection by joint motion\-semantic learning\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 4514–4523\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1)\.
- H\. Ma, C\. Zhang, L\. Zhang, J\. Zhou, J\. Guan, and S\. Zhou \(2025b\)Fine\-grained zero\-shot object detection\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 4504–4513\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1)\.
- H\. Ma, H\. Zhou, K\. Tian, X\. Zhang, C\. Chen, H\. Li, J\. Guan, and S\. Zhou \(2026\)GoR: a unified and extensible generative framework for ordinal regression\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ys80cc2N5M)Cited by:[item 10](https://arxiv.org/html/2606.07599#A3.I2.i10.p1.1.1),[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.25.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.27.1),[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1),[Table 2](https://arxiv.org/html/2606.07599#S4.T2.16.16.17.9),[Table 3](https://arxiv.org/html/2606.07599#S4.T3.8.8.16.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.13.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.13.1)\.
- S\. Ma, J\. Liu, and C\. Wen Chen \(2017\)A\-lamp: adaptive layout\-aware multi\-patch deep convolutional neural network for photo aesthetic assessment\.InProceedings of the IEEE conference on computer vision and pattern recognition,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.15.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.14.1)\.
- X\. Ma, L\. Zhao, G\. Huang, Z\. Wang, Z\. Hu, X\. Zhu, and K\. Gai \(2018\)Entire space multi\-task model: an effective approach for estimating post\-click conversion rate\.InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval,Cited by:[item 2](https://arxiv.org/html/2606.07599#A3.I3.i2.p1.1),[§C\.4\.2](https://arxiv.org/html/2606.07599#A3.SS4.SSS2.p1.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.7.1)\.
- N\. Murray, L\. Marchesotti, and F\. Perronnin \(2012\)AVA: a large\-scale database for aesthetic visual analysis\.In2012 IEEE conference on computer vision and pattern recognition,Cited by:[§C\.2\.1](https://arxiv.org/html/2606.07599#A3.SS2.SSS1.Px1.p1.1),[§5\.1\.1](https://arxiv.org/html/2606.07599#S5.SS1.SSS1.Px1.p1.1)\.
- Z\. Niu, M\. Zhou, L\. Wang, X\. Gao, and G\. Hua \(2016a\)Ordinal regression with multiple output cnn for age estimation\.InProceedings of the IEEE conference on computer vision and pattern recognition,Cited by:[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- Z\. Niu, M\. Zhou, L\. Wang, X\. Gao, and G\. Hua \(2016b\)Ordinal regression with multiple output cnn for age estimation\.InProceedings of the IEEE conference on computer vision and pattern recognition,Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[Table 3](https://arxiv.org/html/2606.07599#S4.T3.8.8.10.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- H\. Pan, H\. Han, S\. Shan, and X\. Chen \(2018\)Mean\-variance loss for deep age estimation from a face\.InProceedings of the IEEE conference on computer vision and pattern recognition,Cited by:[Table 3](https://arxiv.org/html/2606.07599#S4.T3.8.8.13.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- J\. Paplhám, V\. Franc,et al\.\(2024\)A call to reflect on evaluation practices for age estimation: comparative analysis of the state\-of\-the\-art and a unified benchmark\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[Table 3](https://arxiv.org/html/2606.07599#S4.T3.8.8.15.1),[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- J\. Ren, X\. Shen, Z\. Lin, R\. Mech, and D\. J\. Foran \(2017\)Personalized image aesthetics\.InProceedings of the IEEE international conference on computer vision,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.13.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.13.1),[Table 2](https://arxiv.org/html/2606.07599#S4.T2.16.16.17.5)\.
- K\. Ricanek and T\. Tesafaye \(2006\)Morph: a longitudinal image database of normal adult age\-progression\.In7th international conference on automatic face and gesture recognition \(FGR06\),Cited by:[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- R\. Rothe, R\. Timofte, and L\. Van Gool \(2015\)Dex: deep expectation of apparent age from a single image\.InProceedings of the IEEE international conference on computer vision workshops,pp\. 10–15\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- D\. She, Y\. Lai, G\. Yi, and K\. Xu \(2021\)Hierarchical layout\-aware graph convolutional network for unified aesthetics assessment\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.20.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.22.1)\.
- K\. Sheng, W\. Dong, C\. Ma, X\. Mei, F\. Huang, and B\. Hu \(2018\)Attention\-based multi\-patch aggregation for image aesthetic assessment\.InProceedings of the 26th ACM international conference on Multimedia,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.9.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.9.1)\.
- N\. Shin, S\. Lee, and C\. Kim \(2022\)Moving window regression: a novel approach to ordinal regression\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1)\.
- J\. Sun, Z\. Ding, X\. Chen, Q\. Chen, Y\. Wang, K\. Zhan, and B\. Wang \(2024\)CREAD: a classification\-restoration framework with error adaptive discretization for watch time prediction in video recommender systems\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§A\.2](https://arxiv.org/html/2606.07599#A1.SS2.p1.1),[item 7](https://arxiv.org/html/2606.07599#A3.I2.i7.p1.1.1),[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.10.1)\.
- H\. Talebi and P\. Milanfar \(2018\)NIMA: neural image assessment\.IEEE Transactions on Image Processing\.Cited by:[4th item](https://arxiv.org/html/2606.07599#A3.I1.i4.p1.2.1),[5th item](https://arxiv.org/html/2606.07599#A3.I1.i5.p1.1.1),[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.14.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.15.1),[Table 2](https://arxiv.org/html/2606.07599#S4.T2.16.16.17.6)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- Z\. Tu, H\. Talebi, H\. Zhang, F\. Yang, P\. Milanfar, A\. Bovik, and Y\. Li \(2022\)Maxvit: multi\-axis vision transformer\.InEuropean conference on computer vision,Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.22.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.24.1)\.
- J\. J\. Uliana and R\. A\. Krohling \(2025\)Diffusion models applied to skin and oral cancer classification\.arXiv preprint arXiv:2504\.00026\.Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- J\. Wang, J\. Chen, J\. Liu, D\. Tang, D\. Z\. Chen, and J\. Wu \(2025\)A survey on ordinal regression: applications, advances and prospects\.arXiv preprint arXiv:2503\.00952\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- J\. Wang, Y\. Cheng, J\. Chen, T\. Chen, D\. Chen, and J\. Wu \(2023a\)Ord2Seq: regarding ordinal regression as label sequence prediction\.InProceedings of the IEEE/CVF International Conference on Computer Vision,Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p1.1),[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- R\. Wang, P\. Li, H\. Huang, C\. Cao, R\. He, and Z\. He \(2023b\)Learning\-to\-rank meets language: boosting language\-driven ordering alignment for ordinal classification\.Advances in Neural Information Processing Systems36,pp\. 76908–76922\.Cited by:[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- X\. Wang, T\. Liu, and J\. Miao \(2019\)A deep probabilistic model for customer lifetime value prediction\.arXiv preprint arXiv:1912\.07753\.Cited by:[item 3](https://arxiv.org/html/2606.07599#A3.I3.i3.p1.3),[§C\.4\.2](https://arxiv.org/html/2606.07599#A3.SS4.SSS2.p1.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.8.1)\.
- Z\. Wang, Y\. Yin, J\. Shi, W\. Fang, H\. Li, and X\. Wang \(2017\)Zoom\-in\-net: deep mining lesions for diabetic retinopathy detection\.InInternational conference on medical image computing and computer\-assisted intervention,pp\. 267–275\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1)\.
- Z\. Wang, L\. Nguyen, Z\. Zhao, M\. Yang, C\. Qin, Y\. Yang, and L\. Yang \(2026\)CreativeBench: benchmarking and enhancing machine creativity via self\-evolving challenges\.arXiv preprint arXiv:2603\.11863\.Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- Y\. Weng, X\. Tang, Z\. Xu, F\. Lyu, D\. Liu, Z\. Sun, and X\. He \(2024\)OptDist: learning optimal distribution for customer lifetime value prediction\.arXiv preprint arXiv:2408\.08585\.Cited by:[item 6](https://arxiv.org/html/2606.07599#A3.I3.i6.p1.1),[§C\.4\.1](https://arxiv.org/html/2606.07599#A3.SS4.SSS1.p1.1),[§C\.4\.2](https://arxiv.org/html/2606.07599#A3.SS4.SSS2.p1.1),[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1),[§5\.1\.4](https://arxiv.org/html/2606.07599#S5.SS1.SSS4.Px1.p1.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.11.1)\.
- H\. Xu, Z\. Wang, Z\. Zhu, L\. Pan, X\. Chen, S\. Fan, L\. Chen, and K\. Yu \(2025a\)Alignment for efficient tool calling of large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 17776–17792\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.898),[Link](https://aclanthology.org/2025.emnlp-main.898/)Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- H\. Xu, Z\. Zhu, L\. Pan, Z\. Wang, S\. Zhu, D\. Ma, R\. Cao, L\. Chen, and K\. Yu \(2025b\)Reducing tool hallucination via reliability alignment\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=WeOLZmDXyA)Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- J\. Xu, A\. Zheng, L\. Ding, H\. Zhang, Z\. Deng, Q\. Yu, and X\. Zhang \(2025c\)HiLTV: hierarchical multi\-distribution modeling for lifetime value prediction in online games\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,Cited by:[item 7](https://arxiv.org/html/2606.07599#A3.I3.i7.p1.1),[Table 5](https://arxiv.org/html/2606.07599#S5.T5.4.4.12.1)\.
- S\. Yang, H\. Yang, L\. Du, A\. Ganesh, and et al\. \(2024\)SWaT: statistical modeling of video watch time through user behavior analysis\.arXiv preprint arXiv:2408\.07759\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2408.07759),[Link](https://arxiv.org/abs/2408.07759)Cited by:[§A\.2](https://arxiv.org/html/2606.07599#A1.SS2.p1.1),[item 8](https://arxiv.org/html/2606.07599#A3.I2.i8.p1.1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.12.1)\.
- Y\. Yang, H\. Fu, A\. I\. Aviles\-Rivero, C\. Schönlieb, and L\. Zhu \(2023\)Diffmic: dual\-guidance diffusion network for medical image classification\.InInternational conference on medical image computing and computer\-assisted intervention,pp\. 95–105\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- Q\. Yu, J\. Xie, A\. Nguyen, H\. Zhao, J\. Zhang, H\. Fu, Y\. Zhao, Y\. Zheng, and Y\. Meng \(2024\)CLIP\-dr: textual knowledge\-guided diabetic retinopathy grading with ranking\-aware prompting\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 667–677\.Cited by:[§2\.1](https://arxiv.org/html/2606.07599#S2.SS1.p1.1)\.
- H\. Zeng, Z\. Cao, L\. Zhang, and A\. C\. Bovik \(2019\)A unified probabilistic formulation of image aesthetic assessment\.IEEE Transactions on Image Processing\.Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.18.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.19.1)\.
- R\. Zhan, C\. Pei, Q\. Su, J\. Wen, X\. Wang, G\. Mu, D\. Zheng, P\. Jiang, and K\. Gai \(2022\)Deconfounding duration bias in watch\-time prediction for video recommendation\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 4472–4481\.Cited by:[3rd item](https://arxiv.org/html/2606.07599#A3.I1.i3.p1.1),[item 2](https://arxiv.org/html/2606.07599#A3.I2.i2.p1.1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.8.1),[§5](https://arxiv.org/html/2606.07599#S5.p1.1)\.
- C\. Zhang, B\. Huangfu, H\. Ma, J\. Guan, and S\. Zhou \(2025\)Multi\-modal prototype guided few\-shot object detection\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 1852–1861\.Cited by:[§1](https://arxiv.org/html/2606.07599#S1.p2.1)\.
- Z\. Zhang, Y\. Song,et al\.\(2017\)Age progression/regression by conditional adversarial autoencoder\.InProceedings of the IEEE conference on computer vision and pattern recognition,Cited by:[§5\.1\.3](https://arxiv.org/html/2606.07599#S5.SS1.SSS3.Px1.p1.1)\.
- H\. Zhao, G\. Cai, J\. Zhu, Z\. Dong, J\. Xu, and J\. Wen \(2024\)Counteracting duration bias in video recommendation via counterfactual watch time\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 4455–4466\.Cited by:[item 4](https://arxiv.org/html/2606.07599#A3.I2.i4.p1.1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.7.1)\.
- H\. Zhao, L\. Zhang, J\. Xu, G\. Cai, Z\. Dong, and J\. Wen \(2023\)Uncovering user interest from biased and noised watch time in video recommendation\.InProceedings of the 17th ACM Conference on Recommender Systems,pp\. 528–539\.Cited by:[item 3](https://arxiv.org/html/2606.07599#A3.I2.i3.p1.1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.6.1)\.
- Q\. Zhao, E\. Adeli, N\. Honnorat, T\. Leng, and K\. M\. Pohl \(2019\)Variational autoencoder for regression: application to brain aging analysis\.InInternational conference on medical image computing and computer\-assisted intervention,pp\. 823–831\.Cited by:[§2\.2](https://arxiv.org/html/2606.07599#S2.SS2.p1.1)\.
- X\. Zhao, R\. Ma, J\. Chen, W\. Zhao, P\. Yang, and Y\. Hu \(2025\)Multi\-granularity distribution modeling for video watch time prediction via exponential\-gaussian mixture network\.InProceedings of the Nineteenth ACM Conference on Recommender Systems,pp\. 309–318\.Cited by:[item 9](https://arxiv.org/html/2606.07599#A3.I2.i9.p1.1.1),[Table 4](https://arxiv.org/html/2606.07599#S5.T4.4.4.14.1)\.
- H\. Zhu, L\. Li, J\. Wu, S\. Zhao, G\. Ding, and G\. Shi \(2020\)Personalized image aesthetics assessment via meta\-learning with bilevel gradient optimization\.IEEE Transactions on Cybernetics\.Cited by:[Table 8](https://arxiv.org/html/2606.07599#A3.T8.9.9.17.1),[Table 9](https://arxiv.org/html/2606.07599#A3.T9.9.9.18.1)\.

## Appendix ALimitaions of Existing Modeling Paradigms

This appendix provides a self\-contained theoretical treatment of existing modeling paradigms for watch\-time prediction \(WTP\) and includes full proofs for the propositions stated in the main paper\. Throughout,𝐱\\mathbf\{x\}denotes the conditioning features,y∈ℝ\+y\\in\\mathbb\{R\}^\{\+\}denotes watch time, andPdata\(⋅\)P\_\{\\text\{data\}\}\(\\cdot\)denotes the true \(unknown\) data\-generating distribution\.

### A\.1\.Mean Collapse in Conventional Regression

A common baseline is point\-wise regression trained with mean squared error \(MSE\):

minf⁡𝔼\(𝐱,y\)∼Pdata\[\(y−f\(𝐱\)\)2\]\.\\min\_\{f\}\\;\\mathbb\{E\}\_\{\(\\mathbf\{x\},y\)\\sim P\_\{\\text\{data\}\}\}\\big\[\(y\-f\(\\mathbf\{x\}\)\)^\{2\}\\big\]\.WhenPdata\(y∣𝐱\)P\_\{\\text\{data\}\}\(y\\mid\\mathbf\{x\}\)is multimodal, the MSE objective drives the predictor to the conditional mean, which can lie in a low\-density region\.

###### Proposition 0 \(Mean Collapse Effect\)\.

Assume that, for a fixed𝐱\\mathbf\{x\},Pdata\(y∣𝐱\)P\_\{\\text\{data\}\}\(y\\mid\\mathbf\{x\}\)consists of multiple well\-separated modes with separationΔ\\Deltaand within\-mode scaleσ\\sigma\. Letf∗\(𝐱\)=𝔼\[y∣𝐱\]f^\{\*\}\(\\mathbf\{x\}\)=\\mathbb\{E\}\[y\\mid\\mathbf\{x\}\]denote the MSE\-optimal regressor\. Then, under the disjoint\-modes regimeΔ/σ→∞\\Delta/\\sigma\\to\\infty,

\(21\)limΔ/σ→∞Pdata\(f∗\(𝐱\)∣𝐱\)=0\.\\lim\_\{\\Delta/\\sigma\\to\\infty\}P\_\{\\text\{data\}\}\\\!\\left\(f^\{\*\}\(\\mathbf\{x\}\)\\mid\\mathbf\{x\}\\right\)=0\.

###### Proof\.

Step 1: MSE\-optimal predictor\.For any measurableff, the MSE risk is

ℛ\(f\)=𝔼\[\(y−f\(𝐱\)\)2\]\.\\mathcal\{R\}\(f\)=\\mathbb\{E\}\\big\[\(y\-f\(\\mathbf\{x\}\)\)^\{2\}\\big\]\.Conditioning on𝐱\\mathbf\{x\}and minimizing pointwise gives

f∗\(𝐱\)=arg⁡mina∈ℝ⁡𝔼\[\(y−a\)2∣𝐱\]=𝔼\[y∣𝐱\]\.f^\{\*\}\(\\mathbf\{x\}\)=\\arg\\min\_\{a\\in\\mathbb\{R\}\}\\mathbb\{E\}\\big\[\(y\-a\)^\{2\}\\mid\\mathbf\{x\}\\big\]=\\mathbb\{E\}\[y\\mid\\mathbf\{x\}\]\.
Step 2: Density at the conditional mean under separated modes\.Consider the bimodal case for clarity:

Pdata\(y∣𝐱\)=12𝒩\(y;μ1,σ2\)\+12𝒩\(y;μ2,σ2\),Δ≜\|μ1−μ2\|\.P\_\{\\text\{data\}\}\(y\\mid\\mathbf\{x\}\)=\\frac\{1\}\{2\}\\mathcal\{N\}\(y;\\mu\_\{1\},\\sigma^\{2\}\)\+\\frac\{1\}\{2\}\\mathcal\{N\}\(y;\\mu\_\{2\},\\sigma^\{2\}\),\\quad\\Delta\\triangleq\|\\mu\_\{1\}\-\\mu\_\{2\}\|\.Then

f∗\(𝐱\)=μ¯≜μ1\+μ22\.f^\{\*\}\(\\mathbf\{x\}\)=\\bar\{\\mu\}\\triangleq\\frac\{\\mu\_\{1\}\+\\mu\_\{2\}\}\{2\}\.The conditional density atμ¯\\bar\{\\mu\}is

Pdata\(μ¯∣𝐱\)\\displaystyle P\_\{\\text\{data\}\}\(\\bar\{\\mu\}\\mid\\mathbf\{x\}\)=12𝒩\(μ¯;μ1,σ2\)\+12𝒩\(μ¯;μ2,σ2\)\\displaystyle=\\frac\{1\}\{2\}\\mathcal\{N\}\(\\bar\{\\mu\};\\mu\_\{1\},\\sigma^\{2\}\)\+\\frac\{1\}\{2\}\\mathcal\{N\}\(\\bar\{\\mu\};\\mu\_\{2\},\\sigma^\{2\}\)\(22\)=12πσexp⁡\(−Δ28σ2\)\.\\displaystyle=\\frac\{1\}\{\\sqrt\{2\\pi\}\\sigma\}\\exp\\\!\\left\(\-\\frac\{\\Delta^\{2\}\}\{8\\sigma^\{2\}\}\\right\)\.Taking the limitΔ/σ→∞\\Delta/\\sigma\\to\\inftyyields

limΔ/σ→∞Pdata\(μ¯∣𝐱\)=0,\\lim\_\{\\Delta/\\sigma\\to\\infty\}P\_\{\\text\{data\}\}\(\\bar\{\\mu\}\\mid\\mathbf\{x\}\)=0,which proves the claim\. The argument extends to multiple modes: when modes are mutually separated, the conditional mean lies between modes and its density decays exponentially with separation\. ∎

### A\.2\.Limitations of Discretization

Ordinal regression methods \(e\.g\., CREAD\(Sunet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib1)\)and SWaT\(Yanget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib9)\)\) discretize the continuous watch timeyyusing predefined thresholds

c1<c2<⋯<cM,c\_\{1\}<c\_\{2\}<\\dots<c\_\{M\},and transformyyinto an ordered sequence of binary decisions:

𝐁m=𝕀\(y\>cm\),m=1,…,M\.\\mathbf\{B\}^\{m\}=\\mathbb\{I\}\(y\>c\_\{m\}\),\\quad m=1,\\dots,M\.This converts a continuous regression problem into multiple classification sub\-tasks and can reduce sensitivity to long\-tailed targets\. However, it introduces: \(i\)hard quantization error, since thresholds break continuity and create boundary effects; and \(ii\) aninterval independence assumption, since standard losses \(e\.g\., cross\-entropy\) typically treat\{𝐁m\}\\\{\\mathbf\{B\}^\{m\}\\\}as conditionally independent given𝐱\\mathbf\{x\}, ignoring ordinal dependencies across intervals\(Maet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib2)\)\.

###### Proposition 0 \(Dependency Error in Discretized Modeling\)\.

LetPdata\(𝐁∣𝐱\)P\_\{\\text\{data\}\}\(\\mathbf\{B\}\\mid\\mathbf\{x\}\)denote the true joint distribution of interval decisions, where𝐁=\(𝐁1,…,𝐁M\)\\mathbf\{B\}=\(\\mathbf\{B\}^\{1\},\\dots,\\mathbf\{B\}^\{M\}\), and let the naive discretization model assume conditional independence:

Pnaive\(𝐁∣𝐱\)=∏m=1MP\(𝐁m∣𝐱\)\.P\_\{\\text\{naive\}\}\(\\mathbf\{B\}\\mid\\mathbf\{x\}\)=\\prod\_\{m=1\}^\{M\}P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\}\)\.Then the modeling error measured by KL divergence admits the decomposition

\(23\)DKL\(Pdata∥Pnaive\)\\displaystyle D\_\{\\mathrm\{KL\}\}\\\!\\left\(P\_\{\\text\{data\}\}\\,\\\|\\,P\_\{\\text\{naive\}\}\\right\)=∑m=1M𝔼𝐁<m∼Pdata\[DKL\(P\(𝐁m∣𝐱,𝐁<m\)∥P\(𝐁m∣𝐱\)\)\]\\displaystyle=\\sum\_\{m=1\}^\{M\}\\mathbb\{E\}\_\{\\mathbf\{B\}^\{<m\}\\sim P\_\{\\text\{data\}\}\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\\\!\\left\(P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\\,\\\|\\,P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\}\)\\right\)\\right\]=∑m=1MI\(𝐁m;𝐁<m∣𝐱\),\\displaystyle=\\sum\_\{m=1\}^\{M\}I\(\\mathbf\{B\}^\{m\};\\mathbf\{B\}^\{<m\}\\mid\\mathbf\{x\}\),where𝐁<m=\(𝐁1,…,𝐁m−1\)\\mathbf\{B\}^\{<m\}=\(\\mathbf\{B\}^\{1\},\\dots,\\mathbf\{B\}^\{m\-1\}\)andI\(𝐁m;𝐁<m∣𝐱\)I\(\\mathbf\{B\}^\{m\};\\mathbf\{B\}^\{<m\}\\mid\\mathbf\{x\}\)is the conditional mutual information\.

###### Proof\.

We first write the true distribution using the chain rule:

Pdata\(𝐁∣𝐱\)=∏m=1MP\(𝐁m∣𝐱,𝐁<m\)\.P\_\{\\text\{data\}\}\(\\mathbf\{B\}\\mid\\mathbf\{x\}\)=\\prod\_\{m=1\}^\{M\}P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\.By definition,

DKL\(Pdata∥Pnaive\)\\displaystyle D\_\{\\mathrm\{KL\}\}\\\!\\left\(P\_\{\\text\{data\}\}\\,\\\|\\,P\_\{\\text\{naive\}\}\\right\)=𝔼𝐁∼Pdata\[log⁡Pdata\(𝐁∣𝐱\)Pnaive\(𝐁∣𝐱\)\]\\displaystyle=\\mathbb\{E\}\_\{\\mathbf\{B\}\\sim P\_\{\\text\{data\}\}\}\\left\[\\log\\frac\{P\_\{\\text\{data\}\}\(\\mathbf\{B\}\\mid\\mathbf\{x\}\)\}\{P\_\{\\text\{naive\}\}\(\\mathbf\{B\}\\mid\\mathbf\{x\}\)\}\\right\]=𝔼𝐁∼Pdata\[log⁡∏m=1MP\(𝐁m∣𝐱,𝐁<m\)∏m=1MP\(𝐁m∣𝐱\)\]\\displaystyle=\\mathbb\{E\}\_\{\\mathbf\{B\}\\sim P\_\{\\text\{data\}\}\}\\left\[\\log\\frac\{\\prod\_\{m=1\}^\{M\}P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\}\{\\prod\_\{m=1\}^\{M\}P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\}\)\}\\right\]\(24\)=∑m=1M𝔼𝐁∼Pdata\[log⁡P\(𝐁m∣𝐱,𝐁<m\)P\(𝐁m∣𝐱\)\]\.\\displaystyle=\\sum\_\{m=1\}^\{M\}\\mathbb\{E\}\_\{\\mathbf\{B\}\\sim P\_\{\\text\{data\}\}\}\\left\[\\log\\frac\{P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\}\{P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\}\)\}\\right\]\.Taking expectation over𝐁<m\\mathbf\{B\}^\{<m\}explicitly yields

\(25\)∑m=1M𝔼𝐁<m∼Pdata\[∑b∈\{0,1\}P\(𝐁m=b∣𝐱,𝐁<m\)log⁡P\(𝐁m=b∣𝐱,𝐁<m\)P\(𝐁m=b∣𝐱\)\],\\sum\_\{m=1\}^\{M\}\\mathbb\{E\}\_\{\\mathbf\{B\}^\{<m\}\\sim P\_\{\\text\{data\}\}\}\\left\[\\sum\_\{b\\in\\\{0,1\\\}\}P\(\\mathbf\{B\}^\{m\}=b\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\\log\\frac\{P\(\\mathbf\{B\}^\{m\}=b\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\}\{P\(\\mathbf\{B\}^\{m\}=b\\mid\\mathbf\{x\}\)\}\\right\],which is exactly

∑m=1M𝔼𝐁<m∼Pdata\[DKL\(P\(𝐁m∣𝐱,𝐁<m\)∥P\(𝐁m∣𝐱\)\)\]\.\\sum\_\{m=1\}^\{M\}\\mathbb\{E\}\_\{\\mathbf\{B\}^\{<m\}\\sim P\_\{\\text\{data\}\}\}\\left\[D\_\{\\mathrm\{KL\}\}\\\!\\left\(P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\\,\\\|\\,P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\}\)\\right\)\\right\]\.Finally, by the definition of conditional mutual information,

\(26\)I\(𝐁m;𝐁<m∣𝐱\)\\displaystyle I\(\\mathbf\{B\}^\{m\};\\mathbf\{B\}^\{<m\}\\mid\\mathbf\{x\}\)≜𝔼𝐁<m∼Pdata\[DKL\(P\(𝐁m∣𝐱,𝐁<m\)∥P\(𝐁m∣𝐱\)\)\],\\displaystyle\\triangleq\\mathbb\{E\}\_\{\\mathbf\{B\}^\{<m\}\\sim P\_\{\\text\{data\}\}\}\\left\[D\_\{\\mathrm\{KL\}\}\\\!\\left\(P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\},\\mathbf\{B\}^\{<m\}\)\\,\\\|\\,P\(\\mathbf\{B\}^\{m\}\\mid\\mathbf\{x\}\)\\right\)\\right\],so summing overmmgives the stated equality\. ∎

### A\.3\.Proof of AR Limitations

We provide a bias variance decomposition of the expected squared regression error for tokenized AR models\(Maet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib2)\)\.

##### Setup and Notation\.

Consider a continuous target reconstructed from a sequence of value tokens:y≜∑t=1Tϕ\(st\)y\\triangleq\\sum\_\{t=1\}^\{T\}\\phi\(s^\{t\}\), wherests^\{t\}denotes the ground\-truth token at stepttandϕ\(⋅\)\\phi\(\\cdot\)maps a token to its numeric value\. The AR model predicts a token sequences^1:T\\hat\{s\}^\{1:T\}, yielding the predictiony^≜∑t=1Tϕ\(s^t\)\\hat\{y\}\\triangleq\\sum\_\{t=1\}^\{T\}\\phi\(\\hat\{s\}^\{t\}\)\.

Define the numeric token valuesCt≜ϕ\(st\),C^t≜ϕ\(s^t\)C^\{t\}\\triangleq\\phi\(s^\{t\}\),\\hat\{C\}^\{t\}\\triangleq\\phi\(\\hat\{s\}^\{t\}\), and the step\-wise prediction errorΔt≜C^t−Ct\\Delta\_\{t\}\\triangleq\\hat\{C\}^\{t\}\-C^\{t\}\. By assumption, all token values are bounded:Ct,C^t∈\[wmin,wmax\]C^\{t\},\\hat\{C\}^\{t\}\\in\[w\_\{\\min\},w\_\{\\max\}\], and the step\-wise bias satisfies\|𝔼\[Δt\]\|≤B,∀t\\lvert\\mathbb\{E\}\[\\Delta\_\{t\}\]\\rvert\\leq B,\\forall t\.

##### Error Decomposition\.

The squared regression error can be written as

\(27\)𝔼\[\(y^−y\)2\]=𝔼\[\(∑t=1TΔt\)2\]\.\\mathbb\{E\}\\\!\\left\[\(\\hat\{y\}\-y\)^\{2\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\left\(\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\\right\)^\{2\}\\right\]\.Using the bias–variance decomposition, we obtain

\(28\)𝔼\[\(∑t=1TΔt\)2\]=\(∑t=1T𝔼\[Δt\]\)2\+𝕍\(∑t=1TΔt\)\.\\mathbb\{E\}\\\!\\left\[\\left\(\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\\right\)^\{2\}\\right\]=\\left\(\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\[\\Delta\_\{t\}\]\\right\)^\{2\}\+\\mathbb\{V\}\\\!\\left\(\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\\right\)\.

##### Bias Term\.

Letbt≜𝔼\[Δt\]b\_\{t\}\\triangleq\\mathbb\{E\}\[\\Delta\_\{t\}\]\. Since\|bt\|≤B\|b\_\{t\}\|\\leq B, we have

\(29\)\(∑t=1Tbt\)2≤T2B2\.\\left\(\\sum\_\{t=1\}^\{T\}b\_\{t\}\\right\)^\{2\}\\leq T^\{2\}B^\{2\}\.

##### Variance Term\.

The variance term expands as

\(30\)𝕍\(∑t=1TΔt\)=∑t=1T𝕍\(Δt\)\+∑t≠t′Cov\(Δt,Δt′\)\.\\mathbb\{V\}\\\!\\left\(\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\\right\)=\\sum\_\{t=1\}^\{T\}\\mathbb\{V\}\(\\Delta\_\{t\}\)\+\\sum\_\{t\\neq t^\{\\prime\}\}\\mathrm\{Cov\}\(\\Delta\_\{t\},\\Delta\_\{t^\{\\prime\}\}\)\.Applying the Cauchy–Schwarz inequality yields

\(31\)∑t≠t′Cov\(Δt,Δt′\)≤T\(T−1\)2maxt⁡𝕍\(Δt\)\.\\sum\_\{t\\neq t^\{\\prime\}\}\\mathrm\{Cov\}\(\\Delta\_\{t\},\\Delta\_\{t^\{\\prime\}\}\)\\leq\\frac\{T\(T\-1\)\}\{2\}\\max\_\{t\}\\mathbb\{V\}\(\\Delta\_\{t\}\)\.
SinceΔt=C^t−Ct\\Delta\_\{t\}=\\hat\{C\}^\{t\}\-C^\{t\}and bothC^t\\hat\{C\}^\{t\}andCtC^\{t\}lie in\[wmin,wmax\]\[w\_\{\\min\},w\_\{\\max\}\], Popoviciu’s inequality gives

\(32\)𝕍\(Δt\)≤\(wmax−wmin\)24\.\\mathbb\{V\}\(\\Delta\_\{t\}\)\\leq\\frac\{\(w\_\{\\max\}\-w\_\{\\min\}\)^\{2\}\}\{4\}\.Therefore,

\(33\)𝕍\(∑t=1TΔt\)≤T2⋅\(wmax−wmin\)24\.\\mathbb\{V\}\\\!\\left\(\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\\right\)\\leq T^\{2\}\\cdot\\frac\{\(w\_\{\\max\}\-w\_\{\\min\}\)^\{2\}\}\{4\}\.

##### Final Bound\.

Combining the bias and variance bounds, we obtain

\(34\)𝔼\[\(y^−y\)2\]≤T2B2\+T2\(wmax−wmin\)24\.∎\\mathbb\{E\}\\\!\\left\[\(\\hat\{y\}\-y\)^\{2\}\\right\]\\leq T^\{2\}B^\{2\}\+T^\{2\}\\frac\{\(w\_\{\\max\}\-w\_\{\\min\}\)^\{2\}\}\{4\}\.\\qed

##### Vocabulary\-induced trade\-off\.

Both the sequence lengthTTand the step\-wise bias boundBBare induced by the vocabulary design\. Let𝒱\\mathcal\{V\}denote the value\-token vocabulary andϕ\(𝒱\)⊂\[wmin,wmax\]\\phi\(\\mathcal\{V\}\)\\subset\[w\_\{\\min\},w\_\{\\max\}\]its numeric range\. A finer\-grained vocabulary yields smaller token magnitudes and typically requires a longer sequence to represent the same target, leading to largerTT\. Conversely, a coarser vocabulary reducesTTbut increases discretization error at each step, enlarging the attainable bias boundBB\. In particular, sinceΔt=ϕ\(s^t\)−ϕ\(st\)\\Delta\_\{t\}=\\phi\(\\hat\{s\}^\{t\}\)\-\\phi\(s^\{t\}\)andϕ\(st\),ϕ\(s^t\)∈\[wmin,wmax\]\\phi\(s^\{t\}\),\\phi\(\\hat\{s\}^\{t\}\)\\in\[w\_\{\\min\},w\_\{\\max\}\], we have\|Δt\|≤wmax−wmin⇒\|𝔼\[Δt\]\|≤B≤wmax−wmin\|\\Delta\_\{t\}\|\\leq w\_\{\\max\}\-w\_\{\\min\}\\quad\\Rightarrow\\quad\|\\mathbb\{E\}\[\\Delta\_\{t\}\]\|\\leq B\\leq w\_\{\\max\}\-w\_\{\\min\}\. Therefore, tokenized AR regression is intrinsically constrained by a vocabulary\-dependent trade\-off betweenTTandBB, which directly controls the error bound\.

## Appendix BTheoretical Analysis

This section provides a rigorous theoretical analysis for our spatio\-temporal multi\-head diffusion\-style ordinal regression formulation under explicit assumptions\. We focus on statements with step\-by\-step derivations for key equations\.

### B\.1\.Notation and Symbols

Table[7](https://arxiv.org/html/2606.07599#A2.T7)summarizes key notation\.

Table 7\.Summary of key mathematical notation\.SymbolDescriptionx,yx,yInput features and ordinal label \(y∈\{1,…,K\}y\\in\\\{1,\\dots,K\\\}\)z0=ψ\(y\)z\_\{0\}=\\psi\(y\)Continuous target after encodingψ:\{1,…,K\}→\[0,1\]\\psi:\\\{1,\\dots,K\\\}\\to\[0,1\]𝒯\\mathcal\{T\}Discrete diffusion time points \(\{t1,…,tM\}\\\{t\_\{1\},\\dots,t\_\{M\}\\\}\)α¯t\\bar\{\\alpha\}\_\{t\}Noise schedule at timettϵΘ\(⋅\)\\epsilon\_\{\\Theta\}\(\\cdot\)Noise predictor \(parametersΘ\\Theta\)ϵ\\epsilonStandard Gaussian noise \(𝒩\(0,1\)\\mathcal\{N\}\(0,1\)\)γ\\gammaMinimum gap between ordinal thresholdsρm,ηm\\rho\_\{m\},\\eta\_\{m\}Contraction factor and error term at stepmmreff\(U\)r\_\{\\mathrm\{eff\}\}\(U\)Effective rank of representation matrixUU
### B\.2\.Setup: Ordinal Regression and Diffusion Training

#### B\.2\.1\.Data and ordinal regression

Let\(x,y\)∼𝒟\(x,y\)\\sim\\mathcal\{D\}, wherex∈𝒳⊆ℝdx\\in\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{d\}andy∈\{1,2,…,K\}y\\in\\\{1,2,\\dots,K\\\}is an ordinal label\. We assume an encodingψ:\{1,…,K\}→\[0,1\]\\psi:\\\{1,\\dots,K\\\}\\to\[0,1\]and define a continuous target,

\(35\)z0=ψ\(y\)\.z\_\{0\}=\\psi\(y\)\\,\.

#### B\.2\.2\.Forward diffusion and noise\-prediction loss

Fix a discrete set of time points𝒯=\{t1,…,tM\}\\mathcal\{T\}=\\\{t\_\{1\},\\dots,t\_\{M\}\\\}with0<t1<⋯<tM0<t\_\{1\}<\\cdots<t\_\{M\}\. For eacht∈𝒯t\\in\\mathcal\{T\}, define the forward process,

\(36\)zt=α¯tz0\+1−α¯tϵ,ϵ∼𝒩\(0,1\),z\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,z\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon\\,,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,1\),whereα¯t∈\(0,1\)\\bar\{\\alpha\}\_\{t\}\\in\(0,1\)is a prescribed noise schedule\.

###### Lemma 0 \(Gaussian reparameterization\)\.

IfZ∼𝒩\(μ,σ2\)Z\\sim\\mathcal\{N\}\(\\mu,\\sigma^\{2\}\)withσ\>0\\sigma\>0, andϵ∼𝒩\(0,1\)\\epsilon\\sim\\mathcal\{N\}\(0,1\), thenZ~≜μ\+σϵ∼𝒩\(μ,σ2\)\\tilde\{Z\}\\triangleq\\mu\+\\sigma\\epsilon\\sim\\mathcal\{N\}\(\\mu,\\sigma^\{2\}\)\.

###### Proof\.

For anyu∈ℝu\\in\\mathbb\{R\},

ℙ\(Z~≤u\)=ℙ\(μ\+σϵ≤u\)=ℙ\(ϵ≤u−μσ\)=Φ\(u−μσ\),\\mathbb\{P\}\(\\tilde\{Z\}\\leq u\)=\\mathbb\{P\}\(\\mu\+\\sigma\\epsilon\\leq u\)=\\mathbb\{P\}\\\!\\left\(\\epsilon\\leq\\frac\{u\-\\mu\}\{\\sigma\}\\right\)=\\Phi\\\!\\left\(\\frac\{u\-\\mu\}\{\\sigma\}\\right\),whereΦ\\Phiis the standard normal CDF\. This matches the CDF of𝒩\(μ,σ2\)\\mathcal\{N\}\(\\mu,\\sigma^\{2\}\)\. ∎

We use the standard noise\-prediction MSE objective\. Given a predictorϵΘ\(⋅\)\\epsilon\_\{\\Theta\}\(\\cdot\), define the per\-time expected loss,

\(37\)ℒm\(Θ\)≜𝔼\(x,z0\)∼𝒟𝔼ϵ∼𝒩\(0,1\)\[\(ϵ−ϵΘ\(ztm,tm,x\)\)2\],\\mathcal\{L\}\_\{m\}\(\\Theta\)\\triangleq\\mathbb\{E\}\_\{\(x,z\_\{0\}\)\\sim\\mathcal\{D\}\}\\;\\mathbb\{E\}\_\{\\epsilon\\sim\\mathcal\{N\}\(0,1\)\}\\Bigl\[\\bigl\(\\epsilon\-\\epsilon\_\{\\Theta\}\(z\_\{t\_\{m\}\},t\_\{m\},x\)\\bigr\)^\{2\}\\Bigr\],and the weighted multi\-time loss,

\(38\)ℒ\(Θ\)≜∑m=1Mλmℒm\(Θ\),λm\>0\.\\mathcal\{L\}\(\\Theta\)\\triangleq\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}\\,\\mathcal\{L\}\_\{m\}\(\\Theta\)\\,,\\qquad\\lambda\_\{m\}\>0\.

#### B\.2\.3\.Architectures: SH, MH, and the SH\-1T baseline

##### Shared\-head \(SH\)\.

Parameters areΘSH=\(ϕ,θ\)\\Theta\_\{\\mathrm\{SH\}\}=\(\\phi,\\theta\)with encoderEϕE\_\{\\phi\}and a single decodergθg\_\{\\theta\}:

ϵΘSH\(zt,t,x\)=gθ\(Eϕ\(x\),zt,t\)\.\\epsilon\_\{\\Theta\_\{\\mathrm\{SH\}\}\}\(z\_\{t\},t,x\)=g\_\{\\theta\}\(E\_\{\\phi\}\(x\),z\_\{t\},t\)\.

##### Multi\-head over discrete times \(MH\-MT\)\.

Parameters areΘMH=\(ϕ,θ1,…,θM\)\\Theta\_\{\\mathrm\{MH\}\}=\(\\phi,\\theta\_\{1\},\\dots,\\theta\_\{M\}\), and each timetmt\_\{m\}has its own head:

ϵΘMH\(ztm,tm,x\)=gθm\(Eϕ\(x\),ztm,tm\)\.\\epsilon\_\{\\Theta\_\{\\mathrm\{MH\}\}\}\(z\_\{t\_\{m\}\},t\_\{m\},x\)=g\_\{\\theta\_\{m\}\}\(E\_\{\\phi\}\(x\),z\_\{t\_\{m\}\},t\_\{m\}\)\.
###### Proposition 0 \(No cross\-time mixing in head gradients\)\.

In MH\-MT, the total loss decomposes as

\(39\)ℒ\(ϕ,θ1,…,θM\)=∑m=1Mλmℒm\(ϕ,θm\)\.\\displaystyle\\mathcal\{L\}\(\\phi,\\theta\_\{1\},\\dots,\\theta\_\{M\}\)=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}\\mathcal\{L\}\_\{m\}\(\\phi,\\theta\_\{m\}\)\.Hence, for anykk,

∇θkℒ\(ϕ,θ1,…,θM\)=λk∇θkℒk\(ϕ,θk\),\\nabla\_\{\\theta\_\{k\}\}\\mathcal\{L\}\(\\phi,\\theta\_\{1\},\\dots,\\theta\_\{M\}\)=\\lambda\_\{k\}\\nabla\_\{\\theta\_\{k\}\}\\mathcal\{L\}\_\{k\}\(\\phi,\\theta\_\{k\}\),and for anym≠km\\neq k,∇θkℒm\(ϕ,θm\)=0\\nabla\_\{\\theta\_\{k\}\}\\mathcal\{L\}\_\{m\}\(\\phi,\\theta\_\{m\}\)=0\.

###### Proof\.

By linearity of differentiation,

∇θkℒ=∇θk∑m=1Mλmℒm=∑m=1Mλm∇θkℒm\.\\nabla\_\{\\theta\_\{k\}\}\\mathcal\{L\}=\\nabla\_\{\\theta\_\{k\}\}\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}\\mathcal\{L\}\_\{m\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}\\nabla\_\{\\theta\_\{k\}\}\\mathcal\{L\}\_\{m\}\.Form≠km\\neq k,ℒm\(ϕ,θm\)\\mathcal\{L\}\_\{m\}\(\\phi,\\theta\_\{m\}\)does not depend onθk\\theta\_\{k\}, hence∇θkℒm=0\\nabla\_\{\\theta\_\{k\}\}\\mathcal\{L\}\_\{m\}=0\. The remaining term ism=km=k\. ∎

##### Baseline: single\-head single\-time \(SH\-1T\)\.

Fix a timet⋆∈𝒯t\_\{\\star\}\\in\\mathcal\{T\}\. SH\-1T only trains and predicts att⋆t\_\{\\star\}:

ϵΘ1T\(zt⋆,t⋆,x\)=gθ\(Eϕ\(x\),zt⋆,t⋆\),\\epsilon\_\{\\Theta\_\{\\mathrm\{1T\}\}\}\(z\_\{t\_\{\\star\}\},t\_\{\\star\},x\)=g\_\{\\theta\}\(E\_\{\\phi\}\(x\),z\_\{t\_\{\\star\}\},t\_\{\\star\}\),with objective

ℒ1T\(Θ1T\)≜𝔼\[\(ϵ−ϵΘ1T\(zt⋆,t⋆,x\)\)2\]\.\\mathcal\{L\}\_\{\\mathrm\{1T\}\}\(\\Theta\_\{\\mathrm\{1T\}\}\)\\triangleq\\mathbb\{E\}\\Bigl\[\\bigl\(\\epsilon\-\\epsilon\_\{\\Theta\_\{\\mathrm\{1T\}\}\}\(z\_\{t\_\{\\star\}\},t\_\{\\star\},x\)\\bigr\)^\{2\}\\Bigr\]\.

### B\.3\.Assumptions

###### Assumption 1 \(Smoothness and exchanging gradient/expectation\)\.

Eachℒm\(Θ\)\\mathcal\{L\}\_\{m\}\(\\Theta\)is differentiable and∇ℒm\(Θ\)=𝔼\[∇ℓm\(Θ;ξ\)\]\\nabla\\mathcal\{L\}\_\{m\}\(\\Theta\)=\\mathbb\{E\}\[\\nabla\\ell\_\{m\}\(\\Theta;\\xi\)\]for the sampling randomnessξ\\xi\.

###### Assumption 2 \(Gradient interference condition\)\.

In SH, definegm≜∇θℒm\(ϕ,θ\)g\_\{m\}\\triangleq\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{m\}\(\\phi,\\theta\)\. A sufficient “conflict” condition is that for some pairs\(i,j\)\(i,j\),

⟨gi,gj⟩≤−cij,cij≥0\.\\langle g\_\{i\},g\_\{j\}\\rangle\\leq\-c\_\{ij\}\\,,\\qquad c\_\{ij\}\\geq 0\.This is checkable by logging cosine similarity during training\.

###### Assumption 3 \(Coarse\-to\-fine contraction form\)\.

A refinement sequencez^M,z^M−1,…,z^0\\hat\{z\}\_\{M\},\\hat\{z\}\_\{M\-1\},\\dots,\\hat\{z\}\_\{0\}exists such that for all samples,

\|z^m−1−z0\|≤ρm\|z^m−z0\|\+ηm,m=1,2,…,M,\|\\hat\{z\}\_\{m\-1\}\-z\_\{0\}\|\\leq\\rho\_\{m\}\|\\hat\{z\}\_\{m\}\-z\_\{0\}\|\+\\eta\_\{m\}\\,,\\qquad m=1,2,\\dots,M,whereρm∈\[0,1\)\\rho\_\{m\}\\in\[0,1\)andηm≥0\\eta\_\{m\}\\geq 0\.

###### Assumption 4 \(Ordinal decoding thresholds\)\.

Let0=τ0<τ1<⋯<τK−1<τK=10=\\tau\_\{0\}<\\tau\_\{1\}<\\cdots<\\tau\_\{K\-1\}<\\tau\_\{K\}=1, and decodey^=k\\hat\{y\}=kiffz^∈\[τk−1,τk\)\\hat\{z\}\\in\[\\tau\_\{k\-1\},\\tau\_\{k\}\)\. Define the minimum gapγ≜min1≤k≤K⁡\(τk−τk−1\)\>0\\gamma\\triangleq\\min\_\{1\\leq k\\leq K\}\(\\tau\_\{k\}\-\\tau\_\{k\-1\}\)\>0\.

###### Assumption 5 \(Error\-covariance bounds for multi\-time fusion\)\.

For each time pointtmt\_\{m\}, define thez0z\_\{0\}regression errorem≜z^0\(m\)−z0e\_\{m\}\\triangleq\\hat\{z\}\_\{0\}^\{\(m\)\}\-z\_\{0\}\. Assume

𝔼\[em2\]≤vm,\|𝔼\[eiej\]\|≤bij\(i≠j\),\\mathbb\{E\}\[e\_\{m\}^\{2\}\]\\leq v\_\{m\}\\,,\\qquad\|\\mathbb\{E\}\[e\_\{i\}e\_\{j\}\]\|\\leq b\_\{ij\}\\quad\(i\\neq j\),for somevm≥0v\_\{m\}\\geq 0andbij≥0b\_\{ij\}\\geq 0\.

### B\.4\.Coarse\-to\-Fine Refinement Error Bound

###### Theorem 3 \(Explicit error bound for contractive refinement\)\.

Under Assumption[3](https://arxiv.org/html/2606.07599#Thmassumption3), if\|z^m−1−z0\|≤ρm\|z^m−z0\|\+ηm\|\\hat\{z\}\_\{m\-1\}\-z\_\{0\}\|\\leq\\rho\_\{m\}\|\\hat\{z\}\_\{m\}\-z\_\{0\}\|\+\\eta\_\{m\}form=1,…,Mm=1,\\dots,M, then

\|z^0−z0\|≤\(∏m=1Mρm\)\|z^M−z0\|\+∑k=1M\(ηk∏j=1k−1ρj\),\|\\hat\{z\}\_\{0\}\-z\_\{0\}\|\\leq\\left\(\\prod\_\{m=1\}^\{M\}\\rho\_\{m\}\\right\)\|\\hat\{z\}\_\{M\}\-z\_\{0\}\|\+\\sum\_\{k=1\}^\{M\}\\left\(\\eta\_\{k\}\\prod\_\{j=1\}^\{k\-1\}\\rho\_\{j\}\\right\),where the empty product is defined as11\.

###### Proof\.

We prove by induction onMM\.

Base case \(M=1M=1\):Directly from Assumption[3](https://arxiv.org/html/2606.07599#Thmassumption3)withm=1m=1:

\(40\)\|z^0−z0\|≤ρ1\|z^1−z0\|\+η1\.\|\\hat\{z\}\_\{0\}\-z\_\{0\}\|\\leq\\rho\_\{1\}\|\\hat\{z\}\_\{1\}\-z\_\{0\}\|\+\\eta\_\{1\}\.
Inductive step:Assume the bound holds forM−1M\-1steps:

\(41\)\|z^0−z0\|≤\(∏m=1M−1ρm\)\|z^M−1−z0\|\+∑k=1M−1\(ηk∏j=1k−1ρj\)\.\|\\hat\{z\}\_\{0\}\-z\_\{0\}\|\\leq\\left\(\\prod\_\{m=1\}^\{M\-1\}\\rho\_\{m\}\\right\)\|\\hat\{z\}\_\{M\-1\}\-z\_\{0\}\|\+\\sum\_\{k=1\}^\{M\-1\}\\left\(\\eta\_\{k\}\\prod\_\{j=1\}^\{k\-1\}\\rho\_\{j\}\\right\)\.From Assumption[3](https://arxiv.org/html/2606.07599#Thmassumption3)withm=Mm=M:

\(42\)\|z^M−1−z0\|≤ρM\|z^M−z0\|\+ηM\.\|\\hat\{z\}\_\{M\-1\}\-z\_\{0\}\|\\leq\\rho\_\{M\}\|\\hat\{z\}\_\{M\}\-z\_\{0\}\|\+\\eta\_\{M\}\.Substituting \([42](https://arxiv.org/html/2606.07599#A2.E42)\) into \([41](https://arxiv.org/html/2606.07599#A2.E41)\):

\|z^0−z0\|\\displaystyle\|\\hat\{z\}\_\{0\}\-z\_\{0\}\|≤\(∏m=1M−1ρm\)\(ρM\|z^M−z0\|\+ηM\)\+∑k=1M−1\(ηk∏j=1k−1ρj\)\\displaystyle\\leq\\left\(\\prod\_\{m=1\}^\{M\-1\}\\rho\_\{m\}\\right\)\\bigl\(\\rho\_\{M\}\|\\hat\{z\}\_\{M\}\-z\_\{0\}\|\+\\eta\_\{M\}\\bigr\)\+\\sum\_\{k=1\}^\{M\-1\}\\left\(\\eta\_\{k\}\\prod\_\{j=1\}^\{k\-1\}\\rho\_\{j\}\\right\)=\(∏m=1Mρm\)\|z^M−z0\|\+ηM∏j=1M−1ρj\+∑k=1M−1\(ηk∏j=1k−1ρj\)\\displaystyle=\\left\(\\prod\_\{m=1\}^\{M\}\\rho\_\{m\}\\right\)\|\\hat\{z\}\_\{M\}\-z\_\{0\}\|\+\\eta\_\{M\}\\prod\_\{j=1\}^\{M\-1\}\\rho\_\{j\}\+\\sum\_\{k=1\}^\{M\-1\}\\left\(\\eta\_\{k\}\\prod\_\{j=1\}^\{k\-1\}\\rho\_\{j\}\\right\)\(43\)=\(∏m=1Mρm\)\|z^M−z0\|\+∑k=1M\(ηk∏j=1k−1ρj\)\.\\displaystyle=\\left\(\\prod\_\{m=1\}^\{M\}\\rho\_\{m\}\\right\)\|\\hat\{z\}\_\{M\}\-z\_\{0\}\|\+\\sum\_\{k=1\}^\{M\}\\left\(\\eta\_\{k\}\\prod\_\{j=1\}^\{k\-1\}\\rho\_\{j\}\\right\)\.This completes the induction\. ∎

### B\.5\.Core Lemmas and Theorems

#### B\.5\.1\.Cross\-term structure: SH vs MH

###### Proposition 0 \(Gradient interference: SH suffers potential conflict, MH avoids it by design\)\.

Consider the multi\-time objectiveℒ=∑m=1Mλmℒm\\mathcal\{L\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}\\mathcal\{L\}\_\{m\}\.

1. \(a\)Shared\-head \(SH\)\.The total gradient w\.r\.t\. the shared parametersθ\\thetais gSH=∑m=1Mλmgm,gm≜∇θℒm\.g\_\{\\mathrm\{SH\}\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}g\_\{m\},\\quad g\_\{m\}\\triangleq\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{m\}\.Expanding the squared norm yields \(44\)‖gSH‖2=∑m=1Mλm2‖gm‖2\+2∑1≤i<j≤Mλiλj⟨gi,gj⟩\.\\\|g\_\{\\mathrm\{SH\}\}\\\|^\{2\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}^\{2\}\\\|g\_\{m\}\\\|^\{2\}\+2\\sum\_\{1\\leq i<j\\leq M\}\\lambda\_\{i\}\\lambda\_\{j\}\\langle g\_\{i\},g\_\{j\}\\rangle\.If⟨gi,gj⟩≤−cij\\langle g\_\{i\},g\_\{j\}\\rangle\\leq\-c\_\{ij\}withcij\>0c\_\{ij\}\>0for some pair\(i,j\)\(i,j\)\(Assumption[2](https://arxiv.org/html/2606.07599#Thmassumption2)\), then \(45\)‖gSH‖2≤∑m=1Mλm2‖gm‖2−2λiλjcij<∑m=1Mλm2‖gm‖2\.\\\|g\_\{\\mathrm\{SH\}\}\\\|^\{2\}\\leq\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}^\{2\}\\\|g\_\{m\}\\\|^\{2\}\-2\\lambda\_\{i\}\\lambda\_\{j\}c\_\{ij\}<\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}^\{2\}\\\|g\_\{m\}\\\|^\{2\}\.Thus, conflicting gradients weaken the combined update magnitude\.
2. \(b\)Multi\-head \(MH\-MT\)\.With independent parametersθ1,…,θM\\theta\_\{1\},\\dots,\\theta\_\{M\}, the gradient vector is block\-diagonal: ∇\(θ1,…,θM\)ℒ=\(λ1g1,λ2g2,…,λMgM\),gm≜∇θmℒm\.\\nabla\_\{\(\\theta\_\{1\},\\dots,\\theta\_\{M\}\)\}\\mathcal\{L\}=\(\\lambda\_\{1\}g\_\{1\},\\,\\lambda\_\{2\}g\_\{2\},\\,\\dots,\\,\\lambda\_\{M\}g\_\{M\}\),\\quad g\_\{m\}\\triangleq\\nabla\_\{\\theta\_\{m\}\}\\mathcal\{L\}\_\{m\}\.Consequently, \(46\)‖∇ℒ‖2=∑m=1Mλm2‖gm‖2,\\\|\\nabla\\mathcal\{L\}\\\|^\{2\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}^\{2\}\\\|g\_\{m\}\\\|^\{2\},with no cross terms \(Proposition[2](https://arxiv.org/html/2606.07599#A2.Thmtheorem2)\)\.

###### Proof\.

Equation \([44](https://arxiv.org/html/2606.07599#A2.E44)\) follows from expanding‖∑mλmgm‖2=⟨∑mλmgm,∑nλngn⟩\\\|\\sum\_\{m\}\\lambda\_\{m\}g\_\{m\}\\\|^\{2\}=\\langle\\sum\_\{m\}\\lambda\_\{m\}g\_\{m\},\\sum\_\{n\}\\lambda\_\{n\}g\_\{n\}\\rangleand separating diagonal \(m=nm=n\) and off\-diagonal \(m≠nm\\neq n\) terms\. Equation \([45](https://arxiv.org/html/2606.07599#A2.E45)\) substitutes⟨gi,gj⟩≤−cij\\langle g\_\{i\},g\_\{j\}\\rangle\\leq\-c\_\{ij\}into the off\-diagonal sum\. Equation \([46](https://arxiv.org/html/2606.07599#A2.E46)\) holds because the Euclidean norm of a concatenated vector equals the sum of block norms\. ∎

#### B\.5\.2\.From noise prediction to ordinal error bounds

###### Theorem 5 \(Ordinal error upper bound from regression MSE\)\.

Assume the threshold rule from Assumption[4](https://arxiv.org/html/2606.07599#Thmassumption4)and thatz0z\_\{0\}lies at leastγ/2\\gamma/2away from the two nearest thresholds of its true class interval\. Then

\{y^≠y\}⊆\{\|z^−z0\|≥γ/2\},\\\{\\hat\{y\}\\neq y\\\}\\subseteq\\\{\|\\hat\{z\}\-z\_\{0\}\|\\geq\\gamma/2\\\},hence

\(47\)ℙ\(y^≠y\)≤ℙ\(\|z^−z0\|≥γ/2\)≤4𝔼\[\(z^−z0\)2\]γ2\.\\mathbb\{P\}\(\\hat\{y\}\\neq y\)\\leq\\mathbb\{P\}\(\|\\hat\{z\}\-z\_\{0\}\|\\geq\\gamma/2\)\\leq\\frac\{4\\,\\mathbb\{E\}\[\(\\hat\{z\}\-z\_\{0\}\)^\{2\}\]\}\{\\gamma^\{2\}\}\.

###### Proof\.

Ify^≠y\\hat\{y\}\\neq y, thenz^∉\[τy−1,τy\)\\hat\{z\}\\notin\[\\tau\_\{y\-1\},\\tau\_\{y\}\)\. Two cases:

- •z^<τy−1\\hat\{z\}<\\tau\_\{y\-1\}:z0−z^≥z0−τy−1≥γ/2z\_\{0\}\-\\hat\{z\}\\geq z\_\{0\}\-\\tau\_\{y\-1\}\\geq\\gamma/2,
- •z^≥τy\\hat\{z\}\\geq\\tau\_\{y\}:z^−z0≥τy−z0≥γ/2\\hat\{z\}\-z\_\{0\}\\geq\\tau\_\{y\}\-z\_\{0\}\\geq\\gamma/2\.

Thus\|z^−z0\|≥γ/2\|\\hat\{z\}\-z\_\{0\}\|\\geq\\gamma/2, proving\{y^≠y\}⊆\{\|z^−z0\|≥γ/2\}\\\{\\hat\{y\}\\neq y\\\}\\subseteq\\\{\|\\hat\{z\}\-z\_\{0\}\|\\geq\\gamma/2\\\}\. Applying Markov’s inequality toU=\(z^−z0\)2U=\(\\hat\{z\}\-z\_\{0\}\)^\{2\}with threshold\(γ/2\)2\(\\gamma/2\)^\{2\}yields \([47](https://arxiv.org/html/2606.07599#A2.E47)\)\. ∎

###### Definition 0 \(z0z\_\{0\}estimator from noise prediction\)\.

For anyt∈𝒯t\\in\\mathcal\{T\}, givenztz\_\{t\}and a noise predictorϵΘ\(zt,t,x\)\\epsilon\_\{\\Theta\}\(z\_\{t\},t,x\), define

\(48\)z^0\(t\)≜zt−1−α¯tϵΘ\(zt,t,x\)α¯t\.\\hat\{z\}\_\{0\}\(t\)\\triangleq\\frac\{z\_\{t\}\-\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon\_\{\\Theta\}\(z\_\{t\},t,x\)\}\{\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\}\.

###### Theorem 7 \(Exact relation: noise error⇒\\Rightarrowz0z\_\{0\}regression error\)\.

Withzt=α¯tz0\+1−α¯tϵz\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}z\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilonandϵ∼𝒩\(0,1\)\\epsilon\\sim\\mathcal\{N\}\(0,1\), the estimator from Definition[6](https://arxiv.org/html/2606.07599#A2.Thmtheorem6)satisfies

\(49\)z^0\(t\)−z0=1−α¯tα¯t\(ϵ−ϵΘ\(zt,t,x\)\)\.\\hat\{z\}\_\{0\}\(t\)\-z\_\{0\}=\\sqrt\{\\frac\{1\-\\bar\{\\alpha\}\_\{t\}\}\{\\bar\{\\alpha\}\_\{t\}\}\}\\left\(\\epsilon\-\\epsilon\_\{\\Theta\}\(z\_\{t\},t,x\)\\right\)\.Consequently,

\(50\)𝔼\[\(z^0\(t\)−z0\)2\]=1−α¯tα¯t𝔼\[\(ϵΘ\(zt,t,x\)−ϵ\)2\]\.\\mathbb\{E\}\\\!\\left\[\(\\hat\{z\}\_\{0\}\(t\)\-z\_\{0\}\)^\{2\}\\right\]=\\frac\{1\-\\bar\{\\alpha\}\_\{t\}\}\{\\bar\{\\alpha\}\_\{t\}\}\\;\\mathbb\{E\}\\\!\\left\[\\left\(\\epsilon\_\{\\Theta\}\(z\_\{t\},t,x\)\-\\epsilon\\right\)^\{2\}\\right\]\.

###### Proof\.

Substitutingzt=α¯tz0\+1−α¯tϵz\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}z\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsiloninto \([48](https://arxiv.org/html/2606.07599#A2.E48)\):

z^0\(t\)\\displaystyle\\hat\{z\}\_\{0\}\(t\)=α¯tz0\+1−α¯tϵ−1−α¯tϵΘ\(zt,t,x\)α¯t\\displaystyle=\\frac\{\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}z\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilon\-\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilon\_\{\\Theta\}\(z\_\{t\},t,x\)\}\{\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\}\(51\)=z0\+1−α¯tα¯t\(ϵ−ϵΘ\(zt,t,x\)\),\\displaystyle=z\_\{0\}\+\\sqrt\{\\frac\{1\-\\bar\{\\alpha\}\_\{t\}\}\{\\bar\{\\alpha\}\_\{t\}\}\}\\bigl\(\\epsilon\-\\epsilon\_\{\\Theta\}\(z\_\{t\},t,x\)\\bigr\),which gives \([49](https://arxiv.org/html/2606.07599#A2.E49)\)\. Squaring both sides of \([49](https://arxiv.org/html/2606.07599#A2.E49)\) and taking expectation \(using𝔼\[ϵ−ϵΘ\]=0\\mathbb\{E\}\[\\epsilon\-\\epsilon\_\{\\Theta\}\]=0for unbiased predictors\) yields \([50](https://arxiv.org/html/2606.07599#A2.E50)\)\. ∎

###### Corollary 0 \(SH\-1T baseline: ordinal error bound via training MSE\)\.

For SH\-1T att⋆t\_\{\\star\}, letz^0⋆=z^0\(t⋆\)\\hat\{z\}\_\{0\}^\{\\star\}=\\hat\{z\}\_\{0\}\(t\_\{\\star\}\)and decodey^⋆\\hat\{y\}^\{\\star\}by thresholds\. Under the conditions of Theorem[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5),

\(52\)ℙ\(y^⋆≠y\)≤4γ2⋅1−α¯t⋆α¯t⋆𝔼\[\(ϵΘ\(zt⋆,t⋆,x\)−ϵ\)2\]\.\\mathbb\{P\}\(\\hat\{y\}^\{\\star\}\\neq y\)\\leq\\frac\{4\}\{\\gamma^\{2\}\}\\cdot\\frac\{1\-\\bar\{\\alpha\}\_\{t\_\{\\star\}\}\}\{\\bar\{\\alpha\}\_\{t\_\{\\star\}\}\}\\;\\mathbb\{E\}\\\!\\left\[\\bigl\(\\epsilon\_\{\\Theta\}\(z\_\{t\_\{\\star\}\},t\_\{\\star\},x\)\-\\epsilon\\bigr\)^\{2\}\\right\]\.

###### Proof\.

Applying Theorem[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5)withz^=z^0⋆\\hat\{z\}=\\hat\{z\}\_\{0\}^\{\\star\}givesℙ\(y^⋆≠y\)≤4𝔼\[\(z^0⋆−z0\)2\]/γ2\\mathbb\{P\}\(\\hat\{y\}^\{\\star\}\\neq y\)\\leq 4\\,\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\\star\}\-z\_\{0\}\)^\{2\}\]/\\gamma^\{2\}\. Substituting \([50](https://arxiv.org/html/2606.07599#A2.E50)\) witht=t⋆t=t\_\{\\star\}yields \([52](https://arxiv.org/html/2606.07599#A2.E52)\)\. ∎

#### B\.5\.3\.Multi\-embedding \(multi\-head representation\) and information gain

We formalize why splitting into multiple embedding sets \(multi\-embedding\) can increase the*effective information*even when the*total dimensionality*is fixed \(e\.g\.,4×324\\times 32vs\.1×1281\\times 128\)\. Our analysis is inspired by the embedding\-collapse viewpoint of\(Guoet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib165)\), with fully rigorous statements under explicit assumptions\.

###### Definition 0 \(Representation matrix and effective information\)\.

For a batch\{xi\}i=1n\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}, letui=h\(xi\)∈ℝDu\_\{i\}=h\(x\_\{i\}\)\\in\\mathbb\{R\}^\{D\}andU=\[u1⊤;…;un⊤\]∈ℝn×DU=\[u\_\{1\}^\{\\top\};\\dots;u\_\{n\}^\{\\top\}\]\\in\\mathbb\{R\}^\{n\\times D\}\. We userank⁡\(U\)\\operatorname\{rank\}\(U\)to measure effective dimensionality; collapse corresponds to low rank\.

###### Definition 0 \(Effective rank\)\.

Definereff\(U\)≜‖U‖∗2/‖U‖F2r\_\{\\mathrm\{eff\}\}\(U\)\\triangleq\\\|U\\\|\_\{\*\}^\{2\}/\\\|U\\\|\_\{F\}^\{2\}, satisfying1≤reff\(U\)≤rank⁡\(U\)≤min⁡\{n,D\}1\\leq r\_\{\\mathrm\{eff\}\}\(U\)\\leq\\operatorname\{rank\}\(U\)\\leq\\min\\\{n,D\\\}\.

###### Theorem 11 \(Analytical example: matching a low\-rank Gram matrix forces low\-rank embeddings\)\.

LetG⋆∈ℝn×nG^\{\\star\}\\in\\mathbb\{R\}^\{n\\times n\}be PSD withrank⁡\(G⋆\)=r\\operatorname\{rank\}\(G^\{\\star\}\)=r\. Consider

minU∈ℝn×D‖UU⊤−G⋆‖F2\.\\min\_\{U\\in\\mathbb\{R\}^\{n\\times D\}\}\\ \\ \\\|UU^\{\\top\}\-G^\{\\star\}\\\|\_\{F\}^\{2\}\.IfD≥rD\\geq r, there exists a global minimizerU⋆U^\{\\star\}withU⋆\(U⋆\)⊤=G⋆U^\{\\star\}\(U^\{\\star\}\)^\{\\top\}=G^\{\\star\}andrank⁡\(U⋆\)=r\\operatorname\{rank\}\(U^\{\\star\}\)=r\.

###### Proof\.

SinceG⋆G^\{\\star\}is PSD,G⋆=QΛQ⊤G^\{\\star\}=Q\\Lambda Q^\{\\top\}withΛ=diag⁡\(λ1,…,λr,0,…,0\)\\Lambda=\\operatorname\{diag\}\(\\lambda\_\{1\},\\dots,\\lambda\_\{r\},0,\\dots,0\),λi\>0\\lambda\_\{i\}\>0fori≤ri\\leq r\. LetΛr1/2=diag⁡\(λ1,…,λr\)\\Lambda\_\{r\}^\{1/2\}=\\operatorname\{diag\}\(\\sqrt\{\\lambda\_\{1\}\},\\dots,\\sqrt\{\\lambda\_\{r\}\}\)andQrQ\_\{r\}be the firstrreigenvectors\. ForD≥rD\\geq r, define

U⋆≜\[QrΛr1/20\]∈ℝn×D\.U^\{\\star\}\\triangleq\[\\,Q\_\{r\}\\Lambda\_\{r\}^\{1/2\}\\ \\ 0\\,\]\\in\\mathbb\{R\}^\{n\\times D\}\.ThenU⋆\(U⋆\)⊤=QrΛrQr⊤=G⋆U^\{\\star\}\(U^\{\\star\}\)^\{\\top\}=Q\_\{r\}\\Lambda\_\{r\}Q\_\{r\}^\{\\top\}=G^\{\\star\}, achieving objective value0\. Sincerank⁡\(G⋆\)=rank⁡\(U⋆\(U⋆\)⊤\)=rank⁡\(U⋆\)\\operatorname\{rank\}\(G^\{\\star\}\)=\\operatorname\{rank\}\(U^\{\\star\}\(U^\{\\star\}\)^\{\\top\}\)=\\operatorname\{rank\}\(U^\{\\star\}\), we haverank⁡\(U⋆\)=r\\operatorname\{rank\}\(U^\{\\star\}\)=r\. ∎

###### Theorem 12 \(Larger representation subspace⇒\\Rightarrowsmaller best\-fit MSE\)\.

Fix targetsz=\(z1,…,zn\)⊤∈ℝnz=\(z\_\{1\},\\dots,z\_\{n\}\)^\{\\top\}\\in\\mathbb\{R\}^\{n\}\. For any representation matrixU∈ℝn×DU\\in\\mathbb\{R\}^\{n\\times D\}, define

\(53\)ℛ\(U\)≜minw∈ℝD⁡1n‖Uw−z‖22\.\\mathcal\{R\}\(U\)\\triangleq\\min\_\{w\\in\\mathbb\{R\}^\{D\}\}\\frac\{1\}\{n\}\\\|Uw\-z\\\|\_\{2\}^\{2\}\.LetPUP\_\{U\}be the orthogonal projector ontocol⁡\(U\)\\operatorname\{col\}\(U\)\. Then

\(54\)ℛ\(U\)=1n‖\(I−PU\)z‖22\.\\mathcal\{R\}\(U\)=\\frac\{1\}\{n\}\\\|\(I\-P\_\{U\}\)z\\\|\_\{2\}^\{2\}\.Ifcol⁡\(USE\)⊆col⁡\(UME\)\\operatorname\{col\}\(U^\{\\mathrm\{SE\}\}\)\\subseteq\\operatorname\{col\}\(U^\{\\mathrm\{ME\}\}\), thenℛ\(UME\)≤ℛ\(USE\)\\mathcal\{R\}\(U^\{\\mathrm\{ME\}\}\)\\leq\\mathcal\{R\}\(U^\{\\mathrm\{SE\}\}\), with strict inequality whencol⁡\(USE\)⊊col⁡\(UME\)\\operatorname\{col\}\(U^\{\\mathrm\{SE\}\}\)\\subsetneq\\operatorname\{col\}\(U^\{\\mathrm\{ME\}\}\)and\(I−PUSE\)z≠0\(I\-P\_\{U^\{\\mathrm\{SE\}\}\}\)z\\neq 0lies incol⁡\(UME\)\\operatorname\{col\}\(U^\{\\mathrm\{ME\}\}\)\.

###### Proof\.

Equation \([54](https://arxiv.org/html/2606.07599#A2.E54)\) follows because\{Uw:w∈ℝD\}=col⁡\(U\)\\\{Uw:w\\in\\mathbb\{R\}^\{D\}\\\}=\\operatorname\{col\}\(U\)and least squares projectszzorthogonally onto this subspace\. Ifcol⁡\(USE\)⊆col⁡\(UME\)\\operatorname\{col\}\(U^\{\\mathrm\{SE\}\}\)\\subseteq\\operatorname\{col\}\(U^\{\\mathrm\{ME\}\}\), the residual norm cannot increase when projecting onto the larger subspace, yieldingℛ\(UME\)≤ℛ\(USE\)\\mathcal\{R\}\(U^\{\\mathrm\{ME\}\}\)\\leq\\mathcal\{R\}\(U^\{\\mathrm\{SE\}\}\)\. Strict inequality holds when the residual component w\.r\.t\.col⁡\(USE\)\\operatorname\{col\}\(U^\{\\mathrm\{SE\}\}\)becomes representable bycol⁡\(UME\)\\operatorname\{col\}\(U^\{\\mathrm\{ME\}\}\)\. ∎

###### Corollary 0 \(Representation improvement⇒\\Rightarrowordinal\-error bound improvement\)\.

If multi\-embedding reduces𝔼\[\(z^−z0\)2\]\\mathbb\{E\}\[\(\\hat\{z\}\-z\_\{0\}\)^\{2\}\]\(e\.g\., by Theorem[12](https://arxiv.org/html/2606.07599#A2.Thmtheorem12)\), then the ordinal\-error upper bound decreases via \([47](https://arxiv.org/html/2606.07599#A2.E47)\)\. This is aligned with the empirical spectral\-collapse observation emphasized by\(Guoet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib165)\)\.

###### Theorem 14 \(Analytical example: multi\-embedding can increase rank at fixed total dimension\)\.

LetG1⋆,…,GM⋆∈ℝn×nG\_\{1\}^\{\\star\},\\dots,G\_\{M\}^\{\\star\}\\in\\mathbb\{R\}^\{n\\times n\}be PSD with ranksrm=rank⁡\(Gm⋆\)r\_\{m\}=\\operatorname\{rank\}\(G\_\{m\}^\{\\star\}\)and pairwise disjoint column spaces \(col⁡\(Gi⋆\)∩col⁡\(Gj⋆\)=\{0\}\\operatorname\{col\}\(G\_\{i\}^\{\\star\}\)\\cap\\operatorname\{col\}\(G\_\{j\}^\{\\star\}\)=\\\{0\\\}fori≠ji\\neq j\)\. Consider multi\-embedding variablesU\(m\)∈ℝn×dU^\{\(m\)\}\\in\\mathbb\{R\}^\{n\\times d\}\(d≥rmd\\geq r\_\{m\}\) minimizing∑m=1M‖U\(m\)\(U\(m\)\)⊤−Gm⋆‖F2\\sum\_\{m=1\}^\{M\}\\\|U^\{\(m\)\}\(U^\{\(m\)\}\)^\{\\top\}\-G\_\{m\}^\{\\star\}\\\|\_\{F\}^\{2\}, and defineUME≜\[U\(1\)⋯U\(M\)\]∈ℝn×DU^\{\\mathrm\{ME\}\}\\triangleq\[U^\{\(1\)\}~\\cdots~U^\{\(M\)\}\]\\in\\mathbb\{R\}^\{n\\times D\}withD=MdD=Md\. Then there exists a global minimizer withrank⁡\(UME\)=∑m=1Mrm\\operatorname\{rank\}\(U^\{\\mathrm\{ME\}\}\)=\\sum\_\{m=1\}^\{M\}r\_\{m\}\.

###### Proof\.

For eachmm, by Theorem[11](https://arxiv.org/html/2606.07599#A2.Thmtheorem11), there existsU\(m\)⁣⋆∈ℝn×dU^\{\(m\)\\star\}\\in\\mathbb\{R\}^\{n\\times d\}withU\(m\)⁣⋆\(U\(m\)⁣⋆\)⊤=Gm⋆U^\{\(m\)\\star\}\(U^\{\(m\)\\star\}\)^\{\\top\}=G\_\{m\}^\{\\star\}\. The concatenated representationUME⁣⋆=\[U\(1\)⁣⋆⋯U\(M\)⁣⋆\]U^\{\\mathrm\{ME\}\\star\}=\[U^\{\(1\)\\star\}\\cdots U^\{\(M\)\\star\}\]satisfies

col⁡\(UME⁣⋆\)=∑m=1Mcol⁡\(U\(m\)⁣⋆\)=∑m=1Mcol⁡\(Gm⋆\)\.\\operatorname\{col\}\(U^\{\\mathrm\{ME\}\\star\}\)=\\sum\_\{m=1\}^\{M\}\\operatorname\{col\}\(U^\{\(m\)\\star\}\)=\\sum\_\{m=1\}^\{M\}\\operatorname\{col\}\(G\_\{m\}^\{\\star\}\)\.By the direct\-sum assumption,dim\(col⁡\(UME⁣⋆\)\)=∑m=1Mdim\(col⁡\(Gm⋆\)\)=∑m=1Mrm\\dim\(\\operatorname\{col\}\(U^\{\\mathrm\{ME\}\\star\}\)\)=\\sum\_\{m=1\}^\{M\}\\dim\(\\operatorname\{col\}\(G\_\{m\}^\{\\star\}\)\)=\\sum\_\{m=1\}^\{M\}r\_\{m\}, hencerank⁡\(UME⁣⋆\)=∑m=1Mrm\\operatorname\{rank\}\(U^\{\\mathrm\{ME\}\\star\}\)=\\sum\_\{m=1\}^\{M\}r\_\{m\}\. ∎

### B\.6\.Multi\-Time\-Step Advantage: Selection and Linear Fusion

#### B\.6\.1\.Baseline\-time expressivity: MH can represent SH\-1T at the same time point

###### Theorem 15 \(Function\-class inclusion at the baseline time point\)\.

Lett⋆∈𝒯t\_\{\\star\}\\in\\mathcal\{T\}be the baseline time and assumetm⋆=t⋆t\_\{m\_\{\\star\}\}=t\_\{\\star\}for somem⋆m\_\{\\star\}\. Assume SH\-1T and them⋆m\_\{\\star\}\-th head of MH\-MT use the same decoder family\{gθ\}\\\{g\_\{\\theta\}\\\}and encoder family\{Eϕ\}\\\{E\_\{\\phi\}\\\}\. Define the optimal noise\-prediction risks:

\(55\)R1T⋆\\displaystyle R\_\{\\mathrm\{1T\}\}^\{\\star\}≜infϕ,θ𝔼\[\(ϵ−gθ\(Eϕ\(x\),zt⋆,t⋆\)\)2\],\\displaystyle\\triangleq\\inf\_\{\\phi,\\theta\}\\;\\mathbb\{E\}\\\!\\left\[\\bigl\(\\epsilon\-g\_\{\\theta\}\(E\_\{\\phi\}\(x\),z\_\{t\_\{\\star\}\},t\_\{\\star\}\)\\bigr\)^\{2\}\\right\],\(56\)RMH,m⋆⋆\\displaystyle R\_\{\\mathrm\{MH\},m\_\{\\star\}\}^\{\\star\}≜infϕ,θ1,…,θM𝔼\[\(ϵ−gθm⋆\(Eϕ\(x\),zt⋆,t⋆\)\)2\]\.\\displaystyle\\triangleq\\inf\_\{\\phi,\\theta\_\{1\},\\dots,\\theta\_\{M\}\}\\;\\mathbb\{E\}\\\!\\left\[\\bigl\(\\epsilon\-g\_\{\\theta\_\{m\_\{\\star\}\}\}\(E\_\{\\phi\}\(x\),z\_\{t\_\{\\star\}\},t\_\{\\star\}\)\\bigr\)^\{2\}\\right\]\.ThenRMH,m⋆⋆≤R1T⋆R\_\{\\mathrm\{MH\},m\_\{\\star\}\}^\{\\star\}\\leq R\_\{\\mathrm\{1T\}\}^\{\\star\}\.

###### Proof\.

For any baseline parameters\(ϕ,θ\)\(\\phi,\\theta\), construct MH parameters\(ϕ,θ1,…,θM\)\(\\phi,\\theta\_\{1\},\\dots,\\theta\_\{M\}\)withθm⋆=θ\\theta\_\{m\_\{\\star\}\}=\\thetaand arbitraryθm\\theta\_\{m\}form≠m⋆m\\neq m\_\{\\star\}\. At timet⋆t\_\{\\star\}, both models produce identical outputs:

gθm⋆\(Eϕ\(x\),zt⋆,t⋆\)=gθ\(Eϕ\(x\),zt⋆,t⋆\),g\_\{\\theta\_\{m\_\{\\star\}\}\}\(E\_\{\\phi\}\(x\),z\_\{t\_\{\\star\}\},t\_\{\\star\}\)=g\_\{\\theta\}\(E\_\{\\phi\}\(x\),z\_\{t\_\{\\star\}\},t\_\{\\star\}\),so their expected losses are equal\. SinceRMH,m⋆⋆R\_\{\\mathrm\{MH\},m\_\{\\star\}\}^\{\\star\}minimizes over a superset of parameter choices,RMH,m⋆⋆≤R1T⋆R\_\{\\mathrm\{MH\},m\_\{\\star\}\}^\{\\star\}\\leq R\_\{\\mathrm\{1T\}\}^\{\\star\}\. ∎

###### Corollary 0 \(Best\-achievable ordinal\-error upper bound att⋆t\_\{\\star\}is no worse for MH\)\.

DefineUB\(t;R\)≜4γ2⋅1−α¯tα¯t⋅R\\mathrm\{UB\}\(t;R\)\\triangleq\\frac\{4\}\{\\gamma^\{2\}\}\\cdot\\frac\{1\-\\bar\{\\alpha\}\_\{t\}\}\{\\bar\{\\alpha\}\_\{t\}\}\\cdot R\. By Theorems[7](https://arxiv.org/html/2606.07599#A2.Thmtheorem7)and[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5), any model with time\-ttnoise riskRRsatisfiesℙ\(y^≠y\)≤UB\(t;R\)\\mathbb\{P\}\(\\hat\{y\}\\neq y\)\\leq\\mathrm\{UB\}\(t;R\)\. Combining with Theorem[15](https://arxiv.org/html/2606.07599#A2.Thmtheorem15),

infϕ,θ1,…,θMUB\(t⋆;RMH,m⋆⋆\)≤infϕ,θUB\(t⋆;R1T⋆\)\.\\inf\_\{\\phi,\\theta\_\{1\},\\dots,\\theta\_\{M\}\}\\mathrm\{UB\}\(t\_\{\\star\};R\_\{\\mathrm\{MH\},m\_\{\\star\}\}^\{\\star\}\)\\leq\\inf\_\{\\phi,\\theta\}\\mathrm\{UB\}\(t\_\{\\star\};R\_\{\\mathrm\{1T\}\}^\{\\star\}\)\.

###### Definition 0 \(Oracle best\-time selection\)\.

For eachmm, letz^0\(m\)=z^0\(tm\)\\hat\{z\}\_\{0\}^\{\(m\)\}=\\hat\{z\}\_\{0\}\(t\_\{m\}\)and decodey^\(m\)\\hat\{y\}^\{\(m\)\}by thresholds\. Define

\(57\)m†≜arg⁡minm∈\{1,…,M\}⁡𝔼\[\(z^0\(m\)−z0\)2\],z^0†≜z^0\(m†\),y^†≜y^\(m†\)\.m^\{\\dagger\}\\triangleq\\arg\\min\_\{m\\in\\\{1,\\dots,M\\\}\}\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\(m\)\}\-z\_\{0\}\)^\{2\}\]\\,,\\qquad\\hat\{z\}\_\{0\}^\{\\dagger\}\\triangleq\\hat\{z\}\_\{0\}^\{\(m^\{\\dagger\}\)\},\\ \\hat\{y\}^\{\\dagger\}\\triangleq\\hat\{y\}^\{\(m^\{\\dagger\}\)\}\.

###### Theorem 18 \(Oracle selection is no worse than any fixed time point\)\.

Iftm⋆=t⋆t\_\{m\_\{\\star\}\}=t\_\{\\star\}is the baseline time point, then

\(58\)𝔼\[\(z^0†−z0\)2\]≤𝔼\[\(z^0\(m⋆\)−z0\)2\]\.\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\\dagger\}\-z\_\{0\}\)^\{2\}\]\\leq\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\(m\_\{\\star\}\)\}\-z\_\{0\}\)^\{2\}\]\.Moreover, under Theorem[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5),

\(59\)ℙ\(y^†≠y\)≤4γ2𝔼\[\(z^0†−z0\)2\]≤4γ2𝔼\[\(z^0\(m⋆\)−z0\)2\]\.\\mathbb\{P\}\(\\hat\{y\}^\{\\dagger\}\\neq y\)\\leq\\frac\{4\}\{\\gamma^\{2\}\}\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\\dagger\}\-z\_\{0\}\)^\{2\}\]\\leq\\frac\{4\}\{\\gamma^\{2\}\}\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\(m\_\{\\star\}\)\}\-z\_\{0\}\)^\{2\}\]\.

###### Proof\.

By definition ofm†m^\{\\dagger\}in \([57](https://arxiv.org/html/2606.07599#A2.E57)\), \([58](https://arxiv.org/html/2606.07599#A2.E58)\) holds for any fixedm⋆m\_\{\\star\}\. Applying Theorem[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5)toz^=z^0†\\hat\{z\}=\\hat\{z\}\_\{0\}^\{\\dagger\}yields the first inequality in \([59](https://arxiv.org/html/2606.07599#A2.E59)\); the second follows from \([58](https://arxiv.org/html/2606.07599#A2.E58)\)\. ∎

###### Definition 0 \(Linear fusion across time points\)\.

Letw∈ℝMw\\in\\mathbb\{R\}^\{M\}satisfy∑m=1Mwm=1\\sum\_\{m=1\}^\{M\}w\_\{m\}=1\. Define the fused estimator

\(60\)z^0ens≜∑m=1Mwmz^0\(m\)\.\\hat\{z\}\_\{0\}^\{\\mathrm\{ens\}\}\\triangleq\\sum\_\{m=1\}^\{M\}w\_\{m\}\\hat\{z\}\_\{0\}^\{\(m\)\}\.

###### Theorem 20 \(MSE bound for linear fusion under bounded covariance\)\.

Under Assumption[5](https://arxiv.org/html/2606.07599#Thmassumption5), withem=z^0\(m\)−z0e\_\{m\}=\\hat\{z\}\_\{0\}^\{\(m\)\}\-z\_\{0\}, we have

\(61\)𝔼\[\(z^0ens−z0\)2\]≤∑m=1Mwm2vm\+2∑1≤i<j≤M\|wiwj\|bij\.\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\\mathrm\{ens\}\}\-z\_\{0\}\)^\{2\}\]\\leq\\sum\_\{m=1\}^\{M\}w\_\{m\}^\{2\}v\_\{m\}\+2\\sum\_\{1\\leq i<j\\leq M\}\|w\_\{i\}w\_\{j\}\|\\,b\_\{ij\}\.Consequently, by Theorem[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5),

\(62\)ℙ\(y^ens≠y\)≤4γ2𝔼\[\(z^0ens−z0\)2\]\.\\mathbb\{P\}\(\\hat\{y\}^\{\\mathrm\{ens\}\}\\neq y\)\\leq\\frac\{4\}\{\\gamma^\{2\}\}\\mathbb\{E\}\[\(\\hat\{z\}\_\{0\}^\{\\mathrm\{ens\}\}\-z\_\{0\}\)^\{2\}\]\.

###### Proof\.

Usingz^0ens−z0=∑mwmem\\hat\{z\}\_\{0\}^\{\\mathrm\{ens\}\}\-z\_\{0\}=\\sum\_\{m\}w\_\{m\}e\_\{m\}and expanding the square:

𝔼\[\(∑mwmem\)2\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\left\(\\sum\_\{m\}w\_\{m\}e\_\{m\}\\right\)^\{2\}\\right\]=∑mwm2𝔼\[em2\]\+2∑i<jwiwj𝔼\[eiej\]\\displaystyle=\\sum\_\{m\}w\_\{m\}^\{2\}\\mathbb\{E\}\[e\_\{m\}^\{2\}\]\+2\\sum\_\{i<j\}w\_\{i\}w\_\{j\}\\mathbb\{E\}\[e\_\{i\}e\_\{j\}\]≤∑mwm2vm\+2∑i<j\|wiwj\|\|𝔼\[eiej\]\|\\displaystyle\\leq\\sum\_\{m\}w\_\{m\}^\{2\}v\_\{m\}\+2\\sum\_\{i<j\}\|w\_\{i\}w\_\{j\}\|\\,\|\\mathbb\{E\}\[e\_\{i\}e\_\{j\}\]\|\(63\)≤∑mwm2vm\+2∑i<j\|wiwj\|bij,\\displaystyle\\leq\\sum\_\{m\}w\_\{m\}^\{2\}v\_\{m\}\+2\\sum\_\{i<j\}\|w\_\{i\}w\_\{j\}\|\\,b\_\{ij\},where the first inequality useswiwj𝔼\[eiej\]≤\|wiwj\|\|𝔼\[eiej\]\|w\_\{i\}w\_\{j\}\\mathbb\{E\}\[e\_\{i\}e\_\{j\}\]\\leq\|w\_\{i\}w\_\{j\}\|\|\\mathbb\{E\}\[e\_\{i\}e\_\{j\}\]\|, and the second applies Assumption[5](https://arxiv.org/html/2606.07599#Thmassumption5)\. Equation \([62](https://arxiv.org/html/2606.07599#A2.E62)\) follows by applying Theorem[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5)toz^=z^0ens\\hat\{z\}=\\hat\{z\}\_\{0\}^\{\\mathrm\{ens\}\}\. ∎

### B\.7\.Convergence Analysis on the Multi\-Time Objective \(MH\-MT vs SH\-MT\)

This subsection analyzes convergence on the same multi\-time objectiveℒ=∑m=1Mλmℒm\\mathcal\{L\}=\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}\\mathcal\{L\}\_\{m\}for SH\-MT vs MH\-MT \(distinct from SH\-1T which optimizes a single\-time objective\)\.

###### Definition 0 \(SH\-MT and MH\-MT objectives\)\.

Fix encoderEϕE\_\{\\phi\}and define

fm\(θ\)≜𝔼\[\(ϵ−gθ\(Eϕ\(x\),ztm,tm\)\)2\],FSH\(θ\)≜∑m=1Mλmfm\(θ\)\.f\_\{m\}\(\\theta\)\\triangleq\\mathbb\{E\}\\Bigl\[\\bigl\(\\epsilon\-g\_\{\\theta\}\(E\_\{\\phi\}\(x\),z\_\{t\_\{m\}\},t\_\{m\}\)\\bigr\)^\{2\}\\Bigr\]\\,,\\qquad F\_\{\\mathrm\{SH\}\}\(\\theta\)\\triangleq\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}f\_\{m\}\(\\theta\)\.For MH\-MT, defineFMH\(ϑ\)≜∑m=1Mλmfm\(θm\)F\_\{\\mathrm\{MH\}\}\(\\vartheta\)\\triangleq\\sum\_\{m=1\}^\{M\}\\lambda\_\{m\}f\_\{m\}\(\\theta\_\{m\}\)withϑ=\(θ1,…,θM\)\\vartheta=\(\\theta\_\{1\},\\dots,\\theta\_\{M\}\)\.

###### Assumption 6 \(LL\-smoothness and PL condition\)\.

Assume eachfmf\_\{m\}isLmL\_\{m\}\-smooth and satisfies the Polyak\-Łojasiewicz \(PL\) inequality with constantμm\>0\\mu\_\{m\}\>0:

12‖∇fm\(θ\)‖2≥μm\(fm\(θ\)−fm⋆\),fm⋆=infθfm\(θ\)\.\\frac\{1\}\{2\}\\\|\\nabla f\_\{m\}\(\\theta\)\\\|^\{2\}\\geq\\mu\_\{m\}\(f\_\{m\}\(\\theta\)\-f\_\{m\}^\{\\star\}\)\\,,\\qquad f\_\{m\}^\{\\star\}=\\inf\_\{\\theta\}f\_\{m\}\(\\theta\)\.LetL≜∑mλmLmL\\triangleq\\sum\_\{m\}\\lambda\_\{m\}L\_\{m\},μmin≜minm⁡μm\\mu\_\{\\min\}\\triangleq\\min\_\{m\}\\mu\_\{m\}, andλmin≜minm⁡λm\>0\\lambda\_\{\\min\}\\triangleq\\min\_\{m\}\\lambda\_\{m\}\>0\.

###### Lemma 0 \(Descent lemma forLL\-smooth functions\)\.

IfFFisLL\-smooth, then forη∈\(0,1/L\]\\eta\\in\(0,1/L\]andθ\+=θ−η∇F\(θ\)\\theta^\{\+\}=\\theta\-\\eta\\nabla F\(\\theta\),

\(64\)F\(θ\+\)≤F\(θ\)−η2‖∇F\(θ\)‖2\.F\(\\theta^\{\+\}\)\\leq F\(\\theta\)\-\\frac\{\\eta\}\{2\}\\\|\\nabla F\(\\theta\)\\\|^\{2\}\.

###### Proof\.

ByLL\-smoothness,

F\(θ\+\)≤F\(θ\)\+⟨∇F\(θ\),θ\+−θ⟩\+L2‖θ\+−θ‖2\.F\(\\theta^\{\+\}\)\\leq F\(\\theta\)\+\\langle\\nabla F\(\\theta\),\\theta^\{\+\}\-\\theta\\rangle\+\\frac\{L\}\{2\}\\\|\\theta^\{\+\}\-\\theta\\\|^\{2\}\.Substitutingθ\+−θ=−η∇F\(θ\)\\theta^\{\+\}\-\\theta=\-\\eta\\nabla F\(\\theta\)gives

F\(θ\+\)≤F\(θ\)−η‖∇F\(θ\)‖2\+Lη22‖∇F\(θ\)‖2=F\(θ\)−η\(1−Lη2\)‖∇F\(θ\)‖2\.F\(\\theta^\{\+\}\)\\leq F\(\\theta\)\-\\eta\\\|\\nabla F\(\\theta\)\\\|^\{2\}\+\\frac\{L\\eta^\{2\}\}\{2\}\\\|\\nabla F\(\\theta\)\\\|^\{2\}=F\(\\theta\)\-\\eta\\left\(1\-\\frac\{L\\eta\}\{2\}\\right\)\\\|\\nabla F\(\\theta\)\\\|^\{2\}\.Forη≤1/L\\eta\\leq 1/L,1−Lη/2≥1/21\-L\\eta/2\\geq 1/2, yielding \([64](https://arxiv.org/html/2606.07599#A2.E64)\)\. ∎

###### Lemma 0 \(MH gradient lower bound from per\-task PL\)\.

For MH\-MT,∇θmFMH\(ϑ\)=λm∇fm\(θm\)\\nabla\_\{\\theta\_\{m\}\}F\_\{\\mathrm\{MH\}\}\(\\vartheta\)=\\lambda\_\{m\}\\nabla f\_\{m\}\(\\theta\_\{m\}\)and

\(65\)12‖∇FMH\(ϑ\)‖2≥μminλmin\(FMH\(ϑ\)−FMH⋆\),\\frac\{1\}\{2\}\\\|\\nabla F\_\{\\mathrm\{MH\}\}\(\\vartheta\)\\\|^\{2\}\\geq\\mu\_\{\\min\}\\lambda\_\{\\min\}\(F\_\{\\mathrm\{MH\}\}\(\\vartheta\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\),whereFMH⋆=∑mλmfm⋆F\_\{\\mathrm\{MH\}\}^\{\\star\}=\\sum\_\{m\}\\lambda\_\{m\}f\_\{m\}^\{\\star\}\.

###### Proof\.

Using Proposition[2](https://arxiv.org/html/2606.07599#A2.Thmtheorem2)and the PL condition for eachfmf\_\{m\}:

‖∇FMH\(ϑ\)‖2\\displaystyle\\\|\\nabla F\_\{\\mathrm\{MH\}\}\(\\vartheta\)\\\|^\{2\}=∑m‖λm∇fm\(θm\)‖2=∑mλm2‖∇fm\(θm\)‖2\\displaystyle=\\sum\_\{m\}\\\|\\lambda\_\{m\}\\nabla f\_\{m\}\(\\theta\_\{m\}\)\\\|^\{2\}=\\sum\_\{m\}\\lambda\_\{m\}^\{2\}\\\|\\nabla f\_\{m\}\(\\theta\_\{m\}\)\\\|^\{2\}\(66\)≥2∑mλm2μm\(fm\(θm\)−fm⋆\)≥2μmin∑mλm2\(fm\(θm\)−fm⋆\)\.\\displaystyle\\geq 2\\sum\_\{m\}\\lambda\_\{m\}^\{2\}\\mu\_\{m\}\(f\_\{m\}\(\\theta\_\{m\}\)\-f\_\{m\}^\{\\star\}\)\\geq 2\\mu\_\{\\min\}\\sum\_\{m\}\\lambda\_\{m\}^\{2\}\(f\_\{m\}\(\\theta\_\{m\}\)\-f\_\{m\}^\{\\star\}\)\.Sinceλm2≥λminλm\\lambda\_\{m\}^\{2\}\\geq\\lambda\_\{\\min\}\\lambda\_\{m\},

\(67\)∑mλm2\(fm\(θm\)−fm⋆\)≥λmin∑mλm\(fm\(θm\)−fm⋆\)=λmin\(FMH\(ϑ\)−FMH⋆\)\.\\sum\_\{m\}\\lambda\_\{m\}^\{2\}\(f\_\{m\}\(\\theta\_\{m\}\)\-f\_\{m\}^\{\\star\}\)\\geq\\lambda\_\{\\min\}\\sum\_\{m\}\\lambda\_\{m\}\(f\_\{m\}\(\\theta\_\{m\}\)\-f\_\{m\}^\{\\star\}\)=\\lambda\_\{\\min\}\(F\_\{\\mathrm\{MH\}\}\(\\vartheta\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\)\.Combining \([66](https://arxiv.org/html/2606.07599#A2.E66)\) and \([67](https://arxiv.org/html/2606.07599#A2.E67)\) and dividing by22yields \([65](https://arxiv.org/html/2606.07599#A2.E65)\)\. ∎

###### Theorem 24 \(Linear convergence of MH\-MT\)\.

With step sizeη≤1/L\\eta\\leq 1/Land GD updateϑk\+1=ϑk−η∇FMH\(ϑk\)\\vartheta^\{k\+1\}=\\vartheta^\{k\}\-\\eta\\nabla F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\),

\(68\)FMH\(ϑk\)−FMH⋆≤\(1−ημminλmin\)k\(FMH\(ϑ0\)−FMH⋆\)\.F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\\leq\\bigl\(1\-\\eta\\mu\_\{\\min\}\\lambda\_\{\\min\}\\bigr\)^\{k\}\\bigl\(F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{0\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\\bigr\)\.

###### Proof\.

Applying Lemma[22](https://arxiv.org/html/2606.07599#A2.Thmtheorem22)toFMHF\_\{\\mathrm\{MH\}\}and using \([65](https://arxiv.org/html/2606.07599#A2.E65)\):

FMH\(ϑk\+1\)−FMH⋆\\displaystyle F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\+1\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}≤FMH\(ϑk\)−FMH⋆−η2‖∇FMH\(ϑk\)‖2\\displaystyle\\leq F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\-\\frac\{\\eta\}\{2\}\\\|\\nabla F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\\\|^\{2\}≤FMH\(ϑk\)−FMH⋆−ημminλmin\(FMH\(ϑk\)−FMH⋆\)\\displaystyle\\leq F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\-\\eta\\mu\_\{\\min\}\\lambda\_\{\\min\}\(F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\)\(69\)=\(1−ημminλmin\)\(FMH\(ϑk\)−FMH⋆\)\.\\displaystyle=\(1\-\\eta\\mu\_\{\\min\}\\lambda\_\{\\min\}\)\(F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\)\.Iterating \([69](https://arxiv.org/html/2606.07599#A2.E69)\) yields \([68](https://arxiv.org/html/2606.07599#A2.E68)\)\. ∎

###### Theorem 25 \(Linear convergence of SH\-MT under global PL\)\.

AssumeFSHF\_\{\\mathrm\{SH\}\}isLL\-smooth and satisfies a global PL inequality with constantμ~SH\>0\\tilde\{\\mu\}\_\{\\mathrm\{SH\}\}\>0:

12‖∇FSH\(θ\)‖2≥μ~SH\(FSH\(θ\)−FSH⋆\)\.\\frac\{1\}\{2\}\\\|\\nabla F\_\{\\mathrm\{SH\}\}\(\\theta\)\\\|^\{2\}\\geq\\tilde\{\\mu\}\_\{\\mathrm\{SH\}\}\(F\_\{\\mathrm\{SH\}\}\(\\theta\)\-F\_\{\\mathrm\{SH\}\}^\{\\star\}\)\.Then for GD withη≤1/L\\eta\\leq 1/L,

\(70\)FSH\(θk\)−FSH⋆≤\(1−ημ~SH\)k\(FSH\(θ0\)−FSH⋆\)\.F\_\{\\mathrm\{SH\}\}\(\\theta^\{k\}\)\-F\_\{\\mathrm\{SH\}\}^\{\\star\}\\leq\\bigl\(1\-\\eta\\tilde\{\\mu\}\_\{\\mathrm\{SH\}\}\\bigr\)^\{k\}\\bigl\(F\_\{\\mathrm\{SH\}\}\(\\theta^\{0\}\)\-F\_\{\\mathrm\{SH\}\}^\{\\star\}\\bigr\)\.

###### Proof\.

Applying Lemma[22](https://arxiv.org/html/2606.07599#A2.Thmtheorem22)toFSHF\_\{\\mathrm\{SH\}\}and using the global PL condition:

FSH\(θk\+1\)−FSH⋆≤FSH\(θk\)−FSH⋆−ημ~SH\(FSH\(θk\)−FSH⋆\),F\_\{\\mathrm\{SH\}\}\(\\theta^\{k\+1\}\)\-F\_\{\\mathrm\{SH\}\}^\{\\star\}\\leq F\_\{\\mathrm\{SH\}\}\(\\theta^\{k\}\)\-F\_\{\\mathrm\{SH\}\}^\{\\star\}\-\\eta\\tilde\{\\mu\}\_\{\\mathrm\{SH\}\}\(F\_\{\\mathrm\{SH\}\}\(\\theta^\{k\}\)\-F\_\{\\mathrm\{SH\}\}^\{\\star\}\),which iterates to \([70](https://arxiv.org/html/2606.07599#A2.E70)\)\. ∎

###### Corollary 0 \(Convergence implies decreasing ordinal\-error upper bound\)\.

Fix timetmt\_\{m\}\. Letz^0\(m\),k\\hat\{z\}\_\{0\}^\{\(m\),k\}be constructed fromθmk\\theta\_\{m\}^\{k\}via Definition[6](https://arxiv.org/html/2606.07599#A2.Thmtheorem6)and decoded toy^\(m\),k\\hat\{y\}^\{\(m\),k\}\. Then

\(71\)ℙ\(y^\(m\),k≠y\)≤4γ2⋅1−α¯tmα¯tmfm\(θmk\),\\mathbb\{P\}\(\\hat\{y\}^\{\(m\),k\}\\neq y\)\\leq\\frac\{4\}\{\\gamma^\{2\}\}\\cdot\\frac\{1\-\\bar\{\\alpha\}\_\{t\_\{m\}\}\}\{\\bar\{\\alpha\}\_\{t\_\{m\}\}\}\\;f\_\{m\}\(\\theta\_\{m\}^\{k\}\),and

\(72\)fm\(θmk\)−fm⋆≤1λm\(FMH\(ϑk\)−FMH⋆\)≤1λm\(1−ημminλmin\)k\(FMH\(ϑ0\)−FMH⋆\)\.f\_\{m\}\(\\theta\_\{m\}^\{k\}\)\-f\_\{m\}^\{\\star\}\\leq\\frac\{1\}\{\\lambda\_\{m\}\}\\bigl\(F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\\bigr\)\\leq\\frac\{1\}\{\\lambda\_\{m\}\}\\bigl\(1\-\\eta\\mu\_\{\\min\}\\lambda\_\{\\min\}\\bigr\)^\{k\}\\bigl\(F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{0\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\\bigr\)\.

###### Proof\.

Equation \([71](https://arxiv.org/html/2606.07599#A2.E71)\) follows by combining \([50](https://arxiv.org/html/2606.07599#A2.E50)\) with Theorem[5](https://arxiv.org/html/2606.07599#A2.Thmtheorem5)\. For \([72](https://arxiv.org/html/2606.07599#A2.E72)\), note that

FMH\(ϑk\)−FMH⋆=∑j=1Mλj\(fj\(θjk\)−fj⋆\)≥λm\(fm\(θmk\)−fm⋆\),F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}=\\sum\_\{j=1\}^\{M\}\\lambda\_\{j\}\(f\_\{j\}\(\\theta\_\{j\}^\{k\}\)\-f\_\{j\}^\{\\star\}\)\\geq\\lambda\_\{m\}\(f\_\{m\}\(\\theta\_\{m\}^\{k\}\)\-f\_\{m\}^\{\\star\}\),hencefm\(θmk\)−fm⋆≤1λm\(FMH\(ϑk\)−FMH⋆\)f\_\{m\}\(\\theta\_\{m\}^\{k\}\)\-f\_\{m\}^\{\\star\}\\leq\\frac\{1\}\{\\lambda\_\{m\}\}\(F\_\{\\mathrm\{MH\}\}\(\\vartheta^\{k\}\)\-F\_\{\\mathrm\{MH\}\}^\{\\star\}\)\. Applying \([68](https://arxiv.org/html/2606.07599#A2.E68)\) yields the final bound\. ∎

## Appendix CAdditional Experiments

### C\.1\.Experimental Settings

#### C\.1\.1\.Metrics

We utilize a comprehensive suite of metrics to evaluate performance, selecting specific indicators based on task characteristics:

- •MAE \(Mean Absolute Error\): Measures the average magnitude of errors between predictions\{yi^\}\\\{\\hat\{y\_\{i\}\}\\\}and ground truth\{yi\}\\\{y\_\{i\}\\\}, defined as1N∑i=1N\|yi−yi^\|\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\|y\_\{i\}\-\\hat\{y\_\{i\}\}\|\.
- •CS \(Cumulative Score\): Quantifies the percentage of test instances where the absolute prediction error\|y^i−yi\|\|\\hat\{y\}i\-y\_\{i\}\|falls within a specified tolerance thresholdLL\.
- •XAUC\(Zhanet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib11)\): Evaluates pairwise ordinal consistency\. It calculates the probability that the predicted relative order of a randomly sampled pair aligns with the ground truth, serving as a robust indicator of ranking capability\.
- •LCC \(Linear Correlation Coefficient\)\(Talebi and Milanfar,[2018](https://arxiv.org/html/2606.07599#bib.bib143)\): Assesses the linear dependence between predicted and ground truth values\. Computed as the covariance normalized by the product of their standard deviations, it ranges from\[−1,1\]\[\-1,1\], where values closer to±1\\pm 1denote stronger correlation\.
- •SRCC \(Spearman’s Rank Correlation Coefficient\)\(Talebi and Milanfar,[2018](https://arxiv.org/html/2606.07599#bib.bib143)\): Measures monotonic rank correlation\. Unlike LCC, SRCC operates on rank variables rather than raw scores, offering greater robustness to outliers and non\-linear monotonic relationships\.

#### C\.1\.2\.Implementation Details

Unless otherwise specified, we adopt the following protocol\.

##### Architecture\.

The proposed framework employs an encoder\-decoder structure\. The encoder is task\-specific \(detailed in respective sections\), while the decoder is consistently instantiated as a two\-layer Transformer with 8\-head attention for the conditional diffusion process\. A dropout rate of 0\.1 is applied to mitigate overfitting\. The loss weightλ\\lambdain Eq\. \([20](https://arxiv.org/html/2606.07599#S4.E20)\) is set to 10\.

##### Optimization\.

We minimize the objective using the Adam optimizer\(Kingma and Ba,[2014](https://arxiv.org/html/2606.07599#bib.bib137)\)\(β1=0\.9,β2=0\.999\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.999\) with a learning rate of5×10−45\\times 10^\{\-4\}\.

##### Training Regimen\.

For vision tasks, models are trained for 100 epochs with a batch size of 128\. For structured data tasks \(WTP, LTV\), we train for 50 epochs with a batch size of 1024\. All experiments are conducted on a single NVIDIA RTX 4090 GPU\.

### C\.2\.Image Aesthetics Assessment \(IAA\)

#### C\.2\.1\.Experimental Setup

##### Datasets\.

We evaluate DiffoR on four standard IAA benchmarks: TAD66K\(Heet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib135)\), AVA\(Murrayet al\.,[2012](https://arxiv.org/html/2606.07599#bib.bib82)\), ICAA17K\(Heet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib81)\), and SPAQ\(Fanget al\.,[2020](https://arxiv.org/html/2606.07599#bib.bib70)\)\. Following standard practice, each dataset is randomly partitioned into 80% training, 10% validation, and 10% testing sets\.

##### Preprocessing\.

Given the narrow range of aesthetic scores \(typically 0–10\), we scale the target values by a factor of 100 during training to facilitate robust vocabulary construction and ordinal sequencing\. For evaluation, all predictions are rescaled to the original range to ensure metric consistency\.

##### Baselines\.

We select baselines based on two criteria: \(1\) classical architectures with open\-source implementations; and \(2\) state\-of\-the\-art \(SOTA\) methods in specialized domains \(e\.g\., personalized IAA\)\. To guarantee a strictly fair comparison, all baselines are trained using their official hyperparameters and subjected to identical data preprocessing and evaluation protocols, consistent with\(Heet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib81)\)\.

#### C\.2\.2\.Performance\.

Due to space constraints in Sec\.[5\.1\.1](https://arxiv.org/html/2606.07599#S5.SS1.SSS1)of the main paper, comprehensive baseline comparisons for the Image Aesthetics Assessment task are presented here\. Tab\.[8](https://arxiv.org/html/2606.07599#A3.T8)and Tab\.[9](https://arxiv.org/html/2606.07599#A3.T9)details the performance of all compared methods on the TAD66K, AVA, ICAA17K, and SPAQ datasets\.

Table 8\.The results of Image Aesthetics Assessment task on TAD66K and AVA datasetsMethodTAD66KAVAMAE↓\\downarrowXAUC↑\\uparrowLCC↑\\uparrowSRCC↑\\uparrowMAE↓\\downarrowXAUC↑\\uparrowLCC↑\\uparrowSRCC↑\\uparrowRAPID\(Luet al\.,[2014](https://arxiv.org/html/2606.07599#bib.bib98)\)1\.7660\.5100\.3320\.3140\.9780\.5130\.3360\.327AADB\(Konget al\.,[2016](https://arxiv.org/html/2606.07599#bib.bib101)\)1\.4630\.5230\.4000\.3790\.7840\.5340\.4310\.408PAM\(Renet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib89)\)1\.3140\.5340\.4400\.4220\.6140\.6190\.5310\.521NIMA\(Talebi and Milanfar,[2018](https://arxiv.org/html/2606.07599#bib.bib143)\)1\.4220\.5110\.4050\.3900\.7150\.5320\.4720\.447ALamp\(Maet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib92)\)1\.3490\.5230\.4220\.4110\.6570\.5790\.4980\.487MPada\\text\{MP\}\_\{ada\}\(Shenget al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib147)\)1\.1910\.5890\.4080\.3890\.6020\.6320\.5430\.531MLSP\(Hosuet al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib133)\)1\.1320\.6200\.4320\.4090\.5790\.6570\.5630\.553BIAA\(Zhuet al\.,[2020](https://arxiv.org/html/2606.07599#bib.bib136)\)1\.3290\.5380\.4310\.3480\.6720\.5660\.4960\.476UIAA\(Zenget al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib140)\)1\.2810\.5480\.4410\.3610\.6080\.6260\.5350\.525POE\(Liet al\.,[2021](https://arxiv.org/html/2606.07599#bib.bib83)\)1\.1850\.5880\.4200\.3770\.6330\.6080\.5240\.506HGCN\(Sheet al\.,[2021](https://arxiv.org/html/2606.07599#bib.bib148)\)1\.1410\.6150\.4190\.4060\.6580\.5780\.5110\.486TANet\(Heet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib135)\)1\.0810\.6490\.4520\.4280\.5770\.6590\.5680\.554MaxViT\(Tuet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib132)\)1\.0540\.6590\.4720\.4410\.5590\.6790\.5940\.571Delegate\(Heet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib81)\)1\.0410\.6610\.4770\.4510\.5410\.6880\.6420\.634AesMamba\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib131)\)1\.0350\.6660\.4820\.4680\.5220\.6970\.6630\.656GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)0\.9960\.6770\.5410\.5130\.3950\.7510\.7260\.701DiffoR \(Ours\)0\.9140\.6940\.5530\.5260\.3880\.7660\.7340\.725

Table 9\.The results of the Image Aesthetics Assessment task on ICAA17K and SPAQ datasets\.MethodICAA17KSPAQMAE↓\\downarrowXAUC↑\\uparrowLCC↑\\uparrowSRCC↑\\uparrowMAE↓\\downarrowXAUC↑\\uparrowLCC↑\\uparrowSRCC↑\\uparrowRAPID\(Luet al\.,[2014](https://arxiv.org/html/2606.07599#bib.bib98)\)0\.74150\.64160\.51640\.50831\.08900\.69970\.65650\.6128AADB\(Konget al\.,[2016](https://arxiv.org/html/2606.07599#bib.bib101)\)0\.71420\.66610\.53110\.51951\.0830\.70360\.66460\.6162PAM\(Renet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib89)\)0\.70700\.67290\.53850\.52471\.07260\.71040\.66910\.6222ALamp\(Maet al\.,[2017](https://arxiv.org/html/2606.07599#bib.bib92)\)0\.69480\.68470\.54780\.53391\.05110\.72500\.68350\.6349NIMA\(Talebi and Milanfar,[2018](https://arxiv.org/html/2606.07599#bib.bib143)\)0\.69570\.68390\.54580\.53331\.07560\.70840\.67090\.6204MPada\\text\{MP\}\_\{ada\}\(Shenget al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib147)\)0\.69480\.68480\.54850\.53401\.05250\.72400\.68080\.6341MLSP\(Hosuet al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib133)\)0\.68140\.69830\.56060\.54451\.04280\.73060\.69520\.6402MT\-A\(Fanget al\.,[2020](https://arxiv.org/html/2606.07599#bib.bib70)\)0\.68550\.69400\.55580\.54121\.04550\.72890\.68620\.6384BIAA\(Zhuet al\.,[2020](https://arxiv.org/html/2606.07599#bib.bib136)\)0\.68640\.69320\.55520\.54051\.04970\.72590\.68260\.6358UIAA\(Zenget al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib140)\)0\.68890\.69070\.55590\.53861\.04690\.72780\.68620\.6376POE\(Liet al\.,[2021](https://arxiv.org/html/2606.07599#bib.bib83)\)0\.68080\.69660\.55830\.54321\.04560\.73070\.68770\.6368MUSIQ\(Keet al\.,[2021](https://arxiv.org/html/2606.07599#bib.bib116)\)0\.67400\.70590\.56320\.55041\.04270\.73080\.69250\.6401HGCN\(Sheet al\.,[2021](https://arxiv.org/html/2606.07599#bib.bib148)\)0\.68130\.69830\.55660\.54451\.0400\.73280\.69340\.6417TANet\(Heet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib135)\)0\.67890\.70080\.55990\.54651\.04690\.72790\.68440\.6375MaxViT\(Tuet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib132)\)0\.65820\.72270\.58530\.56361\.0420\.73080\.69250\.6401Delegate\(Heet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib81)\)0\.63450\.74980\.60340\.58471\.0190\.74730\.71140\.6545AesMamba\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib131)\)0\.61290\.76630\.61370\.62940\.98750\.75220\.72610\.6895GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\)0\.58420\.79130\.68230\.67890\.87220\.76480\.74340\.7233DiffoR \(Ours\)0\.5520\.8410\.7590\.7650\.7680\.8050\.7990\.785

### C\.3\.Watch Time Prediction \(WTP\)

#### C\.3\.1\.Experimental Setup

##### Dataset\.

We evaluate our method on two large\-scale real\-world datasets derived from Kuaishou video logs\. KuaiRand\(Gaoet al\.,[2022b](https://arxiv.org/html/2606.07599#bib.bib19)\)comprises 26,988 users and 6,598 items, generating a total of 1,266,560 interaction impressions\. KuaiRec\(Gaoet al\.,[2022a](https://arxiv.org/html/2606.07599#bib.bib20)\), a significantly larger collection, consists of 7,176 users and 10,728 items, accumulating an impressive 12,530,806 impressions\.

##### Architecture

Unlike traditional sequential recommendation tasks, Watch Time Prediction \(WTP\) does not inherently rely on historical action sequences\. Accordingly, we employ a simple two\-layer Multi\-Layer Perceptron \(MLP\) as the encoder, maintaining consistency with the configuration of baseline methods to ensure fair comparison\.

##### Baselines\.

Details about these compared methods are as follows:

1. \(1\)VR \(Value Regression\):This method employs direct regression fitting to predict the absolute watch time prediction, optimized via Mean Squared Error \(MSE\)\.
2. \(2\)D2Q\(Zhanet al\.,[2022](https://arxiv.org/html/2606.07599#bib.bib11)\):This method segments data by video duration, predicting watch time quantiles within groups via regression, then mapping to final watch time\.
3. \(3\)D2Co\(Zhaoet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib14)\):introduces a sensitivity\-controlled correction mechanism to mitigate both duration bias and noisy watching, aiming to recover true user interest via a unified causal framework\.
4. \(4\)CWM\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib10)\):It models counterfactual watch time \(CWT\) by estimating user interest via a cost\-based transform function, optimizing a counterfactual likelihood for prediction\.
5. \(5\)TPM\(Linet al\.,[2023](https://arxiv.org/html/2606.07599#bib.bib8)\):It utilizes a tree structure to model multi\-granular time interval relationships, predicting watch time as a weighted sum of probabilities along the tree path\.
6. \(6\)PTPM\(Chenet al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib13)\):It replaces the pre\-defined binary tree with a personalized, learnable structure in TPM with adaptive ordinal discretization and robust bias correction\.
7. \(7\)CREAD\(Sunet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib1)\):This method constructs dynamic, error\-adaptive time intervals\. Within each, a classifier determines threshold exceedance, deriving the final prediction from a weighted sum of probabilistic estimates\.
8. \(8\)SWaT\(Yanget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib9)\):User\-centric statistical framework modeling watch time with behavioral assumptions\. It employs bucketization for non\-stationary viewing probabilities, with prediction via a weighted sum of probabilistic estimates\.
9. \(9\)EGMN\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.07599#bib.bib12)\):It parameterizes an Exponential\-Gaussian Mixture to simultaneously characterize the coarse\-grained skewness of quick skips and the fine\-grained diversity of user interactions\.
10. \(10\)GoR\(Maet al\.,[2026](https://arxiv.org/html/2606.07599#bib.bib149)\):It employs an autoregressive language modeling approach to generate watch time tokens, where the final predicted duration is the sum of these generated tokens\.

For a fair comparison, all compared methods are implemented with their reported optimal hyperparameters and configured to maintain approximate model parameter sizes\.

### C\.4\.Life Time Value Prediction \(LTV\)

#### C\.4\.1\.Experimental Setup

We evaluate DiffoR on the Criteo\-SSC111https://ailab\.criteo\.com/criteo\-sponsored\-search\-conversion\-log\-dataset/and Kaggle222https://www\.kaggle\.com/c/acquire\-valued\-shoppers\-challengedatasets\. For both datasets, a random split of 7:1:2 is used for training, validation, and testing, respectively\. Criteo\-SSC is a large\-scale public dataset derived from Criteo Predictive Search \(CPS\) logs\. Each instance represents a user’s click behavior, with the task being to predict conversion and associated 30\-day revenue\. The product price feature was excluded from the inputs\. The Kaggle Dataset contains transaction records\. Following\(Wenget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib124)\), the task involves predicting a user’s total purchase value from a specific company in the year following their initial purchase\. Our experiments focus on initial purchases within 2012\-03\-01 and 2012\-07\-01, using data from the three companies with the highest transaction volume\.

#### C\.4\.2\.Baselines

We evaluate our method with several existing state\-of\-the\-art LTV methods\(Drachenet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib86); Maet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib88); Wanget al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib96); Liet al\.,[2022a](https://arxiv.org/html/2606.07599#bib.bib112); Liuet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib123); Wenget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib124)\)\. Here, we provide more detailed information about these compared methods as follows:

1. \(1\)Two\-stage\(Drachenet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib86)\)decomposes the CLTV prediction into two tasks: the first task is a classification task predicting whether a user will churn or not, and the second task is a regression task predicting the revenue that the user brings\.
2. \(2\)MTL\-MSE\(Maet al\.,[2018](https://arxiv.org/html/2606.07599#bib.bib88)\)estimates conversion rate and CLTV with MSE loss according to the multi\-task learning paradigm\.
3. \(3\)ZILN\(Wanget al\.,[2019](https://arxiv.org/html/2606.07599#bib.bib96)\)assumes that the long\-tailed CLTV distribution follows a zero\-inflated log\-normal distribution and uses a DNN to estimate the meanμ\\mu, standard deviationσ\\sigmaand conversion rateppfor the samples\.
4. \(4\)MDME\(Liet al\.,[2022a](https://arxiv.org/html/2606.07599#bib.bib112)\)divides the training samples by CLTV into multiple sub\-distributions and buckets, and constructs corresponding classification problems to predict the bucket a sample belongs to\. In the next stage, the bias within the bucket is estimated so that the samples obtain a fine\-grained CLTV value\.
5. \(5\)MDAN\(Liuet al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib123)\)predicts predefined LTV bucket labels using a multi\-classification network and leverages a multi\-channel learning network to derive embeddings for each bucket\. The final sample representation is obtained by fusing these embeddings with the classification network’s output through a weighted sum, which is then utilized for CLTV prediction\.
6. \(6\)OptDist\(Wenget al\.,[2024](https://arxiv.org/html/2606.07599#bib.bib124)\)employs an adaptive mechanism to model and select optimal sub\-distributions for individual samples, consisting of a distribution learning module \(DLM\) that trains multiple sub\-distribution networks, and a distribution selection module \(DSM\) that dynamically chooses the appropriate sub\-distribution for each customer\.
7. \(7\)HiLTV\(Xuet al\.,[2025c](https://arxiv.org/html/2606.07599#bib.bib159)\)is a hierarchical framework for game LTV prediction that models multi\-modal recharge behaviors with a Zero\-Inflated Mixture\-of\-Logistic loss and introduces a calibration module for robust new\-user prediction\.

For this task, we employ the same encoder architecture for DiffoR in Appendix[C\.3\.1](https://arxiv.org/html/2606.07599#A3.SS3.SSS1)\.
DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

Similar Articles

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

FFJORD: Free-form continuous dynamics for scalable reversible generative models

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

Submit Feedback

Similar Articles

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
FFJORD: Free-form continuous dynamics for scalable reversible generative models
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language