From Generalist to Specialist Representation

arXiv cs.LG 05/14/26, 04:00 AM Papers
identifiability representation-learning nonparametric task-relevant machine-learning theory icml
Summary
This paper proves that task-relevant latent representations can be identified from generalist models in a fully nonparametric setting without interventions or parametric constraints, achieving a hierarchical identifiability guarantee across time steps and within each step.
arXiv:2605.12733v1 Announce Type: new Abstract: Given a generalist model, learning a task-relevant specialist representation is fundamental for downstream applications. Identifiability, the asymptotic guarantee of recovering the ground-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints. We first prove that the structure between time steps and tasks is identifiable in a fully unsupervised manner, even when sequences lack strict temporal dependence and may exhibit disconnections, and task assignments can follow arbitrarily complex and interleaving structures. We then prove that, within each time step, the task-relevant latent representation can be disentangled from the irrelevant part under a simple sparsity regularization, without any additional information or parametric constraints. Together, these results establish a hierarchical foundation: task structure is identifiable across time steps, and task-relevant latent representations are identifiable within each step. To our knowledge, each result provides a first general nonparametric identifiability guarantee, and together they mark a step toward provably moving from generalist to specialist models.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:17 AM
# From Generalist to Specialist Representation
Source: [https://arxiv.org/html/2605.12733](https://arxiv.org/html/2605.12733)
###### Abstract

Given a generalist model, learning a task\-relevant specialist representation is fundamental for downstream applications\. Identifiability, the asymptotic guarantee of recovering the ground\-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation\. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints\. We first prove that the structure between time steps and tasks is identifiable in a fully unsupervised manner, even when sequences lack strict temporal dependence and may exhibit disconnections, and task assignments can follow arbitrarily complex and interleaving structures\. We then prove that, within each time step, the task\-relevant latent representation can be disentangled from the irrelevant part under a simple sparsity regularization, without any additional information or parametric constraints\. Together, these results establish a hierarchical foundation: task structure is identifiable across time steps, and task\-relevant latent representations are identifiable within each step\. To our knowledge, each result provides a first general nonparametric identifiability guarantee, and together they mark a step toward provably moving from generalist to specialist models\.

Machine Learning, ICML

### 1Introduction

Learning latent representations from high\-dimensional observations is central to enabling machines to understand and act in the world\(Bengioet al\.,[2013](https://arxiv.org/html/2605.12733#bib.bib264); Schölkopfet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib261)\)\. World models, for instance, compress raw sensory input into low\-dimensional features that capture dynamics\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2605.12733#bib.bib260)\)\. Rather than modeling the entire environment, task\-relevant representations are desirable because they retain only the information required for the task, providing both efficiency and robustness\(Tishby and Zaslavsky,[2015](https://arxiv.org/html/2605.12733#bib.bib258); Wonget al\.,[2025](https://arxiv.org/html/2605.12733#bib.bib259)\)\. For instance, in autonomous driving, planning depends on the positions and velocities of nearby vehicles and pedestrians, not on the color of the cars or billboards along the road\.

Without identifiability, a learned representation cannot be guaranteed to reflect the ground truth, even with infinite data and computation\. This challenge has long been central to latent representation learning, extending beyond task\-relevant settings\(Hyvärinen and Pajunen,[1999](https://arxiv.org/html/2605.12733#bib.bib189); Locatelloet al\.,[2019](https://arxiv.org/html/2605.12733#bib.bib190)\)\. Given two observationally equivalent models𝐨=f\(𝐬\)\\mathbf\{o\}=f\(\\mathbf\{s\}\)and𝐨=f^\(𝐬^\)\\mathbf\{o\}=\\hat\{f\}\(\\hat\{\\mathbf\{s\}\}\), an arbitrary transformationϕ\\phimay exist such that𝐬^=ϕ\(𝐬\)\\hat\{\\mathbf\{s\}\}=\\phi\(\\mathbf\{s\}\)\. In this case, the recovered latents need not correspond in any meaningful way to the true ones\. Task\-relevant variables, for example, may remain entangled with irrelevant factors, making it impossible to isolate what actually matters for the task\. Such ambiguity introduces irreducible uncertainty into a machine’s internal model of the world, constraining the ceiling of achievable intelligence and creating risks in high\-stakes applications\.

Existing theory provides conditions for identifiability of latent representations\. In classical linear settings, identifiability can be obtained under additional parametric assumptions, for example in factor models with constraints on loadings\(Andersonet al\.,[1956](https://arxiv.org/html/2605.12733#bib.bib249); Jöreskog,[1969](https://arxiv.org/html/2605.12733#bib.bib250); Shapiro,[1985](https://arxiv.org/html/2605.12733#bib.bib251)\), in linear Independent Component Analysis \(ICA\) via non\-Gaussianity\(Comon,[1994](https://arxiv.org/html/2605.12733#bib.bib144); Hyvärinenet al\.,[2001](https://arxiv.org/html/2605.12733#bib.bib147)\), and in tensor or multi\-view models via Kruskal\-type rank conditions\(Kruskal,[1977](https://arxiv.org/html/2605.12733#bib.bib252); Sidiropoulos and Bro,[2000](https://arxiv.org/html/2605.12733#bib.bib253); Allmanet al\.,[2009](https://arxiv.org/html/2605.12733#bib.bib263)\)\. More recently, nonlinear theory has advanced along two routes\. In nonlinear ICA, one line leverages auxiliary information across domains or time\(Hyvärinen and Morioka,[2016](https://arxiv.org/html/2605.12733#bib.bib153); Hyvärinenet al\.,[2019](https://arxiv.org/html/2605.12733#bib.bib138); Yaoet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib227); Hälväet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib236); Lachapelleet al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib175)\), while another constrains the mixing class\(Taleb and Jutten,[1999](https://arxiv.org/html/2605.12733#bib.bib199); Moranet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib134); Kivvaet al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib78); Zhenget al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib111); Greseleet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib137); Buchholzet al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib83)\)\. In causal representation learning, identifiability is often derived from interventional data\(von Kügelgenet al\.,[2023](https://arxiv.org/html/2605.12733#bib.bib170); Jiang and Aragam,[2023](https://arxiv.org/html/2605.12733#bib.bib171); Jin and Syrgkanis,[2023](https://arxiv.org/html/2605.12733#bib.bib157); Zhanget al\.,[2024](https://arxiv.org/html/2605.12733#bib.bib169); Variciet al\.,[2025](https://arxiv.org/html/2605.12733#bib.bib265)\)or counterfactual views\(von Kügelgenet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib205); Brehmeret al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib173)\), which require some control over the data\-generating process\. Recent work considers the general setting without extra information, with the assumptions that both latent and observed variables are Boolean vectors\(Zhanget al\.,[2025](https://arxiv.org/html/2605.12733#bib.bib180)\)\. These conditions provide significant insights into recovering the underlying generative process, but may overly restrict the range of applicable scenarios\.

At the same time, most existing theoretical results focus on full identifiability of the latent system: either recovering all latent variables component\-wisely, or identifying them up to ancestors or neighborhoods\. Yet such comprehensive recovery is often unnecessary\. In many applications, tasks depend only on a subset of latent factors – for instance, in robotic manipulation, success hinges on object pose and gripper position, while lighting and textures are irrelevant\. Shifting the goal from full\-system identifiability to task\-relevant identifiability enables weaker assumptions while still directly supporting planning, transfer, and generalization\. Recent works have explored subspace factorization\(von Kügelgenet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib205); Konget al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib115); Liet al\.,[2023](https://arxiv.org/html/2605.12733#bib.bib164); Liuet al\.,[2023](https://arxiv.org/html/2605.12733#bib.bib262)\), aiming to decompose latent factors into interpretable blocks\. However, these approaches impose fixed structures, such as content–style separation, and are not designed to accommodate flexible task settings, where latent variables may correspond to tasks with unknown number, structure, and assignment, and where this uncertainty can further vary across time steps\. Thus, the question remains:

Is a task\-relevant world representation identifiable in the general setting?

##### Contributions\.

To answer this, we develop a theoretical framework for identifying task\-relevant representations from the complex dynamics of the observational world\. Our first result proves that task structure across time is identifiable in a fully general setting, without any parametric or structural assumptions \(Section[3](https://arxiv.org/html/2605.12733#S3)\)\. We do not require strict temporal dependence: steps may be disconnected or even i\.i\.d\., and thus we cannot leverage the temporal information\. In addition, tasks may appear, disappear, and reappear in arbitrary order, allowing interleaving task\-time structures\. After identifying the tasks for each time step, we further ask which latent variables are relevant to those tasks, and provide the first nonparametric identifiability result for task\-relevant latent representations without relying on interventions or functional constraints \(Section[4](https://arxiv.org/html/2605.12733#S4)\)\. Specifically, we show that fine\-tuning a pretrained model with a simple task\-latent regularization provably disentangles task\-relevant variables from irrelevant ones\. Together, these results mark a step towards establishing principled pathways from generalist to specialist models that achieve both compression and fidelity\.

### 2Preliminaries

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/dgp_2.png)Figure 1:An illustration of the generative process\. Latent states𝐬t\\mathbf\{s\}\_\{t\}generate observations𝐨t\\mathbf\{o\}\_\{t\}via nonlinear functions and interact with actions𝐚t\\mathbf\{a\}\_\{t\}under varying temporal connectivity, where consecutive steps may be arbitrarily disconnected\. Tasks𝐠i\\mathbf\{g\}\_\{i\}are defined as colliders across time steps, and different tasks can arbitrarily interleave with one another\. The zoomed\-in view \(right\) shows how different components of𝐬t\\mathbf\{s\}\_\{t\}connect to multiple tasks via the intermediate actions\.We assume an observed sequence\{𝐨t\}t=1T\\\{\\mathbf\{o\}\_\{t\}\\\}\_\{t=1\}^\{T\}generated by latent states\{𝐬t\}t=1T\\\{\\mathbf\{s\}\_\{t\}\\\}\_\{t=1\}^\{T\}, with𝐨t∈ℝdo\\mathbf\{o\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{o\}\},𝐬t∈ℝds\\mathbf\{s\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{s\}\}, and actions𝐚t∈ℝda\\mathbf\{a\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{a\}\}\. Observations satisfy

𝐨t=ft\(𝐬t\),\\mathbf\{o\}\_\{t\}=f\_\{t\}\(\\mathbf\{s\}\_\{t\}\),\(1\)whereftf\_\{t\}is a diffeomorphism onto its image\. The generative functionftf\_\{t\}is hidden and completely unknown\. We allow varying temporal connectivity:𝐬t→𝐚t\\mathbf\{s\}\_\{t\}\\to\\mathbf\{a\}\_\{t\}for alltt, and𝐚t→𝐬t\+1\\mathbf\{a\}\_\{t\}\\to\\mathbf\{s\}\_\{t\+1\},𝐬t→𝐬t\+1\\mathbf\{s\}\_\{t\}\\to\\mathbf\{s\}\_\{t\+1\}whenever the boundaryt→t\+1t\\to t\{\+\}1is connected; both edges into𝐬t\+1\\mathbf\{s\}\_\{t\+1\}are omitted when it is disconnected\. A Structural Causal Model \(SCM\) consistent with these is defined as𝐚t=πt\(𝐬t,ηt\)\\mathbf\{a\}\_\{t\}=\\pi\_\{t\}\(\\mathbf\{s\}\_\{t\},\\eta\_\{t\}\), where

𝐬t\+1=\{Ft\(𝐬t,𝐚t,ξt\),ift→t\+1is connected,Ft0\(ξt\),otherwise,\\mathbf\{s\}\_\{t\+1\}=\\begin\{cases\}F\_\{t\}\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\},\\xi\_\{t\}\),&\\text\{if \}t\\to t\{\+\}1\\text\{ is connected\},\\\\ F\_\{t\}^\{0\}\(\\xi\_\{t\}\),&\\text\{otherwise\},\\end\{cases\}\(2\)with independent noisesηt,ξt\\eta\_\{t\},\\xi\_\{t\}\. We define tasks\{𝐠i\}i=1M\\\{\\mathbf\{g\}\_\{i\}\\\}\_\{i=1\}^\{M\}as colliders among different time steps, that is,𝐬t→𝐚t→𝐠i\\mathbf\{s\}\_\{t\}\\to\\mathbf\{a\}\_\{t\}\\to\\mathbf\{g\}\_\{i\}if the time stepttis relevant to𝐠i\\mathbf\{g\}\_\{i\}\. The visualization of the process is in Figure[1](https://arxiv.org/html/2605.12733#S2.F1), and the reasons to define tasks as colliders instead of others are as follows:

Given the observed variables\{𝐨t\}t=1T\\\{\\mathbf\{o\}\_\{t\}\\\}\_\{t=1\}^\{T\}and the global set of tasks\{𝐠i\}i=1M\\\{\\mathbf\{g\}\_\{i\}\\\}\_\{i=1\}^\{M\}, our goal is first to identify the structure linking time steps and tasks \(Section[3](https://arxiv.org/html/2605.12733#S3)\), and then, within each latent state𝐬t\\mathbf\{s\}\_\{t\}, to isolate the components relevant to the associated tasks \(Section[4](https://arxiv.org/html/2605.12733#S4)\)\. All theoretical guarantees need to be achieved in the general nonparametric setting without additional information\.

### 3Learning Temporal Task Structure

We first establish the identifiability of the time\-task structure in the general setting\. This structure is essential, as it forms the foundation for recovering task\-relevant latent representations within each step\. Without knowing which tasks are active at which times, disentangling latent variables at the step level would be ill\-posed\. Providing formal guarantee in the most general scenario is challenging, mainly due to the following reasons:

- •The hidden process is fully nonparametric, with no auxiliary information or distributional constraints\.
- •Tasks may interleave arbitrarily over time, while classical decomposition assumes sequential completion\.
- •Temporal dependence is not guaranteed; the sequence may contain arbitrary disconnected boundaries\.

Despite these challenges, we prove that the structure between time steps and tasks is identifiable under standard conditions\. This result forms the first pillar of our framework: a principled characterization of temporal task structure in the general regime without additional information\.

#### 3\.1Characterization of Pair\-wise Structure

We assumeTTtime steps, partitioned intoNNcontiguous segments of equal lengthL=T/NL=T/N, withL≥2L\\geq 2andN∣TN\\mid T\. Let us define that

𝐒=\{𝐒1,…,𝐒N\},𝐒i=\{𝐬\(i−1\)L\+1,…,𝐬iL\}\.\\mathbf\{S\}=\\\{\\mathbf\{S\}\_\{1\},\\dots,\\mathbf\{S\}\_\{N\}\\\},\\qquad\\mathbf\{S\}\_\{i\}=\\\{\\mathbf\{s\}\_\{\(i\-1\)L\+1\},\\dots,\\mathbf\{s\}\_\{iL\}\\\}\.\(3\)All states within a segment share the same set of active tasks, and each task𝐠i\\mathbf\{g\}\_\{i\}must appear in at least two segments\. Segments can be short \(as few as two steps\), ensuring flexibility in capturing state changes\. To formalize the conditions used in our theory, we introduce the following notion\.

###### Definition 1\(Band conditioning set\)\.

Fork<vk<vand task𝐠i\\mathbf\{g\}\_\{i\}, define

𝐙band\(k,v,i\)=\\displaystyle\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)=\{𝐬kL−1,𝐬kL\+1,𝐬vL−1,𝐬vL\+1\}\\displaystyle\\\{\\mathbf\{s\}\_\{kL\-1\},\\mathbf\{s\}\_\{kL\+1\},\\mathbf\{s\}\_\{vL\-1\},\\mathbf\{s\}\_\{vL\+1\}\\\}∩\{𝐬1,…,𝐬T\}∪\{𝐠i\},\\displaystyle\\cap\\\{\\mathbf\{s\}\_\{1\},\\dots,\\mathbf\{s\}\_\{T\}\\\}\\cup\\\{\\mathbf\{g\}\_\{i\}\\\},with out\-of\-range indices omitted\.

##### Why Temporal Segmentation\.

It might be worth noting that our segmentation is not about having prior knowledge of how a sequence should be divided, such as knowing in advance that a video naturally breaks into distinct semantic periods and that failing to know that will lead to a segmentation error\. Instead, the purpose is simply to ensure that our tasks are well defined\. The only requirement is that each segment contains more than one time step, which represents the minimal granularity needed to preserve temporal coherence\. Theoretically, we could always set the segment length to minimal to capture the finests granularity of changes\. In practice, one can always view the sequence as a collection of two\-step segments without relying on any semantic understanding of the underlying process\. With granularity this fine, segmentation has negligible effect on the tests\.

Our main result is the following, with the standard Markov and Faithfulness conditions\.

###### Assumption 1\(Markov and Faithfulness\(Spirteset al\.,[2000](https://arxiv.org/html/2605.12733#bib.bib182)\)\)\.

Let𝐆\\mathbf\{G\}be a Directed Acyclic Graph \(DAG\) andℙ\\mathbb\{P\}a distribution over variables𝐕\\mathbf\{V\}\. The Markov property requires that eachX∈𝐕X\\in\\mathbf\{V\}is independent of its non\-descendants given its parents in𝐆\\mathbf\{G\}\. The Faithfulness requires thatℙ\\mathbb\{P\}entails no conditional independence relations beyond those implied by the Markov property of𝐆\\mathbf\{G\}\.

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/example.png)Figure 2:A quick example for Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1)\. Note that the observed variables in𝐨t\\mathbf\{o\}\_\{t\}have been omitted for brevity\. We test whether𝐒k=\{𝐬3,𝐬4\}\\mathbf\{S\}\_\{k\}=\\\{\\mathbf\{s\}\_\{3\},\\mathbf\{s\}\_\{4\}\\\}and𝐒v=\{𝐬7,𝐬8\}\\mathbf\{S\}\_\{v\}=\\\{\\mathbf\{s\}\_\{7\},\\mathbf\{s\}\_\{8\}\\\}belong to task𝐠1\\mathbf\{g\}\_\{1\}by checking the conditional dependence𝐬4⟂⟂̸𝐬8∣𝐙band\(k,v,1\)\\mathbf\{s\}\_\{4\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{8\}\\mid\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,1\), where𝐙band\(k,v,1\)=\{𝐬3,𝐬5,𝐬7,𝐬9,𝐠1\}\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,1\)=\\\{\\mathbf\{s\}\_\{3\},\\mathbf\{s\}\_\{5\},\\mathbf\{s\}\_\{7\},\\mathbf\{s\}\_\{9\},\\mathbf\{g\}\_\{1\}\\\}\. Since𝐬4\\mathbf\{s\}\_\{4\}and𝐬8\\mathbf\{s\}\_\{8\}are conditionally dependent given𝐙band\(k,v,1\)\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,1\),𝐠1\\mathbf\{g\}\_\{1\}is identified as \(one of\) the underlying tasks\. Note that our theory accommodates arbitrary disconnections between time steps \(e\.g\.,𝐬1\\mathbf\{s\}\_\{1\}and𝐬2\\mathbf\{s\}\_\{2\}\), multiple tasks, and arbitrarily interleaving task structures\.###### Theorem 1\.

Assume the Markov property and Faithfulness with respect to the graph above, andL≥2L\\geq 2\. Fixk<vk<vand a task𝐠i\\mathbf\{g\}\_\{i\}\. Then𝐠i\\mathbf\{g\}\_\{i\}is relevant to segments𝐒k\\mathbf\{S\}\_\{k\}and𝐒v\\mathbf\{S\}\_\{v\}if and only if

𝐬kL⟂⟂̸𝐬vL\|𝐙band\(k,v,i\)\.\\mathbf\{s\}\_\{kL\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{vL\}\\ \\Big\|\\ \\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)\.

##### Proof Sketch\.

The proof \(Appendix[A\.2](https://arxiv.org/html/2605.12733#A1.SS2)\) relies on characterizing all possible d\-connecting paths between𝐬kL\\mathbf\{s\}\_\{kL\}and𝐬vL\\mathbf\{s\}\_\{vL\}under the band conditioning set\. Conditioning on the immediate boundary states blocks any path that propagates purely through the temporal dynamics, so dependence can only be transmitted through a shared task\. Since tasks have only incoming edges, any task other than𝐠i\\mathbf\{g\}\_\{i\}appears as a closed collider and blocks the path, which implies that𝐠i\\mathbf\{g\}\_\{i\}must be the unique source of dependence\. Careful consideration of local structures and corner cases then shows that the only admissible d\-connecting paths are those where actions adjacent to𝐬kL\\mathbf\{s\}\_\{kL\}and𝐬vL\\mathbf\{s\}\_\{vL\}both feed into𝐠i\\mathbf\{g\}\_\{i\}\.

##### Implication\.

Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1)provides a provable way to determine whether two segments share the same task𝐠i\\mathbf\{g\}\_\{i\}, giving an exact characterization of temporal task relevance \(visualized in Fig\.[2](https://arxiv.org/html/2605.12733#S3.F2)\)\. This is powerful: once we can identify the corresponding tasks of any pair of segments, the entire task structure can be discovered\. Moreover, the condition is testable directly from observed data, since conditional independence is preserved under the invertible map𝐨t=ft\(𝐬t\)\\mathbf\{o\}\_\{t\}=f\_\{t\}\(\\mathbf\{s\}\_\{t\}\)and the task variables𝐠i\\mathbf\{g\}\_\{i\}are observed\. Hence the procedure requires no parametric assumptions and is broadly applicable\. Finally, the result does not rely on restrictive structural constraints, allowing tasks to appear, disappear, and interleave in arbitrary order across time, and sequences can be disconnected\. This directly generalizes the most common assumption of sequential completion\.

Since all states within a segment share the same task set, conditional independence \(CI\) tests involving the boundary states are equivalent to tests involving any other pair of states within the two segments \(providedL\>2L\>2\)\. Intuitively, this homogeneity means that the specific choice of representative states does not matter: any pair of states across two segments encodes the same task\-level dependence\. For example,𝐬kL⟂⟂̸𝐬vL\|𝐙band\(k,v,i\)\\mathbf\{s\}\_\{kL\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{vL\}\\ \\Big\|\\ \\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)is equivalent to𝐬kL−1⟂⟂̸𝐬vL−1\|\{𝐬kL−2,𝐬kL,𝐬vL−2,𝐬vL\}∩\{𝐬1,…,𝐬T\}∪\{𝐠i\}\.\\mathbf\{s\}\_\{kL\-1\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{vL\-1\}\\ \\Big\|\\ \\\{\\mathbf\{s\}\_\{kL\-2\},\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\-2\},\\mathbf\{s\}\_\{vL\}\\\}\\cap\\\{\\mathbf\{s\}\_\{1\},\\dots,\\mathbf\{s\}\_\{T\}\\\}\\cup\\\{\\mathbf\{g\}\_\{i\}\\\}\.This invariance ensures that identifiability does not hinge on an arbitrary boundary choice, but is intrinsic to the task structure itself\.

###### Corollary 1\.

Assume the Markov property and Faithfulness with respect to the graph above, andL\>2L\>2\. Fixk<vk<vand a task𝐠i\\mathbf\{g\}\_\{i\}\. Then𝐠i\\mathbf\{g\}\_\{i\}is relevant to segments𝐒k\\mathbf\{S\}\_\{k\}and𝐒v\\mathbf\{S\}\_\{v\}iff

𝐬j⟂⟂̸𝐬q\|\{𝐬j−1,𝐬j\+1,𝐬q−1,𝐬q\+1\}∩\{𝐬1,…,𝐬T\}∪\{𝐠i\},\\mathbf\{s\}\_\{j\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{q\}\\ \\Big\|\\ \\\{\\mathbf\{s\}\_\{j\-1\},\\mathbf\{s\}\_\{j\+1\},\\mathbf\{s\}\_\{q\-1\},\\mathbf\{s\}\_\{q\+1\}\\\}\\cap\\\{\\mathbf\{s\}\_\{1\},\\dots,\\mathbf\{s\}\_\{T\}\\\}\\cup\\\{\\mathbf\{g\}\_\{i\}\\\},for anyj∈\{\(k−1\)L\+1,…,kL\}j\\in\\\{\(k\-1\)L\+1,\\dots,kL\\\}andq∈\{\(v−1\)L\+1,…,vL\}q\\in\\\{\(v\-1\)L\+1,\\dots,vL\\\}\.

This corollary does not impose additional conditions but establishes an equivalent characterization, guaranteed by the basic coherence of the tasks\. It strengthens the applicability of Thm\.[1](https://arxiv.org/html/2605.12733#Thmtheorem1)by showing that task relevance can be tested using arbitrary representatives within segments, not only their boundaries\. Conceptually, this flexibility highlights that identifiability of the temporal task structure arises from the global dependency pattern induced by colliders, rather than from local temporal adjacency\. As a consequence, the result is robust to segmentation choices and ensures that the recovered structure reflects intrinsic properties of the underlying process rather than artifacts\.

#### 3\.2Discovering Global Task Structure

Building on Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1)and Corollary[1](https://arxiv.org/html/2605.12733#Thmcorollary1), the characterization of task relevance naturally yields an algorithmic procedure\. With the proposed test, one can systematically determine whether two segments share a common task\. Aggregating these pairwise tests across all segment pairs yields the complete temporal task structure, as detailed in Algorithm[1](https://arxiv.org/html/2605.12733#alg1)\.

###### Proposition 1\.

Under the conditions of Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1), Algorithm[1](https://arxiv.org/html/2605.12733#alg1)exactly recovers the temporal task structure\.

The procedure is not only theoretically solid but also computationally efficient, which scales with the temporal horizon rather than the observation dimension\. Moreover, because conditional independence is preserved under the invertible observation map, the tests can be performed directly in the observed space, without knowledge of the latent states or parametric assumptions on the dynamics\. This provides an operational bridge from identifiability theory to practice: hidden temporal task structure can be precisely recovered by a simple, general, and provably correct algorithm, even in environments with arbitrary interleaving, recurrence, and disconnections across time\.

Algorithm 1Global task structure discoveryInput :

Segments𝐒1:N\\mathbf\{S\}\_\{1:N\}of lengthL≥2L\\geq 2; tasks𝐆=\{𝐠1:M\}\\mathbf\{G\}=\\\{\\mathbf\{g\}\_\{1:M\}\\\}

Output :

Segment–task sets\{𝒯\(𝐒k\)\}k=1N\\\{\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)\\\}\_\{k=1\}^\{N\}and step labels\{𝒯\(t\)\}t=1T\\\{\\mathcal\{T\}\(t\)\\\}\_\{t=1\}^\{T\}

𝒯\(𝐒1:N\)←\[∅\]N\\mathcal\{T\}\(\\mathbf\{S\}\_\{1:N\}\)\\leftarrow\[\\emptyset\]^\{N\};𝒫←\{\(k,v\)∣1≤k<v≤N\}\\mathcal\{P\}\\leftarrow\\\{\(k,v\)\\mid 1\\leq k<v\\leq N\\\}

ForEach*i∈\[1\.\.M\]i\\in\[1\.\.M\]*Do

ForEach*\(k,v\)∈𝒫\(k,v\)\\in\\mathcal\{P\}*Do

If*𝐬kL⟂⟂̸𝐬vL∣𝐙band\(k,v,i\)\\mathbf\{s\}\_\{kL\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)*Then

𝒯\(𝐒k\)←𝒯\(𝐒k\)∪\{𝐠i\}\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)\\leftarrow\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)\\cup\\\{\\mathbf\{g\}\_\{i\}\\\};

𝒯\(𝐒v\)←𝒯\(𝐒v\)∪\{𝐠i\}\\mathcal\{T\}\(\\mathbf\{S\}\_\{v\}\)\\leftarrow\\mathcal\{T\}\(\\mathbf\{S\}\_\{v\}\)\\cup\\\{\\mathbf\{g\}\_\{i\}\\\}
ForEach*k∈\[1\.\.N\]k\\in\[1\.\.N\]*Do

ForEach*t∈𝐒kt\\in\\mathbf\{S\}\_\{k\}*Do

𝒯\(t\)←𝒯\(𝐒k\)\\mathcal\{T\}\(t\)\\leftarrow\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)
Return :

\{𝒯\(𝐒k\)\}k=1N,\{𝒯\(t\)\}t=1T\\\{\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)\\\}\_\{k=1\}^\{N\},\\ \\\{\\mathcal\{T\}\(t\)\\\}\_\{t=1\}^\{T\}

##### What If Tasks Are Not Given\.

In practice, tasks may be unobserved and must be inferred from data\. In this case, we treat the inferred task representation as a latent variable and apply the CI tests to it directly, preserving the original logic\. To avoid confounding, the representation is learned independently of the CI relations being evaluated\. Since representation learning is conceptually separate from temporal structure recovery, extending the method to latent task settings remains fully feasible\.

Moreover, the method does not require prior knowledge of the exact task set\. Starting from a large pool of candidate tasks, the algorithm provably recovers the correct subset together with its temporal structure\. This assumption is substantially weaker than knowing the true task set in advance\. In practice, one often has access to or can infer a broad collection of basic tasks and only needs to identify which of them, and in what structure, appear in the trajectory\. Therefore, our problem setting fits a wide range of real\-world scenarios even without a precise knowledge of the task set\.

##### Complexity\.

The main practical trade\-off concerns computational complexity\. For large\-scale datasets, using very short segment lengths leads to a large number of segments and thus many candidate temporal structures\. While this does not affect correctness, the runtime grows linearly with the number of segments\. In practice, increasing segment length can significantly reduce computational cost, at the expense of a modest loss in temporal resolution\. This provides a controllable accuracy–scalability trade\-off\.

### 4Learning Task\-Relevant Representation

Having established the identifiability of temporal task structure, we now turn to the problem of learning task\-relevant representations within each time step\. Identifying which tasks are active at which times clarifies the dynamics across segments and ensures that temporal dependencies are properly aligned with task boundaries\. This strengthens the focus on the temporal dimension, but it does not yet resolve the finer question of representation: within a single latent state𝐬t\\mathbf\{s\}\_\{t\}, only a subset of variables may be relevant to the task, while the rest correspond to nuisance factors\. To obtain a minimal yet sufficient representation, we must therefore dig deeper into the latent space of𝐬t\\mathbf\{s\}\_\{t\}and disentangle the components that are truly task\-relevant from those that are irrelevant\. Specifically, we aim to ensure that the estimated latents \(e\.g\.,𝐬t1\\mathbf\{s\}\_\{t\_\{1\}\}\) associated with each task \(e\.g\.,t1t\_\{1\}\) are not functions of any other latent variables, whether tied to other tasks or unrelated altogether

Identifiability of the latent variables concerns recovering the unique ground truth𝐬t\\mathbf\{s\}\_\{t\}from two observationally equivalent models𝐨t=ft\(𝐬t\)\\mathbf\{o\}\_\{t\}=f\_\{t\}\(\\mathbf\{s\}\_\{t\}\)and𝐨t=f^t\(𝐬^t\)\\mathbf\{o\}\_\{t\}=\\hat\{f\}\_\{t\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\. Let𝐠=u\(𝐬,θ\)\\mathbf\{g\}=u\(\\mathbf\{s\},\\theta\)and𝐠^=u^\(𝐬^,θ^\)\\hat\{\\mathbf\{g\}\}=\\hat\{u\}\(\\hat\{\\mathbf\{s\}\},\\hat\{\\theta\}\), whereθ\\thetaandθ^\\hat\{\\theta\}denote variables other than𝐬\\mathbf\{s\}and𝐬^\\hat\{\\mathbf\{s\}\}\. These mappings exist due to the dependency structure𝐬t→𝐚t→𝐠i\\mathbf\{s\}\_\{t\}\\to\\mathbf\{a\}\_\{t\}\\to\\mathbf\{g\}\_\{i\}\. With slight abuse of notation, we mostly omitθ\\thetaandθ^\\hat\{\\theta\}and write𝐠=u\(𝐬\)\\mathbf\{g\}=u\(\\mathbf\{s\}\)and𝐠^=u^\(𝐬^\)\\hat\{\\mathbf\{g\}\}=\\hat\{u\}\(\\hat\{\\mathbf\{s\}\}\)for brevity\.

#### 4\.1Identifiability with a Generalist Model

We begin by asking what can be achieved without imposing any structural constraint beyond observational equivalence\. That is, we consider a*generalist model*without explicitly being regularized to focus on the corresponding tasks\. While such a model may capture the necessary information for prediction, its ability to recover the ground\-truth task–relevant latent representation is limited\.

##### Additional Notation\.

For a vector\-valued functionu:ℝds→ℝdgu:\\mathbb\{R\}^\{d\_\{s\}\}\\to\\mathbb\{R\}^\{d\_\{g\}\}, we denote byJu\(𝐬t\)J\_\{u\}\(\\mathbf\{s\}\_\{t\}\)the Jacobian matrix with respect to𝐬t\\mathbf\{s\}\_\{t\}, whose\(i,j\)\(i,j\)entry is∂ui/∂st,j\\partial u\_\{i\}/\\partial s\_\{t,j\}\. For a vector or matrixAA, we writeℐ\(A\)\\mathcal\{I\}\(A\)for the set of indices corresponding to its nonzero entries, and‖ℐ\(A\)‖\\\|\\mathcal\{I\}\(A\)\\\|for its cardinality \(the number of nonzeros, i\.e\., theℓ0\\ell\_\{0\}norm\)\. We denoteIk⊆\[ds\]I\_\{k\}\\subseteq\[d\_\{s\}\]as the set of indices of the latent variables relevant to taskgkg\_\{k\}, and𝐬t,Ik\\mathbf\{s\}\_\{t,I\_\{k\}\}as the corresponding set of latent variables\.

###### Proposition 2\.

Assume that, for eachi∈\[dg\]i\\in\[d\_\{g\}\], there exists a set𝒩i\\mathcal\{N\}\_\{i\}of‖ℐ\(Ju\(𝐬t\)i,⋅\)‖\\\|\\mathcal\{I\}\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}\)\_\{i,\\cdot\}\)\\\|distinct points such that the corresponding Jacobian row vectors

\(∂ui∂st,1,∂ui∂st,2,…,∂ui∂st,ds\)\|𝐬t=𝐬t\(l\),l∈𝒩i,\\small\\left\(\\frac\{\\partial u\_\{i\}\}\{\\partial s\_\{t,1\}\},\\frac\{\\partial u\_\{i\}\}\{\\partial s\_\{t,2\}\},\\ldots,\\frac\{\\partial u\_\{i\}\}\{\\partial s\_\{t,d\_\{s\}\}\}\\right\)\\bigg\|\_\{\\mathbf\{s\}\_\{t\}=\\mathbf\{s\}\_\{t\}^\{\(l\)\}\},\\quad l\\in\\mathcal\{N\}\_\{i\},are linearly independent, andℐ\(\(Ju\(𝐬t\(l\)\)M\)i,⋅\)⊆ℐ\(\(Ju^\(𝐬^t\(l\)\)\)i,⋅\)\\mathcal\{I\}\\Big\(\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)\\,M\)\_\{i,\\cdot\}\\big\)\\subseteq\\mathcal\{I\}\\Big\(\(J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}\\Big\), whereMMis a matrix sharing the nonzero index set of matrix\-valued functionM′\(𝐬,𝐬^\)M^\{\\prime\}\(\\mathbf\{s\},\\hat\{\\mathbf\{s\}\}\)inJu\(𝐬\)M′\(𝐬,𝐬^\)=Ju^\(𝐬^\)J\_\{u\}\(\\mathbf\{s\}\)M^\{\\prime\}\(\\mathbf\{s\},\\hat\{\\mathbf\{s\}\}\)=J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\)\. Then, for any task𝐠k\\mathbf\{g\}\_\{k\}with latent index setIkI\_\{k\}, the number of estimated task\-relevant latent variables is larger than that of the ground truth, i\.e\.,

‖ℐ\(\(Ju^\)i,⋅\)‖≥‖ℐ\(\(Ju\)i,⋅\)‖\.\\\|\\mathcal\{I\}\\big\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\\big\)\\\|\\;\\geq\\;\\\|\\mathcal\{I\}\\big\(\(J\_\{u\}\)\_\{i,\\cdot\}\\big\)\\\|\.

##### Proof Sketch\.

The argument starts by connecting the support of the Jacobian to the underlying dependency graph\. The span condition ensures that the information is being preserved during estimation, and thus no true dependency can be eliminated in the transformation between𝐬\\mathbf\{s\}and𝐬^\\hat\{\\mathbf\{s\}\}\. Equivalently, the nonzero pattern ofJu\(𝐬\)J\_\{u\}\(\\mathbf\{s\}\)must be contained within that ofJu^\(𝐬^\)J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\)\. Translated back to the task–latent structure, this implies that the number of the estimated task\-relevant latent variables, as captured by the support, is always a superset of the true one\.

##### Discussion on Assumptions\.

The requirement of sufficient nonlinearity is standard in identifiability analyses of nonlinear models\(Lachapelleet al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib175); Zhenget al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib111)\)\. Specifically, it rules out degenerate cases where samples concentrate on an extremely small subset \(e\.g\., as few as several samples\) such that the Jacobian vectors cannot even span their own supports\. At the same time, identifiability is defined as an asymptotic property \(infinite samples\), and the assumption only requires the existence of several nondegenerate samples in the whole space, which is almost always satisfied in practice\. More detailed discussion on the assumption is in Appendix[B](https://arxiv.org/html/2605.12733#A2)\.

##### Implication\.

This result shows that, without explicit modelling of specific tasks, generalist models tend to learn a representation that is larger than necessary\. The conclusion is intuitive: a sufficiently expressive foundation model can capture a representation that contains all information needed for downstream tasks\.

At the same time, the guarantee‖ℐ\(Ju^\)i,⋅‖≥‖ℐ\(Ju\)i,⋅‖\\\|\\mathcal\{I\}\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\\\|\\;\\geq\\;\\\|\\mathcal\{I\}\(J\_\{u\}\)\_\{i,\\cdot\}\\\|remains weak: it ensures only that the estimated representation is no smaller insizethan the true one, not that it matches it or recovers the correct variables\. In practice, this means a generalist model often learns an overcomplete representation, where task\-relevant variables may still be missed or entangled with irrelevant ones\. Worse, the inequality provides no guarantee on recovering variable values since the enlarged representation may distort or discard essential information\. This formalizes the intuition that while frontier generalist models are expressive enough to encode all task information, without additional regularization they fail to isolate the minimal task\-relevant latent representation\.

#### 4\.2From Generalist to Specialist

The previous result shows that a generalist model often produces an enlarged representation, estimating more task\-relevant variables than truly exist\. Such oversizing does not guarantee that all genuine factors are preserved; irrelevant latents may be included, while essential ones can still be distorted or obscured\. To recover the*true*task\-relevant representation, additional inductive bias is needed\.

A natural choice is sparsity regularization on the estimated task–latent structure, which enforces minimality in the recovered structure\. Intuitively, sparsity prunes away superfluous dimensions and curbs over\-expansion, ensuring that the final representation retains only the variables genuinely required for each task\. The corresponding gurantee is as follows:

###### Theorem 2\.

Consider two observationally equivalent generative processes𝐨t=ft\(𝐬t\)\\mathbf\{o\}\_\{t\}=f\_\{t\}\(\\mathbf\{s\}\_\{t\}\)and𝐨t=f^t\(𝐬^t\)\\mathbf\{o\}\_\{t\}=\\hat\{f\}\_\{t\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\), and assume the conditions in Proposition[2](https://arxiv.org/html/2605.12733#Thmproposition2)\. Then, for any task𝐠k\\mathbf\{g\}\_\{k\}with latent index setIkI\_\{k\}, with a sparsity regularization

‖ℐ\(Ju^\)‖≤‖ℐ\(Ju\)‖,\\\|\\mathcal\{I\}\(J\_\{\\hat\{u\}\}\)\\\|\\;\\leq\\;\\\|\\mathcal\{I\}\(J\_\{u\}\)\\\|,under some permutationπ\\pi, the estimated task\-relevant latent variables𝐬^t,π\(Ik\)\\hat\{\\mathbf\{s\}\}\_\{t,\\pi\(I\_\{k\}\)\}are an invertible functionhkh\_\{k\}of only the ground\-truth task\-relevant latent variables𝐬t,IK\\mathbf\{s\}\_\{t,I\_\{K\}\}, i\.e\.,

𝐬^t,π\(Ik\)=hk\(𝐬t,IK\)\.\\hat\{\\mathbf\{s\}\}\_\{t,\\pi\(I\_\{k\}\)\}\\;=\\;h\_\{k\}\(\\mathbf\{s\}\_\{t,I\_\{K\}\}\)\.

##### Proof Sketch\.

Under the span assumptions from Proposition[2](https://arxiv.org/html/2605.12733#Thmproposition2), we first show that every nonzero entry ofJu\(𝐬\)J\_\{u\}\(\\mathbf\{s\}\)must correspond to a nonzero entry ofJu^\(𝐬^\)J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\), up to a column permutationπ\\pi\. Sparsity regularization enforces that no additional entries can remain nonzero, which upgrades inclusion into exact equivalence of supports\. Finally, algebraic analysis helps move from structure to variables, exploiting the separation between task\-relevant and task\-irrelevant latents\.

##### Implication\.

Practically, Theorem[2](https://arxiv.org/html/2605.12733#Thmtheorem2)establishes the formal guarantees on going from a generalist to a specialist model\. It shows that, based on the general guarantee in Proposition[2](https://arxiv.org/html/2605.12733#Thmproposition2), a simple sparsity regularization is sufficient to disentangle task\-relevant latent variables from the irrelevant ones\. Unlike the generalist guarantee, which recovers a superset of the true support, the sparsity constraint sharpens recovery to the variable\-level and disentangles irrelevant parts\. Conceptually, this result highlights the necessity of moving from generalist to specialist modeling: while a generalist can cover all possible dependencies, only task\-specific modeling with appropriate regularization yields a representation that is both minimal and faithful\. This provides a principled justification for why specialist models can achieve disentangled task representations where generalist models cannot, offering formal guarantees for the intuition\.

Theoretically, our result suggests a new strategy for provably uncovering the latent variables underlying the observational world\. Importantly, because we allow arbitrary disconnections between time steps, Theorem[2](https://arxiv.org/html/2605.12733#Thmtheorem2)also covers the i\.i\.d\. setting\. This is a substantially harder case than prior work that exploits temporal information or domain shifts, since changes across time or environments inherently provide extra signals for identification, whereas identifiability in the absence of changes is notoriously difficult\. Existing i\.i\.d\. results rely either on restrictive functional assumptions\(Taleb and Jutten,[1999](https://arxiv.org/html/2605.12733#bib.bib199); Buchholzet al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib83)\)or graphical criteria on the underlying structure\(Moranet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib134); Zhenget al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib111)\)to achieve full component\-wise identifiability\. By contrast, our focus is not on recovering every individual latent, but rather on identifying all task\-relevant ones as a subgroup\. This relaxation allows us to bypass such strong assumptions and still establish general identifiability guarantees\. Beyond our setting, this perspective may prove methodologically useful for a wide range of latent\-variable problems, where isolating task\-relevant latents is important\.

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/accuracy_T.png)
![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/mcc_T.png)
![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/accuracy_M.png)
![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/mcc_M.png)

Figure 3:Temporal task structure identification\. Top: varying the number of time stepsTTwithT/5T/5tasks\. Bottom: varying the number of tasksMMwith2020time steps\. Left: accuracy\. Right: MCC\.

### 5Experiments

In this section, we present comprehensive empirical results supporting our theory on both temporal task structure learning and task\-relevant representation learning across diverse settings\. Due to page limits, some setup details are deferred to Appendix[C](https://arxiv.org/html/2605.12733#A3), and we have included many additional empirical results in Appendix[D](https://arxiv.org/html/2605.12733#A4)\.

##### Identifiability of Temporal Task Structure\.

We evaluate whether the proposed algorithm can recover temporal task structures under challenging conditions, including disconnected temporal relations and arbitrarily interleaving tasks\. Two setups are considered: \(1\) varying the number of time stepsTTfrom88to2020withT/5T/5tasks, and \(2\) varying the number of tasksMMfrom22to1010with2020time steps\. To maximize structural complexity, we set the minimum segment length to22and randomly generate the task–time step dependencies\. Additionally,20%20\\%of the dependencies between consecutive segments are randomly removed\. Each dataset contains10,00010,000samples generated from linear Gaussian SCMs\. All results are from1010random runs with Fisher’s z\-test\(Fisher,[1921](https://arxiv.org/html/2605.12733#bib.bib148)\)with the p\-value threshold as0\.050\.05\.

For both settings, we report accuracy and Matthews correlation coefficient \(MCC\) of the tasks identified\. For baselines, we consider classical CCA\(Anderson,[2003](https://arxiv.org/html/2605.12733#bib.bib146)\)and Group Lasso\(Yuan and Lin,[2006](https://arxiv.org/html/2605.12733#bib.bib145)\), as well as the most recent SelTask\(Qiuet al\.,[2024](https://arxiv.org/html/2605.12733#bib.bib257)\)\. Results are summarized in Fig\.[3](https://arxiv.org/html/2605.12733#S4.F3)\. Ours dominates across allTTandMMin both accuracy and MCC\. As expected, performance degrades as the problem becomes harder \(largerTTorMM\), but the gap persists\.

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/violin_map.png)Figure 4:Real\-world temporal task structure discovery\.
##### Real\-World Structure\.

To explore whether we can identify real\-world structures between tasks and time steps, we further conduct experiments on the recent SportsHHI video dataset\(Wuet al\.,[2024](https://arxiv.org/html/2605.12733#bib.bib266)\)\. For each time frame in the video, the objective is to discover its corresponding task labels, which in this context correspond to the behaviors of humans captured in the video\. Because the dataset involves multiple individuals with complex and overlapping interactions, each frame typically contains multiple task labels, resulting in highly intricate task structures\. This makes it a challenging and suitable testbed for stress\-testing the identification\.

Following common practice, we use a pretrained CLIP encoder\(Radfordet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib268)\)to obtain visual embeddings𝐨\\mathbf\{o\}and task embeddings𝐠\\mathbf\{g\}, and employ a variational autoencoder to estimate the latent state variables𝐬\\mathbf\{s\}\. The latent transition dynamics between consecutive states are parameterized by an MLP, while conditional mutual information \(CMI\) is used as a proxy for conditional independence to mitigate the curse of dimensionality in statistical testing\. We compare our approach against two baselines: \(i\) applying Alg\.[1](https://arxiv.org/html/2605.12733#alg1)directly to observed variables instead of latent ones \(replacing𝐬\\mathbf\{s\}with𝐨\\mathbf\{o\}\), and \(ii\) LEAP\(Yaoet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib227)\), a representative method for learning latent temporal representations with identifiability guarantees on the latent variables butnoton the structure\. Following prior work, we evaluate using mean average precision \(mAP\)\. The results in Fig\.[4](https://arxiv.org/html/2605.12733#S5.F4)demonstrate that modeling the complexity of general temporal task structures is essential for accurate discovery in complex real\-world scenarios\. It might be worth noting that we have also compared with more standard video models that do not target identiability, of which the results are shown in Table[3](https://arxiv.org/html/2605.12733#A4.T3)in Appendix[D](https://arxiv.org/html/2605.12733#A4)\.

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/r2_ours.png)Figure 5:R2R^\{2\}for relevant and irrelevant parts\.
##### Identifiability of Task\-Relevant Representation\.

After establishing identifiability of the temporal task structure, we zoom in on a single step and evaluate recovery of the task\-relevant latent representation conditioned on the corresponding tasks\. The data\-generating process follows the theoretical setup, with an MLP using Leaky ReLU as the nonlinear function\. For every dataset, we randomly select1/51/5dimensions as task\-relevant latent variables\. For estimation, we employ a VAE withℓ1\\ell\_\{1\}regularization on the task\-latent structure\. As the evaluation metric, we report theR2R^\{2\}between the estimated and ground\-truth latent components: higher values indicate accurate recovery of relevant parts, while lower values indicate effective separation from irrelevant parts\. Figure[5](https://arxiv.org/html/2605.12733#S5.F5)shows a clear gap: \(1\) task\-relevant representations are successfully disentangled from irrelevant ones \(lowR2R^\{2\}for irrelevant parts\), and \(2\) the estimated task\-relevant part captures most of the information in the ground\-truth one \(highR2R^\{2\}for relevant parts\)\. These provide rigorous validation of the identifiability theory, confirming that task\-relevant latent variables can be uncovered as a group with both information preservation and irrelevance disentanglement in practice\.

##### Task\-Relevant Latents in Realistic Vision\.

We next investigate the recovery of task\-relevant latent representations in realistic scenarios\. Since ground\-truth latents are usually unavailable in practice, direct comparison with the truth is infeasible\. Evaluation therefore turns to human interpretability, where visualizing the identified latents provides key evidence\. We construct a dataset of cat images using Flux, with tasks such as “wearing eyeglasses,” “wearing a hat,” and “wearing a tie,” explicitly considering realistic images to align with real\-world vision\. For estimation, we adopt a GAN\-based generator where each task is associated with a learned transformation of the latents\. Concretely, given𝐬\\mathbf\{s\}, a task\-specific operator modifies only a sparse subset of coordinates by anℓ1\\ell\_\{1\}\-regularized mask, producing masked latents that are then passed to the generator\. Figure[6](https://arxiv.org/html/2605.12733#S5.F6)compares results with and without sparsity\. With sparsity, the recovered latents correspond closely to the intended task attributes, while without sparsity, irrelevant factors such as color are entangled with the target tasks\. This further supports the task\-relevant identifiability and the role of sparsity\. Additional results on more scenarios are included in Appendix[D](https://arxiv.org/html/2605.12733#A4)\(e\.g\., Figure[7](https://arxiv.org/html/2605.12733#A4.F7)\), which further verify the claims\.

With sparsity ![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/sparse/3_1.jpg) ![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/sparse/4_2.jpg) ![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/sparse/9_3.jpg)

Without sparsity ![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/nosparse/4_1.jpg) ![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/nosparse/53_2.jpg) ![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/nosparse/13_3.jpg)

Figure 6:Qualitative comparison of identified task\-relevant latents\. Tasks include “wearing glasses,” “wearing a hat,” and “wearing a tie\.” With sparsity, the model isolates a minimal but sufficient subset of latents aligned with each task\. Without sparsity, irrelevant factors such as color become entangled with task\-relevant ones\.

### 6Conclusion and Discussion

In this paper, we initiated the theoretical investigation of learning task\-relevant world representations, aiming to move from generalist to specialist\. The main challenges lie in the level of generality, which requires handling both complex structures, such as disconnected sequences, interleaving tasks, and frequent switches, and general processes, including nonlinear functions, arbitrary distributions, and the absence of auxiliary information\. While we have addressed these, several related questions remain open\. First, although identifiability is defined asymptotically and frontier models are often trained on web\-scale data, it is still important to understand the finite\-sample regime, and the lack of related analysis is a limitation in data\-sparse scenarios\. Second, our present way of leveraging identifiability is relatively simple, essentially a standard estimator with sparsity regularization\. While such simplicity and universality are often advantageous, it is also intriguing to consider identifiability\-inspired architectures that depart more radically from existing patterns\. A stronger focus on identifiability within the community may reveal barrier\-breaking insights that have been overshadowed by the pursuit of purely empirical gains, and we aim to contribute toward this shift\.

### Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

### References

- E\. S\. Allman, C\. Matias, and J\. A\. Rhodes \(2009\)Identifiability of parameters in latent structure models with many observed variables\.The Annals of Statistics37\(6A\),pp\. 3099\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- T\. W\. Anderson, H\. Rubin,et al\.\(1956\)Statistical inference in factor analysis\.InProceedings of the third Berkeley symposium on mathematical statistics and probability,Vol\.5,pp\. 111–150\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- T\. Anderson \(2003\)An introduction to multivariate statistical analysis\.Cited by:[§5](https://arxiv.org/html/2605.12733#S5.SS0.SSS0.Px1.p2.4)\.
- M\. Bagatella, J\. Hübotter, G\. Martius, and A\. Krause \(2025\)Active fine\-tuning of multi\-task policies\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.SS0.SSS0.Px3.p1.9)\.
- M\. I\. Belghazi, A\. Baratin, S\. Rajeshwar, S\. Ozair, Y\. Bengio, A\. Courville, and D\. Hjelm \(2018\)Mutual information neural estimation\.InInternational conference on machine learning,pp\. 531–540\.Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.p1.1)\.
- Y\. Bengio, A\. Courville, and P\. Vincent \(2013\)Representation learning: a review and new perspectives\.IEEE transactions on pattern analysis and machine intelligence35\(8\),pp\. 1798–1828\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p1.1)\.
- J\. Brehmer, P\. De Haan, P\. Lippe, and T\. S\. Cohen \(2022\)Weakly supervised causal representation learning\.Advances in Neural Information Processing Systems35,pp\. 38319–38331\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- S\. Buchholz, M\. Besserve, and B\. Schölkopf \(2022\)Function classes for identifiable nonlinear independent component analysis\.arXiv preprint arXiv:2208\.06406\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1),[§4\.2](https://arxiv.org/html/2605.12733#S4.SS2.SSS0.Px2.p2.1)\.
- P\. Comon \(1994\)Independent component analysis, a new concept?\.Signal processing36\(3\),pp\. 287–314\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- C\. Feichtenhofer, H\. Fan, J\. Malik, and K\. He \(2019\)Slowfast networks for video recognition\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 6202–6211\.Cited by:[Appendix D](https://arxiv.org/html/2605.12733#A4.SS0.SSS0.Px2.p1.1)\.
- R\. A\. Fisher \(1921\)On the “probable error” of a coefficient of correlation deduced from a small sample\.Metron1,pp\. 3–32\.Cited by:[§5](https://arxiv.org/html/2605.12733#S5.SS0.SSS0.Px1.p1.13)\.
- L\. Gresele, J\. von Kügelgen, V\. Stimper, B\. Schölkopf, and M\. Besserve \(2021\)Independent mechanism analysis, a new concept?\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)World models\.arXiv preprint arXiv:1803\.10122\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p1.1)\.
- H\. Hälvä, S\. Le Corff, L\. Lehéricy, J\. So, Y\. Zhu, E\. Gassiat, and A\. Hyvärinen \(2021\)Disentangling identifiable features from noisy data with structured nonlinear ICA\.Advances in Neural Information Processing Systems34\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- A\. Hyvärinen, J\. Karhunen, and E\. Oja \(2001\)Independent component analysis\.John Wiley & Sons, Inc\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- A\. Hyvärinen and H\. Morioka \(2016\)Unsupervised feature extraction by time\-contrastive learning and nonlinear ICA\.Advances in Neural Information Processing Systems29,pp\. 3765–3773\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- A\. Hyvärinen and P\. Pajunen \(1999\)Nonlinear independent component analysis: existence and uniqueness results\.Neural networks12\(3\),pp\. 429–439\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p2.4)\.
- A\. Hyvärinen, H\. Sasaki, and R\. Turner \(2019\)Nonlinear ICA using auxiliary variables and generalized contrastive learning\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 859–868\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- Y\. Jiang and B\. Aragam \(2023\)Learning nonparametric latent causal graphs with unknown interventions\.Advances in Neural Information Processing Systems36,pp\. 60468–60513\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- J\. Jin and V\. Syrgkanis \(2023\)Learning causal representations from general environments: identifiability and intrinsic ambiguity\.arXiv preprint arXiv:2311\.12267\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- K\. G\. Jöreskog \(1969\)A general approach to confirmatory maximum likelihood factor analysis\.Psychometrika34\(2\),pp\. 183–202\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- B\. Kivva, G\. Rajendran, P\. Ravikumar, and B\. Aragam \(2022\)Identifiability of deep generative models without auxiliary information\.Advances in Neural Information Processing Systems35,pp\. 15687–15701\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- L\. Kong, S\. Xie, W\. Yao, Y\. Zheng, G\. Chen, P\. Stojanov, V\. Akinwande, and K\. Zhang \(2022\)Partial identifiability for domain adaptation\.InInternational Conference on Machine Learning,pp\. 11455–11472\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p4.1)\.
- J\. B\. Kruskal \(1977\)Three\-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics\.Linear algebra and its applications18\(2\),pp\. 95–138\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- S\. Lachapelle, P\. R\. López, Y\. Sharma, K\. Everett, R\. L\. Priol, A\. Lacoste, and S\. Lacoste\-Julien \(2022\)Disentanglement via mechanism sparsity regularization: a new principle for nonlinear ICA\.Conference on Causal Learning and Reasoning\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.12733#S4.SS1.SSS0.Px3.p1.1)\.
- Z\. Li, R\. Cai, G\. Chen, B\. Sun, Z\. Hao, and K\. Zhang \(2023\)Subspace identification for multi\-source domain adaptation\.Advances in Neural Information Processing Systems36,pp\. 34504–34518\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p4.1)\.
- Y\. Liu, B\. Huang, Z\. Zhu, H\. Tian, M\. Gong, Y\. Yu, and K\. Zhang \(2023\)Learning world models with identifiable factorization\.Advances in Neural Information Processing Systems36,pp\. 31831–31864\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p4.1)\.
- F\. Locatello, S\. Bauer, M\. Lucic, G\. Raetsch, S\. Gelly, B\. Schölkopf, and O\. Bachem \(2019\)Challenging common assumptions in the unsupervised learning of disentangled representations\.Ininternational conference on machine learning,pp\. 4114–4124\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p2.4)\.
- S\. Molavipour, G\. Bassi, and M\. Skoglund \(2020\)Conditional mutual information neural estimator\.InICASSP 2020\-2020 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 5025–5029\.Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.p1.1)\.
- A\. Mondal, A\. Bhattacharjee, S\. Mukherjee, H\. Asnani, S\. Kannan, and P\. AP \(2020\)C\-mi\-gan: estimation of conditional mutual information using minmax formulation\.InConference on Uncertainty in Artificial Intelligence,pp\. 849–858\.Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.p1.1)\.
- G\. E\. Moran, D\. Sridhar, Y\. Wang, and D\. M\. Blei \(2021\)Identifiable variational autoencoders via sparse decoding\.arXiv preprint arXiv:2110\.10804\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1),[§4\.2](https://arxiv.org/html/2605.12733#S4.SS2.SSS0.Px2.p2.1)\.
- A\. v\. d\. Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.p1.1)\.
- J\. Pearl \(1988\)Probabilistic reasoning in intelligent systems: networks of plausible inference\.Morgan Kaufmann Publishers Inc\.\.Cited by:[§A\.2](https://arxiv.org/html/2605.12733#A1.SS2.SSS0.Px1.p1.5)\.
- Y\. Qiu, Y\. Zheng, and K\. Zhang \(2024\)Identifying selections for unsupervised subtask discovery\.Advances in Neural Information Processing Systems37,pp\. 11966–11996\.Cited by:[§B\.1](https://arxiv.org/html/2605.12733#A2.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.12733#S5.SS0.SSS0.Px1.p2.4)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.SS0.SSS0.Px2.p2.3),[§5](https://arxiv.org/html/2605.12733#S5.SS0.SSS0.Px2.p2.5)\.
- B\. Schölkopf, F\. Locatello, S\. Bauer, N\. R\. Ke, N\. Kalchbrenner, A\. Goyal, and Y\. Bengio \(2021\)Toward causal representation learning\.Proceedings of the IEEE109\(5\),pp\. 612–634\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p1.1)\.
- A\. Shapiro \(1985\)Identifiability of factor analysis: some results and open problems\.Linear Algebra and its Applications70,pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- N\. D\. Sidiropoulos and R\. Bro \(2000\)On the uniqueness of multilinear decomposition of n\-way arrays\.Journal of Chemometrics: A Journal of the Chemometrics Society14\(3\),pp\. 229–239\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- A\. Sordoni, N\. Dziri, H\. Schulz, G\. Gordon, P\. Bachman, and R\. T\. Des Combes \(2021\)Decomposed mutual information estimation for contrastive representation learning\.InInternational conference on machine learning,pp\. 9859–9869\.Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.p1.1)\.
- P\. Spirtes, C\. N\. Glymour, R\. Scheines, and D\. Heckerman \(2000\)Causation, prediction, and search\.MIT press\.Cited by:[Assumption 1](https://arxiv.org/html/2605.12733#Thmassumption1)\.
- A\. Taleb and C\. Jutten \(1999\)Source separation in post\-nonlinear mixtures\.IEEE Transactions on signal Processing47\(10\),pp\. 2807–2820\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1),[§4\.2](https://arxiv.org/html/2605.12733#S4.SS2.SSS0.Px2.p2.1)\.
- N\. Tishby and N\. Zaslavsky \(2015\)Deep learning and the information bottleneck principle\.In2015 IEEE Information Theory Workshop, ITW 2015,pp\. 7133169\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p1.1)\.
- Z\. Tong, Y\. Song, J\. Wang, and L\. Wang \(2022\)Videomae: masked autoencoders are data\-efficient learners for self\-supervised video pre\-training\.Advances in neural information processing systems35,pp\. 10078–10093\.Cited by:[Appendix D](https://arxiv.org/html/2605.12733#A4.SS0.SSS0.Px2.p1.1)\.
- B\. Varici, E\. Acartürk, K\. Shanmugam, A\. Kumar, and A\. Tajer \(2025\)Score\-based causal representation learning: linear and general transformations\.Journal of Machine Learning Research26\(112\),pp\. 1–90\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- J\. von Kügelgen, M\. Besserve, L\. Wendong, L\. Gresele, A\. Kekić, E\. Bareinboim, D\. Blei, and B\. Schölkopf \(2023\)Nonparametric identifiability of causal representations from unknown interventions\.Advances in Neural Information Processing Systems36,pp\. 48603–48638\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- J\. von Kügelgen, Y\. Sharma, L\. Gresele, W\. Brendel, B\. Schölkopf, M\. Besserve, and F\. Locatello \(2021\)Self\-supervised learning with data augmentations provably isolates content from style\.arXiv preprint arXiv:2106\.04619\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1),[§1](https://arxiv.org/html/2605.12733#S1.p4.1)\.
- L\. Wong, K\. M\. Collins, L\. Ying, C\. E\. Zhang, A\. Weller, T\. Gerstenberg, T\. O’Donnell, A\. K\. Lew, J\. D\. Andreas, J\. B\. Tenenbaum,et al\.\(2025\)Modeling open\-world cognition as on\-demand synthesis of probabilistic models\.arXiv preprint arXiv:2507\.12547\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p1.1)\.
- T\. Wu, R\. He, G\. Wu, and L\. Wang \(2024\)SportsHHI: a dataset for human\-human interaction detection in sports videos\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 18537–18546\.Cited by:[§5](https://arxiv.org/html/2605.12733#S5.SS0.SSS0.Px2.p1.1)\.
- W\. Yao, Y\. Sun, A\. Ho, C\. Sun, and K\. Zhang \(2021\)Learning temporally causal latent processes from general temporal data\.arXiv preprint arXiv:2110\.05428\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1),[§5](https://arxiv.org/html/2605.12733#S5.SS0.SSS0.Px2.p2.5)\.
- T\. Yu, D\. Quillen, Z\. He, R\. Julian, K\. Hausman, C\. Finn, and S\. Levine \(2020\)Meta\-world: a benchmark and evaluation for multi\-task and meta reinforcement learning\.InConference on robot learning,pp\. 1094–1100\.Cited by:[Appendix C](https://arxiv.org/html/2605.12733#A3.SS0.SSS0.Px3.p1.9)\.
- M\. Yuan and Y\. Lin \(2006\)Model selection and estimation in regression with grouped variables\.Journal of the Royal Statistical Society Series B: Statistical Methodology68\(1\),pp\. 49–67\.Cited by:[§5](https://arxiv.org/html/2605.12733#S5.SS0.SSS0.Px1.p2.4)\.
- K\. Zhang, S\. Xie, I\. Ng, and Y\. Zheng \(2024\)Causal representation learning from multiple distributions: a general setting\.InInternational Conference on Machine Learning,pp\. 60057–60075\.Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- T\. Zhang, G\. Chen, and F\. Chen \(2025\)When do neural networks learn world models?\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12733#S1.p3.1)\.
- Y\. Zheng, I\. Ng, and K\. Zhang \(2022\)On the identifiability of nonlinear ICA: sparsity and beyond\.InAdvances in Neural Information Processing Systems,Cited by:[1st item](https://arxiv.org/html/2605.12733#A2.I2.i1.p1.1),[2nd item](https://arxiv.org/html/2605.12733#A2.I2.i2.p1.1),[3rd item](https://arxiv.org/html/2605.12733#A2.I2.i3.p1.1),[§B\.1](https://arxiv.org/html/2605.12733#A2.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.12733#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.12733#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.12733#S4.SS2.SSS0.Px2.p2.1)\.

## Appendices

### Appendix AProofs

#### A\.1Notation

We first provide a summary of notation in Table[1](https://arxiv.org/html/2605.12733#A1.T1)\.

Table 1:Summary of notation\.
#### A\.2Proof of Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1)

See[1](https://arxiv.org/html/2605.12733#Thmtheorem1)

##### Notation and Blocking Rules\.

A path is a sequence of distinct nodes\(v0,…,vr\)\(v\_\{0\},\\dots,v\_\{r\}\)with each consecutive pair adjacent\. Along a path, a node is a collider if both incident path edges have arrowheads into the node, and a non\-collider otherwise\. A path is blocked by a conditioning set𝐙\\mathbf\{Z\}if it contains a non\-collider in𝐙\\mathbf\{Z\}or a collider that is neither in𝐙\\mathbf\{Z\}nor has a descendant in𝐙\\mathbf\{Z\}\(i\.e\., d\-separation\(Pearl,[1988](https://arxiv.org/html/2605.12733#bib.bib142)\)\)\. In our graph, tasks have only incoming edges, and tasks have no descendants\.

##### Band Conditioning Set\.

Throughout this subsection the conditioning set is

𝐙band\(k,v,i\)=\{𝐬kL−1,𝐬kL\+1,𝐬vL−1,𝐬vL\+1\}∩\{𝐬1,…,𝐬T\}∪\{𝐠i\},\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)=\\\{\\mathbf\{s\}\_\{kL\-1\},\\mathbf\{s\}\_\{kL\+1\},\\mathbf\{s\}\_\{vL\-1\},\\mathbf\{s\}\_\{vL\+1\}\\\}\\cap\\\{\\mathbf\{s\}\_\{1\},\\dots,\\mathbf\{s\}\_\{T\}\\\}\\cup\\\{\\mathbf\{g\}\_\{i\}\\\},\(4\)with out\-of\-range indices omitted\. Thus only the two immediate inner neighbors𝐬kL\+1,𝐬vL−1\\mathbf\{s\}\_\{kL\+1\},\\mathbf\{s\}\_\{vL\-1\}and the two immediate outer neighbors𝐬kL−1,𝐬vL\+1\\mathbf\{s\}\_\{kL\-1\},\\mathbf\{s\}\_\{vL\+1\}\(when they exist\) are conditioned; among tasks only𝐠i\\mathbf\{g\}\_\{i\}is conditioned\.

###### Lemma 1\(A d\-connecting path uses exactly one task, equal to𝐠i\\mathbf\{g\}\_\{i\}\)\.

Every path from𝐬kL\\mathbf\{s\}\_\{kL\}to𝐬vL\\mathbf\{s\}\_\{vL\}that is d\-connecting given𝐙band\(k,v,i\)\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)contains exactly one task node, and that task is𝐠i\\mathbf\{g\}\_\{i\}\.

###### Proof\.

Consider any path with no task nodes\. Such a path alternates among states and actions and moves in time via𝐬t→st\+1\\mathbf\{s\}\_\{t\}\\to s\_\{t\+1\},𝐬t→at\\mathbf\{s\}\_\{t\}\\to a\_\{t\}orat→st\+1a\_\{t\}\\to s\_\{t\+1\}\. Any forward traversal from𝐬kL\\mathbf\{s\}\_\{kL\}toward𝐬vL\\mathbf\{s\}\_\{vL\}must pass through the cut state𝐬kL\+1\\mathbf\{s\}\_\{kL\+1\}; symmetrically, any approach into𝐬vL\\mathbf\{s\}\_\{vL\}from the left must pass through𝐬vL−1\\mathbf\{s\}\_\{vL\-1\}\. All these cut states are in𝐙band\\mathbf\{Z\}\_\{\\mathrm\{band\}\}and are non\-colliders on such chain paths, so the path is blocked\. Hence, any d\-connected path must include at least one task\.

If a path contains a taskgj≠𝐠ig\_\{j\}\\neq\\mathbf\{g\}\_\{i\}, then atgjg\_\{j\}both incident edges point intogjg\_\{j\}, sogjg\_\{j\}is a collider\. Sincegj∉𝐙bandg\_\{j\}\\notin\\mathbf\{Z\}\_\{\\mathrm\{band\}\}and tasks have no descendants, this collider is closed and the path is blocked\. Therefore no d\-connecting path can contain any task other than𝐠i\\mathbf\{g\}\_\{i\}\.

If a path contains two or more tasks, at least one of them is not𝐠i\\mathbf\{g\}\_\{i\}, which blocks the path by the previous argument\. Thus every d\-connecting path contains exactly one task and that task is𝐠i\\mathbf\{g\}\_\{i\}\. ∎

###### Lemma 2\(Local structure of d\-connecting paths\)\.

Under the graph and conditioning in Equation[4](https://arxiv.org/html/2605.12733#A1.E4), every d\-connecting path between𝐬kL\\mathbf\{s\}\_\{kL\}and𝐬vL\\mathbf\{s\}\_\{vL\}has one of the four forms

\(I\)𝐬kL→𝐚kL→𝐠i←𝐚vL←𝐬vL,\(II\)𝐬kL→𝐚kL→𝐠i←𝐚vL−1→𝐬vL,\\displaystyle\\text\{\(I\)\}\\quad\\mathbf\{s\}\_\{kL\}\\to\\mathbf\{a\}\_\{kL\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{vL\}\\leftarrow\\mathbf\{s\}\_\{vL\},\\qquad\\text\{\(II\)\}\\quad\\mathbf\{s\}\_\{kL\}\\to\\mathbf\{a\}\_\{kL\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{vL\-1\}\\to\\mathbf\{s\}\_\{vL\},\(III\)𝐬kL←𝐚kL−1→𝐠i←𝐚vL←𝐬vL,\(IV\)𝐬kL←𝐚kL−1→𝐠i←𝐚vL−1→𝐬vL,\\displaystyle\\text\{\(III\)\}\\quad\\mathbf\{s\}\_\{kL\}\\leftarrow\\mathbf\{a\}\_\{kL\-1\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{vL\}\\leftarrow\\mathbf\{s\}\_\{vL\},\\qquad\\text\{\(IV\)\}\\quad\\mathbf\{s\}\_\{kL\}\\leftarrow\\mathbf\{a\}\_\{kL\-1\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{vL\-1\}\\to\\mathbf\{s\}\_\{vL\},\(5\)with out\-of\-range indices omitted\.

###### Proof\.

By Lemma[1](https://arxiv.org/html/2605.12733#Thmlemma1), any d\-connecting path contains exactly the single task𝐠i\\mathbf\{g\}\_\{i\}\.

*Left boundary\.*The first neighbor of𝐬kL\\mathbf\{s\}\_\{kL\}on any d\-connecting path cannot be a state, because the only state neighbors are𝐬kL−1\\mathbf\{s\}\_\{kL\-1\}and𝐬kL\+1\\mathbf\{s\}\_\{kL\+1\}, both in𝐙band\\mathbf\{Z\}\_\{\\mathrm\{band\}\}and both non\-colliders on chain moves, which would block the path\. Hence the neighbor must be an adjacent action,𝐚kL−1\\mathbf\{a\}\_\{kL\-1\}\(ifkL\>1kL\>1\) or𝐚kL\\mathbf\{a\}\_\{kL\}\. From that action, any continuation to a state would encounter one of the conditioned cut states as a non\-collider, so the next node must be𝐠i\\mathbf\{g\}\_\{i\}via an edgea→𝐠ia\\to\\mathbf\{g\}\_\{i\}\. This yields the two left fragments𝐬kL←𝐚kL−1→𝐠i\\mathbf\{s\}\_\{kL\}\\leftarrow\\mathbf\{a\}\_\{kL\-1\}\\to\\mathbf\{g\}\_\{i\}and𝐬kL→𝐚kL→𝐠i\\mathbf\{s\}\_\{kL\}\\to\\mathbf\{a\}\_\{kL\}\\to\\mathbf\{g\}\_\{i\}\.

*Right boundary\.*Symmetrically, the predecessor of𝐬vL\\mathbf\{s\}\_\{vL\}on the path cannot be a state, since the only state neighbors are𝐬vL−1\\mathbf\{s\}\_\{vL\-1\}and𝐬vL\+1\\mathbf\{s\}\_\{vL\+1\}, which are in𝐙band\\mathbf\{Z\}\_\{\\mathrm\{band\}\}and would block as non\-colliders in the potential additional paths\. Specifically,𝐬vL−1\\mathbf\{s\}\_\{vL\-1\}is a non\-collider in paths involving𝐬vL−1→𝐬vL\\mathbf\{s\}\_\{vL\-1\}\\to\\mathbf\{s\}\_\{vL\}, and𝐬vL\+1\\mathbf\{s\}\_\{vL\+1\}is a non\-collider in paths involving𝐬vL\+1→𝐚vL\+1\\mathbf\{s\}\_\{vL\+1\}\\to\\mathbf\{a\}\_\{vL\+1\}or𝐬vL\+1→𝐬vL\+2\\mathbf\{s\}\_\{vL\+1\}\\to\\mathbf\{s\}\_\{vL\+2\}\. Thus the predecessor must be𝐚vL−1\\mathbf\{a\}\_\{vL\-1\}or𝐚vL\\mathbf\{a\}\_\{vL\}, linked to𝐠i\\mathbf\{g\}\_\{i\}by an edge𝐚→𝐠i\\mathbf\{a\}\\to\\mathbf\{g\}\_\{i\}traversed in reverse on the path\. This yields the two right fragments𝐠i←𝐚vL−1→𝐬vL\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{vL\-1\}\\to\\mathbf\{s\}\_\{vL\}and𝐠i←𝐚vL←𝐬vL\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{vL\}\\leftarrow\\mathbf\{s\}\_\{vL\}\.

Combining the two left with the two right fragments gives exactly the four forms in Equation[5](https://arxiv.org/html/2605.12733#A1.E5)\. On each such path,𝐠i\\mathbf\{g\}\_\{i\}is the unique collider and is in𝐙band\\mathbf\{Z\}\_\{\\mathrm\{band\}\}, while all other nodes are non\-colliders that are not in𝐙band\\mathbf\{Z\}\_\{\\mathrm\{band\}\}, so these paths are d\-connecting\. ∎

Now we are ready to prove the theorem\.

See[1](https://arxiv.org/html/2605.12733#Thmtheorem1)

###### Proof\.

\(⇒\)\(\\Rightarrow\)Suppose𝐬kL\\mathbf\{s\}\_\{kL\}and𝐬vL\\mathbf\{s\}\_\{vL\}are conditionally dependent given𝐙band\(k,v,i\)\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)\. By Lemma[2](https://arxiv.org/html/2605.12733#Thmlemma2), there exists a d\-connecting path of one of the four forms in Equation[5](https://arxiv.org/html/2605.12733#A1.E5)\. In each form, the actions adjacent to𝐬kL\\mathbf\{s\}\_\{kL\}and𝐬vL\\mathbf\{s\}\_\{vL\}that appear on the path are parents of𝐠i\\mathbf\{g\}\_\{i\}\. Hence𝐠i\\mathbf\{g\}\_\{i\}is relevant to segments𝐒k\\mathbf\{S\}\_\{k\}and𝐒v\\mathbf\{S\}\_\{v\}\.

\(⇐\)\(\\Leftarrow\)Conversely, suppose both intersections are nonempty\. Choosep∈\{kL−1,kL\}p\\in\\\{kL\-1,kL\\\}andq∈\{vL−1,vL\}q\\in\\\{vL\-1,vL\\\}such that𝐚p→𝐠i\\mathbf\{a\}\_\{p\}\\to\\mathbf\{g\}\_\{i\}and𝐚q→𝐠i\\mathbf\{a\}\_\{q\}\\to\\mathbf\{g\}\_\{i\}\. Then one of the four forms in Equation[5](https://arxiv.org/html/2605.12733#A1.E5)exists\. Along that path,𝐠i\\mathbf\{g\}\_\{i\}is the unique collider and is conditioned, while all other nodes are non\-colliders not in𝐙band\(k,v,i\)\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)\. Therefore the path is not blocked and

𝐬kL⟂⟂̸𝐬vL\|𝐙band\(k,v,i\)\.\\mathbf\{s\}\_\{kL\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{vL\}\\ \\Big\|\\ \\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)\.\(6\)This proves the equivalence stated in Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1)\. ∎

#### A\.3Proof of Corollary[1](https://arxiv.org/html/2605.12733#Thmcorollary1)

See[1](https://arxiv.org/html/2605.12733#Thmcorollary1)

###### Proof\.

Fixk<vk<v, pick anyj∈\{\(k−1\)L\+1,…,kL\}j\\in\\\{\(k\-1\)L\+1,\\dots,kL\\\}andq∈\{\(v−1\)L\+1,…,vL\}q\\in\\\{\(v\-1\)L\+1,\\dots,vL\\\}, and define the local band set

𝐙loc\(j,q,i\)=\{𝐬j−1,𝐬j\+1,𝐬q−1,𝐬q\+1\}∩\{𝐬1,…,𝐬T\}∪𝐠i\.\\mathbf\{Z\}\_\{\\mathrm\{loc\}\}\(j,q,i\)=\\\{\\mathbf\{s\}\_\{j\-1\},\\mathbf\{s\}\_\{j\+1\},\\mathbf\{s\}\_\{q\-1\},\\mathbf\{s\}\_\{q\+1\}\\\}\\cap\\\{\\mathbf\{s\}\_\{1\},\\dots,\\mathbf\{s\}\_\{T\}\\\}\\ \\cup\\ \{\\mathbf\{g\}\_\{i\}\}\.\(7\)
By the same blocking argument as in Lemma[1](https://arxiv.org/html/2605.12733#Thmlemma1), any d\-connecting path between𝐬j\\mathbf\{s\}\_\{j\}and𝐬q\\mathbf\{s\}\_\{q\}given𝐙loc\(j,q,i\)\\mathbf\{Z\}\_\{\\mathrm\{loc\}\}\(j,q,i\)must contain exactly one task node and it must be𝐠i\\mathbf\{g\}\_\{i\}\. The neighbor of𝐬j\\mathbf\{s\}\_\{j\}on any such path cannot be a state, since𝐬j−1\\mathbf\{s\}\_\{j\-1\}and𝐬j\+1\\mathbf\{s\}\_\{j\+1\}are in𝐙loc\\mathbf\{Z\}\_\{\\mathrm\{loc\}\}and are non\-colliders on chain moves, hence they would block\. Therefore the path must leave𝐬j\\mathbf\{s\}\_\{j\}through an adjacent action𝐚j−1\\mathbf\{a\}\_\{j\-1\}or𝐚j\\mathbf\{a\}\_\{j\}, and from there enter𝐠i\\mathbf\{g\}\_\{i\}via an edge𝐚→𝐠i\\mathbf\{a\}\\to\\mathbf\{g\}\_\{i\}\. A symmetric argument holds at the right end near𝐬q\\mathbf\{s\}\_\{q\}\. Consequently every d\-connecting path between𝐬j\\mathbf\{s\}\_\{j\}and𝐬q\\mathbf\{s\}\_\{q\}given𝐙loc\(j,q,i\)\\mathbf\{Z\}\_\{\\mathrm\{loc\}\}\(j,q,i\)has one of the four forms

\(I\)𝐬j→𝐚j→𝐠i←𝐚q←𝐬q,\(II\)𝐬j→𝐚j→𝐠i←𝐚q−1→𝐬q,\\displaystyle\\text\{\(I\)\}\\ \\mathbf\{s\}\_\{j\}\\to\\mathbf\{a\}\_\{j\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{q\}\\leftarrow\\mathbf\{s\}\_\{q\},\\qquad\\text\{\(II\)\}\\ \\mathbf\{s\}\_\{j\}\\to\\mathbf\{a\}\_\{j\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{q\-1\}\\to\\mathbf\{s\}\_\{q\},\\\(III\)𝐬j←𝐚j−1→𝐠i←𝐚q←𝐬q,\(IV\)𝐬j←𝐚j−1→𝐠i←𝐚q−1→𝐬q,\\displaystyle\\text\{\(III\)\}\\ \\mathbf\{s\}\_\{j\}\\leftarrow\\mathbf\{a\}\_\{j\-1\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{q\}\\leftarrow\\mathbf\{s\}\_\{q\},\\qquad\\text\{\(IV\)\}\\ \\mathbf\{s\}\_\{j\}\\leftarrow\\mathbf\{a\}\_\{j\-1\}\\to\\mathbf\{g\}\_\{i\}\\leftarrow\\mathbf\{a\}\_\{q\-1\}\\to\\mathbf\{s\}\_\{q\},\(8\)with out\-of\-range indices omitted\. On each such path𝐠i\\mathbf\{g\}\_\{i\}is the unique collider and is conditioned, while all other nodes are non\-colliders that are not conditioned, so the path is d\-connecting\.

Note that states inside the same segment share the same task set, and task nodes have only incoming edges from actions\. It follows that, if𝐠i\\mathbf\{g\}\_\{i\}is relevant to𝐒k\\mathbf\{S\}\_\{k\}and𝐒v\\mathbf\{S\}\_\{v\}then there existp∈j−1,jp\\in\{j\-1,j\}andr∈q−1,qr\\in\{q\-1,q\}such that𝐚p→𝐠i\\mathbf\{a\}\_\{p\}\\to\\mathbf\{g\}\_\{i\}and𝐚r→𝐠i\\mathbf\{a\}\_\{r\}\\to\\mathbf\{g\}\_\{i\}\.

Then we prove the equivalence as follows\.

\(⇒\)\(\\Rightarrow\)If𝐠i\\mathbf\{g\}\_\{i\}is relevant to𝐒k\\mathbf\{S\}\_\{k\}and𝐒v\\mathbf\{S\}\_\{v\}, pickp∈j−1,jp\\in\{j\-1,j\}andr∈q−1,qr\\in\{q\-1,q\}with𝐚p→𝐠i\\mathbf\{a\}\_\{p\}\\to\\mathbf\{g\}\_\{i\}and𝐚r→𝐠i\\mathbf\{a\}\_\{r\}\\to\\mathbf\{g\}\_\{i\}as above\. Then one of the four local forms in Equation[8](https://arxiv.org/html/2605.12733#A1.E8)exists and is d\-connecting given𝐙loc\(j,q,i\)\\mathbf\{Z\}\_\{\\mathrm\{loc\}\}\(j,q,i\), hence

𝐬j⟂⟂̸𝐬q\|𝐙loc\(j,q,i\)\.\\mathbf\{s\}\_\{j\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{q\}\\ \\Big\|\\ \\mathbf\{Z\}\_\{\\mathrm\{loc\}\}\(j,q,i\)\.\(9\)
\(⇐\)\(\\Leftarrow\)Conversely, if𝐬j\\mathbf\{s\}\_\{j\}and𝐬q\\mathbf\{s\}\_\{q\}are conditionally dependent given𝐙loc\(j,q,i\)\\mathbf\{Z\}\_\{\\mathrm\{loc\}\}\(j,q,i\), then by Equation[8](https://arxiv.org/html/2605.12733#A1.E8)the actions adjacent to𝐬j\\mathbf\{s\}\_\{j\}and𝐬q\\mathbf\{s\}\_\{q\}that lie on a d\-connecting path are parents of𝐠i\\mathbf\{g\}\_\{i\}\. Together with the segment homogeneity,𝐠i\\mathbf\{g\}\_\{i\}is relevant to𝐒k\\mathbf\{S\}\_\{k\}and𝐒v\\mathbf\{S\}\_\{v\}\.

This proves the stated equivalence for arbitraryj∈𝐒kj\\in\\mathbf\{S\}\_\{k\}andq∈𝐒vq\\in\\mathbf\{S\}\_\{v\}withL\>2L\>2\. ∎

#### A\.4Proof of Proposition[1](https://arxiv.org/html/2605.12733#Thmproposition1)

See[1](https://arxiv.org/html/2605.12733#Thmproposition1)

###### Proof\.

Fix a task𝐠i\\mathbf\{g\}\_\{i\}and a segment𝐒k\\mathbf\{S\}\_\{k\}\. If𝐒k\\mathbf\{S\}\_\{k\}truly contains𝐠i\\mathbf\{g\}\_\{i\}, then for𝐒v\\mathbf\{S\}\_\{v\}withv≠kv\\neq kthat also contains𝐠i\\mathbf\{g\}\_\{i\}, by Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1), there must be

𝐬kL⟂⟂̸𝐬vL\|𝐙band\(k,v,i\)\.\\mathbf\{s\}\_\{kL\}\\\!\\perp\\\!\\\!\\\!\\\!\\not\\perp\\\!\\mathbf\{s\}\_\{vL\}\\ \\Big\|\\ \\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)\.\(10\)The oracle CI test returns dependence, so Algorithm[1](https://arxiv.org/html/2605.12733#alg1)adds𝐠i\\mathbf\{g\}\_\{i\}to both𝒯\(𝐒k\)\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)and𝒯\(𝐒v\)\\mathcal\{T\}\(\\mathbf\{S\}\_\{v\}\)\.

Conversely, if𝒯\(𝐒k\)\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)does not contain𝐠i\\mathbf\{g\}\_\{i\}, then Theorem[1](https://arxiv.org/html/2605.12733#Thmtheorem1)implies conditional independence for all pairs involving𝐒k\\mathbf\{S\}\_\{k\}, hence the algorithm never adds𝐠i\\mathbf\{g\}\_\{i\}to𝒯\(𝐒k\)\\mathcal\{T\}\(\\mathbf\{S\}\_\{k\}\)\. Therefore the recovered segment–task incidence is exact\. Per\-step labels are correct by assignment\. ∎

#### A\.5Proof of Proposition[2](https://arxiv.org/html/2605.12733#Thmproposition2)

See[2](https://arxiv.org/html/2605.12733#Thmproposition2)

###### Proof\.

Since𝐨t=ft\(𝐬t\)\\mathbf\{o\}\_\{t\}=f\_\{t\}\(\\mathbf\{s\}\_\{t\}\)ando=f^t\(𝐬^t\)o=\\hat\{f\}\_\{t\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)are observationally equivalent, there exists an invertible mappingϕ\\phisuch that

𝐬^t=f^t−1∘ft\(𝐬t\)=ϕ\(𝐬t\),\\hat\{\\mathbf\{s\}\}\_\{t\}=\\hat\{f\}\_\{t\}^\{\-1\}\\circ f\_\{t\}\(\\mathbf\{s\}\_\{t\}\)=\\phi\(\\mathbf\{s\}\_\{t\}\),\(11\)with inverseϕ−1\\phi^\{\-1\}\. By the chain rule,

Ju^=JuJϕ−1\.J\_\{\\hat\{u\}\}\\;=\\;J\_\{u\}\\,J\_\{\\phi^\{\-1\}\}\.\(12\)
Fixi∈\[dg\]i\\in\[d\_\{g\}\]\. Consider the set𝒩i\\mathcal\{N\}\_\{i\}of‖ℐ\(\(Ju\)i,⋅\)‖\\\|\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\\\|distinct points and the corresponding Jacobian row vectors

\(∂ui∂𝐬t,1,…,∂ui∂st,ds\)\|𝐬t=𝐬t\(l\),l∈𝒩i,\\left\(\\frac\{\\partial u\_\{i\}\}\{\\partial\\mathbf\{s\}\_\{t,1\}\},\\ldots,\\frac\{\\partial u\_\{i\}\}\{\\partial s\_\{t,d\_\{s\}\}\}\\right\)\\Big\|\_\{\\mathbf\{s\}\_\{t\}=\\mathbf\{s\}\_\{t\}^\{\(l\)\}\},\\quad l\\in\\mathcal\{N\}\_\{i\},\(13\)which are linearly independent by assumption\.

Now construct a matrixMMwithℐ\(M\)=ℐ\(Jϕ−1\(𝐬^\)\)\\mathcal\{I\}\(M\)=\\mathcal\{I\}\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\)\)\. By the index\-set inclusion assumption, for eachl∈𝒩il\\in\\mathcal\{N\}\_\{i\}we have

ℐ\(\(Ju\(𝐬t\(l\)\)M\)i,⋅\)⊆ℐ\(\(Ju^\(𝐬^t\(l\)\)\)i,⋅\)\.\\mathcal\{I\}\\big\(\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)M\)\_\{i,\\cdot\}\\big\)\\;\\subseteq\\;\\mathcal\{I\}\\big\(\(J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}\\big\)\.\(14\)Thus,

\(Ju\(𝐬t\(l\)\)\)i,⋅M∈span⁡\{ej:j∈ℐ\(\(Ju^\)i,⋅\)\}\.\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}M\\in\\operatorname\{span\}\\\{e\_\{j\}:j\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\\\}\.\(15\)Taking linear combinations acrossl∈𝒩il\\in\\mathcal\{N\}\_\{i\}preserves this property, so in particular,

Mj,⋅∈span⁡\{ek:k∈ℐ\(\(Ju^\)i,⋅\)\},∀j∈ℐ\(\(Ju\)i,⋅\)\.M\_\{j,\\cdot\}\\in\\operatorname\{span\}\\\{e\_\{k\}:k\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\\\},\\quad\\forall j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\.\(16\)
SinceJϕ−1\(𝐬^t\)J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)is invertible, there exists a permutationπ\\pisuch that

\(Jϕ−1\(𝐬^t\)\)j,π\(j\)≠0,∀j∈\{1,…,ds\}\.\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\)\_\{j,\\pi\(j\)\}\\neq 0,\\quad\\forall j\\in\\\{1,\\dots,d\_\{s\}\\\}\.\(17\)Becauseℐ\(M\)=ℐ\(Jϕ−1\(𝐬^t\)\)\\mathcal\{I\}\(M\)=\\mathcal\{I\}\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\), we obtain

Mj,π\(j\)≠0,∀j∈ℐ\(\(Ju\)i,⋅\)\.M\_\{j,\\pi\(j\)\}\\neq 0,\\quad\\forall j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\.\(18\)Combining this with Equation[16](https://arxiv.org/html/2605.12733#A1.E16), we conclude

π\(j\)∈ℐ\(\(Ju^\)i,⋅\),∀j∈ℐ\(\(Ju\)i,⋅\)\.\\pi\(j\)\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\),\\quad\\forall j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\.\(19\)
Equation[19](https://arxiv.org/html/2605.12733#A1.E19)shows that each ground\-truth relevant indexj∈ℐ\(\(Ju\)i,⋅\)j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)is mapped to a distinct estimated relevant indexπ\(j\)∈ℐ\(\(Ju^\)i,⋅\)\\pi\(j\)\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\. Therefore, the estimated index set must contain at least as many elements as the ground\-truth one:

‖ℐ\(\(Ju^\)i,⋅\)‖≥‖ℐ\(\(Ju\)i,⋅\)‖\.\\\|\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\\\|\\;\\geq\\;\\\|\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\\\|\.\(20\)This completes the proof\. ∎

#### A\.6Proof of Theorem[2](https://arxiv.org/html/2605.12733#Thmtheorem2)

See[2](https://arxiv.org/html/2605.12733#Thmtheorem2)

###### Proof\.

The first part of the proof follows the similar strategy as Proposition[2](https://arxiv.org/html/2605.12733#Thmproposition2), and we provide the full details for completeness\. Since𝐨t=ft\(𝐬t\)\\mathbf\{o\}\_\{t\}=f\_\{t\}\(\\mathbf\{s\}\_\{t\}\)and𝐨t=f^t\(𝐬^t\)\\mathbf\{o\}\_\{t\}=\\hat\{f\}\_\{t\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)are observationally equivalent, there exists an invertible mappingϕ\\phisuch that

𝐬^t=f^t−1∘ft\(𝐬t\)=ϕ\(𝐬t\),\\hat\{\\mathbf\{s\}\}\_\{t\}=\\hat\{f\}\_\{t\}^\{\-1\}\\circ f\_\{t\}\(\\mathbf\{s\}\_\{t\}\)=\\phi\(\\mathbf\{s\}\_\{t\}\),\(21\)with inverseϕ−1\\phi^\{\-1\}\. By the chain rule,

Ju^=JuJϕ−1\.J\_\{\\hat\{u\}\}=J\_\{u\}\\,J\_\{\\phi^\{\-1\}\}\.\(22\)
For eachi∈\[dg\]i\\in\[d\_\{g\}\], consider a set𝒩i\\mathcal\{N\}\_\{i\}of‖ℐ\(\(Ju\)i,⋅\)‖\\\|\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\\\|distinct points and the corresponding Jacobians

\(∂ui∂𝐬t,1,∂ui∂𝐬t,2,…,∂ui∂𝐬t,ds\)\|𝐬=𝐬t\(l\),l∈𝒩i\.\\left\(\\frac\{\\partial u\_\{i\}\}\{\\partial\\mathbf\{s\}\_\{t,1\}\},\\frac\{\\partial u\_\{i\}\}\{\\partial\\mathbf\{s\}\_\{t,2\}\},\\ldots,\\frac\{\\partial u\_\{i\}\}\{\\partial\\mathbf\{s\}\_\{t,d\_\{s\}\}\}\\right\)\\bigg\|\_\{\\mathbf\{s\}=\\mathbf\{s\}\_\{t\}^\{\(l\)\}\},\\quad l\\in\\mathcal\{N\}\_\{i\}\.\(23\)By assumption, the vectors in Equation[23](https://arxiv.org/html/2605.12733#A1.E23)are linearly independent\.

We now construct a matrixMM\. Because the row vectors in Equation[23](https://arxiv.org/html/2605.12733#A1.E23)are linearly independent, any rowMj,⋅M\_\{j,\\cdot\}withj∈ℐ\(\(Ju\)i,⋅\)j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)can be expressed as a linear combination of them\. That is, there exist coefficients\{βl\}l∈𝒩i\\\{\\beta\_\{l\}\\\}\_\{l\\in\\mathcal\{N\}\_\{i\}\}such that

Mj,⋅=∑l∈𝒩iβl\(Ju\(𝐬t\(l\)\)\)i,⋅M\.M\_\{j,\\cdot\}=\\sum\_\{l\\in\\mathcal\{N\}\_\{i\}\}\\beta\_\{l\}\\,\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}\\,M\.\(24\)
We requireMMto satisfy two conditions: \(i\) for eachi∈\[dg\]i\\in\[d\_\{g\}\], the linear combination in Equation[24](https://arxiv.org/html/2605.12733#A1.E24)must lie in the span of the canonical basis vectors indexed byℐ\(\(Ju^\)i,⋅\)\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\), i\.e\.,

∑l∈𝒩iβl\(Ju\(𝐬t\(l\)\)\)i,⋅M∈span⁡\{ej:j∈ℐ\(\(Ju^\)i,⋅\)\},\\sum\_\{l\\in\\mathcal\{N\}\_\{i\}\}\\beta\_\{l\}\\,\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}\\,M\\in\\operatorname\{span\}\\\{e\_\{j\}:j\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\\\},\(25\)and \(ii\) its index set matches that ofJϕ−1\(𝐬^t\)J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\):

ℐ\(M\)=ℐ\(Jϕ−1\(𝐬^t\)\)\.\\mathcal\{I\}\(M\)=\\mathcal\{I\}\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\)\.\(26\)
By the index\-set inclusion assumption, for alll∈𝒩il\\in\\mathcal\{N\}\_\{i\}we have

ℐ\(\(Ju\(𝐬t\(l\)\)M\)i,⋅\)⊆ℐ\(\(Ju^\(𝐬^t\(l\)\)\)i,⋅\)\.\\mathcal\{I\}\\big\(\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)\\,M\)\_\{i,\\cdot\}\\big\)\\subseteq\\mathcal\{I\}\\big\(\(J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}\\big\)\.\(27\)This guarantees

\(Ju\(𝐬t\(l\)\)\)i,⋅M∈span⁡\{ej:j∈ℐ\(\(Ju^\)i,⋅\)\}\.\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}\\,M\\in\\operatorname\{span\}\\\{e\_\{j\}:j\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\\\}\.\(28\)Taking linear combinations with the coefficients\{βl\}\\\{\\beta\_\{l\}\\\}, we conclude

∑l∈𝒩iβl\(Ju\(𝐬t\(l\)\)\)i,⋅M∈span⁡\{ej:j∈ℐ\(\(Ju^\)i,⋅\)\}\.\\sum\_\{l\\in\\mathcal\{N\}\_\{i\}\}\\beta\_\{l\}\\,\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}^\{\(l\)\}\)\)\_\{i,\\cdot\}\\,M\\in\\operatorname\{span\}\\\{e\_\{j\}:j\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\\\}\.\(29\)
Equivalently, for everyj∈ℐ\(\(Ju\)i,⋅\)j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\),

Mj,⋅∈span⁡\{ek:k∈ℐ\(\(Ju^\)i,⋅\)\}\.M\_\{j,\\cdot\}\\in\\operatorname\{span\}\\\{e\_\{k\}:k\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\)\\\}\.\(30\)
SinceJϕ−1\(𝐬^t\)J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)is invertible, its determinant is nonzero\. Expanding the determinant as a sum over permutations, there must exist a permutationπ\\pisuch that

\(Jϕ−1\(𝐬^t\)\)j,π\(j\)≠0,∀j∈\{1,…,ds\}\.\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\)\_\{j,\\pi\(j\)\}\\neq 0,\\quad\\forall j\\in\\\{1,\\dots,d\_\{s\}\\\}\.\(31\)This establishes a one\-to\-one correspondence between the indices of𝐬t\\mathbf\{s\}\_\{t\}and𝐬^t\\hat\{\\mathbf\{s\}\}\_\{t\}throughπ\\pi\.

In particular, for everyj∈ℐ\(\(Ju\)i,⋅\)j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\), we have

\(Jϕ−1\(𝐬^t\)\)j,π\(j\)≠0\.\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\)\_\{j,\\pi\(j\)\}\\neq 0\.\(32\)Becauseℐ\(M\)=ℐ\(Jϕ−1\(𝐬^t\)\)\\mathcal\{I\}\(M\)=\\mathcal\{I\}\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\), this implies

Mj,π\(j\)≠0,∀j∈ℐ\(\(Ju\)i,⋅\)\.M\_\{j,\\pi\(j\)\}\\neq 0,\\quad\\forall j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\.\(33\)Combining this with Equation[30](https://arxiv.org/html/2605.12733#A1.E30), it follows that

π\(j\)∈ℐ\(\(Ju^\)i,⋅\),∀j∈ℐ\(\(Ju\)i,⋅\)\.\\pi\(j\)\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{i,\\cdot\}\),\\quad\\forall j\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{i,\\cdot\}\)\.\(34\)Therefore, every nonzero entry ofJuJ\_\{u\}has a corresponding nonzero entry ofJu^J\_\{\\hat\{u\}\}at the permuted column index:

\(Ju\)i,j≠0⟹\(Ju^\)i,π\(j\)≠0\.\(J\_\{u\}\)\_\{i,j\}\\neq 0\\;\\;\\implies\\;\\;\(J\_\{\\hat\{u\}\}\)\_\{i,\\pi\(j\)\}\\neq 0\.\(35\)Finally, with the sparsity regularization‖Ju^‖0≤‖Ju‖0\\\|J\_\{\\hat\{u\}\}\\\|\_\{0\}\\leq\\\|J\_\{u\}\\\|\_\{0\}, this implication strengthens to an equivalence:

\(Ju\)i,j≠0⟺\(Ju^\)i,π\(j\)≠0\.\(J\_\{u\}\)\_\{i,j\}\\neq 0\\;\\;\\Longleftrightarrow\\;\\;\(J\_\{\\hat\{u\}\}\)\_\{i,\\pi\(j\)\}\\neq 0\.\(36\)
Forc∈Ikc\\in I\_\{k\}, we havec∈ℐ\(\(Ju\)k,⋅\)c\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{k,\\cdot\}\)\. Hence, by Equation[30](https://arxiv.org/html/2605.12733#A1.E30),

Mc,⋅\\displaystyle M\_\{c,\\cdot\}∈span⁡\{ek′:k′∈ℐ\(\(Ju^\)k,⋅\)\}\.\\displaystyle\\in\\operatorname\{span\}\\\{e\_\{k^\{\\prime\}\}:k^\{\\prime\}\\in\\mathcal\{I\}\(\(J\_\{\\hat\{u\}\}\)\_\{k,\\cdot\}\)\\\}\.\(37\)Suppose, for contradiction, thatMc,π\(r\)≠0M\_\{c,\\pi\(r\)\}\\neq 0for somer∈I∖Ikr\\in I\\setminus I\_\{k\}\. Thenπ\(r\)\\pi\(r\)belongs to the index set on the right\-hand side of Equation[37](https://arxiv.org/html/2605.12733#A1.E37)\.

By Equation[36](https://arxiv.org/html/2605.12733#A1.E36), this implies thatr∈ℐ\(\(Ju\)k,⋅\)r\\in\\mathcal\{I\}\(\(J\_\{u\}\)\_\{k,\\cdot\}\), i\.e\.r∈Ikr\\in I\_\{k\}, contradictingr∈I∖Ikr\\in I\\setminus I\_\{k\}\. Therefore,Mc,π\(r\)=0M\_\{c,\\pi\(r\)\}=0, which together withℐ\(M\)=ℐ\(Jϕ−1\(𝐬^t\)\)\\mathcal\{I\}\(M\)=\\mathcal\{I\}\(J\_\{\\phi^\{\-1\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\)yields

∂𝐬t,c∂𝐬^t,π\(r\)=0,∀c∈Ik,r∈I∖Ik\.\\frac\{\\partial\\mathbf\{s\}\_\{t,c\}\}\{\\partial\\hat\{\\mathbf\{s\}\}\_\{t,\\pi\(r\)\}\}=0,\\quad\\forall c\\in I\_\{k\},\\;r\\in I\\setminus I\_\{k\}\.\(38\)Sinceϕ\\phiis invertible, there exists an invertible mapping between𝐬t,c\\mathbf\{s\}\_\{t,c\}and𝐬^t,π\(c\)\\hat\{\\mathbf\{s\}\}\_\{t,\\pi\(c\)\}, and𝐬t,c\\mathbf\{s\}\_\{t,c\}depends only on𝐬^t,π\(c\)\\hat\{\\mathbf\{s\}\}\_\{t,\\pi\(c\)\}\. Moreover, becauser∈I∖Ikr\\in I\\setminus I\_\{k\}andc∈Ikc\\in I\_\{k\},𝐬t,r\\mathbf\{s\}\_\{t,r\}is independent of𝐬t,c\\mathbf\{s\}\_\{t,c\}\. Hence,𝐬t,r\\mathbf\{s\}\_\{t,r\}does not depend on𝐬^t,π\(c\)\\hat\{\\mathbf\{s\}\}\_\{t,\\pi\(c\)\}, in the sense that their mutual information is zero\. Thus, we further have

∂𝐬t,r∂𝐬^t,π\(c\)=0,∀c∈Ik,r∈I∖Ik\.\\frac\{\\partial\\mathbf\{s\}\_\{t,r\}\}\{\\partial\\hat\{\\mathbf\{s\}\}\_\{t,\\pi\(c\)\}\}=0,\\quad\\forall c\\in I\_\{k\},\\;r\\in I\\setminus I\_\{k\}\.\(39\)Given the invertibility of the mapping between𝐬t\\mathbf\{s\}\_\{t\}and𝐬^t\\hat\{\\mathbf\{s\}\}\_\{t\}, the inverse of both Equations[38](https://arxiv.org/html/2605.12733#A1.E38)and[39](https://arxiv.org/html/2605.12733#A1.E39)also holds\. Thus, the only dependencies remain within the estimated and ground\-truth task\-relevant parts, completing the proof\. ∎

### Appendix BSupplementary Discussions

#### B\.1Further Comparison with Related Works

##### Learning Temporal Task Structure\.

For our temporal task structure results in Section[3](https://arxiv.org/html/2605.12733#S3), the most relevant prior work isQiuet al\.\([2024](https://arxiv.org/html/2605.12733#bib.bib257)\), which also models tasks as colliders and seeks to recover their structure in an unsupervised manner\. However, the problem we considered isfundamentally differentas follows:

- •Theoretical Foundation vs\. Heuristic Decomposition\.The most essential distinction lies in identifiability\. As noted in Section[3](https://arxiv.org/html/2605.12733#S3), SelTask relies on sequential non\-negative matrix factorization, which is a purely heuristic decomposition without any identifiability guarantees for recovering the true temporal task structure\. Identifiability is critical because it sets the ultimate limit of any model and provides the guarantee of recovering the ground\-truth representation\. Our work fills this theoretical gap by providing the first general nonparametric identifiability guarantee for this problem, ensuring the recovered structure is faithful to the underlying generative process\.
- •More General Problem Setting\.In addition to providing theoretical guarantees, our approach generalizes the previous heuristic methods \(including SelTask\) in several crucial ways\. First, our theory accommodates tasks that may appear, disappear, and interleave arbitrarily over time, moving beyond the assumption of sequential completion typical of decomposition\-based methods like SelTask\. Second, we do not require strict temporal dependence, and our framework accounts for sequences that may contain an arbitrary number of disconnected boundaries or even i\.i\.d\. settings\. SelTask does not support such temporal disconnections\. Lastly, the data\-generating process is fully nonparametric, allowing for complex nonlinear functions and arbitrary distributions, without relying on auxiliary information or distributional constraints\.

##### Learning Task Relevant Representation\.

Many previous works study the identification of latent variables that generate observational data, but they do not address our goal of recovering a task\-relevant representation\. On the technical side, some works also adopt a structural view of the hidden generative process\. A key example isZhenget al\.\([2022](https://arxiv.org/html/2605.12733#bib.bib111)\), which establishes identifiability results for nonlinear ICA under structural conditions\. To avoid confusion, we clarify the differences between their setting and ours as follows:

- •Different settings\.Zhenget al\.\([2022](https://arxiv.org/html/2605.12733#bib.bib111)\)considers the identifiability of nonlinear ICA in the IID setting, while we consider the general temporal settings\.
- •Different goals\.Zhenget al\.\([2022](https://arxiv.org/html/2605.12733#bib.bib111)\)aims to recover all individual latent variables, while we aim to recover the temporal task structure, and task\-relevant variables as a group\.
- •Different assumptions\.Zhenget al\.\([2022](https://arxiv.org/html/2605.12733#bib.bib111)\)assumes structural sparsity, i\.e\., a specific graphical criterion on the underlying structure between latent and observed variables; while we do not impose these constraints on the data generative process\. Moreover,Zhenget al\.\([2022](https://arxiv.org/html/2605.12733#bib.bib111)\)assumes latent independence while we do not\.
- •Different proof strategies\.Since the considered problems and techniques are fundamentally different, naturally, the proof strategies differ\. If we treat task variables as observed variables, the proof ideas align up to the point where we recover the support of the Jacobian up to permutation\. However, this yieldsJu^\(s^\)=D1Ju\(s\)D2PJ\_\{\\hat\{u\}\}\(\\hat\{s\}\)=D\_\{1\}J\_\{u\}\(s\)D\_\{2\}P, which actually saysnothingabout the identifiability of latent variables\. The latent components can still be mixed arbitrarily, including across the sets associated with different tasks, let alone within each set\. Previous works relying on Structural Sparsity use that assumption to go further and achieve element\-wise identifiability, obtainingJu^\(s^\)=Ju\(s\)DPJ\_\{\\hat\{u\}\}\(\\hat\{s\}\)=J\_\{u\}\(s\)DP\. Our setting does not require that level of identifiability\. The central question for us is more modest but crucial: for two task variablesu1u\_\{1\}andu2u\_\{2\}, how do we ensure that estimated latents associated withu1u\_\{1\}do not mix with ground\-truth latents associated withu2u\_\{2\}? This separation cannot be guaranteed by any existing proof logic\. Our result provides exactly that guarantee without imposing structural sparsity\.

#### B\.2Detailed Discussion on Main Conditions

For identifying task\-relevant representations, our main condition is that in Proposition[2](https://arxiv.org/html/2605.12733#Thmproposition2)\. The assumptions are intended to ensure that the Jacobian carries enough variation to span its support, thereby capturing the underlying dependencies between latent state variables and task variables in the nonlinear setting\. While these assumptions may appear technical at first, they are usually quite mild in practice\. The span condition requires that across a small number of samples, the Jacobian vectors for each task variableuiu\_\{i\}span the relevant support\. Intuitively, this rules out degenerate situations where all data points come from an extremely narrow subpopulation that fails to exhibit the necessary variation\. In typical settings with smooth mappings and a latent state distribution that has a continuous density, Jacobian evaluations at independently drawn samples form continuous random vectors that are in general position with probability one\. This means that only\|ℐ\(Ju\(𝐬t\)i,⋅\)\|\|\\mathcal\{I\}\(J\_\{u\}\(\\mathbf\{s\}t\)\{i,\\cdot\}\)\|random samples are usually enough to span the support, and this number corresponds to how many latent state variables are relevant touiu\_\{i\}, which is usually much smaller than the full sample size\. As a result, the condition is typically satisfied with very few datapoints\. The index set inclusion assumption is also mild\. Since

Ju\(𝐬t\)M′\(𝐬t,𝐬^t\)=Ju^\(𝐬^t\),J\_\{u\}\(\\mathbf\{s\}\_\{t\}\)M^\{\\prime\}\(\\mathbf\{s\}\_\{t\},\\hat\{\\mathbf\{s\}\}\_\{t\}\)=J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\),\(40\)andMMshares the same nonzero index pattern asM′M^\{\\prime\}, the row\(Ju\(𝐬t\)M\)i,⋅\(J\_\{u\}\(\\mathbf\{s\}\_\{t\}\)M\)\_\{i,\\cdot\}already lies inside the support of\(Ju^\(𝐬^t\)\)i,⋅\(J\_\{\\hat\{u\}\}\(\\hat\{\\mathbf\{s\}\}\_\{t\}\)\)\_\{i,\\cdot\}\. While special value combinations could in principle cause the supports to differ at particular points, the condition only requires the existence of one such Jacobian in the relevant space, which is almost always satisfied in practice\.

### Appendix CSupplementary Experimental Setups

In this section, we include further details of the experimental setups not fully elaborated in the main text because of space constraints\. In summary, our practical CI implementation is aligned with existing high\-dimensional practice\. When the variables are high\-dimensional, it is routine to first map them into a lower\-dimensional representation and then estimate conditional mutual information in that reduced space\. This representation step is widely used in CMI based CI testing to avoid the curse of dimensionality\. It is also common to approximate CMI with scalable variational bounds\. Neural and variational CMI estimators\(Belghaziet al\.,[2018](https://arxiv.org/html/2605.12733#bib.bib51); Molavipouret al\.,[2020](https://arxiv.org/html/2605.12733#bib.bib188); Mondalet al\.,[2020](https://arxiv.org/html/2605.12733#bib.bib187)\)and contrastive objectives such as InfoNCE\(Oordet al\.,[2018](https://arxiv.org/html/2605.12733#bib.bib186); Sordoniet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib185)\)follow exactly this paradigm and have been used extensively in high dimensional dependence estimation\. Our method adopts the same blueprint\.

##### CMI Surrogate for the CI Test\.

In high\-dimensional settings, when direct conditional independence \(CI\) testing is computationally infeasible, it is standard to use approximations such as conditional mutual information \(CMI\)\. Below we provide additional details on this surrogate\.

For each pair\(𝐒k,𝐒v\)\(\\mathbf\{S\}\_\{k\},\\mathbf\{S\}\_\{v\}\)and task𝐠i\\mathbf\{g\}\_\{i\}, we replace the CI test

H0:𝐬kL⟂𝐬vL\|𝐙band\(k,v,i\),H\_\{0\}:\\ \\mathbf\{s\}\_\{kL\}\\perp\\mathbf\{s\}\_\{vL\}\\ \\big\|\\ \\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\),\(41\)with an estimate of the conditional mutual information

I\(𝐬kL;𝐬vL∣𝐙band\(k,v,i\)\)=𝔼\[log⁡p\(𝐬kL,𝐬vL∣𝐙band\)p\(𝐬kL∣𝐙band\)p\(𝐬vL∣𝐙band\)\]\.I\\big\(\\mathbf\{s\}\_\{kL\};\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)\\big\)=\\mathbb\{E\}\\left\[\\log\\frac\{p\(\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\)\}\{p\(\\mathbf\{s\}\_\{kL\}\\mid\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\)\\,p\(\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\)\}\\right\]\.\(42\)Direct estimation with𝐙band\\mathbf\{Z\}\_\{\\mathrm\{band\}\}can be high dimensional\. We therefore learn a task\-conditioned representation𝐜i=hϕ\(𝐙band\(k,v,i\)\)\\mathbf\{c\}\_\{i\}=h\_\{\\phi\}\\\!\\big\(\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\(k,v,i\)\\big\)and instead test withI\(𝐬kL;𝐬vL∣𝐜i\)I\(\\mathbf\{s\}\_\{kL\};\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{c\}\_\{i\}\)\. Ifhϕh\_\{\\phi\}is conditionally sufficient for𝐙band\\mathbf\{Z\}\_\{\\mathrm\{band\}\}with respect to\{𝐬kL,𝐬vL\}\\\{\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\}\\\}, then

I\(𝐬kL;𝐬vL∣𝐙band\)=I\(𝐬kL;𝐬vL∣𝐜i\)\.I\(\\mathbf\{s\}\_\{kL\};\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{Z\}\_\{\\mathrm\{band\}\}\)=I\(\\mathbf\{s\}\_\{kL\};\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{c\}\_\{i\}\)\.
We estimate a variational lower bound onI\(𝐬kL;𝐬vL∣𝐜i\)I\(\\mathbf\{s\}\_\{kL\};\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{c\}\_\{i\}\)using a conditional InfoNCE objective\. Letfθ\(𝐬kL,𝐬vL,𝐜i\)f\_\{\\theta\}\(\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\},\\mathbf\{c\}\_\{i\}\)be a critic\. For each positive pair\(𝐬kL,𝐬vL,𝐜i\)\(\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\},\\mathbf\{c\}\_\{i\}\), drawKKnegatives\{𝐬~vL\(j\)\}j=1K\\\{\\tilde\{\\mathbf\{s\}\}\_\{vL\}^\{\(j\)\}\\\}\_\{j=1\}^\{K\}by shuffling𝐬vL\\mathbf\{s\}\_\{vL\}within mini\-batches that share𝐜i\\mathbf\{c\}\_\{i\}\(or within nearest neighbors of𝐜i\\mathbf\{c\}\_\{i\}\)\. Optimize

ℒcNCE\(θ,ϕ\)=𝔼\[log⁡exp⁡fθ\(𝐬kL,𝐬vL,𝐜i\)exp⁡fθ\(𝐬kL,𝐬vL,𝐜i\)\+∑j=1Kexp⁡fθ\(𝐬kL,𝐬~vL\(j\),𝐜i\)\],\\mathcal\{L\}\_\{\\mathrm\{cNCE\}\}\(\\theta,\\phi\)=\\mathbb\{E\}\\left\[\\log\\frac\{\\exp f\_\{\\theta\}\(\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\},\\mathbf\{c\}\_\{i\}\)\}\{\\exp f\_\{\\theta\}\(\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\},\\mathbf\{c\}\_\{i\}\)\+\\sum\_\{j=1\}^\{K\}\\exp f\_\{\\theta\}\(\\mathbf\{s\}\_\{kL\},\\tilde\{\\mathbf\{s\}\}\_\{vL\}^\{\(j\)\},\\mathbf\{c\}\_\{i\}\)\}\\right\],\(43\)which lower boundsI\(𝐬kL;𝐬vL∣𝐜i\)I\(\\mathbf\{s\}\_\{kL\};\\mathbf\{s\}\_\{vL\}\\mid\\mathbf\{c\}\_\{i\}\)up to a constant\. After training a single task\-conditioned critic across all\(k,v,i\)\(k,v,i\), define

I^cNCE\(k,v,i\)=𝔼\[log⁡exp⁡fθ\(𝐬kL,𝐬vL,𝐜i\)1K∑j=1Kexp⁡fθ\(𝐬kL,𝐬~vL\(j\),𝐜i\)\]\.\\widehat\{I\}\_\{\\mathrm\{cNCE\}\}\(k,v,i\)=\\mathbb\{E\}\\left\[\\log\\frac\{\\exp f\_\{\\theta\}\(\\mathbf\{s\}\_\{kL\},\\mathbf\{s\}\_\{vL\},\\mathbf\{c\}\_\{i\}\)\}\{\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}\\exp f\_\{\\theta\}\(\\mathbf\{s\}\_\{kL\},\\tilde\{\\mathbf\{s\}\}\_\{vL\}^\{\(j\)\},\\mathbf\{c\}\_\{i\}\)\}\\right\]\.\(44\)We rejectH0H\_\{0\}for\(k,v,i\)\(k,v,i\)ifI^cNCE\(k,v,i\)\\widehat\{I\}\_\{\\mathrm\{cNCE\}\}\(k,v,i\)exceeds a permutation threshold obtained by re\-sampling\{𝐬~vL\(j\)\}\\\{\\tilde\{\\mathbf\{s\}\}\_\{vL\}^\{\(j\)\}\\\}within𝐜i\\mathbf\{c\}\_\{i\}buckets\.

##### Additional Details of SportsHHI\.

SportsHHI contains11,39811,398video sequences, partitioned into short clips of55frames each, with55,63155,631annotated pairwise interaction instances\. The HHID task labels the interaction for each pair of human actors in a video; interactions often occupy short temporal windows embedded in long sequences, so a single sequence typically contains multiple, possibly overlapping interactions\. This results in complex temporal patterns, with flexible interactions across multiple actors\.

In our implementation of Algorithm[1](https://arxiv.org/html/2605.12733#alg1)on SportsHHI, we set the number of latent state variables to be the same as the number of humans at framett\. For all baselines, we use a pretrained CLIP encoder\(Radfordet al\.,[2021](https://arxiv.org/html/2605.12733#bib.bib268)\)with a ResNet\-50 backbone to get the observed RGB features𝐨\\mathbf\{o\}\. To handle temporal dynamics, an MLP parameterizes transitions𝐬t−1→𝐬t\\mathbf\{s\}\_\{t\-1\}\\rightarrow\\mathbf\{s\}\_\{t\}, while conditional mutual information \(CMI\) is estimated on latent trajectories as a surrogate for conditional independence testing\. To ensure fairness, all baselines employ a ResNet\-50 backbone for RGB feature extraction, consistent with prior work\.

##### Downstream Benefit\.

We evaluate our method on the Meta\-World benchmark\(Yuet al\.,[2020](https://arxiv.org/html/2605.12733#bib.bib11)\)by constructing an interleaved offline dataset from thedoor\-open/close; drawer\-opentasks\. Both tasks involve a 7\-DoF robotic arm manipulating the same door but with opposite goals, making them an ideal testbed for multi\-task interference\. We first train task\-specific expert policies using SAC until reaching60%60\\%success rate, then collect∼\\sim300 successful and∼\\sim300 mixed\-quality trajectories for each task\. To create interleaved data, we segment trajectories into 30–60 step skill chunks \(e\.g\., reaching, grasping, rotating\)\. With probabilityp=0\.8p=0\.8, we randomly splice open\- and close\-task segments into a single trajectory, inserting short transition phases to ensure physical continuity\. This results in∼\\sim2\.4k interleaved trajectories, with on average2\.12\.1task switches per trajectory\. We provide only weak or noisy task labels derived from the door angle change, simulating realistic partially labeled data\. We build upon the Active Fine\-Tuning \(AMF\) framework\(Bagatellaet al\.,[2025](https://arxiv.org/html/2605.12733#bib.bib27)\)\. Specifically, the agent learns a policy over identified tasks using their representation𝐠\{\\mathbf\{g\}\}, which replaces the task embeddingμc\\mu\_\{c\}in AMF\. This enables the agent to actively select tasks that improve generalization\. To evaluate this, we train on three tasks—door\-open,door\-close, anddrawer\-open, and test generalization to the new taskdrawer\-closewith only10410^\{4\}samples\.

### Appendix DSupplementary Experimental Results

Table 2:Runtime \(expected value\) under varying number of time steps\.##### Runtime Analysis\.

we have conducted additional runtime analysis on seven datasets with four methods\. To test our algorithm in the most computationally\-heavy case, we set the segment lengths to the minimal value of22\. Following the same setting in the manuscript, we vary the number of time steps from88to2020, and set the number of tasksM=T/5M=T/5\. The results \(seconds\) are as Table[2](https://arxiv.org/html/2605.12733#A4.T2)\.

Table 3:Additional results on SportsHHI
##### Additional Results on Learning Temporal Task Structure\.

To further study the ability of different models to recover temporal task structure, we evaluated several additional approaches on the task structure prediction benchmark\. The goal is to compare with more standard video models that do not target identifiability\. We further included two new video models, Slowfast\(Feichtenhoferet al\.,[2019](https://arxiv.org/html/2605.12733#bib.bib179)\)and VitB\(Tonget al\.,[2022](https://arxiv.org/html/2605.12733#bib.bib181)\)\. As shown in Table[3](https://arxiv.org/html/2605.12733#A4.T3), our method achieves the best performance, which further validates that a principled structure learning approach yields the most reliable recovery of temporal task structure\. Leap outperforms the Base model, which highlights the benefit of identifiable representations for structure learning\. However, when observations are modeled more appropriately, this advantage becomes less pronounced, as seen in the similar performance of Slowfast and VitB\.

##### Additional Results on Controllable Generation\.

Figure 7:Comparison of controllable generation for a task of “a dog playing ball, jumping high, lying on the grass, and reading a book”\. With sparsity, the model learns task\-relevant latent representations and modifies only the intended concepts for each instruction\. Without sparsity, the representation entangles irrelevant factors, causing unintended changes and reduced precision\.Moreover, we conduct additional experiments to evaluate the benefit of recovering task relevant representations for controllable generation\. We consider the task of ”a dog playing ball, jumping high, lying on the grass, and reading a book”\. The empirical setup follows that of Figure[6](https://arxiv.org/html/2605.12733#S5.F6)\. From Figure[7](https://arxiv.org/html/2605.12733#A4.F7), it becomes evident that precise control requires learning task\-relevant representations with sparsity regularization\. Without it, the learned representation absorbs irrelevant hidden factors and introduces unwanted changes in the generated outputs\. For instance, the dog may change into a different one, its shape may become unnatural, and sometimes irrelevant backgrounds dominate the generation\.

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/dog/104_index113_2.jpg)

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/dog/104_index34_1.jpg)

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/dog/117_index113_2.jpg)

![Refer to caption](https://arxiv.org/html/2605.12733v1/figures/dog/117_index34_1.jpg)

Identified latents corresponding to “running” and “season” are successfully identified\. Even in complex settings where attributes are not clearly separable visually, the method recovers meaningful representations\. This also shows how the identified latents vary across tasks in an interpretable way\.
From Generalist to Specialist Representation

Similar Articles

From Generalist to Specialist Representation

Diffusion Model as a Generalist Segmentation Learner

Diverse Dictionary Learning

Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Submit Feedback

Similar Articles

From Generalist to Specialist Representation
Diffusion Model as a Generalist Segmentation Learner
Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning