Diverse Dictionary Learning
Summary
The paper introduces diverse dictionary learning, showing that key set-theoretic relationships among latent variables can be identified from observational data without strong assumptions, enabling partial or full identifiability with minimal inductive bias.
View Cached Full Text
Cached at: 04/23/26, 07:47 AM
Paper page - Diverse Dictionary Learning
Source: https://huggingface.co/papers/2604.17568
Abstract
Without strong assumptions, latent variable recovery is made possible through diverse dictionary learning that identifies set-theoretic relationships and structures from observational data.
Given onlyobservational dataX = g(Z), where both thelatent variablesZ and the generating process g are unknown, recovering Z is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To makeidentifiabilityactionable in the real-world scenarios, we take a complementary view: in the general settings where fullidentifiabilityis unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem ofdiverse dictionary learningto formalize this view. Specifically, we show that intersections, complements, and symmetric differences oflatent variableslinked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficientstructural diversityis present, they further imply fullidentifiabilityof alllatent variables. Notably, allidentifiabilitybenefits follow from a simpleinductive biasduring estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.17568
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17568 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17568 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17568 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
This paper introduces a validity-diversity framework attributing diversity collapse in LLMs to order and shape miscalibration during decoding, validated across 14 language models.
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference
DyStruct is a training-free Bayesian decoding framework for discrete Diffusion Language Models that enables flexible-length generation by dynamically determining expansion size and decoding order, improving accuracy on math and code tasks.
MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series
This paper introduces MOSAIC, a method for module discovery in scientific time series that combines causal representation learning with sparse additive identifiable causal learning. It aims to recover interpretable latent variables and their associated observations without post-hoc alignment, validated on domains like molecular dynamics and climate data.
How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation
This paper presents a systematic evaluation of how differential privacy impacts social bias in large language models, finding that while it reduces bias in sentence scoring, the effect does not generalize across all tasks.