Diverse Dictionary Learning
Summary
The paper introduces diverse dictionary learning, showing that key set-theoretic relationships among latent variables can be identified from observational data without strong assumptions, enabling partial or full identifiability with minimal inductive bias.
View Cached Full Text
Cached at: 04/23/26, 07:47 AM
Paper page - Diverse Dictionary Learning
Source: https://huggingface.co/papers/2604.17568
Abstract
Without strong assumptions, latent variable recovery is made possible through diverse dictionary learning that identifies set-theoretic relationships and structures from observational data.
Given onlyobservational dataX = g(Z), where both thelatent variablesZ and the generating process g are unknown, recovering Z is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To makeidentifiabilityactionable in the real-world scenarios, we take a complementary view: in the general settings where fullidentifiabilityis unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem ofdiverse dictionary learningto formalize this view. Specifically, we show that intersections, complements, and symmetric differences oflatent variableslinked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficientstructural diversityis present, they further imply fullidentifiabilityof alllatent variables. Notably, allidentifiabilitybenefits follow from a simpleinductive biasduring estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.17568
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17568 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17568 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17568 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series
This paper introduces MOSAIC, a method for module discovery in scientific time series that combines causal representation learning with sparse additive identifiable causal learning. It aims to recover interpretable latent variables and their associated observations without post-hoc alignment, validated on domains like molecular dynamics and climate data.
Data-Driven Variational Basis Learning Beyond Neural Networks: A Non-Neural Framework for Adaptive Basis Discovery
This paper introduces Data-Driven Variational Basis Learning (DVBL), a non-neural framework that learns basis functions directly from data through variational optimization, offering interpretability and mathematical transparency compared to neural networks.
Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms
This paper systematically investigates unlearnable examples under diverse training paradigms, revealing that pretrained weights weaken existing methods, and proposes Shallow Semantic Camouflage (SSC) to maintain unlearnability by generating perturbations in a semantically valid subspace.
Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
This paper introduces Bipredictability (P) and the Information Digital Twin (IDT), a lightweight method to monitor conversational consistency in multi-turn LLM interactions using token frequency statistics without embeddings or model internals. The approach achieves 100% sensitivity in detecting contradictions and topic shifts while establishing a practical monitoring framework for extended LLM deployments.