Tag
This paper proposes using sparse autoencoders to detect out-of-distribution inputs for transformers, including typos and jailbreak prompts, by analyzing spurious concept activations. The method enables a mechanistically grounded fine-tuning strategy to improve LLM robustness.
This paper argues that vanilla conditional diffusion models fundamentally fail at compositional generation when the target distribution is out-of-distribution, due to score estimation error, and that inference-time corrections cannot fully compensate.
This paper studies how self-driving car systems and humans perform on visual question answering tasks across different geographic locations (Lima and New York City), finding that both humans and VLMs show similar performance regardless of location but diverge based on question type.
This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.
This paper examines whether language models can independently discover the concept of zero as a form of out-of-distribution generalization, finding that GPT-2 sized models cannot at test time but improve with training on examples of zero, and that language pretraining reduces the number of required examples.
This paper introduces a non-parametric multi-view Gaussian process framework for detecting machine-generated text that is robust to adversarial manipulations like paraphrasing. By combining complementary features and providing calibrated uncertainty, it outperforms existing detectors on held-out attacks.
ADAPTOOD is a novel framework that uses data uncertainty to quantify distribution shift severity and guide fine-tuning of ECG time series models for out-of-distribution settings. It combines uncertainty estimation with low-rank model updates and adaptive hyperparameter optimization, achieving up to 7% higher accuracy and 12.9% higher precision than existing OOD adaptation methods.
Proposes Latent-Predictive Counterfactual Decoupling (LPCD) to address tactical out-of-distribution shifts in live streaming risk assessment by decoupling stable malicious intent from evolving narrative tactics at the latent level, achieving superior performance on large-scale industrial datasets.
This paper introduces DOPA, a demonstration search framework that uses an out-of-distribution proxy to retrieve robust demonstrations for LLMs when the target domain is inaccessible, enhancing in-context learning performance under distribution shift.
This paper proposes Staged-Competence, a curriculum learning framework for DPO-based safety alignment that organizes preference data by difficulty, improving robustness and data efficiency while preserving general capabilities.
Introduces GORMPO, a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas, achieving 17% improvement on a real-world medical dataset and outperforming state-of-the-art baselines.
This paper presents the first theoretical model for out-of-distribution generalization in reinforcement learning, showing that smaller abstract state spaces enable cross-scale generalization in POMDPs.
This paper introduces Domain Generalizable Dataset Distillation (DGDD), a new problem setting that targets out-of-distribution generalization of distilled datasets, and proposes Spectral Gradient Surgery (SGS) to disentangle class-discriminative and domain-specific information by leveraging cross-domain gradient agreement in the spectral domain.
A developer notes that coding agents consistently fail to help his 10-year-old build creative simulators, revealing LLMs' inability to handle out-of-distribution use cases and arguing that claims of imminent AGI are overstated.
This paper introduces a protocol for fair comparison of diffusion-based OOD detectors and proposes Canonical Feature Snapshots (CFS), which leverage sparse internal activations for efficient detection.
CPCANet is a domain generalization framework that uses Common Principal Component Analysis to discover structured domain-invariant subspaces, achieving state-of-the-art performance in zero-shot transfer.
This paper proposes CAP-TTA, a test-time adaptation framework that uses preconditioned LoRA updates triggered by bias-risk scores to mitigate toxicity and bias in large language models during narrative generation, achieving faster optimization and better fluency than standard baselines.