Tag
Proposes ARIADNE, a training-free, adapter-agnostic routing framework that selects the optimal PEFT adapter at inference time by measuring input proximity to adapter-specific centroids in embedding space, recovering 97.44% of upper-bound performance on 23 tasks.
This paper introduces Relational Reflective Intelligence (RRI), an inference-time governance layer that uses auditable reasoning loops to stabilize human-AI reasoning, addressing cognitive vulnerabilities shared by humans and LLMs.
The paper introduces Probe-Conditioned Head Intervention (PCHI), an inference-time method for LLMs that selectively reduces overconfidence on wrong answers without significantly reducing confidence on correct ones, by conditionally rescaling attention head outputs when the model is likely wrong but confident.
Evoflux uses evolutionary search at inference time to repair failed tool workflows for compact language models, boosting execution feasibility significantly over fine-tuning methods.
An independent researcher introduces Epistemic Lattice Tethering (ELT), an inference-time scaffolding framework that extends coherent LLM threads to 325k–1M tokens by applying epistemic and ontological governance.
This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.
This paper presents the 'Digital Apprentice,' a framework for scalable and safe agentic AI in which autonomy is earned incrementally through observational learning, human authorization, and continuous alignment correction. It introduces ADAPT, an inference-time control plane that operationalizes graduated autonomy tiers and converts human corrections into reusable preference data.
This paper proposes Dynamic Contextual Orthogonalization (DCO), an inference-time method that reduces hallucinations in large language models by aligning attention head outputs with the context manifold, achieving superior faithfulness on benchmarks with Llama-3 models.
Introduces Latent Reward Steering (Lrs), an adaptive inference-time framework that uses sparse autoencoder latent states and a learned reward model to implicitly promote cognitive behaviors like verification and backtracking in reasoning LLMs, improving performance across multiple models and benchmarks.
TIGER is an inference-time framework that mitigates hallucinations in multimodal generation by extracting observation and claim graphs and assigning risk scores to repair unsupported facts. It reduces unsupported content across image-to-text, image+text-to-text, audio-to-text, and video-to-text tasks.
SITA (Scalable Inference-Time Annealing) introduces a method for efficiently sampling molecular Boltzmann distributions by retraining flow-based models along a temperature ladder using energy-based surrogate likelihoods, avoiding costly divergence computations. The approach achieves state-of-the-art performance on Alanine Dipeptide and Tripeptide benchmarks.
DenseSteer is a training-free inference-time framework that improves small language models' math reasoning by steering their internal representations towards dense reasoning patterns, achieving accuracy gains without increasing token-level negative log-likelihood.
The paper introduces SeDT, a training-free inference-time method that improves LLM reliability in multi-turn conversations by annotating conversation history with cumulative relevance scores from three signals, achieving up to +37.7% performance gains on the Lost-in-Conversation benchmark.
Visual Concept Fusion (VCF) enables dual conditioning on both an image and text prompt in diffusion models at inference time without retraining, using a lightweight aligner and fusion strategy.
This paper studies whether tabular foundation models based on pretrained prior-data fitted networks (PFNs) can generalize to strategic tabular data where individuals modify features after deployment. It proposes Strategic Prior-data Fitted Network (SPN), an inference-time framework that aligns PFN predictions with the post-manipulation distribution without retraining.
A novel inference-time method for long video generation using overlapping sliding windows with Tweedie matching and stochastic early-phase sampling to improve temporal consistency and visual quality without additional training.
This paper tests whether varying inference-time reasoning effort affects the alignment between large reasoning models' chain-of-thought lengths and human reaction times. Results show alignment is invariant to effort perturbations, suggesting it is a training-time achievement.
HASP is a framework that upgrades agent skills into executable program functions acting as guardrails, enabling direct intervention in LLM agent loops and improving performance on complex tasks like web-search, math reasoning, and coding.
This paper introduces GUARD-IT, a training-free method for machine unlearning that uses input-dependent activation steering at inference time to remove targeted knowledge from LLMs without modifying weights, matching or exceeding gradient-based baselines while preserving utility and robustness to quantization.
This article introduces SkillGen, a multi-agent framework that synthesizes and verifies reusable inference-time skills for LLM agents by contrasting successful and failed trajectories. The method ensures skills are auditable and empirically verified for their net positive impact on agent performance.