Tag
GEOALIGN, from the Alibaba team behind Qwen, identifies that instability in RL for LLMs often stems from a few bad rollouts causing conflicting update directions, and proposes a lightweight method to curate rollouts based on directional consistency, improving training stability and performance.
This paper analyzes latent reasoning models (LRMs) and demonstrates that observable patterns in latent states are not causal explanations of reasoning; it advocates for matched controls and causal tests in interpretability research.
This paper reveals that low-bit KV cache quantization can silently destroy safety alignment in instruction-tuned LLMs, and proposes a diagnostic method (PCR) to classify failure modes along with a training-free mitigation protocol that recovers up to 97% of lost alignment.
Introduces Contribution Weights, a projection-based metric that accounts for attention weight, value magnitude, and directional alignment to more faithfully measure token importance in transformer LLMs, revealing active functional roles of attention sinks.
This paper analyzes linear activation steering in language models by decomposing interventions into angular and radial components. It finds that concepts are primarily encoded in angular structure, but norm adjustments are crucial for stability, supporting spherical steering methods while showing that additive coefficients conflate geometry.
This paper examines counterfactual behavior in ML models through a geometric lens, showing that models with similar predictive performance can differ substantially in counterfactual outcomes due to the interaction between decision-boundary proximity and local data support. The findings identify counterfactual behavior as a distinct dimension from predictive performance, with implications for model selection and reliability of counterfactual explanation methods.
This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.
This paper introduces a Jacobian-PCA-Grassmann framework to analyze the geometric structure of expert specialization in Mixture-of-Experts (MoE) Transformers. It finds that experts exhibit strong functional decorrelation while their representations overlap, and that routing sparsity significantly influences this geometry.
A developer built Arc Gate, a monitoring proxy for LLMs that uses Fisher information manifold geometry to detect session-level prompt injection attacks, identifying Crescendo-style gradual manipulation by tracking t-values against a phase transition threshold t* = 1.2247 rather than per-turn phrase detection.
This paper introduces Shesha, a geometric stability metric that quantifies directional coherence of single-cell CRISPR perturbation responses using mean cosine similarity, revealing regulatory architecture and predicting cellular stress across 2,200+ perturbations in five CRISPR datasets.