steerability

#steerability

When is Your LLM Steerable?

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper investigates when activation steering succeeds or fails for LLMs by analyzing early decoding dynamics. The authors introduce ASTEER, a large testbed of steered generations, and train a GBDT classifier to predict steering outcomes from early hidden states, enabling efficient steering strength search.

0 favorites 0 likes

#steerability

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper investigates the ability of LLMs-as-judges for safety to adapt to contextual information and varying safety definitions, finding that they are largely rigid and fail to adjust when the context contradicts their internal priors.

0 favorites 0 likes

steerability

When is Your LLM Steerable?

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

Submit Feedback