feature-directions

Tag

Cards List
#feature-directions

Evidence for feature-specific error correction in LLMs

arXiv cs.LG · yesterday Cached

This paper provides the first empirical evidence for feature-specific error correction in large language models, showing that residual-stream activations are robust to small perturbations but less robust along candidate feature directions, supporting the theory of computation in superposition.

0 favorites 0 likes
← Back to home

Submit Feedback