feature-directions

#feature-directions

Evidence for feature-specific error correction in LLMs

arXiv cs.LG ↗ · yesterday Cached

This paper provides the first empirical evidence for feature-specific error correction in large language models, showing that residual-stream activations are robust to small perturbations but less robust along candidate feature directions, supporting the theory of computation in superposition.

0 favorites 0 likes

feature-directions

Evidence for feature-specific error correction in LLMs

Submit Feedback