Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect
Summary
This paper characterizes compositional literary primitives in instruction-tuned LLMs using sparse autoencoders, discovering feature classes for self, style, and affect that enable emotion steering across two architectures.
Similar Articles
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
This paper uses sparse autoencoders to decompose LLMs into interpretable features and shows that semantic features explain brain alignment with cortical semantic topography, generalizing across English, Chinese, and French.
Interpreting Style Representations via Style-Eliciting Prompts
This paper proposes a framework to interpret style representations by using style-eliciting prompts—natural language instructions that steer LLMs to generate text with specific stylistic attributes. The method outperforms baseline LLM prompting techniques in both describing and imitating writing styles.
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
Proposes LiSCP, a lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Achieves superior performance on in-domain and cross-domain detection with notable robustness.
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
This paper investigates preference instability in reward models for LLMs, where subtle input variations cause contradictory preference assignments. The authors propose two SAE-based mitigation strategies—SAE Feature Steering and SAE Residual Correction—to reduce incorrect preference assignments without retraining.
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
This study investigates how LLMs ground abstract concepts compared to humans, finding a significant 'grounding gap' where models rely heavily on word associations rather than emotional or internal states. Using sparse autoencoders, the authors identify internal features related to grounding dimensions, suggesting LLMs possess this information but do not recruit it naturally during generation.