Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

arXiv cs.LG 05/20/26, 04:00 AM Papers

Summary

This paper characterizes compositional literary primitives in instruction-tuned LLMs using sparse autoencoders, discovering feature classes for self, style, and affect that enable emotion steering across two architectures.

arXiv:2605.18808v1 Announce Type: new Abstract: We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of $10^{-3}$ and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.

Original Article

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

Similar Articles

Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Interpreting Style Representations via Style-Eliciting Prompts

Literary Non-Style in LLM-Generated Text

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation

Submit Feedback

Similar Articles

Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Interpreting Style Representations via Style-Eliciting Prompts

Literary Non-Style in LLM-Generated Text

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation