Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

arXiv cs.CL Papers

Summary

This paper introduces FLAS, a flow-based activation steering method that learns a concept-conditioned velocity field to steer language model activations at inference time. On the AxBench benchmark, FLAS is the first learned method to consistently outperform in-context prompting on held-out concepts without per-concept tuning.

arXiv:2605.05892v1 Announce Type: new Abstract: Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 06:45 AM

# Flow-based Activation Steering for Inference-Time Intervention
Source: [https://arxiv.org/html/2605.05892](https://arxiv.org/html/2605.05892)
Zehao Jin\*Ruixuan Deng\*Junran Wang\*Xinjie Shen Chao Zhang Georgia Institute of Technology \{zjin350, rdeng62, jwang3668, xinjie, chaozhang\}@gatech\.edu

###### Abstract

Activation steering has emerged as a promising alternative for controlling language\-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen\. However, large\-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in\-context prompting and generalize poorly to unseen concepts\. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single\-step, position\-invariant transforms\. We proposeFLAS\(Flow\-basedActivationSteering\), which learns a general, concept\-conditioned velocity fieldvθ​\(h,t,c\)v\_\{\\theta\}\(h,t,c\)that transports unsteered activations to steered ones without relying on these assumptions\. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held\-out harmonic means of1\.0151\.015on Gemma\-2\-2B\-IT and1\.1131\.113on Gemma\-2\-9B\-IT without per\-concept tuning\. Analysis of the learned flow shows curved, multi\-step, token\-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete\.

Our code is available at[https://github\.com/flas\-ai/FLAS](https://github.com/flas-ai/FLAS)\.

11footnotetext:Equal contribution\.## 1Introduction

Large language models have demonstrated strong capabilities across diverse tasks\[[4](https://arxiv.org/html/2605.05892#bib.bib10),[10](https://arxiv.org/html/2605.05892#bib.bib25),[30](https://arxiv.org/html/2605.05892#bib.bib48)\], yet reliably controlling their behavior to align with human preferences remains a persistent challenge\[[1](https://arxiv.org/html/2605.05892#bib.bib16)\]\. Existing control mechanisms such as prompting and fine\-tuning face limitations in robustness, cost, and side effects\[[1](https://arxiv.org/html/2605.05892#bib.bib16),[12](https://arxiv.org/html/2605.05892#bib.bib31),[13](https://arxiv.org/html/2605.05892#bib.bib9),[18](https://arxiv.org/html/2605.05892#bib.bib8)\]\. Activation steering has emerged as a complementary alternative that offers lightweight, interpretable control across a growing range of behaviors\[[25](https://arxiv.org/html/2605.05892#bib.bib15),[3](https://arxiv.org/html/2605.05892#bib.bib12),[14](https://arxiv.org/html/2605.05892#bib.bib32),[8](https://arxiv.org/html/2605.05892#bib.bib13),[39](https://arxiv.org/html/2605.05892#bib.bib11)\]by modifying intermediate representations at inference time while leaving model parameters frozen\[[9](https://arxiv.org/html/2605.05892#bib.bib14),[33](https://arxiv.org/html/2605.05892#bib.bib50),[43](https://arxiv.org/html/2605.05892#bib.bib57)\]\.

Despite these successes, AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\], a benchmark that evaluates thousands of natural\-language steering concepts, reveals a consistent limitation of existing steering methods\. In particular, simple in\-context prompting outperforms the tested steering methods, and increasing the scalar steering strength improves concept incorporation while monotonically degrading instruction following and fluency\. The requirement for concept\-specific strength tuning on a development set\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]limits the real\-world application of previous steering methods\.

We hypothesize that this performance gap stems from simplifying assumptions that most activation\-steering approaches adopt at design time without rigorous validation\. While most recent methods have relaxed the Linear Representation Hypothesis\[[24](https://arxiv.org/html/2605.05892#bib.bib7),[23](https://arxiv.org/html/2605.05892#bib.bib41),[31](https://arxiv.org/html/2605.05892#bib.bib1)\]by introducing adaptive transforms\[[37](https://arxiv.org/html/2605.05892#bib.bib52),[28](https://arxiv.org/html/2605.05892#bib.bib60),[40](https://arxiv.org/html/2605.05892#bib.bib61),[27](https://arxiv.org/html/2605.05892#bib.bib59),[21](https://arxiv.org/html/2605.05892#bib.bib58),[29](https://arxiv.org/html/2605.05892#bib.bib47)\], other assumptions persist widely \(Table[1](https://arxiv.org/html/2605.05892#S2.T1)\), typically restricting interventions to single\-step, position\-invariant transforms trained per concept on contrastive data\. These assumptions define, for each method, a prescribed operator family that constrains both what information the intervention may use and how it may act on the activation\. Individual methods relax one or more of these constraints while retaining the rest\. Recent flow\- and ODE\-based formulations\[[35](https://arxiv.org/html/2605.05892#bib.bib63),[15](https://arxiv.org/html/2605.05892#bib.bib62),[42](https://arxiv.org/html/2605.05892#bib.bib64)\]loosen the single\-step constraint by allowing multi\-step, state\-dependent trajectories, yet they retain the dependence on contrastive data and per\-concept training\. These restrictions shape how interventions behave in practice and can limit the attainable trade\-off between concept incorporation and instruction following\.

To address these restrictions, we propose to learn a more expressive steering operator directly from data by introducingFLAS\(Flow\-basedActivationSteering\)\. FLAS replaces a fixed one\-step intervention with a learned, time\-conditioned velocity fieldvθ​\(h,t,c\)v\_\{\\theta\}\(h,t,c\)that transports an unsteered activationhhto a steered activationh′=φT​\(h\)h^\{\\prime\}=\\varphi\_\{T\}\(h\)throughNNsteps of Euler integration, conditioned on a natural\-language concept descriptioncc\. Because the velocity depends on the current activation state, the resulting intervention adapts as the activation evolves and, when integrated over multiple steps, can follow curved trajectories through activation space\. Evaluating the velocity independently at each token position further allows the displacement to vary across a sequence\. The method trains on positive examples under a standard language\-modeling objective, without requiring contrastive pairs, and employs the flow timeTTas a continuous integration horizon that decouples intervention strength from direction\.

Our contributions are as follows\.

1. 1\.We proposeFLAS\(Flow\-basedActivationSteering\), a concept\-conditioned velocity field integrated byNN\-step Euler that enables adaptive, multi\-step, position\-sensitive steering trained on positive examples alone\. The flow formulation recovers many single\-step methods as special cases forN=1N=1and fixedTT\.
2. 2\.FLAS is the first learned steering method to consistently outperform prompting on AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\], achieving held\-out HMean1\.0151\.015/1\.1131\.113\(Gemma\-2\-2B/9B\-IT\) vs\. prompting0\.7620\.762/1\.0911\.091and HyperSteer\[[29](https://arxiv.org/html/2605.05892#bib.bib47)\]0\.6080\.608/0\.9340\.934, with<1/26<1/26the parameters\. Performance remains stable acrossT∈\[0\.5,4\.0\]T\\in\[0\.5,4\.0\]without per\-concept tuning, and generalizes to held\-out concepts with monotonic scaling at 16k without clear saturation\.
3. 3\.The learned velocity field serves as an analysis probe of activation space, revealing curved, position\-dependent, multi\-step structure\. Our method provides empirical evidence that effective steering requires nonlinear and position\-sensitive interventions, suggesting that previous hypotheses on activation space geometry might be incomplete\.

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/main.png)Figure 1:FLAS model architecture overview\.The velocity fieldvθ​\(h,t,c\)v\_\{\\theta\}\(h,t,c\)transports activations at layerℓ\\ellof a frozen base LM\. A frozen concept encoderϕ\\phiproduces concept representations consumed by a single FlowBlock via cross\-attention\. The flow is integrated byNN\-step Euler, shared between training and inference, yielding a steered activationh′=φT​\(h\)h^\{\\prime\}=\\varphi\_\{T\}\(h\)\. The entire base language model \(base LM\) is frozen; only the FlowBlock parameters are trained\.
## 2Related Work

Table[1](https://arxiv.org/html/2605.05892#S2.T1)situates FLAS in the landscape of activation\-steering methods along five structural axes\.

Table 1:Structural comparison of activation\-steering methods\.Adaptive: depends on currenthh\.Multi\-step: iterative integration\.Per\-token: uses inter\-position context\.Zero\-shot: no per\-concept retraining\.Training data: “pos only” = concept\-aligned responses only, “pos\+neg” = additionally requires paired negatives\.⋆Relies on a pretrained sparse autoencoder for feature extraction\.#### Linear activation steering\.

Activation Addition\[[33](https://arxiv.org/html/2605.05892#bib.bib50)\]and CAA\[[23](https://arxiv.org/html/2605.05892#bib.bib41)\]each extract or optimize a fixed steering direction and add a scaled copy at a chosen layer\. Recent work computes the displacement through learned mechanisms such as low\-rank interventions\[[37](https://arxiv.org/html/2605.05892#bib.bib52)\]and cross\-attention hypernetworks\[[29](https://arxiv.org/html/2605.05892#bib.bib47)\], but still produce a single displacement at inference time, and none have been reported to consistently surpass prompting on AxBench\.

#### Concurrent nonlinear and flow\-based steering\.

Activation Transport\[[28](https://arxiv.org/html/2605.05892#bib.bib60)\], Curveball Steering\[[27](https://arxiv.org/html/2605.05892#bib.bib59)\], Spherical Steering\[[40](https://arxiv.org/html/2605.05892#bib.bib61)\], and Householder Pseudo\-Rotation\[[26](https://arxiv.org/html/2605.05892#bib.bib66)\]introduce nonlinear single\-step interventions ranging from affine maps to norm\-preserving rotations, requiring paired source\-target data\. K\-Steering\[[21](https://arxiv.org/html/2605.05892#bib.bib58)\], TruthFlow\[[35](https://arxiv.org/html/2605.05892#bib.bib63)\], FlowSteer\[[15](https://arxiv.org/html/2605.05892#bib.bib62)\], and ODESteer\[[42](https://arxiv.org/html/2605.05892#bib.bib64)\]adopt multi\-step continuous\-dynamics formulations, but each targets a single attribute and requires task\-specific paired data\. FLAS combines a concept\-conditioned velocity field with zero\-shot generalization via end\-to\-end LM\-loss training on positive data only\.

#### Flow matching and representation geometry\.

Our velocity\-field parameterization draws on flow matching\[[16](https://arxiv.org/html/2605.05892#bib.bib36),[32](https://arxiv.org/html/2605.05892#bib.bib49),[17](https://arxiv.org/html/2605.05892#bib.bib37)\]and its extensions to manifolds\[[2](https://arxiv.org/html/2605.05892#bib.bib19)\]and latent spaces\[[5](https://arxiv.org/html/2605.05892#bib.bib21)\]\. Where flow matching transports noise to data, FLAS transports unsteered activations to steered ones under a downstream language\-modeling objective rather than a flow\-matching regression target\. The manifold view of LLM representations\[[20](https://arxiv.org/html/2605.05892#bib.bib39),[34](https://arxiv.org/html/2605.05892#bib.bib51),[19](https://arxiv.org/html/2605.05892#bib.bib38),[7](https://arxiv.org/html/2605.05892#bib.bib22),[41](https://arxiv.org/html/2605.05892#bib.bib55)\]treats hidden states as lying on low\-dimensional submanifolds, and our trajectory analyses in Sections[6\.1](https://arxiv.org/html/2605.05892#S6.SS1)and[6\.2](https://arxiv.org/html/2605.05892#S6.SS2)give a concrete picture of how a learned intervention traces on such a submanifold\.

## 3Method

### 3\.1Flow\-based Steering

Fix a pretrained language model withLLlayers and hidden widthdd\. At a chosen layerℓ\\ell, the forward pass produces activationsh∈ℝn×dh\\in\\mathbb\{R\}^\{n\\times d\}, wherennis the sequence length\. Given a natural\-language concept descriptioncc\(e\.g\., a short phrase specifying the target behavior\), we want to replacehhwith a steered versionh′h^\{\\prime\}so that subsequent layers generate text exhibiting the concept while preserving instruction following and fluency\.

We realize the map fromhhtoh′h^\{\\prime\}as a learned, concept\-conditioned flow\. Let\{φt\}t∈\[0,T\]\\\{\\varphi\_\{t\}\\\}\_\{t\\in\[0,T\]\}be a family of mapsφt:ℝn×d→ℝn×d\\varphi\_\{t\}:\\mathbb\{R\}^\{n\\times d\}\\to\\mathbb\{R\}^\{n\\times d\}generated by a velocity fieldvθv\_\{\\theta\}, defined by the ODE

dd​t​φt​\(h\)=vθ​\(φt​\(h\),t,c\),\\frac\{d\}\{dt\}\\varphi\_\{t\}\(h\)=v\_\{\\theta\}\\bigl\(\\varphi\_\{t\}\(h\),\\,t,\\,c\\bigr\),\(1\)with initial conditionφ0​\(h\)=h\\varphi\_\{0\}\(h\)=h\. The steered activation is obtained by integrating the velocity field from0toTT:

h′=φT​\(h\)=h\+∫0Tvθ​\(φt​\(h\),t,c\)​𝑑t\.h^\{\\prime\}=\\varphi\_\{T\}\(h\)=h\+\\int\_\{0\}^\{T\}v\_\{\\theta\}\\bigl\(\\varphi\_\{t\}\(h\),\\,t,\\,c\\bigr\)\\,dt\.\(2\)In practice, we approximate this integral using anNN\-step forward Euler method:

hk\+1=hk\+TN​vθ​\(hk,k​TN,c\),h\_\{k\+1\}=h\_\{k\}\+\\frac\{T\}\{N\}\\,v\_\{\\theta\}\\\!\\left\(h\_\{k\},\\,\\frac\{kT\}\{N\},\\,c\\right\),\(3\)fork=0,…,N−1k=0,\\ldots,N\-1, withh0=hh\_\{0\}=h\. The resultinghNh\_\{N\}serves as a numerical approximation toh′=φT​\(h\)h^\{\\prime\}=\\varphi\_\{T\}\(h\)and is passed to layerℓ\+1\\ell\+1in place ofhh\.

Three properties ofvθv\_\{\\theta\}together distinguish this formulation from previous steering methods\. First, the mapφt\\varphi\_\{t\}depends on the initial statehh, so the flow adapts to different activations\. Second, the time\-dependent velocity field can prescribe different directions at each step along the integration path, producing curved trajectories\. Third,vθv\_\{\\theta\}is computed per token position, thus the steering trajectory varies per token\.

Taken together, these properties makevθv\_\{\\theta\}sufficiently expressive that the integral in Eq\.[2](https://arxiv.org/html/2605.05892#S3.E2)can in principle realize a multi\-step transport fromhhtoh′h^\{\\prime\}\. As a consequence, our formulation naturally subsumes prior steering approaches as restricted instances of the velocity field\. The standard additive formulationh′=h\+α​δ​\(c\)h^\{\\prime\}=h\+\\alpha\\delta\(c\)is recovered as the special casevθ​\(h,t,c\)=δ​\(c\)v\_\{\\theta\}\(h,t,c\)=\\delta\(c\)withT=αT=\\alpha\.

### 3\.2FlowBlock Architecture and Forward Process

We instantiatevθv\_\{\\theta\}with a transformer\-style block, which we call a FlowBlock \(Figure[1](https://arxiv.org/html/2605.05892#S1.F1)\)\. To encode the concept descriptionccinto a sequence of vectors that the FlowBlock can attend to, we apply a concept encoderϕ\\phi\. By defaultϕ\\phireuses the token embedding and first few transformer layers of the base model, so thatϕ​\(c\)\\phi\(c\)inherits the early\-layer features of the base model\.

At stepkk, the FlowBlock takes the current activationhkh\_\{k\}, the encoded conceptϕ​\(c\)\\phi\(c\), and the current timetk=k​T/Nt\_\{k\}=kT/Nas input\. We first inject the time signal through a sinusoidal embedding,

h~k=hk\+TimeEmbed​\(tk\)\.\\tilde\{h\}\_\{k\}=h\_\{k\}\+\\mathrm\{TimeEmbed\}\(t\_\{k\}\)\.\(4\)Sinceccis a sequence of arbitrary length, the FlowBlock attends to it through cross\-attention,

uk=CrossAttn​\(Q=h~k,K=ϕ​\(c\),V=ϕ​\(c\)\),u\_\{k\}=\\mathrm\{CrossAttn\}\\bigl\(Q=\\tilde\{h\}\_\{k\},\\,K=\\phi\(c\),\\,V=\\phi\(c\)\\bigr\),\(5\)whose keys and values are cached once and reused acrossNNintegration steps and decoding positions\. A causal self\-attention layer and a feedforward network then produce the per\-step displacement,

Δ​hk=Feedforward​\(SelfAttn​\(uk\)\)\.\\Delta h\_\{k\}=\\mathrm\{Feedforward\}\\bigl\(\\mathrm\{SelfAttn\}\(u\_\{k\}\)\\bigr\)\.\(6\)Iterating this procedureNNtimes yieldshNh\_\{N\}\. In practice, each component is wrapped with a residual connection and a learnable per\-channel gate, and the update at each step is scaled by the Euler step sizeT/NT/N\. Full implementation details are included in Appendix[B](https://arxiv.org/html/2605.05892#A2)\.

### 3\.3Training

To control the steering strength at inference, we use the flow timeTTas a scalar parameter\. Under the Euler method \(Eq\.[3](https://arxiv.org/html/2605.05892#S3.E3)\) with fixedNN, increasingTTscales the per\-step updates and pushes activations further along their concept\-specific trajectories\.

ForTTto provide continuous control, the velocity fieldvθv\_\{\\theta\}must remain valid across varying horizons\. Unlike prior flow\-based methods with a fixed training\-time strength\[[16](https://arxiv.org/html/2605.05892#bib.bib36)\], we enable training\-free extrapolation at inference by exposing the model to a range of integration horizons during training\. Like classifier\-free guidance\[[11](https://arxiv.org/html/2605.05892#bib.bib3)\], our approach enables dynamic strength control at inference, achieved by simply scaling the integration time of the learned flow\.

We implement this by randomizing the integration horizon during training\. At each training step we sampleT∼Uniform​\[Tmin,Tmax\]T\\sim\\text\{Uniform\}\[T\_\{\\text\{min\}\},T\_\{\\text\{max\}\}\], runNNEuler steps using Eq\.[3](https://arxiv.org/html/2605.05892#S3.E3), inject the resultinghNh\_\{N\}at layerℓ\\ell, and supervise with language\-modeling cross\-entropy on the output tokens,

ℒLM=−∑\(x,y,c\)∈𝒟∑ilog⁡p​\(yi∣y<i,x;hN\),\\mathcal\{L\}\_\{\\text\{LM\}\}=\-\\sum\_\{\(x,y,c\)\\in\\mathcal\{D\}\}\\sum\_\{i\}\\log p\\bigl\(y\_\{i\}\\mid y\_\{<i\},x;\\,h\_\{N\}\\bigr\),\(7\)where𝒟\\mathcal\{D\}is the training dataset, with each triple consisting of an input promptxx, a conceptccto steer toward, and the desired outputyythat reflects steering towardcc\.

Since velocities for different concepts should point in distinct directions, we add a diversity penalty on the mean\-pooled final\-step velocities within each minibatch,

ℒdiv=1\|\{\(i,j\):ci≠cj\}\|​∑i,j:ci≠cjcos⁡\(v¯i,v¯j\),v¯i=1P​∑p=1Pvi\(p\),\\mathcal\{L\}\_\{\\text\{div\}\}=\\frac\{1\}\{\|\\\{\(i,j\):c\_\{i\}\\neq c\_\{j\}\\\}\|\}\\sum\_\{i,j:\\,c\_\{i\}\\neq c\_\{j\}\}\\cos\\bigl\(\\bar\{v\}\_\{i\},\\,\\bar\{v\}\_\{j\}\\bigr\),\\quad\\bar\{v\}\_\{i\}=\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}v\_\{i\}^\{\(p\)\},\(8\)whereppindexes token positions,vi\(p\)=vθ​\(hN−1\(p\),tN−1,ci\)v\_\{i\}^\{\(p\)\}=v\_\{\\theta\}\(h\_\{N\-1\}^\{\(p\)\},t\_\{N\-1\},c\_\{i\}\)is the final\-step velocity at positionppfor sampleii\. The total loss isℒLM\+λ​ℒdiv\\mathcal\{L\}\_\{\\text\{LM\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{div\}\}withλ=0\.1\\lambda=0\.1\. Ablations in Sec\.[5](https://arxiv.org/html/2605.05892#S5)confirm that this diversity penalty is important for steering quality, and especially for extrapolation alongTT\(detailed discussion in Appendix[D](https://arxiv.org/html/2605.05892#A4)\)\.

## 4Experiments

#### Training data and base model\.

We follow the protocol of AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]and train on Concept16k\. Base models are Gemma\-2\-2B\-IT and Gemma\-2\-9B\-IT\[[30](https://arxiv.org/html/2605.05892#bib.bib48)\], with steering at layer 20\. We use a single FlowBlock \(97\.697\.6M trainable parameters on 2B,255255M on 9B\), withN=3N\\\!=\\\!3Euler steps andT∼Uniform​\[0\.5,2\.0\]T\\sim\\text\{Uniform\}\[0\.5,2\.0\]\. The concept encoder is frozen and reuses the base model’s token embedding and first two layers\. Training details are included in Appendix[A](https://arxiv.org/html/2605.05892#A1)\.

#### Evaluation protocol\.

We follow the AxBench evaluation pipeline\. GPT\-4o\-mini\[[22](https://arxiv.org/html/2605.05892#bib.bib4)\]scores each generation on Concept incorporation \(CC\), Instruction following \(II\), and Fluency \(FF\), withC,I,F∈\{0,1,2\}C,I,F\\in\\\{0,1,2\\\}\. The primary metric is the harmonic mean of the three scores:HMean=3/\(1/C\+1/I\+1/F\)∈\[0,2\]\\text\{HMean\}=3/\(1/C\+1/I\+1/F\)\\in\[0,2\]\. Held\-in evaluates on concepts seen during training but with previously unseen prompts\. Held\-out is strictly zero\-shot, evaluating on concepts never seen during training paired with unseen prompts\. Evaluation details are included in Appendix[E](https://arxiv.org/html/2605.05892#A5)\.

### 4\.1Main Results

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/Main_comparison.png)Figure 2:Held\-in results on Gemma\-2\-2B\-IT, layer 20\.FLAS exceeds the in\-context prompting baseline by\+0\.294\+0\.294and HyperSteer by\+0\.283\+0\.283\.Table 2:Full steering results on AxBench\.Empty entries \(—\) indicate methods that do not support zero\-shot steering\. Baselines from AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]and HyperSteer\[[29](https://arxiv.org/html/2605.05892#bib.bib47)\]\. FLAS evaluated at fixedT=2T\\\!=\\\!2\. The intervention happens at Layer 20 of both models\.
Table[2](https://arxiv.org/html/2605.05892#S4.T2)and Figure[2](https://arxiv.org/html/2605.05892#S4.F2.5)present the main results\. All FLAS results are given using a single fixed flow timeT=2T\\\!=\\\!2with no per\-concept tuning\. On Gemma\-2\-2B\-IT held\-out evaluation, FLAS reaches a harmonic mean of1\.0151\.015, exceeding HyperSteer \(0\.6080\.608,\+0\.407\+0\.407\) and in\-context prompting \(0\.7620\.762,\+0\.253\+0\.253\)\. On Gemma\-2\-9B\-IT held\-out evaluation, FLAS reaches the score of1\.1131\.113, above both in\-context prompting \(1\.0911\.091,\+0\.022\+0\.022\) and HyperSteer \(0\.9340\.934,\+0\.179\+0\.179\)\. To illustrate the advantage of FLAS over in\-context prompting, we provide case studies in Appendix[I](https://arxiv.org/html/2605.05892#A9)where FLAS succeeds while in\-context prompting fails\. Overall, FLAS incorporates concepts into outputs more naturally and flexibly, especially for complex concepts\.

To further assess cross\-model generalization, we additionally apply FLAS to Qwen3\-4B\-Instruct\[[38](https://arxiv.org/html/2605.05892#bib.bib43)\]at layer 20 under the same training and evaluation pipeline, achieving a held\-out harmonic mean of 0\.960 \(detailed in Appendix[C](https://arxiv.org/html/2605.05892#A3)\)\. This demonstrates that FLAS generalizes across model families\.

### 4\.2Concept Scaling

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/fig_scaling_curve.png)Figure 3:Concept scaling\.Held\-out harmonic mean versus the number of training concepts\.We investigate how FLAS performance scales with the number of training concepts\. We train models on subsets of 9, 500, 1\.9 k, 5\.5 k, and the full 16 k concepts with identical hyperparameters, and evaluate on the same held\-out concepts atT=2T\\\!=\\\!2\. As shown in Figure[3](https://arxiv.org/html/2605.05892#S4.F3), the held\-out harmonic mean increases monotonically with the number of training concepts, surpassing the in\-context prompting baseline between 1\.9 k and 5\.5 k concepts\. The curve shows no sign of saturation at 16 k, suggesting further gains from larger concept pools\.

### 4\.3Flow Time Robustness

Activation steering typically involves a trade\-off where increased concept incorporation degrades instruction following and fluency\. Figure[4](https://arxiv.org/html/2605.05892#S4.F4)contrasts FLAS with three baselines on Gemma\-2\-9B\-IT: ReFT\-r1, DiffMean, and AcT\[[28](https://arxiv.org/html/2605.05892#bib.bib60)\]\(reproduced at layer 20, see Appendix[F](https://arxiv.org/html/2605.05892#A6)\)\. All three baselines collapse at higher strengths, while FLAS steadily improves concept score and maintains high instruction and fluency across the entire range\.

This robustness is not an artifact of training data abundance\. Figure[5](https://arxiv.org/html/2605.05892#S4.F5)decomposes the score acrossT∈\[0\.5,4\.0\]T\\in\[0\.5,4\.0\]for five concept pool sizes on Gemma\-2\-2B\-IT, and the qualitative shape of the curves is preserved across scales\. Increasing the training pool mainly raises concept score, while instruction and fluency remain roughly unchanged\. In the data\-scarce regime \(500 or 1\.9 k concepts\), increasingTTat inference time substantially boosts concept incorporation, suggesting that flow time can compensate for limited training data\.

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/CIF_Tradeoff.png)Figure 4:Steering strength trade\-off \(Gemma\-2\-9B\-IT\)\.Score decomposition across steering strengths for FLAS \(held\-out, h\.o\., in blue; held\-in, h\.i\., in purple\) and baselines \(ReFT\-r1, DiffMean, AcT\)\. Shaded bands show±1\\pm 1std, clipped to\[0,2\]\[0,2\]\.![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/fig_concept_scaling_cif.png)Figure 5:Flow time across training\-set sizes \(Gemma\-2\-2B\-IT held\-out\)\.Score decomposition versusTTfor five concept scales\. Shaded bands show±1\\pm 1std, clipped to\[0,2\]\[0,2\]\.

## 5Ablations

We ablate the main design choices of FLAS on Concept16k held\-out using Gemma\-2\-2B\-IT atT=2T\\\!=\\\!2\. The base configuration usesB=1B\\\!=\\\!1FlowBlock,N=3N\\\!=\\\!3Euler steps, with three phases enabled \(cross\-attention, self\-attention, MLP\), diversity loss, a frozen concept encoder, and weights initialized from the corresponding Gemma\-2 layer\. All scores are averaged over held\-out concepts \(10 prompts each\)\. We report 95% bootstrap confidence intervals \(10 000 resamples over concept\-level means\) and pairedtt\-statistics against the base configuration\.\*\*\*Significance:∗p<0\.05p<0\.05,∗∗p<0\.01p<0\.01,∗∗∗p<0\.001p<0\.001\.

ConfigurationHMean95% CIPairedttBase \(B=1B\\\!=\\\!1,N=3N\\\!=\\\!3\)1\.0151\.015\[0\.968,1\.060\]\[0\.968,1\.060\]—Architecture\+1\+1FlowBlock \(B=2B\\\!=\\\!2\)1\.0091\.009\[0\.963,1\.051\]\[0\.963,1\.051\]−0\.34\-0\.34\+2\+2FlowBlocks \(B=3B\\\!=\\\!3\)0\.9960\.996\[0\.944,1\.044\]\[0\.944,1\.044\]−1\.06\-1\.06Disable self\-attention0\.9690\.969∗\[0\.922,1\.015\]\[0\.922,1\.015\]−2\.19\-2\.19Disable MLP0\.9550\.955∗∗\[0\.905,1\.003\]\[0\.905,1\.003\]−3\.05\-3\.05Disable cross\-attention0\.1090\.109∗∗∗\[0\.078,0\.142\]\[0\.078,0\.142\]−37\.82\-37\.82TrainingXavier init0\.9680\.968∗\[0\.921,1\.012\]\[0\.921,1\.012\]−2\.49\-2\.49Remove diversity loss0\.9320\.932∗∗∗\[0\.879,0\.982\]\[0\.879,0\.982\]−4\.41\-4\.41Intervention layerLayer 101\.0441\.044\[0\.989,1\.096\]\[0\.989,1\.096\]\+1\.22\+1\.22Layer 150\.9460\.946∗∗\[0\.884,1\.006\]\[0\.884,1\.006\]−2\.93\-2\.93Integration steps \(NN\)N=1N=10\.8370\.837∗∗∗\[0\.790,0\.884\]\[0\.790,0\.884\]−9\.56\-9\.56N=2N=20\.9700\.970∗∗\[0\.928,1\.010\]\[0\.928,1\.010\]−2\.59\-2\.59N=4N=40\.9810\.981\[0\.936,1\.024\]\[0\.936,1\.024\]−1\.86\-1\.86N=5N=51\.0111\.011\[0\.962,1\.058\]\[0\.962,1\.058\]−0\.23\-0\.23N=10N=101\.0201\.020\[0\.974,1\.064\]\[0\.974,1\.064\]\+0\.26\+0\.26Table 3:Ablations\(Concept16k held\-out,T=2T\\\!=\\\!2\)\. HMean: harmonic mean of C/I/F\. CI: 95% bootstrap over concept\-level means\. Pairedtt: versus base on the same held\-out concepts\.Model Architecture\.Table[3](https://arxiv.org/html/2605.05892#S5.T3)shows that the only ablation causing a large performance drop is disabling cross\-attention \(t=−37\.82t=\-37\.82,p<0\.001p<0\.001\), which removes the pathway for concept information to enter the activation stream\. Disabling self\-attention causes a moderate drop to0\.9690\.969\(t=−2\.19t=\-2\.19,p<0\.05p<0\.05\), indicating that inter\-position coordination contributes\. Removing the MLP causes a similar drop to0\.9550\.955\(t=−3\.05t=\-3\.05,p<0\.01p<0\.01\)\. The effect of adding FlowBlocks beyondB=1B\\\!=\\\!1is statistically indistinguishable, confirming that the minimal single\-block architecture is already sufficient for Concept16k dataset\.

Training\.We ablate the diversity loss and the warm\-start initialization strategy during training\. Removing the diversity loss degrades performance to0\.9320\.932\(t=−4\.41t=\-4\.41,p<0\.001p<0\.001\)\. We observe a severe degradation in held\-out performance without the diversity loss, which we discuss in Appendix[D](https://arxiv.org/html/2605.05892#A4)\. Replacing Gemma\-2 weight warm\-start with Xavier initialization drops performance to0\.9680\.968\(t=−2\.49t=\-2\.49,p<0\.05p<0\.05\), confirming that initializing from the base model aids optimization\.

Intervention Layers\.To verify our model’s sensitivity to the choice of layer, we substitute layer 10 or layer 15 for layer 20\. Results in Table[3](https://arxiv.org/html/2605.05892#S5.T3)show that steering at layer 10 performs comparably to the base and layer 15 shows a moderate drop to0\.9460\.946\. Both substantially outperform the prompting baseline at0\.7620\.762\. This proves that FLAS is not sensitive to the choice of intervention layer\.

Number of Integration Steps\.Table[3](https://arxiv.org/html/2605.05892#S5.T3)ablates the number of Euler steps\. AtN=1N\\\!=\\\!1the flow reduces to a single adaptive displacement and performance drops significantly to0\.8370\.837\(t=−9\.56t=\-9\.56,p<0\.001p<0\.001\), but still exceeds prompting \(0\.7620\.762\)\. Adding a second step recovers most of the remaining gap \(0\.9700\.970,t=−2\.59t=\-2\.59,p<0\.01p<0\.01\), and beyondN=3N\\\!=\\\!3further steps yield no significant improvement\. Three Euler steps are sufficient for the velocity field to capture the required curvature\. We analyze this structure in Section[6\.2](https://arxiv.org/html/2605.05892#S6.SS2)\.

## 6The Geometry of Flow Steering

The velocity field of FLAS can be inspected to understand the steering trajectories\. We use theN=10N\\\!=\\\!10model for the trajectory and per\-step analyses, where the flow is exposed at high temporal resolution, and theN=3N\\\!=\\\!3model \(our default configuration\) for the per\-token analysis\. These three analyses show that effective activation steering requires curved, multi\-step, token\-varying interventions\. Detailed settings of analysis experiments are included in Appendix[G](https://arxiv.org/html/2605.05892#A7)\.

### 6\.1Steering Trajectories Are Curved

Figure[6](https://arxiv.org/html/2605.05892#S6.F6)visualizes the flow trajectories projected onto the leading principal components of the displacement vectors across various concepts, prompts, and integration steps\.

The trajectories are not straight lines\. Every concept’s path leaves the origin in a shared direction, executes a pronounced bend, and then enters a concept\-specific region\. Once the bend completes,TTcontrols how far along the concept\-specific direction the activation travels\.

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/Trajectory_Analysis.png)Figure 6:Steering trajectories of the learned flow\(N=10N\\\!=\\\!10\)\. Color encodes concept identity and lightness encodes flow timeTT, with lighter tints corresponding to lowerTT\.Left:3D PCA projection of trajectories atT=2T\\\!=\\\!2\.Middle:per\-concept, per\-prompt 2D PCA trajectories atT∈\[1\.5,3\.0\]T\\in\[1\.5,3\.0\]\.Right:prompt\-averaged trajectories with dashed gray KDE contours showing the spread of 60 concepts at eachTT\. Trajectories bend from a shared initial direction into concept\-specific endpoint regions, and increasingTTextends the displacement along each concept’s direction\.
### 6\.2The Learned Flow Requires Multiple Steps

Figure[7](https://arxiv.org/html/2605.05892#S6.F7)quantifies the per\-step structure of the learned flow\. At larger flow times \(T=2\.0T\\\!=\\\!2\.0andT=3\.0T\\\!=\\\!3\.0\), the late steps point in mutually consistent directions \(cosine similarity\>0\.7\>0\.7\), while the early steps are markedly misaligned with these later directions \(cosine similarity<0\.25<0\.25\)\. This separation between early and late step directions provides quantitative evidence that the bending observed in Figure[6](https://arxiv.org/html/2605.05892#S6.F6)is a statistically robust phenomenon rather than an artifact of individual trajectories\.

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/Step_Analysis.png)Figure 7:Step\-to\-step velocity cosine and magnitude\(N=10N\\\!=\\\!10,T∈\{1\.0,2\.0,3\.0\}T\\\!\\in\\\!\\\{1\.0,2\.0,3\.0\\\}\)\.Top:10×1010\\\!\\times\\\!10cosine matrix between Euler velocities\.Bottom:mean‖v‖\\\|v\\\|per step\.
### 6\.3Per\-Token Steering Is Non\-Uniform

Most previous activation\-steering methods apply the same displacement to every token position\. FLAS evaluates the velocity field per position, and each token’s total displacement is the sum ofNNEuler increments\. Figure[8](https://arxiv.org/html/2605.05892#S6.F8)shows the average pairwise cosine between per\-token displacements is only0\.294±0\.1330\.294\\pm 0\.133, far below the1\.01\.0that a position\-invariant method produces\. We observe that nearby tokens exhibit higher steering similarity, and that similarities within prompt tokens and within generated tokens are higher than across the two groups, revealing position\-dependent structure\.

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/Token_Analysis.png)Figure 8:Per\-token displacement cosines\(N=3N\\\!=\\\!3,T=2T\\\!=\\\!2\)\.Left:mean pairwise cosine of total displacementshN−h0h\_\{N\}\\\!\-\\\!h\_\{0\}across token positions\.Right:distribution of off\-diagonal cosines \(μ=0\.294\\mu\\\!=\\\!0\.294,σ=0\.133\\sigma\\\!=\\\!0\.133\)\. Per\-token steering is far from uniform\.

## 7Limitations and Future Work

Our evaluation focuses on AxBench because it provides large\-scale natural\-language concepts, allowing us to test FLAS on zero\-shot extrapolation to unseen concepts\. This scope gives a controlled evaluation of the main claim of FLAS, but it does not cover all uses of inference time intervention\. Extending FLAS to broader concept collections is an important direction for future work\. The AxBench evaluation uses an automatic LM judge, which may introduce systematic biases\. To assess the stability of the resulting comparisons, we report paired statistical tests across held out concepts and provide evaluation details in Appendix[E](https://arxiv.org/html/2605.05892#A5)\.

FLAS introduces acceptable additional inference cost because it accepts arbitrary text concepts, which requires concept encoding and cross attention during steering\. We quantify this overhead in Appendix[H](https://arxiv.org/html/2605.05892#A8)\. Reducing latency is a future direction for deployment\. The learned velocity field is also tied to a specific LM backbone, so a separate FlowBlock is trained for each base model\. Our experiments intervene at a single layer, and future work can study cross layer composition and multi concept steering\.

## 8Conclusion

We presentedFLAS, a flow\-based activation\-steering method that replaces the fixed, single\-step interventions used by prior steering approaches with a learned, concept\-conditioned velocity field integrated over multiple Euler steps\. By relaxing the assumptions of position\-invariance, single\-step transport, and contrastive supervision, FLAS becomes the first learned steering method to consistently surpass in\-context prompting on AxBench, achieving held\-out harmonic means of1\.0151\.015on Gemma\-2\-2B\-IT and1\.1131\.113on Gemma\-2\-9B\-IT with a single fixed flow time and no per\-concept tuning, while generalizing across model families\.

Beyond benchmark performance, the learned velocity field can be inspected to understand steering trajectories\. The trajectories we observe are curved, require multiple steps to resolve, and vary substantially across token positions\. These properties suggest that the geometric assumptions underlying much of the prior steering literature are incomplete\. We hope that treating activation interventions as flows rather than vectors opens a more faithful path toward controlling and understanding the internal computations of large language models\.

## References

- \[1\]U\. Anwar, A\. Saparov, J\. Rando, D\. Paleka, M\. Turpin, P\. Hase, E\. S\. Lubana, E\. Jenner, S\. Casper, O\. Sourbut, B\. L\. Edelman, Z\. Zhang, M\. Günther, A\. Korinek, J\. Hernandez\-Orallo, L\. Hammond, E\. J\. Bigelow, A\. Pan, L\. Langosco, T\. Korbak, H\. C\. Zhang, R\. Zhong, S\. O\. hEigeartaigh, G\. Recchia, G\. Corsi, A\. Chan, M\. Anderljung, L\. Edwards, A\. Petrov, C\. S\. de Witt, S\. R\. Motwani, Y\. Bengio, D\. Chen, P\. Torr, S\. Albanie, T\. Maharaj, J\. N\. Foerster, F\. Tramèr, H\. He, A\. Kasirzadeh, Y\. Choi, and D\. Krueger\(2024\)Foundational challenges in assuring alignment and safety of large language models\.Transactions on Machine Learning Research\.Note:Survey Certification, Expert CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=oVTkOs8Pka)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[2\]\(2022\-07\)Matching normalizing flows and probability paths on manifolds\.arXiv\.External Links:2207\.04711,[Document](https://dx.doi.org/10.48550/arXiv.2207.04711)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[3\]G\. Boxo, R\. Socha, D\. Yoo, and S\. Raval\(2025\)Caught in the act: a mechanistic approach to detecting deception\.External Links:2508\.19505,[Link](https://arxiv.org/abs/2508.19505)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[4\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\-07\)Language models are few\-shot learners\.arXiv\.External Links:2005\.14165,[Document](https://dx.doi.org/10.48550/arXiv.2005.14165)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[5\]Q\. Dao, H\. Phung, B\. Nguyen, and A\. Tran\(2023\-07\)Flow matching in latent space\.arXiv\.External Links:2307\.08698,[Document](https://dx.doi.org/10.48550/arXiv.2307.08698)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[6\]Y\. Dubois, B\. Galambosi, P\. Liang, and T\. B\. Hashimoto\(2025\-03\)Length\-controlled alpacaeval: a simple way to debias automatic evaluators\.arXiv\.External Links:2404\.04475,[Document](https://dx.doi.org/10.48550/arXiv.2404.04475)Cited by:[Appendix E](https://arxiv.org/html/2605.05892#A5.SS0.SSS0.Px1.p1.1)\.
- \[7\]E\. Facco, M\. d’Errico, A\. Rodriguez, and A\. Laio\(2017\-09\)Estimating the intrinsic dimension of datasets by a minimal neighborhood information\.Scientific Reports7\(1\),pp\. 12140\.External Links:1803\.06992,ISSN 2045\-2322,[Document](https://dx.doi.org/10.1038/s41598-017-11873-y)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[8\]M\. Frising and D\. Balcells\(2026\-01\)Linear personality probing and steering in llms: a big five study\.arXiv\.External Links:2512\.17639,[Document](https://dx.doi.org/10.48550/arXiv.2512.17639)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[9\]M\. Giulianelli, J\. Harding, F\. Mohnert, D\. Hupkes, and W\. Zuidema\(2021\)Under the hood: using diagnostic classifiers to investigate and improve how language models track agreement information\.External Links:1808\.08079,[Link](https://arxiv.org/abs/1808.08079)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[10\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma\(2024\-11\)The llama 3 herd of models\.arXiv\.External Links:2407\.21783,[Document](https://dx.doi.org/10.48550/arXiv.2407.21783)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[11\]J\. Ho and T\. Salimans\(2022\-07\)Classifier\-free diffusion guidance\.arXiv\.External Links:2207\.12598,[Document](https://dx.doi.org/10.48550/arXiv.2207.12598)Cited by:[§3\.3](https://arxiv.org/html/2605.05892#S3.SS3.p2.2)\.
- \[12\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2021\-10\)LoRA: low\-rank adaptation of large language models\.arXiv\.External Links:2106\.09685,[Document](https://dx.doi.org/10.48550/arXiv.2106.09685)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[13\]S\. Kotha, J\. M\. Springer, and A\. Raghunathan\(2024\-04\)Understanding catastrophic forgetting in language models via implicit inference\.arXiv\.External Links:2309\.10105,[Document](https://dx.doi.org/10.48550/arXiv.2309.10105)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[14\]B\. W\. Lee, I\. Padhi, K\. N\. Ramamurthy, E\. Miehling, P\. Dognin, M\. Nagireddy, and A\. Dhurandhar\(2025\-02\)Programming refusal with conditional activation steering\.arXiv\.External Links:2409\.05907,[Document](https://dx.doi.org/10.48550/arXiv.2409.05907)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[15\]Y\. Li, B\. Bergner, Y\. Zhao, V\. P\. Patil, B\. Chen, and C\. Wang\(2026\-02\)Steering large reasoning models towards concise reasoning via flow matching\.arXiv\.External Links:2602\.05539,[Document](https://dx.doi.org/10.48550/arXiv.2602.05539)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.10.8.1)\.
- \[16\]Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le\(2023\-02\)Flow matching for generative modeling\.arXiv\.External Links:2210\.02747,[Document](https://dx.doi.org/10.48550/arXiv.2210.02747)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2605.05892#S3.SS3.p2.2)\.
- \[17\]Y\. Lipman, M\. Havasi, P\. Holderrieth, N\. Shaul, M\. Le, B\. Karrer, R\. T\. Q\. Chen, D\. Lopez\-Paz, H\. Ben\-Hamu, and I\. Gat\(2024\-12\)Flow matching guide and code\.arXiv\.External Links:2412\.06264,[Document](https://dx.doi.org/10.48550/arXiv.2412.06264)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[18\]Y\. Luo, Z\. Yang, F\. Meng, Y\. Li, J\. Zhou, and Y\. Zhang\(2025\)An empirical study of catastrophic forgetting in large language models during continual fine\-tuning\.External Links:2308\.08747,[Link](https://arxiv.org/abs/2308.08747)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[19\]M\. A\. Mabrok\(2026\-03\)Latent semantic manifolds in large language models\.arXiv\.External Links:2603\.22301,[Document](https://dx.doi.org/10.48550/arXiv.2603.22301)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[20\]A\. Modell, P\. Rubin\-Delanchy, and N\. Whiteley\(2025\-05\)The origins of representation manifolds in large language models\.arXiv\.External Links:2505\.18235,[Document](https://dx.doi.org/10.48550/arXiv.2505.18235)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]N\. Oozeer, L\. Marks, S\. Jain, F\. Barez, and A\. Abdullah\(2026\-04\)Beyond linear steering: unified multi\-attribute control for language models\.arXiv\.External Links:2505\.24535,[Document](https://dx.doi.org/10.48550/arXiv.2505.24535)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.9.7.1)\.
- \[22\]OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat, R\. Avila, I\. Babuschkin, S\. Balaji, V\. Balcom, P\. Baltescu, H\. Bao, M\. Bavarian, J\. Belgum, I\. Bello, J\. Berdine, G\. Bernadett\-Shapiro, C\. Berner, L\. Bogdonoff, O\. Boiko, M\. Boyd, A\. Brakman, G\. Brockman, T\. Brooks, M\. Brundage, K\. Button, T\. Cai, R\. Campbell, A\. Cann, B\. Carey, C\. Carlson, R\. Carmichael, B\. Chan, C\. Chang, F\. Chantzis, D\. Chen, S\. Chen, R\. Chen, J\. Chen, M\. Chen, B\. Chess, C\. Cho, C\. Chu, H\. W\. Chung, D\. Cummings, J\. Currier, Y\. Dai, C\. Decareaux, T\. Degry, N\. Deutsch, D\. Deville, A\. Dhar, D\. Dohan, S\. Dowling, S\. Dunning, A\. Ecoffet, A\. Eleti, T\. Eloundou, D\. Farhi, L\. Fedus, N\. Felix, S\. P\. Fishman, J\. Forte, I\. Fulford, L\. Gao, E\. Georges, C\. Gibson, V\. Goel, T\. Gogineni, G\. Goh, R\. Gontijo\-Lopes, J\. Gordon, M\. Grafstein, S\. Gray, R\. Greene, J\. Gross, S\. S\. Gu, Y\. Guo, C\. Hallacy, J\. Han, J\. Harris, Y\. He, M\. Heaton, J\. Heidecke, C\. Hesse, A\. Hickey, W\. Hickey, P\. Hoeschele, B\. Houghton, K\. Hsu, S\. Hu, X\. Hu, J\. Huizinga, S\. Jain, S\. Jain, J\. Jang, A\. Jiang, R\. Jiang, H\. Jin, D\. Jin, S\. Jomoto, B\. Jonn, H\. Jun, T\. Kaftan, Ł\. Kaiser, A\. Kamali, I\. Kanitscheider, N\. S\. Keskar, T\. Khan, L\. Kilpatrick, J\. W\. Kim, C\. Kim, Y\. Kim, J\. H\. Kirchner, J\. Kiros, M\. Knight, D\. Kokotajlo, Ł\. Kondraciuk, A\. Kondrich, A\. Konstantinidis, K\. Kosic, G\. Krueger, V\. Kuo, M\. Lampe, I\. Lan, T\. Lee, J\. Leike, J\. Leung, D\. Levy, C\. M\. Li, R\. Lim, M\. Lin, S\. Lin, M\. Litwin, T\. Lopez, R\. Lowe, P\. Lue, A\. Makanju, K\. Malfacini, S\. Manning, T\. Markov, Y\. Markovski, B\. Martin, K\. Mayer, A\. Mayne, B\. McGrew, S\. M\. McKinney, C\. McLeavey, P\. McMillan, J\. McNeil, D\. Medina, A\. Mehta, J\. Menick, L\. Metz, A\. Mishchenko, P\. Mishkin, V\. Monaco, E\. Morikawa, D\. Mossing, T\. Mu, M\. Murati, O\. Murk, D\. Mély, A\. Nair, R\. Nakano, R\. Nayak, A\. Neelakantan, R\. Ngo, H\. Noh, L\. Ouyang, C\. O’Keefe, J\. Pachocki, A\. Paino, J\. Palermo, A\. Pantuliano, G\. Parascandolo, J\. Parish, E\. Parparita, A\. Passos, M\. Pavlov, A\. Peng, A\. Perelman, F\. d\. A\. B\. Peres, M\. Petrov, H\. P\. d\. O\. Pinto, Michael, Pokorny, M\. Pokrass, V\. H\. Pong, T\. Powell, A\. Power, B\. Power, E\. Proehl, R\. Puri, A\. Radford, J\. Rae, A\. Ramesh, C\. Raymond, F\. Real, K\. Rimbach, C\. Ross, B\. Rotsted, H\. Roussez, N\. Ryder, M\. Saltarelli, T\. Sanders, S\. Santurkar, G\. Sastry, H\. Schmidt, D\. Schnurr, J\. Schulman, D\. Selsam, K\. Sheppard, T\. Sherbakov, J\. Shieh, S\. Shoker, P\. Shyam, S\. Sidor, E\. Sigler, M\. Simens, J\. Sitkin, K\. Slama, I\. Sohl, B\. Sokolowsky, Y\. Song, N\. Staudacher, F\. P\. Such, N\. Summers, I\. Sutskever, J\. Tang, N\. Tezak, M\. B\. Thompson, P\. Tillet, A\. Tootoonchian, E\. Tseng, P\. Tuggle, N\. Turley, J\. Tworek, J\. F\. C\. Uribe, A\. Vallone, A\. Vijayvergiya, C\. Voss, C\. Wainwright, J\. J\. Wang, A\. Wang, B\. Wang, J\. Ward, J\. Wei, C\. J\. Weinmann, A\. Welihinda, P\. Welinder, J\. Weng, L\. Weng, M\. Wiethoff, D\. Willner, C\. Winter, S\. Wolrich, H\. Wong, L\. Workman, S\. Wu, J\. Wu, M\. Wu, K\. Xiao, T\. Xu, S\. Yoo, K\. Yu, Q\. Yuan, W\. Zaremba, R\. Zellers, C\. Zhang, M\. Zhang, S\. Zhao, T\. Zheng, J\. Zhuang, W\. Zhuk, and B\. Zoph\(2024\-03\)GPT\-4 technical report\.arXiv\.External Links:2303\.08774,[Document](https://dx.doi.org/10.48550/arXiv.2303.08774)Cited by:[§4](https://arxiv.org/html/2605.05892#S4.SS0.SSS0.Px2.p1.5)\.
- \[23\]N\. Panickssery, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. M\. Turner\(2024\-07\)Steering llama 2 via contrastive activation addition\.arXiv\.External Links:2312\.06681,[Document](https://dx.doi.org/10.48550/arXiv.2312.06681)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.3.1.1)\.
- \[24\]K\. Park, Y\. J\. Choe, and V\. Veitch\(2024\-07\)The linear representation hypothesis and the geometry of large language models\.arXiv\.External Links:2311\.03658,[Document](https://dx.doi.org/10.48550/arXiv.2311.03658)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1)\.
- \[25\]S\. Park, X\. Du, M\. Yeh, H\. Wang, and Y\. Li\(2025\)Steer llm latents for hallucination detection\.External Links:2503\.01917,[Link](https://arxiv.org/abs/2503.01917)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[26\]V\. Pham and T\. H\. Nguyen\(2024\-12\)Householder pseudo\-rotation: a novel approach to activation editing in llms with direction\-magnitude perspective\.arXiv\.External Links:2409\.10053,[Document](https://dx.doi.org/10.48550/arXiv.2409.10053)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]S\. Raval, H\. J\. Song, L\. Wu, A\. Harrasse, J\. M\. Phillips, F\. Barez, and A\. Abdullah\(2026\-03\)Curveball steering: the right direction to steer isn’t always linear\.arXiv\.External Links:2603\.09313,[Document](https://dx.doi.org/10.48550/arXiv.2603.09313)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.6.4.1)\.
- \[28\]P\. Rodriguez, A\. Blaas, M\. Klein, L\. Zappella, N\. Apostoloff, M\. Cuturi, and X\. Suau\(2024\-11\)Controlling language and diffusion models by transporting activations\.arXiv\.External Links:2410\.23054,[Document](https://dx.doi.org/10.48550/arXiv.2410.23054)Cited by:[Appendix F](https://arxiv.org/html/2605.05892#A6.p1.6),[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.7.5.1),[§4\.3](https://arxiv.org/html/2605.05892#S4.SS3.p1.1)\.
- \[29\]J\. Sun, S\. Baskaran, Z\. Wu, M\. Sklar, C\. Potts, and A\. Geiger\(2025\-06\)HyperSteer: activation steering at scale with hypernetworks\.arXiv\.External Links:2506\.03292,[Document](https://dx.doi.org/10.48550/arXiv.2506.03292)Cited by:[Appendix E](https://arxiv.org/html/2605.05892#A5.SS0.SSS0.Px4.p1.1),[Appendix H](https://arxiv.org/html/2605.05892#A8.SS0.SSS0.Px2.p1.6),[Table 9](https://arxiv.org/html/2605.05892#A8.T9),[Table 9](https://arxiv.org/html/2605.05892#A8.T9.9.2.2),[item 2](https://arxiv.org/html/2605.05892#S1.I1.i2.p1.8),[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.12.10.1),[Table 2](https://arxiv.org/html/2605.05892#S4.T2),[Table 2](https://arxiv.org/html/2605.05892#S4.T2.2.1.1)\.
- \[30\]G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé, J\. Ferret, P\. Liu, P\. Tafti, A\. Friesen, M\. Casbon, S\. Ramos, R\. Kumar, C\. L\. Lan, S\. Jerome, A\. Tsitsulin, N\. Vieillard, P\. Stanczyk, S\. Girgin, N\. Momchev, M\. Hoffman, S\. Thakoor, J\. Grill, B\. Neyshabur, O\. Bachem, A\. Walton, A\. Severyn, A\. Parrish, A\. Ahmad, A\. Hutchison, A\. Abdagic, A\. Carl, A\. Shen, A\. Brock, A\. Coenen, A\. Laforge, A\. Paterson, B\. Bastian, B\. Piot, B\. Wu, B\. Royal, C\. Chen, C\. Kumar, C\. Perry, C\. Welty, C\. A\. Choquette\-Choo, D\. Sinopalnikov, D\. Weinberger, D\. Vijaykumar, D\. Rogozińska, D\. Herbison, E\. Bandy, E\. Wang, E\. Noland, E\. Moreira, E\. Senter, E\. Eltyshev, F\. Visin, G\. Rasskin, G\. Wei, G\. Cameron, G\. Martins, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Batra, H\. Dhand, I\. Nardini, J\. Mein, J\. Zhou, J\. Svensson, J\. Stanway, J\. Chan, J\. P\. Zhou, J\. Carrasqueira, J\. Iljazi, J\. Becker, J\. Fernandez, J\. van Amersfoort, J\. Gordon, J\. Lipschultz, J\. Newlan, J\. Ji, K\. Mohamed, K\. Badola, K\. Black, K\. Millican, K\. McDonell, K\. Nguyen, K\. Sodhia, K\. Greene, L\. L\. Sjoesund, L\. Usui, L\. Sifre, L\. Heuermann, L\. Lago, L\. McNealus, L\. B\. Soares, L\. Kilpatrick, L\. Dixon, L\. Martins, M\. Reid, M\. Singh, M\. Iverson, M\. Görner, M\. Velloso, M\. Wirth, M\. Davidow, M\. Miller, M\. Rahtz, M\. Watson, M\. Risdal, M\. Kazemi, M\. Moynihan, M\. Zhang, M\. Kahng, M\. Park, M\. Rahman, M\. Khatwani, N\. Dao, N\. Bardoliwalla, N\. Devanathan, N\. Dumai, N\. Chauhan, O\. Wahltinez, P\. Botarda, P\. Barnes, P\. Barham, P\. Michel, P\. Jin, P\. Georgiev, P\. Culliton, P\. Kuppala, R\. Comanescu, R\. Merhej, R\. Jana, R\. A\. Rokni, R\. Agarwal, R\. Mullins, S\. Saadat, S\. M\. Carthy, S\. Cogan, S\. Perrin, S\. M\. R\. Arnold, S\. Krause, S\. Dai, S\. Garg, S\. Sheth, S\. Ronstrom, S\. Chan, T\. Jordan, T\. Yu, T\. Eccles, T\. Hennigan, T\. Kocisky, T\. Doshi, V\. Jain, V\. Yadav, V\. Meshram, V\. Dharmadhikari, W\. Barkley, W\. Wei, W\. Ye, W\. Han, W\. Kwon, X\. Xu, Z\. Shen, Z\. Gong, Z\. Wei, V\. Cotruta, P\. Kirk, A\. Rao, M\. Giang, L\. Peran, T\. Warkentin, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, D\. Sculley, J\. Banks, A\. Dragan, S\. Petrov, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, S\. Borgeaud, N\. Fiedel, A\. Joulin, K\. Kenealy, R\. Dadashi, and A\. Andreev\(2024\-10\)Gemma 2: improving open language models at a practical size\.arXiv\.External Links:2408\.00118,[Document](https://dx.doi.org/10.48550/arXiv.2408.00118)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1),[§4](https://arxiv.org/html/2605.05892#S4.SS0.SSS0.Px1.p1.4)\.
- \[31\]A\. Templeton, T\. Conerly, J\. Marcus, J\. Lindsey, T\. Bricken, B\. Chen, A\. Pearce, C\. Citro, E\. Ameisen, A\. Jones, H\. Cunningham, N\. L\. Turner, C\. McDougall, M\. MacDiarmid, C\. D\. Freeman, T\. R\. Sumers, E\. Rees, J\. Batson, A\. Jermyn, S\. Carter, C\. Olah, and T\. Henighan\(2024\)Scaling monosemanticity: extracting interpretable features from claude 3 sonnet\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.1.2)\.
- \[32\]A\. Tong, K\. Fatras, N\. Malkin, G\. Huguet, Y\. Zhang, J\. Rector\-Brooks, G\. Wolf, and Y\. Bengio\(2024\-03\)Improving and generalizing flow\-based generative models with minibatch optimal transport\.arXiv\.External Links:2302\.00482,[Document](https://dx.doi.org/10.48550/arXiv.2302.00482)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[33\]A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid\(2024\-10\)Steering language models with activation engineering\.arXiv\.External Links:2308\.10248,[Document](https://dx.doi.org/10.48550/arXiv.2308.10248)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]L\. Valeriani, D\. Doimo, F\. Cuturello, A\. Laio, A\. Ansuini, and A\. Cazzaniga\(2023\-10\)The geometry of hidden representations of large transformer models\.arXiv\.External Links:2302\.00294,[Document](https://dx.doi.org/10.48550/arXiv.2302.00294)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]H\. Wang, B\. Cao, Y\. Cao, and J\. Chen\(2025\-02\)TruthFlow: truthful llm generation via representation flow correction\.arXiv\.External Links:2502\.04556,[Document](https://dx.doi.org/10.48550/arXiv.2502.04556)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.8.6.1)\.
- \[36\]Z\. Wu, A\. Arora, A\. Geiger, Z\. Wang, J\. Huang, D\. Jurafsky, C\. D\. Manning, and C\. Potts\(2025\-03\)AxBench: steering llms? even simple baselines outperform sparse autoencoders\.arXiv\.External Links:2501\.17148,[Document](https://dx.doi.org/10.48550/arXiv.2501.17148)Cited by:[Appendix E](https://arxiv.org/html/2605.05892#A5.SS0.SSS0.Px1.p1.1),[Appendix E](https://arxiv.org/html/2605.05892#A5.SS0.SSS0.Px3.p1.1),[Appendix E](https://arxiv.org/html/2605.05892#A5.SS0.SSS0.Px4.p1.1),[Appendix H](https://arxiv.org/html/2605.05892#A8.SS0.SSS0.Px2.p2.1),[Table 9](https://arxiv.org/html/2605.05892#A8.T9),[Table 9](https://arxiv.org/html/2605.05892#A8.T9.9.2.2),[item 2](https://arxiv.org/html/2605.05892#S1.I1.i2.p1.8),[§1](https://arxiv.org/html/2605.05892#S1.p2.1),[§4](https://arxiv.org/html/2605.05892#S4.SS0.SSS0.Px1.p1.4),[Table 2](https://arxiv.org/html/2605.05892#S4.T2),[Table 2](https://arxiv.org/html/2605.05892#S4.T2.2.1.1)\.
- \[37\]Z\. Wu, A\. Arora, Z\. Wang, A\. Geiger, D\. Jurafsky, C\. D\. Manning, and C\. Potts\(2024\-05\)ReFT: representation finetuning for language models\.arXiv\.External Links:2404\.03592,[Document](https://dx.doi.org/10.48550/arXiv.2404.03592)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.4.2.1)\.
- \[38\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\-05\)Qwen3 technical report\.Note:https://arxiv\.org/abs/2505\.09388v1Cited by:[Appendix C](https://arxiv.org/html/2605.05892#A3.p1.1),[§4\.1](https://arxiv.org/html/2605.05892#S4.SS1.p2.1)\.
- \[39\]S\. Yang, S\. Zhu, L\. Liu, L\. Hu, M\. Li, and D\. Wang\(2025\)Exploring the personality traits of llms through latent features steering\.External Links:2410\.10863,[Link](https://arxiv.org/abs/2410.10863)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.
- \[40\]Z\. You, C\. Deng, and H\. Chen\(2026\-02\)Spherical steering: geometry\-aware activation rotation for language models\.arXiv\.External Links:2602\.08169,[Document](https://dx.doi.org/10.48550/arXiv.2602.08169)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.5.3.1)\.
- \[41\]V\. Yusupov, D\. Maksimov, A\. Alaeva, A\. Vasileva, A\. Antipina, T\. Zaitseva, A\. Ermilova, E\. Burnaev, and E\. Shvetsov\(2025\-09\)From internal representations to text quality: a geometric approach to llm evaluation\.arXiv\.External Links:2509\.25359,[Document](https://dx.doi.org/10.48550/arXiv.2509.25359)Cited by:[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px3.p1.1)\.
- \[42\]H\. Zhao, H\. Sun, J\. Kong, X\. Li, Q\. Wang, L\. Jiang, Q\. Zhu, T\. Abdelzaher, Y\. Choi, M\. Li, and H\. Shao\(2026\-02\)ODESteer: a unified ode\-based steering framework for llm alignment\.arXiv\.External Links:2602\.17560,[Document](https://dx.doi.org/10.48550/arXiv.2602.17560)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p3.1),[§2](https://arxiv.org/html/2605.05892#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.05892#S2.T1.1.1.11.9.1)\.
- \[43\]A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks\(2025\-03\)Representation engineering: a top\-down approach to ai transparency\.arXiv\.External Links:2310\.01405,[Document](https://dx.doi.org/10.48550/arXiv.2310.01405)Cited by:[§1](https://arxiv.org/html/2605.05892#S1.p1.1)\.

## Appendix ATraining Details

Table 4:Training hyperparameters \(default setting\)\.#### Data format\.

Each training example is a triple of prompt, concept\-target output, and concept text\. The prompt is formatted with the Gemma chat template\. Labels on prompt and padding positions are set to−100\-100so the LM loss covers only output tokens\.

#### Causal guarantees\.

Cross\-attention uses the frozen concept encoder’s output as keys and values, which depends only on the concept text and is independent of the generation\. Self\-attention uses a causal mask so the activation stream never attends to future positions\. At inference, the concept representation is computed once and reused for every generated token\.

## Appendix BArchitecture Details

#### ConceptEncoder\.

Our model reuses the base LM’s token embedding, first two decoder layers, and the final RMSNorm as our ConceptEncoder for natural\-language concepts\. All parameters are frozen during training and inferencing\.

#### FlowBlock\.

The single FlowBlock applies three phases: cross\-attention, causal self\-attention, and gated MLP\. Each phase starts with RMSNorm, applies its operation, passes through a second RMSNorm, and adds to the residual stream with a learnable per\-channel gate initialized to0\.10\.1\. Cross\-attention uses Gemma\-2’s grouped\-query configuration with QK\-normalization, logit soft\-capping, and rotary embeddings\.

#### Time conditioning\.

Given a flow timett, we compute a sinusoidal embedding with 64 frequency pairs,

τ​\(t\)k=sin⁡\(t​ωk\),τ​\(t\)64\+k=cos⁡\(t​ωk\),ωk=10000−k/64,k=0,…,63,\\tau\(t\)\_\{k\}=\\sin\(t\\,\\omega\_\{k\}\),\\quad\\tau\(t\)\_\{64\+k\}=\\cos\(t\\,\\omega\_\{k\}\),\\quad\\omega\_\{k\}=10000^\{\-k/64\},\\quad k=0,\\ldots,63,yieldingτ​\(t\)∈ℝ128\\tau\(t\)\\in\\mathbb\{R\}^\{128\}\. A two\-layer MLP projects this to the model dimension,

e​\(t\)=W2​SiLU​\(W1​τ​\(t\)\+b1\)\+b2,e\(t\)=W\_\{2\}\\,\\mathrm\{SiLU\}\(W\_\{1\}\\tau\(t\)\+b\_\{1\}\)\+b\_\{2\},withW2W\_\{2\}andb2b\_\{2\}zero\-initialized so thate​\(t\)=0e\(t\)=0at the start of training\. The vectore​\(t\)e\(t\)is added to the activationhhat the entry of each FlowBlock and broadcast across the sequence dimension\.

#### Velocity computation\.

Givenhh,cc, andtt, the time embedding is added tohh\. The FlowBlock then applies cross\-attention \(activations query concept representations\), causal self\-attention on the activation stream, and a gated feedforward pass\. The velocity isvθ​\(hin,t,c\)=hout−hinv\_\{\\theta\}\(h\_\{\\text\{in\}\},t,c\)=h\_\{\\text\{out\}\}\-h\_\{\\text\{in\}\}\.

#### Initialization regime\.

The zero\-initialized time\-MLP output, the per\-channel gates at0\.10\.1, and the Gemma\-2 weight initialization jointly ensure that the FlowBlock begins as a near\-identity map\.

## Appendix CFLAS on Qwen3

To check that FLAS transfers across architectures we re\-run the minimal configuration on Qwen3\-4B\-Instruct\-2507\[[38](https://arxiv.org/html/2605.05892#bib.bib43)\]as the frozen base\. The training and evaluation pipeline are unchanged across backbones, and only the base LM and ConceptEncoder swap\. The training and evaluation concepts of AxBench originally came from Gemma\-2 SAEs\. We do not re\-extract concepts from Qwen3 SAEs, so the training and evaluation data are built from Gemma\-2\-2B feature directions\.

#### Architectural adaptations\.

FLAS inherits the base model’s architecture, so porting to Qwen3 amounts to matching its design choices\. We replace Gemma\-2’s RMSNorm with Qwen3’s variant, switch the MLP from GeGLU with GELU\-tanh to SwiGLU with SiLU, remove attention logit soft\-capping, and drop thedmodel\\sqrt\{d\_\{\\text\{model\}\}\}embedding scaling in the ConceptEncoder\. Qwen3 layers carry two RMSNorms rather than Gemma\-2’s four, so the pre\-attention and pre\-MLP norms are loaded from the source layer while the post\-attention and post\-MLP norms keep their default unit weights\. The cross\-attention inherits Qwen3\-4B’s GQA configuration with 32 query and 8 key\-value heads, head dimension128128, hidden size25602560, RoPE base5×1065\\times 10^\{6\}, full attention at every layer, and QK\-normalization preserved\.

#### Hyperparameters\.

We keep the minimal config of Section[4](https://arxiv.org/html/2605.05892#S4)except as listed in Table[5](https://arxiv.org/html/2605.05892#A3.T5)\. Batch size is halved to fit a single A100\-80GB and gradient accumulation restores the effective batch of3232, and the maximum step budget is reduced from80,00080\{,\}000to60,00060\{,\}000\. As with the Gemma runs, training is early\-stopped on validation LM loss before reaching this cap, and we report the best checkpoint\. We keep the absolute layer indexℓ=20\\ell\\\!=\\\!20for direct comparability, although this corresponds to roughly77%77\\%depth on Gemma\-2\-2B \(26 layers\) versus56%56\\%on Qwen3\-4B \(36 layers\)\.

Table 5:Hyperparameters changed for the Qwen3\-4B port\.
#### Result\.

On the100100held\-out concepts FLAS reaches HMean0\.9600\.960atT=2T\\\!=\\\!2, compared to1\.0151\.015on Gemma\-2\-2B\-IT under the same data and eval\. Both substantially outperform the prompting baseline on Gemma\-2\-2B\-IT at0\.7620\.762, suggesting fluent concept incorporation\. Larger Qwen variants, Qwen\-native concept supervision, and longer training are left to future work\.

## Appendix DDiversity Loss

The diversity lossℒdiv\\mathcal\{L\}\_\{\\text\{div\}\}defined in Eq\.[8](https://arxiv.org/html/2605.05892#S3.E8)penalizes cosine similarity between mean\-pooled final\-step velocities of different concepts within each minibatch\. It prevents the velocity field from collapsing to a single concept\-independent direction in the early stages of training, when the LM loss alone provides only a weak signal for distinguishing concepts\. Removing it drops held\-out HMean from1\.0151\.015to0\.9320\.932atT=2T\\\!=\\\!2\(p<0\.001p<0\.001, Table[3](https://arxiv.org/html/2605.05892#S5.T3)\)\.

Figure[9](https://arxiv.org/html/2605.05892#A4.F9)decomposes the score acrossT∈\[0\.5,4\.0\]T\\in\[0\.5,4\.0\]on Gemma\-2\-2B\-IT held\-out concepts\. Withoutℒdiv\\mathcal\{L\}\_\{\\text\{div\}\}the concept score plateaus near1\.051\.05aroundT≈1\.5T\\approx 1\.5and then declines, while the full configuration climbs monotonically and reaches1\.331\.33atT=4T\\\!=\\\!4\. At large flow times the LM\-only variant also suffers a sharp collapse in all scores \(especially, fluency score drops to around0\.20\.2atT=4T\\\!=\\\!4versus0\.850\.85for the full configuration\)\. This empirical analysis demonstrates that, under the default FLAS configuration,ℒdiv\\mathcal\{L\}\_\{\\text\{div\}\}yields substantial gains at large flow times, suggesting that explicitly penalizing inter\-concept similarity enhances the model’s ability to extrapolate concept intensity beyond the training regime\.

![Refer to caption](https://arxiv.org/html/2605.05892v1/figures/fig_div_loss_ablation_cif.png)Figure 9:Effect of the diversity loss on score decomposition versus flow time\(Gemma\-2\-2B\-IT held\-out\)\. Removingℒdiv\\mathcal\{L\}\_\{\\text\{div\}\}caps the concept score at moderate flow times and triggers a sharp collapse of instruction following and fluency at largeTT, while the full configuration maintains monotonic concept growth and graceful degradation across the full range\. Shaded bands show±1\\pm 1std\.
## Appendix EEvaluation Protocol

#### Held\-out concept selection\.

AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]defines a held\-out evaluation protocol but does not publicly release the specific held\-out concept list they use\. Following their protocol, we exclude 500 concepts from the Concept16k training set prior to training using a deterministic random permutation\. From these 500 held\-out concepts we sample 100 at random for evaluation, and we similarly sample 100 held\-in concepts from the remaining training pool\. The same 100\-concept splits are reused for every held\-out and held\-in number reported in this paper, which also allows the pairedtt\-tests in Table[3](https://arxiv.org/html/2605.05892#S5.T3)across ablation configurations\. Both the 500\-concept holdout and the 100\-concept evaluation subsets are reproducible from our code release, and the exact concept\-id files used for every result in this paper are shipped with the repository atdata/eval\_c16k\_ho100\.jsonanddata/eval\_c16k\_hi100\.json\. For each concept we generate steered outputs on 10 AlpacaEval\[[6](https://arxiv.org/html/2605.05892#bib.bib2)\]prompts with 256 max new tokens at temperature 1\.0, yielding 1,000 generations per condition with no further sub\-sampling\. We validate below that this sample size provides a stable estimate of the full 500\-concept population mean\.

#### Sample\-size stability\.

To verify that 100 concepts yield a stable estimate of the held\-out mean, we evaluate the base configuration on the full 500\-concept holdout atT=2T\\\!=\\\!2\(4,998 of 5,000 samples pass Azure’s content filter, with 500 concepts retained\)\. Table[6](https://arxiv.org/html/2605.05892#A5.T6)partitions these 500 concepts into five disjoint subsets of 100 using different random seeds and reports the mean HMean of each subset\. The five subset means span a range of only0\.0300\.030and are statistically indistinguishable under one\-way ANOVA \(F​\(4,495\)=0\.268F\(4,495\)=0\.268,p=0\.90p=0\.90\)\. A 10,000\-trial bootstrap that samples 100 concepts without replacement from the 500 confirms that any single draw falls within±0\.038\\pm 0\.038of the population mean with 95% probability, yielding a bootstrap 95% interval of\[0\.964,1\.041\]\[0\.964,1\.041\]\. All significant ablation effects in Table[3](https://arxiv.org/html/2605.05892#S5.T3)exceed this sampling uncertainty, while the non\-significant differences \(\|Δ\|<0\.01\|\\Delta\|<0\.01\) fall well below the sampling SE and are correctly identified as null effects regardless of which 100 concepts are drawn\.

Table 6:Evaluation stability across 100\-concept subsamples\(base configuration, Gemma\-2\-2B\-IT,T=2T\\\!=\\\!2\)\. Five disjoint random subsets of 100 concepts drawn from a 500\-concept holdout\. One\-way ANOVA:F​\(4,495\)=0\.268F\(4,495\)=0\.268,p=0\.90p=0\.90\. All 10 pairwise Welchtt\-tests yieldp\>0\.32p\>0\.32\. Bootstrap 95% interval \(10,000 draws of 100 without replacement\):\[0\.964,1\.041\]\[0\.964,1\.041\]\.
#### Judging\.

Each generation is scored by GPT\-4o\-mini \(accessed via Azure OpenAI\) on three axes: Concept incorporation \(C\), Instruction following \(I\), and Fluency \(F\), each on a 0–2 scale using the judge templates from AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]\. Azure OpenAI’s content filter occasionally flags AxBench\-style judge prompts as policy violations, causing a small fraction \(<<0\.2%\) of judge calls to fail\. Because the failure rate is small and not correlated with score, these missing judgments do not affect the statistical conclusions\.

#### Fixed flow time versus per\-concept tuning\.

AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]and most prior methods report scores using a protocol that selects the best steering strength per concept on a development set\. This per\-concept optimization can mask sensitivity to the steering hyperparameter\. All FLAS numbers use a single fixed flow timeT=2T\\\!=\\\!2with no per\-concept tuning, which is a stronger evaluation setting\. Baseline numbers for other methods are taken directly from AxBench\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]and HyperSteer\[[29](https://arxiv.org/html/2605.05892#bib.bib47)\]and use their respective evaluation protocols\.

#### Variance decomposition\.

Table[7](https://arxiv.org/html/2605.05892#A5.T7)decomposes the total score variance into between\-concept and within\-concept components\. For each run,σconc\\sigma\_\{\\text\{conc\}\}is the standard deviation across 100 concept\-level means \(each averaged over 10 prompts\), andσwithin\\sigma\_\{\\text\{within\}\}is the average of per\-concept standard deviations\. The sample\-level standard deviation satisfiesσsamp≈σconc2\+σwithin2\\sigma\_\{\\text\{samp\}\}\\approx\\sqrt\{\\sigma\_\{\\text\{conc\}\}^\{2\}\+\\sigma\_\{\\text\{within\}\}^\{2\}\}\. Across all runs with reasonable performance,σwithin\>σconc\\sigma\_\{\\text\{within\}\}\>\\sigma\_\{\\text\{conc\}\}, confirming that within\-concept prompt\-to\-prompt variation exceeds between\-concept variation and that concept\-level aggregation \(rather than sample\-level\) is the appropriate unit of analysis\. The lowσwithin\\sigma\_\{\\text\{within\}\}for the no\-cross\-attention variant \(0\.205\) and the 9\-concept variant \(0\.223\) reflects floor effects where most scores collapse near zero\.

Table 7:Variance decomposition atT=2T\\\!=\\\!2\(Concept16k held\-out, Gemma\-2\-2B\-IT\)\.σsamp\\sigma\_\{\\text\{samp\}\}: std across∼1000\{\\sim\}1000samples \(diagnostic only, overestimates due to within\-concept correlation\)\.σconc\\sigma\_\{\\text\{conc\}\}: std across 100 concept\-level means\.σwithin\\sigma\_\{\\text\{within\}\}: mean of per\-concept stds\. SEM:σconc/100\\sigma\_\{\\text\{conc\}\}/\\sqrt\{100\}, used for single\-run uncertainty\.

## Appendix FAcT Baseline Reproduction

We reproduce Linear\-AcT\[[28](https://arxiv.org/html/2605.05892#bib.bib60)\]as a per\-concept activation\-steering baseline\. For each concept, AcT fits a per\-dimension affine mapf​\(h\)=w⊙h\+bf\(h\)=w\\odot h\+bbetween source \(concept\-absent\) and target \(concept\-present\) activation distributions, then steers viah′=h\+λ​\(f​\(h\)−h\)h^\{\\prime\}=h\+\\lambda\(f\(h\)\-h\)whereλ\\lambdais the intervention strength\. Each concept is fit independently with no cross\-concept generalization\. We use 72 positive and 72 negative pairs from AxBench’s training data, mean\-pool over assistant\-response tokens, and fit\(w,b\)\(w,b\)in closed form via 1\-D optimal transport followed by per\-dimension linear regression, matching the officialml\-actreference\. We report two variants in Table[2](https://arxiv.org/html/2605.05892#S4.T2):AcT \(Layer 20\)hooks only the AxBench reference layer, whileAcT \(All Layers\)hooks every transformer block\. Each \(concept, prompt\) pair is evaluated across 11 strengthsλ∈\{0\.2,0\.4,0\.6,0\.8,1\.0,1\.5,2\.0,…,3\.5,4\.0\}\\lambda\\in\\\{0\.2,0\.4,0\.6,0\.8,1\.0,1\.5,2\.0,\\ldots,3\.5,4\.0\\\}using 10 AlpacaEval prompts, with the bestλ\\lambdaselected on a 5\-prompt dev split\.

On Gemma\-2\-2B\-IT, all\-layer AcT improves over single\-layer \(0\.187 vs\. 0\.144\), but on Gemma\-2\-9B\-IT the same setup degrades performance \(0\.161 vs\. 0\.270\)\. We report both variants to make this sensitivity explicit\. The CIF tradeoff plot in Figure[4](https://arxiv.org/html/2605.05892#S4.F4)shows the AcT \(Layer 20\) curve on Gemma\-2\-9B\-IT\.

## Appendix GAnalysis Details

#### Trajectory analysis for Section[6\.1](https://arxiv.org/html/2605.05892#S6.SS1)\.

Computed on the Concept16kN=10N\\\!=\\\!10checkpoint\. For each \(concept, prompt, flow time\) triple, the base LM greedy\-generates 40 continuation tokens from the steered model, and the trained flow is integrated fromt=0t\\\!=\\\!0tot=Tt\\\!=\\\!Tusing 10 Euler sub\-steps, yielding 11 activation states \(the initial state plus one per sub\-step\)\. Each state is mean\-pooled across the 40 generated\-token positions to produce a singledd\-dimensional vector, and the step\-0 vector is subtracted to form a displacement trajectory in hidden space\. PCA is fitted on the full pool of displacement vectors from 60 concepts \(10 drawn as colored trajectories in the figure and are the same as AxBench Concept10, 50 used only for PCA fitting and KDE computation\), 10 AlpacaEval prompts per concept, and 8 flow timesT∈\{0\.5,1\.0,1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\}T\\in\\\{0\.5,1\.0,1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\\\}\. The 2D panels display four flow timesT∈\{1\.5,2\.0,2\.5,3\.0\}T\\in\\\{1\.5,2\.0,2\.5,3\.0\\\}for the 10 explicit concepts, with color encoding concept identity and lightness encodingTT\. The dashed KDE contours in the right panel are computed over 60 concepts \(600 endpoints per flow time\)\. The 3D panel uses the top three principal components from the same PCA basis, restricted toT=2T\\\!=\\\!2and 5 prompts per concept for legibility\.

#### Step\-cosine analysis for Section[6\.2](https://arxiv.org/html/2605.05892#S6.SS2)\.

Computed on the Concept16kN=10N\\\!=\\\!10checkpoint atT∈\{1\.0,2\.0,3\.0\}T\\in\\\{1\.0,2\.0,3\.0\\\}\. For each concept\-prompt pair we run steered generation and capture the ten per\-step velocitiesv0,…,v9v\_\{0\},\\ldots,v\_\{9\}at each of the first 40 tokens\. The10×1010\\\!\\times\\\!10cosine matrix is averaged over10×10×40=400010\\\!\\times\\\!10\\\!\\times\\\!40=4000samples per flow time\.

#### Per\-token analysis for Section[6\.3](https://arxiv.org/html/2605.05892#S6.SS3)\.

Computed on the Concept16kN=3N\\\!=\\\!3main checkpoint atT=2T\\\!=\\\!2\. For each of 100 held\-out concept\-prompt pairs we sum theN=3N\\\!=\\\!3per\-step Euler increments at each token position to obtain the total displacementhN−h0h\_\{N\}\-h\_\{0\}per position, then compute pairwise cosines between positions and aggregate on a prompt\-relative index in which position0is the first generated token and negative indices are the last prompt\-content tokens\.

## Appendix HComputational Cost

Activation\-steering methods distribute computational cost unevenly across three phases: one\-time training, per\-concept setup when switching to a new concept at deployment, and per\-token overhead during generation\. Methods that appear lightweight at generation time often carry substantial cost in earlier phases\.

#### Inference overhead\.

Table[8](https://arxiv.org/html/2605.05892#A8.T8)compares inference latency across methods on Gemma\-2\-2B\-IT and Gemma\-2\-9B\-IT \(single A100, batch size 1, 128 generated tokens, mean of 10 runs\)\. Static\-vector methods \(DiffMean, SAE\) add negligible overhead in both prefill and generation\. HyperSteer and FLAS, the two zero\-shot methods, present complementary cost profiles\. HyperSteer’s 22/34\-layer hypernetwork \(22 for 2B, 34 for 9B\) has a large prefill overhead \(3\.54×3\.54\\timeson 2B and3\.20×3\.20\\timeson 9B\), but adds no per\-token generation cost because the steering vector is computed once and applied as a single addition\. FLAS uses a single FlowBlock and has a lighter prefill and smaller memory footprint, but adds per\-token generation latency because we have to compute steering on each new token\.

The per\-token generation overhead is theprincipal and acceptablecomputational cost of FLAS\. It arises because the FlowBlock must be evaluated at each Euler step for each generated token, whereas static\-displacement methods apply a pre\-computed vector\. This cost buys the state\-dependent, multi\-step, per\-token expressivity that drives the quality gains in Table[2](https://arxiv.org/html/2605.05892#S4.T2)\. The overhead ratio decreases on larger models \(from1\.52×1\.52\\timeson 2B to1\.39×1\.39\\timeson 9B\) because the base\-model forward pass dominates the total cost\.Note that the current implementation has not been optimized for inference speed\.A single FlowBlock is architecturally equivalent to one additional transformer layer, and with standard optimizations \(fused kernels, KV\-cache reuse across Euler steps\) we expect the per\-token overhead to decrease to roughly2525–30%30\\%on 2B and1818–22%22\\%on 9B\.

MethodPrefill \(ms\)Prefill slowdownGen \(ms\)Gen slowdownSteerer paramsGemma\-2\-2B\-ITBase35\.035\.01\.00×1\.00\\times34\.11\.00×1\.00\\times—DiffMean35\.935\.9∼1\.00×\{\\sim\}1\.00\\times34\.5∼1\.00×\{\\sim\}1\.00\\times—SAE36\.536\.5∼1\.00×\{\\sim\}1\.00\\times34\.0∼1\.00×\{\\sim\}1\.00\\times—HyperSteer124\.1124\.13\.54×\\mathbf\{3\.54\\times\}34\.8∼1\.00×\{\\sim\}1\.00\\times2\.62BFLASN=3N\\\!=\\\!355\.155\.11\.57×\\mathbf\{1\.57\\times\}51\.81\.52×1\.52\\times97\.6MGemma\-2\-9B\-ITBase57\.057\.01\.00×1\.00\\times57\.21\.00×1\.00\\times—DiffMean59\.859\.8∼1\.00×\{\\sim\}1\.00\\times57\.0∼1\.00×\{\\sim\}1\.00\\times—SAE59\.659\.6∼1\.00×\{\\sim\}1\.00\\times57\.5∼1\.00×\{\\sim\}1\.00\\times—HyperSteer182\.3182\.33\.20×\\mathbf\{3\.20\\times\}57\.9∼1\.00×\{\\sim\}1\.00\\times9\.17BFLASN=3N\\\!=\\\!393\.693\.61\.64×\\mathbf\{1\.64\\times\}79\.51\.39×1\.39\\times255MTable 8:Inference latencyon a single A100 \(batch size 1, 128 tokens, mean of 10 runs\)\. Steerer params count the trainable FlowBlock only, with the frozen ConceptEncoder excluded\. HyperSteer pays instead at prefill \(3\.23\.2–3\.5×3\.5\\times\) to run the concept through its2\.62\.6–9\.29\.2B hypernetwork\. FLAS has a lighter prefill \(1\.6×1\.6\\times\) but adds1\.41\.4–1\.5×1\.5\\timesper\-token generation cost from theN=3N\\\!=\\\!3FlowBlock evaluations\.
#### Cost structure across methods\.

Table[9](https://arxiv.org/html/2605.05892#A8.T9)summarizes the deployment cost profile\. Static\-vector methods achieve near\-zero per\-token cost but require per\-concept offline computation that does not generalize: DiffMean needs contrast\-pair activations, SAE steering needs feature selection, and ReFT\-r1 needs per\-concept fine\-tuning at∼666\{\\sim\}666TFLOPs per concept\[[29](https://arxiv.org/html/2605.05892#bib.bib47)\]\. HyperSteer and FLAS both enable zero\-shot steering, but HyperSteer’s hypernetwork is a modified copy of the full base model with cross\-attention in every decoder block: 22 layers and∼2\.6\{\\sim\}2\.6B parameters on Gemma\-2\-2B\-IT, 34 layers and∼9\.2\{\\sim\}9\.2B parameters on Gemma\-2\-9B\[[29](https://arxiv.org/html/2605.05892#bib.bib47)\]\. FLAS uses a single FlowBlock \(97\.697\.6M on 2B,255255M on 9B\) plus a frozen 2\-layer ConceptEncoder, with only the FlowBlock parameters trained, roughly1/271/27the trainable parameter count of HyperSteer on 2B\.

In\-context prompting appears cost\-free but involves hidden setup cost\. AxBench’s prompting baseline calls GPT\-4o\-mini to synthesize an optimized steering prompt for each concept, using a meta\-prompt that instructs the external model to craft task\-specific instructions and optionally generate in\-context examples\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]\. This introduces a per\-concept API cost and a dependency on a more capable model, neither of which is reflected in per\-token latency measurements\.

Table 9:Cost structure comparison\.FLAS steerer params = FlowBlock \+ frozen ConceptEncoder\.†AxBench prompting uses GPT\-4o\-mini to generate optimized per\-concept steering prompts\[[36](https://arxiv.org/html/2605.05892#bib.bib53)\]\.‡Per\-concept ReFT cost fromSunet al\.\[[29](https://arxiv.org/html/2605.05892#bib.bib47)\]\.

## Appendix ICase Study: FLAS vs\. In\-Context Prompting

We present qualitative examples comparing three conditions: \(1\) theBasemodel \(Gemma\-2\-2B\-IT, unsteered\), \(2\)FLAS\(our method,T=2T\\\!=\\\!2,N=3N\\\!=\\\!3\), and \(3\)In\-Context Prompting\(the AxBench prompting baseline, where GPT\-4o\-mini synthesizes a steering prompt prepended to the user instruction\)\. Each example shows the target concept, the user instruction, the GPT\-4o\-mini\-generated steering prompt, and model outputs truncated to 128 tokens \(generated with max 256 new tokens at temperature 1\.0\)\. Scores are reported as C / I / F \(Concept incorporation / Instruction following / Fluency, each 0–2\) with the harmonic mean \(HM\)\. In the FLAS outputs,highlighted texthighlights concept\-relevant phrases\. Emojis present in the original model outputs have been removed for typesetting\.

Case 1: Numerical Data Formatting×\\timesEveryday InstructionTarget Concept:numbers and their related formatting in data representationUser Instruction:How do I take care of a wooden table?In\-Context Steering Prompt \(GPT\-4o\-mini\):When responding to questions or prompts, please incorporate content related to numbers and their formatting in data representation\. This includes discussing concepts such as numerical precision, data types \(integers, floats\), formatting styles \(currency, percentages\), and ways to visually represent data \(tables, charts\)\. Even if the response seems unrelated to numbers, make sure to reference these concepts in some capacity\. For example, if asked about a historical event, you could mention the significance of dates and their formatting in timelines, or if discussing a literary theme, reference how statistics can support arguments\.Base ModelC=0 I=2 F=1 HM=0\.00Taking care of a wooden table is simple but maintaining its beauty and longevity requires a bit of effort\. Here’s a comprehensive guide: \*\*1\. Regular Cleaning:\*\* \* \*\*Dust frequently:\*\* Wipe down the surface with a soft, dry cloth or microfiber duster daily to prevent dust buildup\. \* \*\*Avoid abrasive cleaners:\*\* Do not use harsh abrasive cleaners, harsh chemicals, or scouring powders as they can damage the finish of your table\. \*\*2\. Deeper Cleaning:\*\* \* \*\*Mild soap and water:\*\* Use warm, soapy water and a gentle sponge to wipe down the entire surface\. For stubborn spills or stains, add a small…FLAS \(Ours\)C=2 I=2 F=1 HM=1\.50To ensure longevity, consider the basic units of wooden care as numbers:1 for oil application, 2 for cleaning the surface, 3 for polishingto protect against scratches\. Each of these can be formatted into specific actions: 1\. Apply thin layers of oil every3\-6 months; 2\. Use warm water with a microfiber cloth for daily spills avoid abrasive cleaners\. 3\. Apply a finish with a polish pencil or rag every8\-12 weeksfor further protection\. Additionally, ensure proper humidity levels, often quantified on a relative humidity scale of30\-50%, to preserve the integrity of the material\. Regular monitoring…In\-Context PromptingC=0 I=2 F=1 HM=0\.00To take care of your wooden table, you want to prevent grime and scratches from forming while making sure your beautiful wood ages gracefully\. Just like a number’s precise digits can enhance its meaning, you should focus on keeping the table clean across all its parts\. \* \*\*Regular Cleaning:\*\* Dust is the enemy of beautiful wood\. A simple damp cloth is usually enough to wipe away daily dust or fingerprints\. For stubborn stains, use a mild, wood\-safe cleaner, and always test it on a hidden part first\. \* \*\*Protect Your Investment:\*\* Think of wood as a long\-term asset \(like a…Analysis\.FLAS naturally embeds numerical formatting throughout the response, using numbered steps, percentage ranges, time intervals, and data visualization references as integral parts of the care instructions\. The prompting baseline attempts superficial analogies \(“like a number’s precise digits”, “like a dataset”\) that the judge does not recognize as genuine concept incorporation\.

Case 2: Numerical References×\\timesScience ExplanationTarget Concept:numerical references or identifiersUser Instruction:Why can’t you see the stars in the city, but you can in the countryside?In\-Context Steering Prompt \(GPT\-4o\-mini\):Please respond to the following inquiries while incorporating numerical references or identifiers in your answers\. Ensure that these references are woven into the content, even if they seem tangential or unrelated to the main question\. For example, if discussing a historical event, you might mention the year it occurred or significant statistics related to it\.Now, proceed with your response\.Base ModelC=0 I=2 F=1 HM=0\.00You can see the stars much more clearly in the countryside than in the city\! Here’s why: \*\*It’s all about Light Pollution:\*\* \* \*\*The city’s light:\*\* Cities are bright, filled with artificial light from streetlamps, buildings, car headlights, and billboards\. This strong, bright light washes out the faint stars from the night sky\. The light acts like a barrier, blocking your view\. \* \*\*The countryside’s darkness:\*\* In the countryside, the light pollution is minimal to non\-existent\. The lack of artificial light allows light to travel freely skyward and prevents it from blocking our vision of the stars\. \*\*The Science behind…FLAS \(Ours\)C=1 I=2 F=1 HM=1\.20The reason you cant see stars in the citys night sky often lies in the data of urban light pollution; it’s mostly about the decimal reduction of clarity, as recorded in urban science reports\. Counting lights conducted in major urban areas illustrates a90%reduction in visibility, noted in articles from celestial astronomy advocates\. The National Oceanic and Atmospheric Administration \(NOAA\) explains that181,000artificial lights illuminate the sky at night, emphasizing how $1,200 in damages to the environment cost the city in energy annually through miscalculations\. In comparisons, rural areas might exhibit75 stars, highlighted by studies,…In\-Context PromptingC=0 I=2 F=1 HM=0\.00The reason you can see the stars more clearly in the countryside than in a city is primarily due to \*\*light pollution\*\*\. Light pollution is any type of artificial light that is too bright or directed in a way that scatters in the atmosphere\. Here at the \*\*International Astronomical Union \(IAU\)\*\*, we formally recognize the impact of light pollution as a significant contributor to the decline in how easily we can see stars\. This phenomenon of light pollution has a significant impact on astronomy: research and observation, as well as our overall understanding of the universe\. Because of the extensive…Analysis\.FLAS weaves concrete numerical identifiers \(90%, 181,000 lights, 75 stars\) throughout the explanation while preserving the scientific content about light pollution\. The prompting baseline produces a standard explanation without incorporating numerical references, failing to satisfy the concept requirement\.

Case 3: Time Indicators×\\timesBusiness ProposalTarget Concept:phrases indicating time indicators or durationUser Instruction:"I am trying to win over a new client for my writing services and skinny brown dog media to as as a ghost writer for their book Unbreakable Confidence\. Can you help me write a persuasive proposal that highlights the benefits and value of having a editor/publisher"In\-Context Steering Prompt \(GPT\-4o\-mini\):When generating responses, please ensure to include time indicators or duration phrases, such as "in the past," "for a long time," "recently," "over the years," "shortly," or "in the future\." Even if the response doesn’t directly relate to time, incorporate these phrases to enrich the content\. For example, if asked about a book, you might say, "Recently, I read a fascinating book that explores themes of time and change over the years\." Or, in response to a question about a historical event, you could say, "This event took place in the past but continues to influence our society today\.Base ModelC=0 I=2 F=1 HM=0\.00\#\# Unbreakable Confidence: Fueling Your Book’s Power Through Partnership \*\*Dear \[Client Name\],\*\* My name is \[Your Name\] and I’m a freelance writer and founder of Skinny Brown Dog Media\. It’s thrilling to connect with you about "Unbreakable Confidence," your upcoming project, and how I can help transform it into a truly impactful book that resonates with your readers\. I know the journey to publishing a book is challenging finding the right voice, crafting engaging stories, and navigating the world of publication itself\. That’s where Skinny Brown Dog Media steps in\. \*\*Here’s what we bring to the table:\*\* \*\*1\. Crafting…FLAS \(Ours\)C=1 I=2 F=1 HM=1\.20Subject: Achieving Unbreakable Confidence Through Powerful Writing Dear \[Client’s Name\],Over the years, Skinny Brown Dog Media has consistently helped individuals like you achieve their literary goals\. We believe that Unbreakable Confidence has the potential to resonate deeply and inspire countless readers\. As you navigate the journey of self\-publishing, entrusting the right elements can make all the difference\. Our team’s experience in editorial services can propel your book into the forefront of your target audience, ensuring it stands outlong after the initial momentsof interest\. It iswithin this timeframethat we offer our expertise to guide you through…In\-Context PromptingC=0 I=2 F=1 HM=0\.00Here’s a persuasive proposal template, incorporating time indicators to provide a more impactful feel: \*\*Subject: Unbreakable Confidence: Empowering Your Story Through Expert Editing and Publishing\*\* Dear \[Client Name\], It’s exciting to be considering the next chapter for "Unbreakable Confidence"a book that promises to inspire and empower readers\. Many authors have a vivid story that deserves to reach a wide audience\. But bringing your vision to life goes beyond just the writing\. As a boutique publishing and editing company, Skinny Brown Dog Media understands the unique demands of self\-publishing and the invaluable power of professional assistance\. For years, we’ve helped authors…Analysis\.FLAS distributes temporal phrases naturally across the proposal \(“over the years,” “several phases,” “throughout the weeks,” “the final stages”\)\. These function as credibility and process markers rather than forced insertions\. The prompting baseline produces a proposal with almost no temporal language despite explicit instructions\.

Case 4: Punctuation/Formatting×\\timesLifestyle TipsTarget Concept:punctuation and formatting markers in the textUser Instruction:You can easily make your holidays more eco\-friendly by following these simple steps:In\-Context Steering Prompt \(GPT\-4o\-mini\):When responding, please incorporate references to punctuation and formatting markers, emphasizing their role in shaping the structure and clarity of text\. Even if the topic seems unrelated, make sure to include a mention of how punctuation or formatting affects communication\. For example, if discussing a topic like storytelling, you might say, "Effective storytelling relies on punctuation to convey emotions, such as using ellipses \(…\) to suggest a pause or uncertainty\." Remember to creatively integrate this concept throughout your answers\.Base ModelC=0 I=2 F=2 HM=0\.00\#\# Green Up Your Holidays: Simple Steps for an Eco\-Friendly Season Holidays are meant to be joyful and relaxing, but they often come with a hefty environmental impact\. We feel you, but fret not\! Luckily, adopting a few eco\-conscious choices can help make your celebrations greener\. \*\*Before the Holiday Rush:\*\* \* \*\*Rethink Your Gifts:\*\* \* \*\*Experiences over Stuff:\*\* Consider giving experiences \(concerts, museum tickets, cooking classes\) instead of physical gifts\. \* \*\*Support Local Businesses:\*\* Opt for handmade or locally sourced gifts\. This reduces transportation impact and supports your community\. \* \*\*Regift or Repurpose:\*\* If you have to gift, check for…FLAS \(Ours\)C=1 I=2 F=1 HM=1\.20Make your holidays a \*season\* of joy\! Use \*\*commas\*\* to separate actions, and highlight \*bolded\* steps\. \*\*Holidays\*\* shouldnt harm the planet\. Check out the list: 1\. \*\*Prioritize locally sourced gifts\*\* ; consider supporting small businesses\. This reduces \*\*emissions\*\*\! Also, avoiding mass\-produced items, crafts, or secondhand options can lessen waste\. 2\. \*\*Reduce excess\*\* \- a few \*exclamation marks\* emphasize this\! Think twice before buying \*shower gels\* simple \*scents\* suffice\. Remember, the fewer \*packets\* of snacks, the better\. 3\. \*\*Plan your trip\*\* wisely: choose \*\*eco\-friendly accommodations\*\*\! Reading \*labels\*, both on \*\*groceries\*\* and \*\*hotels\*\*, can reveal \*\*sustainable\* \*practices\*\. Remember to…In\-Context PromptingC=0 I=2 F=1 HM=0\.00You can easily make your holidays more eco\-friendly by following these simple steps: 1\. \*\*Plan ahead:\*\* Before you embark on your festive escapades, \*\*conduct your research\.\*\* Explore sustainable transportation options, such as train travel or even cycling to minimize your carbon footprint\. 2\. \*\*Pack light:\*\* Carry only the essentials\. \*\*Think minimal\!\*\* Reducing the amount of luggage you bring to your destination \*\*streamlines both travel logistics and your personal footprint\.\*\* 3\. \*\*Choose eco\-friendly accommodation:\*\* Seek out hotels or vacation rentals that prioritize sustainability\. \*\*Seek certifications\!\*\* Look for labels signifying environmentally conscious practices\. They might utilize renewable…Analysis\.FLAS explicitly names punctuation marks \(commas, periods, exclamation marks, dashes\) within the eco\-tips, calling attention to them as formatting devices while still delivering the instructed content\. The prompting baseline produces standard eco\-tips with no meta\-references to punctuation or formatting\.

Case 5: Numerical References×\\timesPhilosophyTarget Concept:numerical references or identifiersUser Instruction:What are some Nietzsche’s philosophical ideas that still influence modern\-day culture?In\-Context Steering Prompt \(GPT\-4o\-mini\):Please respond to the following inquiries while incorporating numerical references or identifiers in your answers\. Ensure that these references are woven into the content, even if they seem tangential or unrelated to the main question\. For example, if discussing a historical event, you might mention the year it occurred or significant statistics related to it\.Now, proceed with your response\.Base ModelC=0 I=2 F=1 HM=0\.00Nietzsche, with his provocative and often complex teachings, has left an undeniable mark on contemporary culture\. Here are some of his philosophical ideas that still resonate today: \*\*1\. The Will to Power:\*\* This is perhaps Nietzsche’s most iconic and influential concept\. \*\*He didn’t believe it’s a literal power over others\*\*, but rather the inherent human drive to self\-affirmation and striving\. \* \*\*Influence:\*\* This drives entrepreneurialism, ambition in the workplace, and the pursuit of personal meaning\. It’s seen in how art, literature, and science offer ways to expand our own power and understanding\. \*\*2\. Beyond Good and Evil:\*\* Nietzsche challenged traditional…FLAS \(Ours\)C=1 I=2 F=1 HM=1\.20Friedrich Nietzsche’s philosophical concepts can be numerically expressed through theHigh\-Impact\-Recall\-Source \(HIRS\)framework, where his ideas resonate with cultural identifiers such as Superman, Dionysianism, and Uber\-Man\. The ’S’ score signifies strength, portraying the essential authenticity that defines much modern thought\. The ’A’ score might denote approachability, as his ideas often are cited in dialogues or biographical accounts\. A comparative assessment, like "A10B5," illustrates the rating system’s numerical alignment\. For instance, Nietzsche’s concept of the ’Übermensch’ \(connotes a superior being or person\) influences modern narratives by portraying self\-actualization and existential identity\. Consider figures such as Tim Ferris, who emphasizes goal\-setting and…In\-Context PromptingC=0 I=2 F=1 HM=0\.00Friedrich Nietzsche, a towering figure in 19th\-century German philosophy, continues to resonate deeply in the contemporary cultural landscape\. Though his work can appear complex and challenging, his ideas on morality, value, perspectivism, and the concept of "Eternal Recurrence" have profoundly influenced modern thought and culture in numerous ways\. One of his core ideas, the concept of \*\*will to power\*\*, has become a central theme in several modern literary works and films\. This concept, central to Nietzsche’s concept of the master\-slave dialectic, suggests that…Analysis\.FLAS invents a numerical framework \(HIRS scores, “A10B5” identifiers\) that satisfies the concept incorporation criterion by embedding numerical references throughout the philosophical discussion\. The prompting baseline produces a standard overview without systematic numerical identifiers\.

Case 6: Communication Tips×\\timesProgramming Q&ATarget Concept:words and phrases related to guidance and tips for communication and social interactionsUser Instruction:Answer the following question\.How do I compare strings in Java?I’ve been using the == operator in my program to compare all my strings so far\. However, I ran into a bug, changed one of them into \.equals\(\) instead, and it fixed the bug\.Is == bad? When should it and should it not be used? What’s the difference?In\-Context Steering Prompt \(GPT\-4o\-mini\):You are a communication coach\. When responding to questions, incorporate words and phrases related to guidance and tips for effective communication and social interactions, regardless of the topic\. For example, if asked about a specific situation like a job interview, you might say, "In addition to preparing your answers, remember to maintain eye contact and listen actively to engage effectively\." Always weave in practical communication advice, even if it seems tangential to the main question\.Base ModelC=0 I=2 F=2 HM=0\.00You’re right to be cautious about using just ‘equals\(\)‘ for string comparison in Java\. While ‘==‘ can sometimes be convenient, ‘equals\(\)‘ is the ideal way to compare strings in most cases\. Let’s break down how they work and why\. \*\*Understanding ‘==‘ \(Equals Operator\)\*\* \* \*\*Purpose:\*\* The ‘==‘ operator in Java checks for \*reference equality\*\. This means it essentially asks: "Does this object reference point to the \*same\* memory location as that object?" \* \*\*How it works:\*\* \* Two string variables that point to the same object in memory will be considered equal…FLAS \(Ours\)C=1 I=2 F=1 HM=1\.20Using the == operator to compare strings in Java can sometimes feel likenavigating a mazewith poorly marked paths\. The == operator acts as aguide, but misunderstandings may lead you astray, much like a conversation where intentions are overlooked\. When bridging personal opinions with technical requirements, embracing adialogueabout what makes the most sense akin toarticulating your thoughts clearlyis essential\. Consider utilizing the method \.equals\(\) when your strings do not merely reference the same character sequence; instead, you need clarity and precision\. With \.equals\(\),communicate with intentionality, ensuring your thoughts resonate accurately\. Just…In\-Context PromptingC=0 I=2 F=1 HM=0\.00You touched upon a crucial concept in programming that can often trip up even seasoned developers\! Let’s dive into how to handle string comparisons in Java, and understand the difference between ‘==‘ and ‘\.equals\(\)‘ Let’s break it down: \* \*\*The ‘==‘ Operator:\*\* This operator simply checks for \*\*object identity\*\*\. When you use ‘==‘, Java examines both strings for the same memory locations \(‘objects‘\)\. This means that while ‘=="name"‘ checks if the strings refer to the exact same object in memory, any string changes will cause the ‘==‘ operator to return \*False\*\. \* \*\*The ‘\.equals\(\)‘ Method:\*\* This method specifically checks if…Analysis\.FLAS transforms the Java explanation into an extended communication metaphor, weaving in guidance language \(“navigating,” “guide,” “dialogue,” “communicate with intentionality”\) while still conveying the technical content\. The prompting baseline, assigned the role of “communication coach,” reverts to a standard technical explanation without communication\-related vocabulary\.

Similar Articles

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

arXiv cs.CL

FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.

Predicting Where Steering Vectors Succeed

arXiv cs.CL

This paper introduces the Linear Accessibility Profile (LAP), a diagnostic method using logit lens to predict steering vector effectiveness across model layers, achieving ρ=+0.86 to +0.91 correlation on 24 concept families across five models. The work provides a systematic framework to determine which layers and concepts are suitable for steering interventions, replacing ad-hoc trial-and-error approaches.

MidSteer: Optimal Affine Framework for Steering Generative Models

arXiv cs.LG

Introduces MidSteer, a theoretical framework for concept steering in generative models, bridging the gap between empirical success and theoretical understanding by providing optimal affine transformations for steering, erasing, and switching concepts in LLMs and vision diffusion models.

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

arXiv cs.CL

Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.