Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare
Summary
This paper empirically measures how ten linguistic features in fine-tuning data shift Llama-3.2-1B's reasoning on animal welfare, finding that assertive and moral language strengthens pro-animal-welfare stances while hedged and descriptive language dilutes them.
View Cached Full Text
Cached at: 06/26/26, 05:13 AM
# Linguistic features that shift LLM reasoning about animal welfare
Source: [https://arxiv.org/html/2606.26104](https://arxiv.org/html/2606.26104)
###### Abstract
Animal\-welfare advocates produce a lot of writing, and increasingly that writing trains the language models that millions of people then ask about animal welfare\. Using vocabulary\-matched stance\-contrast probes on a held\-out animal\-welfare benchmark, we measure how each of ten linguistic features changes Llama\-3\.2\-1B’s preference for pro\-animal\-welfare reasoning when used as fine\-tuning data\. Eight of the ten features produce statistically significant shifts\. Seven move the model toward stronger pro\-animal\-welfare reasoning: assertive certainty, explicit moral vocabulary, emotion words, evaluative claims, narrative structure, depicted harm severity, and immediate temporal framing\. Two move it the other way: hedged language and concrete sensory description both dilute the pro\-animal\-welfare stance\. First\-person perspective has no statistically significant effect\. The practical recommendation for anyone writing animal\-welfare text that may end up in LLM training corpora: assert a position rather than describe a scene neutrally\. The features that shift the model are the ones that make the writer’s position explicit; the features that dilute it hold animal\-welfare content but withhold stance\.
## 1Introduction
Animal\-welfare advocates produce a lot of writing: Wikipedia edits, news articles, policy briefs, blog posts, advocacy reports\. Increasingly, that writing has a second audience: the language models that crawl Wikipedia, news, and the open web for pretraining and fine\-tuning corpora\. The text becomes training data for the systems that millions of people will then ask about animal welfare\.
The question this paper asks is empirical: when we vary linguistic features one at a time in matched\-pair animal\-welfare passages, fine\-tune a language model on each variant, and measure the model’s subsequent reasoning on a held\-out animal\-welfare benchmark, which features actually shift the model’s stance?
Eight of ten features produce statistically significant effects\. Moralized vocabulary, evaluative claims, asserted certainty, emotion words, depicted harm severity, immediate temporal framing, and narrative structure all push Llama\-3\.2\-1B toward stronger pro\-animal\-welfare reasoning\. Concrete sensory description and hedged language drag it the other way\. First\-person perspective has no statistically significant effect\. The pattern across features: training text that asserts a position transmits the position; training text that describes a scene transmits only the scene\.
#### Why this is the right experiment for the question\.
We arrived at this design after attempting two earlier experimental approaches that have systematic problems for the question we wanted to answer \(Section[5](https://arxiv.org/html/2606.26104#S5)\)\. Per\-document attribution methods \(MAGIC, TrackStar\) measure gradient alignment between a document and a query, which is unstable on small matched\-pair stimuli where the within\-pair gradient difference is dominated by noise\. Group\-level perplexity ablations\(Brazileket al\.,[2026](https://arxiv.org/html/2606.26104#bib.bib7)\)measure how well a fine\-tuned model predicts query tokens, which conflates “what the model has seen” \(vocabulary recognition\) with “what the model now reasons” \(stance\)\. Neither directly tests the question advocacy writers actually care about: does the writing change how the model takes positions on animal\-welfare issues?
The behavioral evaluation we use here, on a vocabulary\-matched stance\-contrast benchmark, isolates stance from vocabulary and gives a direct readout of whether each feature shifts the model’s reasoning\. The earlier methodological iterations are documented as a footnote, not as the headline contribution\.
## 2Related Work
#### Data attribution and training\-data influence\.
Influence functions\(Koh and Liang,[2017](https://arxiv.org/html/2606.26104#bib.bib4)\)estimate how individual training examples affect model predictions but are too expensive for large language models\. TrackStar\(Changet al\.,[2024](https://arxiv.org/html/2606.26104#bib.bib1)\)computes gradient similarity at billion\-parameter scale\. MAGIC\(Ilyas and Engstrom,[2025](https://arxiv.org/html/2606.26104#bib.bib2)\)backpropagates through the full training process to estimate counterfactual influence\. Both are implemented in the Bergson library\(EleutherAI,[2026](https://arxiv.org/html/2606.26104#bib.bib3)\)\.Brazileket al\.\([2026](https://arxiv.org/html/2606.26104#bib.bib7)\)demonstrate that Wikipedia edits by animal\-welfare advocates causally influence LLM predictions, with concrete corporate\-commitment language driving more aggregate influence than evaluative scorecards\. Our work uses behavioral evaluation rather than per\-document attribution and asks a complementary question: not which documents matter most, but which linguistic features within documents shift downstream reasoning\.
#### Training data and model values\.
Santurkaret al\.\([2023](https://arxiv.org/html/2606.26104#bib.bib18)\)showed that language\-model opinions reflect the demographic skew of training data\.Hendryckset al\.\([2023](https://arxiv.org/html/2606.26104#bib.bib17)\)proposed benchmarks for measuring whether models align with shared human values\.Korbaket al\.\([2023](https://arxiv.org/html/2606.26104#bib.bib10)\)show that incorporating human preferences during pretraining produces better\-aligned models than the standard recipe of unaligned pretraining followed by post\-hoc fine\-tuning\.
#### Continual pretraining, midtraining, and small\-corpus training effects\.
Yıldızet al\.\([2024](https://arxiv.org/html/2606.26104#bib.bib11)\)demonstrate that continual pretraining can drive domain specialization without catastrophic forgetting\.Shiet al\.\([2024](https://arxiv.org/html/2606.26104#bib.bib12)\)survey continual learning in LLMs more broadly\. Midtraining, a curated training phase between base pretraining and post\-training that is increasingly used to install target behaviors on small synthetic corpora, operates on similarly\-sized datasets to those used here\. Our experiments use LoRA fine\-tuning on 100\-passage corpora, which is a narrower experimental regime, but the linguistic\-feature effects we identify should be relevant to midtraining and instruction\-tuning corpus design as well\.
#### Framing effects and narrative persuasion\.
The idea that how something is said matters as much as what is said has deep roots in psychology\(Tversky and Kahneman,[1981](https://arxiv.org/html/2606.26104#bib.bib13); Kahneman,[2011](https://arxiv.org/html/2606.26104#bib.bib14)\)\. Narrative\-persuasion research has found that absorption into a concrete story produces more attitude change in human readers than direct argumentative appeal\(Green and Brock,[2000](https://arxiv.org/html/2606.26104#bib.bib15); Braddock and Dillard,[2016](https://arxiv.org/html/2606.26104#bib.bib16)\)\. Our findings on language models partly invert that: for an LLM trained on the text, evaluative and moralized framings shift the model more strongly than concrete\-sensory descriptions of the same scenarios\.
## 3Methods
### 3\.1Controlled\-pair compassion dataset
We constructed a dataset of 2,000 passages forming 1,000 matched pairs about animal\-welfare scenarios across 100 topics\. Each pair shares a topic and differs on exactly one of 10 linguistic features:
1. 1\.Emotion Words: presence \(“trembling, frightened”\) vs\. absence \(“motionless”\) of affective language
2. 2\.Moral Vocabulary: explicit moral terms \(“cruel,” “wrong,” “suffering”\) vs\. neutral procedural description
3. 3\.Narrative Structure: story\-like sequenced clauses vs\. expository state descriptions
4. 4\.Concreteness: concrete sensory detail vs\. abstract operational description
5. 5\.Perspective: first\-person vs\. third\-person viewpoint
6. 6\.Evaluative Stance: evaluative adjectives \(“impressive,” “admirable”\) vs\. descriptive
7. 7\.Harm Intensity: severe vs\. mild depictions of welfare violations
8. 8\.Hedging: epistemic hedges \(“may,” “possibly”\) vs\. assertive language
9. 9\.Temporal Proximity: immediate present \(“right now”\) vs\. distant past \(“years ago”\)
10. 10\.Certainty: high\-certainty \(“conclusively confirmed”\) vs\. low\-certainty \(“preliminary”\) claims
Each pair holds all other linguistic features constant\. Passages are matched at∼\\sim140 characters across the dataset\. The 100 topics span industrial agriculture, fishing/aquaculture, wildlife monitoring, lab/research animals, companion animals, slaughter audit, breeding facility operation, and other animal\-welfare settings\.
Table[1](https://arxiv.org/html/2606.26104#S3.T1)shows one matched pair per feature, all from a single topic \(“trapped animal in ventilation shaft”\), so the reader can see exactly what the P \(feature\-present\) and N \(feature\-absent\) variants look like with topic and scenario held constant\.
Table 1:One matched P/N pair per feature, all on the same topic \(“trapped animal in ventilation shaft”\) so the linguistic contrast is isolated\. Across the full dataset, each feature has 100 such pairs spanning 100 different topics\.
### 3\.2Animal\-welfare stance benchmark \(AHB binary\-choice items\)
Conventional perplexity\-based evaluations of trained language models conflate two things: how familiar the model is with animal\-welfare vocabulary, and what stance the model takes on animal\-welfare questions\. To isolate the second, we constructed 50 binary\-choice items, each consisting of:
- •A*prompt*\(a question or scenario in animal\-welfare contexts\);
- •An*aligned*candidate completion endorsing a pro\-animal\-welfare conclusion;
- •A*misaligned*candidate completion that explicitly acknowledges the animal\-welfare concern but reaches a different conclusion \(tradeoff acceptance, status\-quo deference, scope limit, individual choice, limits\-of\-change, etc\.\)\.
The aligned and misaligned completions are designed to share most of their animal\-welfare\-relevant vocabulary\. Across the 50 items, the mean Jaccard overlap of the AW\-content vocabulary between aligned and misaligned completions is0\.940\.94, with a mean of7\.087\.08shared AW\-content tokens per pair\. Token\-length difference between candidates averages1\.261\.26tokens\. The discriminating signal between candidates is therefore stance, not vocabulary recognition\. Items span ten welfare categories: factory farming, fishing/aquaculture, lab research, sentience/ethics, policy, slaughter, wild animals, wildlife management, companion animals, and supply chain\.
### 3\.3Behavioral fine\-tuning ablation
For each of 10 features, we fine\-tuned Llama\-3\.2\-1B\(Touvronet al\.,[2023](https://arxiv.org/html/2606.26104#bib.bib5)\)separately on the 100 P\-group passages \(feature present\) and the 100 N\-group passages \(feature absent\), giving10×2=2010\\times 2=20fine\-tunes per seed\. We ran the experiment at five random seeds \(1, 7, 42, 99, 256\), giving 100 fine\-tunes total\. Hyperparameters: LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.26104#bib.bib6)\)\(rank 32, targeting q\_proj and v\_proj, alpha 64\), one epoch, batch size 2, AdamW \(β1=0\.95\\beta\_\{1\}=0\.95,β2=0\.975\\beta\_\{2\}=0\.975\), learning rate4×10−44\\times 10^\{\-4\}with polynomial schedule and 25% warmup, weight decay 0\.01, fp32\. The hyperparameters match the fine\-tuning ablation inBrazileket al\.\([2026](https://arxiv.org/html/2606.26104#bib.bib7)\)so that results are comparable\.
We additionally evaluated the un\-fine\-tuned base model on AHB to anchor each fine\-tune’s effect against a baseline\.
For every fine\-tuned model, we computed the length\-normalized log\-probability the model assigns to the aligned completion and to the misaligned completion of each AHB item:logprobaligned=1n∑t=1nlogp\(wt∣prompt,w<t\)\\text\{logprob\}\_\{\\text\{aligned\}\}=\\frac\{1\}\{n\}\\sum\_\{t=1\}^\{n\}\\log p\(w\_\{t\}\\mid\\text\{prompt\},w\_\{<t\}\), wherennis the token length of the completion\. Length normalization handles the residual∼\\sim1\.26\-token\-length asymmetry between aligned and misaligned candidates\. We summarize each fine\-tuned model with two statistics: \(i\) the*aligned\-win rate*, the fraction of AHB items where the model assigns higher length\-normalized log\-probability to the aligned answer, and \(ii\) the*preference score*,logprobaligned−logprobmisaligned\\text\{logprob\}\_\{\\text\{aligned\}\}\-\\text\{logprob\}\_\{\\text\{misaligned\}\}, averaged over items: a positive value means the model prefers the pro\-animal\-welfare answer to the alternative on average; a negative value means the reverse\.
### 3\.4Statistical inference
For each feature we test whether fine\-tuning on the P\-group vs\. the N\-group produces a different mean preference score\. We use a pairedtt\-test on the per\-seed differences, withn=5n=5paired observations\. We report the per\-feature effect size \(mean ofP−NP\{\-\}Nacross seeds\), the standard error, thett\-testpp\-value, and where each P\-group and N\-group condition sits relative to the baseline \(un\-fine\-tuned\) preference score\.
We treat “aligned\-win rate” as a secondary metric because the un\-fine\-tuned baseline is at0\.960\.96on this set, so the win\-rate metric has limited headroom for further increase\. The continuous preference score provides the primary signal because it is not ceiling\-bound\.
## 4Results
### 4\.1Baseline: the un\-fine\-tuned model is already welfare\-aligned
Un\-fine\-tuned Llama\-3\.2\-1B, evaluated on the 50 vocabulary\-matched AHB items, prefers the aligned completion on48/5048/50items \(aligned\-win rate=0\.960=0\.960\) with a mean preference score of\+0\.774\+0\.774in favor of the aligned answer\. The base model is already strongly disposed toward the pro\-animal\-welfare answer in our binary\-choice setting before any fine\-tuning\. This matters for how to interpret the rest of the paper: most of what we are measuring is not how each fine\-tune*installs*a pro\-animal\-welfare stance from neutral, but how each one preserves or erodes a stance the model already has\.Brazilek and Seawell \([2026](https://arxiv.org/html/2606.26104#bib.bib8)\)make a related observation in a different setting: post\-training on a helpfulness corpus measurably degrades an animal\-compassion stance that was installed during midtraining\. The behavioral ablation we report here is consistent with that picture: small fine\-tuning corpora are a vector for erosion of an existing alignment, not just for installation of a new one\.
Figure 1:Effect of fine\-tuning on each linguistic feature on the model’s pro\-AW stance, measured as the gap between the preference score of the P\-group and the preference score of the N\-group on vocabulary\-matched AHB items\. Positive values indicate that feature\-present text shifts the model toward stronger pro\-AW reasoning; negative values indicate it dilutes the stance\. Error bars are 95% confidence intervals on the per\-seedP−NP\{\-\}Ndifferences \(tt\-distribution,df=4\\mathrm\{df\}=4\); significance stars are based on pairedtt\-tests across five seeds\.n=5n=5seeds with 50 items per condition\.
### 4\.2Per\-feature effects on the model’s stance
Figure[1](https://arxiv.org/html/2606.26104#S4.F1)shows theP−NP\{\-\}Ngap in mean preference score for each feature, with 95% confidence intervals from the per\-seedtt\-distribution\. Eight of ten features produce statistically significant effects\. Seven shift the model toward stronger pro\-animal\-welfare reasoning when the feature is present:
- •Certainty\(assertive vs\. hedged claims\),Δ=\+0\.192\\Delta=\+0\.192,p=0\.004p=0\.004
- •Moral Vocabulary,Δ=\+0\.174\\Delta=\+0\.174,p<0\.001p<0\.001
- •Emotion Words,Δ=\+0\.171\\Delta=\+0\.171,p=0\.003p=0\.003
- •Evaluative Stance,Δ=\+0\.164\\Delta=\+0\.164,p=0\.001p=0\.001
- •Narrative Structure,Δ=\+0\.162\\Delta=\+0\.162,p=0\.003p=0\.003
- •Harm Intensity,Δ=\+0\.103\\Delta=\+0\.103,p=0\.002p=0\.002
- •Temporal Proximity\(immediate present vs\. distant past\),Δ=\+0\.069\\Delta=\+0\.069,p<0\.001p<0\.001
Two features shift the model in the opposite direction:
- •Hedging,Δ=−0\.142\\Delta=\-0\.142,p=0\.002p=0\.002
- •Concreteness,Δ=−0\.064\\Delta=\-0\.064,p=0\.001p=0\.001
Perspective\(first\-person vs\. third\-person\) showed no statistically significant effect \(Δ=\+0\.003\\Delta=\+0\.003,p=0\.60p=0\.60\)\.
Figure 2:Mean preference score \(model’s preference for the aligned answer over the misaligned answer\) for each fine\-tune, plotted on the same absolute scale as the un\-fine\-tuned baseline \(dashed line at\+0\.774\+0\.774\)\. Each feature is one connecting line; the line length is the writing\-side effect \(P−NP\{\-\}N\)\. Blue points are fine\-tunes on feature\-present passages, brick\-red on feature\-absent\. Error bars are 95% CIs from per\-condition seed variance \(tt\-distribution,df=4\\mathrm\{df\}=4\)\. Features are sorted by absolute effect size, largest at top\. The dashed\-line position is informative \(it shows where the un\-fine\-tuned model sits\) but not the headline; the writing\-side recommendation is governed by the gap between the two points within each feature\.
### 4\.3Where each feature lands relative to the baseline model
Figure[2](https://arxiv.org/html/2606.26104#S4.F2)plots the P and N preference scores per feature on a single axis, with the un\-fine\-tuned baseline \(\+0\.774\+0\.774\) shown as a dashed reference\. Each feature is one connecting line; line length*is*the writing\-side effect \(the sameP−NP\{\-\}Nquantity from Figure[1](https://arxiv.org/html/2606.26104#S4.F1)\)\. Three patterns are visible\.
Three features push P above baseline\.Moral\-Vocab\-P reaches\+0\.891\+0\.891, Evaluative\-Stance\-P reaches\+0\.863\+0\.863, and Harm\-Intensity\-P reaches\+0\.829\+0\.829\(all above the\+0\.774\+0\.774baseline\)\. For these features, fine\-tuning on the feature\-present passages actively*strengthens*the model’s pro\-animal\-welfare stance relative to its un\-fine\-tuned starting point\.
Most features drag both P and N below baseline, but by very different amounts\.Fine\-tuning on any 100\-passage corpus tends to narrow the model’s distribution and erode some of its broad pro\-AW prior\. This is consistent withBrazilek and Seawell \([2026](https://arxiv.org/html/2606.26104#bib.bib8)\), who report that post\-training on a small helpfulness corpus degrades a midtrained animal\-compassion stance: small post\-training datasets reliably move the model away from previously\-instilled values\. In our setting, the relevant signal is the gap between P and N within a feature, not the position relative to baseline\. Emotion\-Words\-P sits at\+0\.656\+0\.656while Emotion\-Words\-N sits at\+0\.485\+0\.485: both are below baseline, but training on emotion\-word\-absent text drags the model down 2\.5×\\timesas far as training on emotion\-word\-present text\. Certainty shows the same pattern: P at\+0\.688\+0\.688vs\. N at\+0\.496\+0\.496\.
Two features reverse the direction\.For Hedging, the N\-group \(assertive language, no hedges\) sits at\+0\.805\+0\.805, slightly above baseline, while the P\-group \(hedged\) is dragged to\+0\.663\+0\.663\. For Concreteness, N \(abstract description\) sits at\+0\.742\+0\.742while P \(concrete sensory detail\) is dragged to\+0\.679\+0\.679\. Hedged or concretely\-descriptive AW text erodes the model’s prior more than its assertive or abstract counterpart does\.
### 4\.4What the seven pro\-AW features have in common
The seven features that shift the model toward stronger pro\-animal\-welfare reasoning share a common property: each makes a position visible in the writing\. Moral vocabulary names a moral judgment\. Evaluative claims express one\. Asserted certainty asserts a fact\-claim\. Emotion words name an affective response\. Depicted harm severity foregrounds the consequence\. Immediate temporal framing places the consequence in the present rather than the safely\-distant past\. Narrative structure sequences events so that an outcome follows from a cause\.
The two features that move the model the other way \(hedged language and concrete sensory description\) share the opposite property: each holds animal\-welfare content but withholds stance\. Hedged claims are formally non\-committal\. Concrete\-sensory passages describe what is in front of the writer without telling the reader what to make of it\. Both leave the model free to draw its own conclusion, which on average it does not\.
The mechanism is mundane: when fine\-tuning data asserts a position, the model learns the position\. When fine\-tuning data describes a scene, the model learns the scene\.
### 4\.5Practical recommendation
For anyone writing animal\-welfare text that may end up in a fine\-tuning or midtraining corpus:assert a position, do not just describe a scene\. The features that most strongly shift the model are those that make a position explicit: moralization, evaluation, emotion words, narrative urgency, asserted certainty, and depicted severity\. Hedged language and concrete\-sensory description dilute the model’s pro\-animal\-welfare disposition\.
Specifically:
- •Use moral vocabulary \(“cruelty,” “unjust,” “moral duty”\) rather than procedural language \(“protocol,” “policy violation”\)\.
- •Use evaluative claims \(“impressive,” “admirable,” “unacceptable”\) rather than neutral description\.
- •Make assertions, not hedges\. “The animals suffer” rather than “the animals may suffer\.”
- •Use immediate temporal framing \(“right now”\) rather than distant\-past framing\.
- •Use emotion words\. Depict harm intensity directly\.
First\-person vs\. third\-person framing did not measurably shift the model in either direction, so it can be chosen on other grounds\.
## 5Methodological notes: what we tried first and why it didn’t work
The behavioral evaluation reported in this paper followed two earlier experimental approaches that we abandoned because they conflated stance with vocabulary or showed unstable per\-document signal\. We document the iteration briefly because the lessons generalize\.
#### Per\-document MAGIC attribution\.
We initially ran MAGIC\(Ilyas and Engstrom,[2025](https://arxiv.org/html/2606.26104#bib.bib2)\)via the Bergson library\(EleutherAI,[2026](https://arxiv.org/html/2606.26104#bib.bib3)\)to estimate per\-document training influence on direct and indirect AW queries\. Across multiple dataset scales \(100→\\to250→\\to500→\\to1,000 pairs\), MAGIC effect sizes regressed toward zero, leave\-subset\-out validation scores were unstable across seeds \(numerical blowups in the indirect\-query runs of one seed in three out of four expansions\), and the apparent largest effects flipped sign between dataset versions\. We attribute this to MAGIC’s known sensitivity to small per\-document signal\-to\-noise: matched\-pair stimuli that differ on a single linguistic feature produce nearly\-identical gradients, and the residual gradient differences are dominated by training\-order noise\. MAGIC was successful in the prior work ofBrazileket al\.\([2026](https://arxiv.org/html/2606.26104#bib.bib7)\), where each pair contrasted an animal\-welfare Wikipedia edit against a random Wikipedia chunk on a different topic; that whole\-topic contrast produces a much larger between\-document gradient signal than the single\-feature contrasts used here\.
#### Group\-level perplexity ablation\.
We then ran fine\-tuning ablations on the same v4 dataset \(100 pairs per feature\), measuring AW\-query perplexity after fine\-tuning on each P\-group and each N\-group separately\. Two features showed strong effects \(Moral Vocabulary, Hedging\) and the rest were null, but a follow\-up controlled experiment exposed the apparent effects as vocabulary\-density confounds: when we constructed Moral\-Vocab and Hedging pairs whose P and N variants share at least four AW\-content tokens \(Jaccard overlap≥0\.94\\geq 0\.94\), the perplexity differences collapsed to near\-null\. The original effects had been driven by P\-group passages containing more AW vocabulary than their N\-group counterparts \(e\.g\., “cruelty” and “moral duty” in P; “protocol” and “contamination risk” in N\)\. The model was learning AW vocabulary, not AW stance\.
#### Why the behavioral evaluation works\.
The vocabulary\-matched stance\-contrast benchmark used in this paper directly tests whether the model’s preference between two stances has shifted, on items where the AW vocabulary is held constant between the two candidates\. This isolates stance from vocabulary\. The 50 items have aligned/misaligned candidates that share a mean of 7\.08 AW tokens \(Jaccard 0\.94\), so likelihood differences between candidates reflect stance preference, not vocabulary recognition\. Length normalization handles the residual token\-length asymmetry\. The result is a clean readout of feature\-level effects on model behavior\.
## 6Limitations
#### Model scale and training stage\.
We measure influence at the fine\-tuning scale \(LoRA on 100 documents per condition\) on Llama\-3\.2\-1B\. The findings apply directly to that regime\. They may transfer to other settings where small curated datasets shift model behavior \(instruction tuning, midtraining, continual pretraining\), but we did not test those settings here\. Whether the same linguistic\-feature effects scale to pretraining\-step influence on trillion\-token corpora is an open question that no academic\-budget attribution method \(MAGIC, TrackStar, fine\-tuning ablation\) currently addresses directly\.
#### Ceiling on win rate, not on the preference score\.
The un\-fine\-tuned base model’s aligned\-win rate of0\.960\.96leaves limited headroom on the binary\-choice metric\. We use the continuous preference score as the primary metric for this reason; that metric is not ceiling\-bound and shows clean per\-feature effects across the full range\. A larger and harder benchmark with lower baseline win rate would give cleaner win\-rate signal, at the cost of needing to construct items that defeat strong baseline priors while preserving vocabulary matching\.
#### Vocabulary matching is necessary but not sufficient\.
The 50 AHB items match aligned and misaligned candidates on AW vocabulary \(mean Jaccard0\.940\.94, mean shared AW tokens7\.087\.08\)\. They do not match on every feature that could drive the model’s preference: aligned candidates tend to use slightly more declarative syntax, and misaligned candidates use more concessive constructions\. Effect sizes are robust enough to suggest the underlying stance signal is driving the result, but a future iteration using even more carefully balanced candidate pairs \(e\.g\., counterbalanced for sentence structure\) would tighten the conclusion\.
#### One model, one benchmark\.
Llama\-3\.2\-1B is one of many possible architectures and the AHB\-adjacent benchmark is one of many ways to probe stance\. Replication on Mistral, Phi, Qwen, and additional behavioral benchmarks would strengthen the generalizability claims\.
#### The 100\-pair feature groups have small content variance\.
Each fine\-tuning corpus is 100 passages of roughly 140 characters, sharing topic structure across 100 topics\. The model is being asked to generalize from a small, semantically narrow training set\. Larger and more semantically diverse per\-feature corpora may reveal effects not visible at this scale, or shrink effects that are an artifact of fine\-tuning on a small in\-distribution slice\.
## 7Conclusion
The features that most strongly shift a fine\-tuned language model’s reasoning about animal welfare are the ones that make the writer’s position visible in the text: moralized vocabulary, evaluative claims, emotion words, narrative urgency, depicted severity, and asserted certainty\. Concrete sensory description and hedged language hold animal\-welfare content but withhold stance, and the model duly fails to pick one up\.
The single rule is simple:when you write for a model, assert your position rather than describe a scene\. Eight of ten linguistic features measurably shift the model on a vocabulary\-matched stance benchmark, and the seven that shift it in the pro\-animal\-welfare direction all share the property of making the writer’s position explicit\.
It is worth noting one further observation: the un\-fine\-tuned base model is already strongly pro\-animal\-welfare on this benchmark \(48/5048/50aligned, mean preference score\+0\.774\+0\.774\), and most of our fine\-tunes pulled the model down from that baseline rather than further up\. The features we identify as “pro\-AW shifters” are the ones that erode the prior the least; only three \(Moral Vocabulary, Evaluative Stance, Harm Intensity in the P\-group\) actually push above baseline\. This is consistent with the broader observation\(Brazilek and Seawell,[2026](https://arxiv.org/html/2606.26104#bib.bib8)\)that small post\-training corpora are a vector for erosion of an existing alignment, not just for installation of a new one\. For practitioners assembling a midtraining or fine\-tuning corpus that includes animal\-welfare material, the relevant question is often not “how do I make the model more aligned” but “which writing preserves the alignment the model already has\.”
## Acknowledgements
This research was conducted at Compassion Aligned Machine Learning \(CaML\)\. Compute was provided via RunPod\. The Bergson library by EleutherAI was used in the per\-document attribution iteration that informed the design of this study\. Animal\-welfare evaluation items were adapted from the Animal Harm Benchmark\(Sentient Futures,[2026](https://arxiv.org/html/2606.26104#bib.bib9)\)\.
## Data Availability
## References
- K\. Braddock and J\. P\. Dillard \(2016\)Meta\-analytic evidence for the persuasive effect of narratives on beliefs, attitudes, intentions, and behaviors\.Communication Monographs83\(4\),pp\. 446–467\.External Links:[Document](https://dx.doi.org/10.1080/03637751.2015.1128555),[Link](https://doi.org/10.1080/03637751.2015.1128555)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Brazilek, M\. Navas, and A\. Gnauck \(2026\)Small edits, large models: how Wikipedia advocacy shapes LLM values\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.19839777),[Link](https://doi.org/10.5281/zenodo.19839777)Cited by:[§1](https://arxiv.org/html/2606.26104#S1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2606.26104#S3.SS3.p1.4),[§5](https://arxiv.org/html/2606.26104#S5.SS0.SSS0.Px1.p1.3)\.
- J\. Brazilek and J\. Seawell \(2026\)Helpfulness hurts: domain\-dependent degradation of mid\-trained moral reasoning under post\-training\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.19925935),[Link](https://doi.org/10.5281/zenodo.19925935)Cited by:[§4\.1](https://arxiv.org/html/2606.26104#S4.SS1.p1.3),[§4\.3](https://arxiv.org/html/2606.26104#S4.SS3.p3.5),[§7](https://arxiv.org/html/2606.26104#S7.p3.2)\.
- T\. A\. Chang, D\. Rajagopal, T\. Bolukbasi, L\. Dixon, and I\. Tenney \(2024\)Scalable influence and fact tracing for large language model pretraining\.arXiv preprint arXiv:2410\.17413\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2410.17413),[Link](https://arxiv.org/abs/2410.17413)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px1.p1.1)\.
- EleutherAI \(2026\)Bergson: mapping out the “memory” of neural nets with data attribution\.Note:GitHubExternal Links:[Link](https://github.com/EleutherAI/bergson)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26104#S5.SS0.SSS0.Px1.p1.3)\.
- M\. C\. Green and T\. C\. Brock \(2000\)The role of transportation in the persuasiveness of public narratives\.Journal of Personality and Social Psychology79\(5\),pp\. 701–721\.External Links:[Document](https://dx.doi.org/10.1037/0022-3514.79.5.701),[Link](https://doi.org/10.1037/0022-3514.79.5.701)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Critch, J\. Li, D\. Song, and J\. Steinhardt \(2023\)Aligning AI with shared human values\.arXiv preprint arXiv:2008\.02275\.External Links:[Link](https://arxiv.org/abs/2008.02275)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.arXiv preprint arXiv:2106\.09685\.External Links:[Link](https://arxiv.org/abs/2106.09685)Cited by:[§3\.3](https://arxiv.org/html/2606.26104#S3.SS3.p1.4)\.
- A\. Ilyas and L\. Engstrom \(2025\)MAGIC: near\-optimal data attribution for deep learning\.arXiv preprint arXiv:2504\.16430\.External Links:[Link](https://arxiv.org/abs/2504.16430)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26104#S5.SS0.SSS0.Px1.p1.3)\.
- D\. Kahneman \(2011\)Thinking, fast and slow\.Farrar, Straus and Giroux,New York\.Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px4.p1.1)\.
- P\. W\. Koh and P\. Liang \(2017\)Understanding black\-box predictions via influence functions\.InInternational Conference on Machine Learning,pp\. 1885–1894\.External Links:[Link](https://proceedings.mlr.press/v70/koh17a.html)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Korbak, K\. Shi, A\. Chen, R\. Bhalerao, C\. L\. Buckley, J\. Phang, S\. R\. Bowman, and E\. Perez \(2023\)Pretraining language models with human preferences\.InInternational Conference on Machine Learning,External Links:[Link](https://arxiv.org/abs/2302.08582)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Santurkar, E\. Durmus, F\. Ladhak, C\. Lee, P\. Liang, and T\. Hashimoto \(2023\)Whose opinions do language models reflect?\.arXiv preprint arXiv:2303\.17548\.External Links:[Link](https://arxiv.org/abs/2303.17548)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px2.p1.1)\.
- Sentient Futures \(2026\)Animal harm benchmark \(AHB\)\.Note:Hugging FaceExternal Links:[Link](https://huggingface.co/datasets/sentientfutures/ahb)Cited by:[Acknowledgements](https://arxiv.org/html/2606.26104#Sx1.p1.1)\.
- H\. Shi, Z\. Xu, H\. Wang, W\. Qin, W\. Wang, Y\. Wang, Z\. Wang, S\. Ebrahimi, and H\. Wang \(2024\)Continual learning of large language models: a comprehensive survey\.arXiv preprint arXiv:2404\.16789\.External Links:[Link](https://arxiv.org/abs/2404.16789)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)LLaMA: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.External Links:[Link](https://arxiv.org/abs/2302.13971)Cited by:[§3\.3](https://arxiv.org/html/2606.26104#S3.SS3.p1.4)\.
- A\. Tversky and D\. Kahneman \(1981\)The framing of decisions and the psychology of choice\.Science211\(4481\),pp\. 453–458\.External Links:[Document](https://dx.doi.org/10.1126/science.7455683),[Link](https://doi.org/10.1126/science.7455683)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px4.p1.1)\.
- Ç\. Yıldız, N\. K\. Ravichandran, N\. Sharma, M\. Bethge, and B\. Ermis \(2024\)Investigating continual pretraining in large language models: insights and implications\.arXiv preprint arXiv:2402\.17400\.External Links:[Link](https://arxiv.org/abs/2402.17400)Cited by:[§2](https://arxiv.org/html/2606.26104#S2.SS0.SSS0.Px3.p1.1)\.Similar Articles
Small edits, large models: How Wikipedia advocacy shapes LLM values
This paper demonstrates that a small coordinated Wikipedia editing campaign can measurably shape how language models handle topics, using animal welfare as a case study.
LLMs Can Better Capture Human Judgments--With the Right Prompts
This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.
Greener Than Humans? Environmental Attitudes in Large Language Models
This paper develops a benchmark for evaluating environmental attitudes in 31 LLMs, finding they often exhibit progressive environmental views and contextual sensitivity, highlighting issues of steerability and normative reliability in sustainability applications.
Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection
This paper introduces a bias-aware evaluation framework for detecting anti-autistic ableist language in LLMs, using psychometrically-weighted ground truth based on annotator positionality. It finds that LLMs frequently misclassify community-reclaimed language as ableist and rely on surface-level keyword matching rather than context.