DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

arXiv cs.CL 05/25/26, 04:00 AM Papers
Summary
This paper presents the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, which applies activation steering to multilingual LLMs using language vectors from parallel FLORES data. The system achieved 86.96% accuracy in the MCQ track, ranking 7th out of 17 teams, and post-hoc analyses reveal that gains are layer-sensitive and vary across language-region pairs.
arXiv:2605.23069v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages. We present the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, where we apply activation steering to multilingual LLMs using language vectors extracted from parallel FLORES data. Our method performs inference-time adaptation by adding language-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates. We participated in both the short-answer (SAQ) and multiple-choice (MCQ) tracks; however, only our MCQ submission received an official score. In the official MCQ track, we achieved 86.96% accuracy, ranking 7th out of 17 teams. To better understand system behavior, we conduct post-hoc analyses on the shared-task MCQ and SAQ settings. These analyses show that activation steering yields modest and heterogeneous improvements on cultural reasoning: gains are strongly layer-sensitive, vary substantially across language-region pairs, with some configurations even degrading performance, and interact with prompt formulation, comparing generic and culturally conditioned prompts. Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
Source: [https://arxiv.org/html/2605.23069](https://arxiv.org/html/2605.23069)
Yusser Al Ghussin1,2Daniil Gurgurov1,2Yasser Hamidullah1,2 Josef van Genabith1,2Cristina España\-Bonet1,3Simon Ostermann1,2

1German Research Center for Artificial Intelligence \(DFKI GmbH\), 2Saarland Informatics Campus, Saarbrücken, Germany 3Barcelona Supercomputing Center \(BSC\-CNS\), Barcelona, Catalonia, Spain

###### Abstract

Large language models \(LLMs\) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages\. We present theDFKI\-MLTsystem forSemEval\-2026 Task 7on cultural awareness, where we apply*activation steering*to multilingual LLMs using language vectors extracted from parallel FLORES data\. Our method performs inference\-time adaptation by adding language\-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates\. We participated in both the short\-answer \(SAQ\) and multiple\-choice \(MCQ\) tracks; however, only our MCQ submission received an official score\. In the officialMCQtrack, we achieved86\.96%accuracy, ranking7th out of 17teams\. To better understand system behavior, we conduct post\-hoc analyses on the shared\-task MCQ and SAQ settings\. These analyses show that activation steering yields*modest*and*heterogeneous*improvements on cultural reasoning: gains are strongly*layer\-sensitive*, vary substantially across language–region pairs \(some configurations even degrade performance\), and interact with prompt formulation \(generic vs\. culturally conditioned prompts\)\. Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference\. We release our code and experimental configurations at[https://github\.com/Yusser96/SemEval\-2026\-Track7](https://github.com/Yusser96/SemEval-2026-Track7)\.

DFKI\-MLT at SemEval\-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

## 1Introduction

LanguageCultureActivations space:LanguageCultureFigure 1:Motivation: if culture overlaps with language representations and language identity forms stable directions, then steering with language vectors may improve access to culturally relevant knowledge\.Large language models \(LLMs\) are increasingly deployed in multilingual settings, but strong multilingual performance does not necessarily imply strong*cultural*competence\. Recent work shows that LLMs often underperform on culturally grounded reasoning and everyday cultural knowledge, especially for underrepresented regions and languages, even when they appear linguistically fluent\(Myunget al\.,[2024](https://arxiv.org/html/2605.23069#bib.bib2); Romeroet al\.,[2024](https://arxiv.org/html/2605.23069#bib.bib31)\)\. These concerns have motivated a growing body of research on*cultural awareness*and its evaluation in language models\(Pawaret al\.,[2025](https://arxiv.org/html/2605.23069#bib.bib9)\)\. This challenge is central to SemEval\-2026 Task 7\(Ousidhoumet al\.,[2026](https://arxiv.org/html/2605.23069#bib.bib1)\), which evaluates cultural knowledge and reasoning across diverse languages and cultures using BLEnD\-style evaluation protocols\(Myunget al\.,[2024](https://arxiv.org/html/2605.23069#bib.bib2)\)\.

In this paper, we describe theDFKI\-MLTsubmission to SemEval\-2026 Task 7\(Ousidhoumet al\.,[2026](https://arxiv.org/html/2605.23069#bib.bib1); Ghoshet al\.,[2026](https://arxiv.org/html/2605.23069#bib.bib3)\)\. Prior work provides mechanistic evidence that multilingual LLMs encode cultural information in representations that overlap and interact with language\-specific components\(Namazifard and Poech,[2025](https://arxiv.org/html/2605.23069#bib.bib12)\), suggesting that intervening on*language\-aligned directions*may also modulate culturally relevant behavior\. Motivated by this, our system uses*activation steering*: instead of optimizing model parameters through fine\-tuning, we modify internal activations at inference time using steering vectors\(Rimskyet al\.,[2024](https://arxiv.org/html/2605.23069#bib.bib25)\)\. Concretely, we extract language steering vectors and inject them into the residual stream of multilingual LLMs during generation\. We build on evidence that language identity is encoded as a stable direction in activation space\(Marks and Tegmark,[2023](https://arxiv.org/html/2605.23069#bib.bib83)\), and hypothesize that steering along such directions can improve access to culturally relevant knowledge \(Figure[1](https://arxiv.org/html/2605.23069#S1.F1)\)\.

Our experiments across multiple multilingual instruction\-tuned models, prompts and languages show that activation steering yields*modest*and*heterogeneous*effects on cultural reasoning: at best, we observe improvements of up to\+1\.5%absolute accuracy over the unsteered baseline on individual locales, but other configurations degrade performance, and gains do not generalize uniformly across language\-region pairs\. These results highlight both the appeal of steering as a lightweight inference\-time intervention and its current limitations as a stand\-alone solution to cultural alignment\.

Beyond reporting shared\-task performance, we aim to provide a detailed analysis of*when*and*why*using language vectors for activation steering can help cultural reasoning\.

## 2Task Background

SemEval\-2026 Task 7\(Ousidhoumet al\.,[2026](https://arxiv.org/html/2605.23069#bib.bib1)\)evaluates the*cultural awareness*of language models and NLP systems across languages and regions\. The task is based on the manually constructed BLEnD benchmark\(Myunget al\.,[2024](https://arxiv.org/html/2605.23069#bib.bib2)\), which is designed specifically for evaluation and therefore does not provide training data\. By withholding BLEnD from system training, the shared task aims to assess whether models can generalize to unseen cultural and linguistic contexts rather than memorizing benchmark content\.

BLEnD currently covers multiple languages and cultures, and the shared task further expands coverage by adding additional language\-culture pairs\. Participants may compete in one or more tracks\.

#### Track 1: Short Answer Questions \(SAQ\)\.

In the SAQ track, systems answer short questions in the same language as the input question\. The goal is to generate a culturally appropriate response while respecting linguistic and regional variation\. Answers are evaluated against human\-annotated BLEnD responses\.

#### Track 2: Multiple\-Choice Questions \(MCQ\)\.

In the MCQ track, questions are provided in English, and each question includes four answer options representing different cultural perspectives \(one option per country/region candidate, subject to the benchmark construction constraints\)\. Systems must select the culturally appropriate option for the target region\.

MCQ ExampleQuestion: What sports do men like to watch the most in Ireland? A\.baseball B\.basketball C\.cricket D\.footballGold label: D

#### Our participation\.

We participated inbothTrack 1 \(SAQ\) and Track 2 \(MCQ\)\. Our submission uses inference\-time activation steering with language vectors extracted from multilingual parallel data, without model fine\-tuning\.

#### Evaluation metric\.

The official metric isaccuracy, with evaluation designed to account for valid response variation\. In the SAQ track, a generated answer is considered correct if it matches any acceptable human\-annotated response for the same question\. In the MCQ track, accuracy is computed based on whether the selected option matches the correct culturally appropriate choice\.

## 3System Overview

Our SemEval\-2026 Task 7 submission uses*activation steering*as an inference\-time intervention for culturally aware multilingual inference\. Instead of fine\-tuning model parameters, we intervene at inference time by adding a steering vector to the residual stream at a selected transformer layer\.

The central hypothesis is that language identity is encoded as a direction in activation space\(Marks and Tegmark,[2023](https://arxiv.org/html/2605.23069#bib.bib83)\)and that steering along this direction may modulate access to culturally relevant knowledge for a target language\-region pair\. We therefore construct language vectors from multilingual sentence representations and inject them during decoding\.

The system has three components:

1. 1\.Off\-line Language vector extractionfrom FLORES\-based multilingual data;
2. 2\.Inference\-time activation steeringwith tunable strengthβ\\beta;
3. 3\.Development\-time model / layer selection / steering strengthusing the SemEval\-2026 development phase\.

In the final submission, we selected a single steering configuration based on development performance and applied it to both shared\-task tracks\.

### 3\.1Language Vector Extraction

We compute language vectors from FLORES\(Teamet al\.,[2022](https://arxiv.org/html/2605.23069#bib.bib8)\)sentences by averaging residual\-stream activations and taking a difference of means similar to the approach used in AxBench\(Wuet al\.,[2025](https://arxiv.org/html/2605.23069#bib.bib84)\)\. Leth\(l\)\(x\)h^\{\(l\)\}\(x\)denote the post\-normalization residual\-stream activation at layerllfor input sentencexx\. For a target languageℓ\\ell, the language vector is defined as:

vℓ\(l\)=1\|Dℓ\|∑x∈Dℓh\(l\)\(x\)−1\|D¬ℓ\|∑x∈D¬ℓh\(l\)\(x\),v\_\{\\ell\}^\{\(l\)\}=\\frac\{1\}\{\|D\_\{\\ell\}\|\}\\sum\_\{x\\in D\_\{\\ell\}\}h^\{\(l\)\}\(x\)\-\\frac\{1\}\{\|D\_\{\\neg\\ell\}\|\}\\sum\_\{x\\in D\_\{\\neg\\ell\}\}h^\{\(l\)\}\(x\),\(1\)whereDℓD\_\{\\ell\}is the set of sentences for the target language andD¬ℓD\_\{\\neg\\ell\}is the set of sentences from the remaining languages\.

#### Activation extraction details:

We use thepost\-normalization residual streamand compute the mean activation overall tokensin each sentence\. Sentences are processed one at a time, and no additional prompt template is used during vector extraction \(i\.e\., we feed the original FLORES sentence directly\)\.

#### FLORES mapping to shared\-task language\-region pairs:

BLEnD targets language\-region pairs \(e\.g\.,ar\-DZ,es\-MX\), while FLORES\(Teamet al\.,[2022](https://arxiv.org/html/2605.23069#bib.bib8)\)provides language/script identifiers\. We therefore define a mapping from shared\-task pairs to FLORES language codes\. For some cases where an exact regional mapping is unavailable in FLORES \(e\.g\., multiple regions sharing the same language variety\), we approximate using the closest available language\-level FLORES code \(e\.g\., a shared Spanish code for multiple Spanish\-speaking regions\)\. We provide the full mapping in Appendix[A](https://arxiv.org/html/2605.23069#A1)\.

#### Data size and preprocessing:

For each mapped language, we use the first1,000available FLORES dev sentences\(Teamet al\.,[2022](https://arxiv.org/html/2605.23069#bib.bib8)\)to compute the vector\. We do not apply additional preprocessing beyond standard tokenization by the model tokenizer\. A sample\-size convergence study in Appendix[B](https://arxiv.org/html/2605.23069#A2)shows that the resulting DiffMean directions are already highly stable at substantially smaller sample sizes across the models we analyze\.

### 3\.2Inference\-Time Steering

During inference, we steer the hidden state at a selected transformer layer:

h~\(l\)=h\(l\)\+β⋅vℓ\(l\),\\tilde\{h\}^\{\(l\)\}=h^\{\(l\)\}\+\\beta\\cdot v\_\{\\ell\}^\{\(l\)\},\(2\)wherevℓ\(l\)v\_\{\\ell\}^\{\(l\)\}is the language vector for the target language andβ\\betais a scalar steering strength\.

We evaluated a small set of steering strengthsβ∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}during development and find thatβ=1\\beta=1performs best for cultural steering in our setting\. This value is used in the final submission\.

### 3\.3Development\-Time Model and Layer Selection

We perform model and layer selection during the SemEval development phase by evaluating a set of multilingual instruction\-tuned LLMs and candidate steering layers\. We tested older and newer models in different sizes that have proven to perform well in multilingual settings, including Qwen2\.5\-72B\-Instruct and Qwen2\.5\-7B\-Instruct\(Team,[2024](https://arxiv.org/html/2605.23069#bib.bib5)\), Aya Expanse 8B and Aya Expanse 32B\(Danget al\.,[2024](https://arxiv.org/html/2605.23069#bib.bib6)\), and Qwen3\-8B and Qwen3\-32B\(Team,[2025](https://arxiv.org/html/2605.23069#bib.bib4)\)\.

Based on development performance, we selectQwen2\.5\-72B\-Instructwith steering applied atLayer 26for the final shared\-task submission\.

## 4Experimental Setup

### 4\.1Decoding and Inference

We use greedy decoding \(temperature=0\) for both tracks to minimize confounding factors when evaluating activation steering\. Since our method intervenes directly on internal representations, stochastic decoding \(e\.g\., sampling with nonzero temperature\) would introduce additional variance that can obscure whether performance changes are caused by the intervention or by decoding randomness\. Deterministic decoding therefore allows a clearer attribution of gains or degradations to the steering configuration \(layer andβ\\beta\), and improves reproducibility across layer sweeps and prompt comparisons\.

#### Track 2 \(MCQ\)\.

For each question, we prompt the model to choose one option fromA/B/C/D\. We score the four answer letters using theiroutput log\-probabilitiesand select the option with the highest log\-probability\. We generate at most1 token\.

#### Track 1 \(SAQ\)\.

We generate up to32 tokensto balance completeness and evaluation stability\. Although SAQ targets concise answers, the required length varies across languages due to tokenization and morphology \(e\.g\., multi\-word expressions\), and overly small limits risk truncating otherwise correct answers\. At the same time, longer generations increase the chance of irrelevant continuations that can hurt near\-exact matching\. To reduce formatting artifacts, we apply a lightweight normalization procedure to the generated text \(Normalization details in Appendix[C](https://arxiv.org/html/2605.23069#A3)\)\.

### 4\.2Prompting Strategy

We evaluate two prompt formulations for both tracks during analysis: ageneric promptand acultural prompt\. The official shared\-task submission uses thecultural prompt\.

#### Generic prompt\.

The generic prompt instructs the model to answer the question \(or select one MCQ option\) without explicitly mentioning the target region or language in the instruction text\.

Generic prompt TemplateSelect exactly one option: A, B, C, or D\.Question: \{question\}A\. \{option\_a\}B\. \{option\_b\}C\. \{option\_c\}D\. \{option\_d\}Answer \(A/B/C/D\):

#### Cultural prompt \(official submission\)\.

The cultural prompt explicitly conditions the model on the target region and language \(e\.g\., “for someone living in \[region\]” and “respond in \[language\]”\)\. For SAQ, it additionally instructs the model to produce a concise answer without explanation\.

Cultural prompt TemplateYou are answering a multiple\-choice question for someone living in \{Region\}\. Respond strictly in \{Language\} and select exactly one option: A, B, C, or D\.Question: \{question\}A\. \{option\_a\}B\. \{option\_b\}C\. \{option\_c\}D\. \{option\_d\}Answer \(A/B/C/D\):

### 4\.3Hyperparameters

We select the steering strength fromβ∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}on the SemEval\-2026 development phase and useβ=1\\beta=1in the final submission\. We run layer sweeps to locate the best depth for steering\. The steering layer \(Layer 26\) and backbone model \(Qwen2\.5\-72B\-Instruct\) are also chosen based on development performance for the official results\.

## 5Results and Analysis

Table 1:Official SemEval\-2026 Task 7 results for our submission\. The official submission used thecultural prompt\. Our SAQ submission was not evaluated due to a corrupted/incorrect file and therefore has no official score\.Table 2:Top\-5 language–region pairs \(Track 2 MCQ\) by our official accuracy\. “Best” denotes the top\-ranked system on the leaderboard \(overall winner\)\. Positive gap indicates our score exceeds the winner’s per\-locale score in the excerpt\.### 5\.1Official Shared\-Task Results

Due to an incorrect/corrupted submission file, our Track 1 \(SAQ\) submission was not successfully evaluated by the organizers and therefore has no official score\. We therefore report official leaderboard results only for Track 2 \(MCQ\)\.

Using the cultural prompt and activation steering, our Track 2 system achieved86\.96%overall accuracy and ranked7thout of17teams \(Table[1](https://arxiv.org/html/2605.23069#S5.T1)\)\. The best\-performing system on the leaderboard achieved96\.78%, leaving a gap of9\.82percentage points to our submission\.

Table[2](https://arxiv.org/html/2605.23069#S5.T2)lists five language\-region pairs where our system performs best at the MCQ track\. For each locale, we report our accuracy and our locale\-specific rank \(based on the official leaderboard\)\. Forar\-EG, we obtain94\.84%while the overall winner system reports91\.03%, meaning we outperform the winning system by3\.81percentage points on this locale\. In contrast, forbg\-BG, we reach94\.60%while the winner achieves99\.54%, leaving a4\.94percentage\-point deficit\. This heterogeneity aligns with our post\-hoc analyses, which indicate that both steering and prompting effects are highly locale\-dependent\.

### 5\.2Post\-hoc Analysis

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_mcq/Qwen__Qwen2p5-72B-Instruct__quen72b__beta_1/cross_prompt_overall_layer_sweep.png)

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_saq/Qwen__Qwen2p5-72B-Instruct__quen72b__beta_1/cross_prompt_overall_layer_sweep.png)

Figure 2:Post\-hoc cross\-prompt layer sweeps for Qwen2\.5\-72B\-Instruct withβ=1\\beta=1onMCQ\(top\) andSAQ\(bottom\)\. The official submission uses thecultural prompt\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen72b_overall_layer_sweep_multi_beta.png)Figure 3:Post\-hoc overall MCQ layer sweeps for Qwen2\.5\-72B\-Instruct under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen72b_beta1_top_bottom_L2.png)Figure 4:Top and bottom per\-language MCQ accuracy changes \(steered minus baseline, percentage points\) using the cultural prompt\.To characterize system behavior beyond the single locked submission configuration, we ran post\-hoc analyses on both MCQ and SAQ evaluation data across multiple models \(Qwen2\.5\-72B/7B, Aya Expanse 8B/32B, Qwen3 8B/32B\) using the same evaluation metrics provided by the SemEval\-2026 organizers for each track\. We observe:

\(i\)strong layer sensitivity: steering gains concentrate in a subset of layers while some layers degrade performance \(e\.g\., the MCQ/SAQ cross\-prompt layer sweeps in Figure[2](https://arxiv.org/html/2605.23069#S5.F2)\); notably, for Qwen2\.5\-72B the best steering layer differs by both task and prompt \(MCQ peaks at Layer 2 vs\. 3, and SAQ peaks at Layer 8 vs\. 7 for cultural vs\. generic\), illustrating that a single global layer choice is a compromise\. Notably, the best post\-hoc layer differs from the layer selected for the official submission due to differences in evaluation split\.

\(ii\)β\\betasensitivity: larger steering strengths are more prone to early\-layer instability, whereas smaller strengths are generally more robust; in practice we foundβ=1\\beta=1to be the most reliable setting for Qwen2\.5 models \(Figure[3](https://arxiv.org/html/2605.23069#S5.F3)\), while some Qwen3/Aya configurations tolerate stronger steering in post\-hoc sweeps \(Appendices[D](https://arxiv.org/html/2605.23069#A4)and[E](https://arxiv.org/html/2605.23069#A5)\)\.

\(iii\)prompt\-task interaction: cultural prompting tends to be stronger for MCQ, where it conditions choice probabilities without changing output format, whereas the generic prompt is often better for SAQ across several models, likely because SAQ scoring depends on matching short surface forms and culturally conditioned prompts can induce verbose or stylistically marked responses \(Figure[2](https://arxiv.org/html/2605.23069#S5.F2)\)\. For example, for the SAQ item*“What is a popular snack at an amusement park in Azerbaijan?”*, the generic prompt yields a short candidate \(*“Somsa/Samsa”*\), while the cultural prompt produces a longer explanatory response \(e\.g\.,*“A popular snack at an amusement park in Azerbaijan is pakhlava, a sweet pastry…”*\), which is more likely to fail evaluation even when broadly plausible\. This aligns with findings that cultural prompting can be beneficial but is not uniformly effective across settings\(Taoet al\.,[2024](https://arxiv.org/html/2605.23069#bib.bib7)\)\.

\(iv\)model\- and locale\-dependent effects: steering impacts vary substantially across language\-region pairs \(Figure[4](https://arxiv.org/html/2605.23069#S5.F4)\), with some locales showing large gains and others degradations, and these patterns are not uniform across models, motivating model\- and locale\-aware steering policies in future work\.

\(v\)model\- andβ\\betaeffects: We do not observe a simple monotonic relationship between model parameter count or depth and the optimal steering strengthβ\\betain our post\-hoc sweeps\. The preferredβ\\betaappears model\- and setting\-dependent: across our evaluated models,β=1\\beta=1is the safest default, while a few Qwen3/Aya configurations tolerate stronger steering in localized layers \(Appendices[D](https://arxiv.org/html/2605.23069#A4)and[E](https://arxiv.org/html/2605.23069#A5)\)\. We therefore caution against treatingβ\\betaas a function of scale alone, and recommend re\-tuning it per model and prompt\.

![Refer to caption](https://arxiv.org/html/2605.23069v1/x1.png)Figure 5:Per\-locale steering effect for Qwen2\.5\-72B:Δrandom\\Delta\_\{\\mathrm\{random\}\}averaged over four Gaussian draws \(x\-axis\) vs\.Δlanguage\\Delta\_\{\\mathrm\{language\}\}\(y\-axis\), using the two dev\-selected layers and both prompts\. Each point is a \(layer, prompt, locale\) cell;n=60n\{=\}60per prompt\. Random\-vector effects concentrate near zero, while language\-vector effects span a wider range and include negative outliers\.\(vi\)random vs\. language vector effects: To check whether language\-vector effects are distinguishable from generic activation perturbations, we compare them against L2\-normalized Gaussian random vectors at the same layers with the sameβ=1\\beta=1intervention \(Appendix F\)\. For Qwen2\.5\-72B, random\-vector effects remain concentrated near zero after averaging over four draws, while language\-vector effects are somewhat more dispersed and include negative outliers \(Figure[5](https://arxiv.org/html/2605.23069#S5.F5)\)\. This suggests that random perturbations do not fully explain the language\-vector effects, but the effects are also not reliably beneficial\.

## 6Discussion

Our post\-hoc analyses indicate that activation steering for cultural MCQ/SAQ reasoning yields modest and highly context\-dependent improvements rather than uniform gains\. First, the steering effect is strongly layer\-sensitive, with improvements concentrated in a subset of layers and other layers degrading performance\. Second, per\-config means stay under 0\.5 pp on either track because most layers are neutral, and gains on one track do not predict gains on the other, so a single global steering layer \(e\.g\., the Layer 26 used for the official submission\) cannot be optimal for every \(locale, track\) pair\. Third, prompt design interacts with steering in non\-trivial ways: the cultural prompt used for the official submission and a simpler generic prompt produce different optimal steering layers and different per\-language gains\.

These findings indicate that prompt design and activation steering should be treated as a jointly optimized inference\-time adaptation problem rather than independent components\.

## 7Limitations and Future Work

- •Official evaluation coverage\.Our Track 1 \(SAQ\) submission was not officially evaluated because of a corrupted file\. All SAQ results are therefore*post\-hoc*offline re\-evaluations and not comparable to the official leaderboard\. Future submissions should include stricter package validation before upload\.
- •Scope of empirical comparison\.We analyze sensitivity to layer,β\\beta, prompt, model, and locale, but do not exhaustively compare against stronger prompt\-only baselines, fine\-tuning, or alternative steering methods such as CAA, ReFT, or SAE\-based steering\. Future work should benchmark DiffMean steering against these adaptation methods under matched compute and evaluation settings\.
- •Language\-derived vectors and cultural conflation\.Our vectors are derived from FLORES language\-level data rather than culturally annotated or task\-specific data\. This conflates language identity with culture: several language–region pairs share the same FLORES code, so within\-language regional variation is not captured\. Future work should compare FLORES\-based vectors with culture\-specific and task\-specific steering directions\.
- •Single global steering configuration\.Our official submission uses one global\(β,layer\)\(\\beta,\\text\{layer\}\)pair, although post\-hoc analyses show that locally optimal settings vary across models, prompts, layers, and locales\. Future work should explore adaptive per\-language, per\-locale, or per\-prompt steering policies\.

## Acknowledgments

This research was supported by the German Federal Ministry of Research, Technology and Space \(BMFTR\) as part of the project TRAILS \(01IW24005\)\.

## References

- J\. Dang, S\. Singh, D\. D’souza, A\. Ahmadian, A\. Salamanca, M\. Smith, A\. Peppin, S\. Hong, M\. Govindassamy, T\. Zhao, S\. Kublik, M\. Amer, V\. Aryabumi, J\. A\. Campos, Y\. Tan, T\. Kocmi, F\. Strub, N\. Grinsztajn, Y\. Flet\-Berliac, A\. Locatelli, H\. Lin, D\. Talupuru, B\. Venkitesh, D\. Cairuz, B\. Yang, T\. Chung, W\. Ko, S\. S\. Shi, A\. Shukayev, S\. Bae, A\. Piktus, R\. Castagné, F\. Cruz\-Salinas, E\. Kim, L\. Crawhall\-Stein, A\. Morisot, S\. Roy, P\. Blunsom, I\. Zhang, A\. Gomez, N\. Frosst, M\. Fadaee, B\. Ermis, A\. Üstün, and S\. Hooker \(2024\)Aya expanse: combining research breakthroughs for a new multilingual frontier\.Vol\.abs/2412\.04261\.External Links:[Link](https://arxiv.org/abs/2412.04261)Cited by:[§3\.3](https://arxiv.org/html/2605.23069#S3.SS3.p1.1)\.
- Proceedings of the 20th international workshop on semantic evaluation \(semeval\-2026\)\.Association for Computational Linguistics,San Diego, United States\.Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p2.1)\.
- S\. Marks and M\. Tegmark \(2023\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.ArXiv preprintabs/2310\.06824\.External Links:[Link](https://arxiv.org/abs/2310.06824)Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p2.1),[§3](https://arxiv.org/html/2605.23069#S3.p2.1)\.
- J\. Myung, N\. Lee, Y\. Zhou, J\. Jin, R\. Putri, D\. Antypas, H\. Borkakoty, E\. Kim, C\. Perez\-Almendros, A\. A\. Ayele,et al\.\(2024\)Blend: a benchmark for llms on everyday knowledge in diverse cultures and languages\.Advances in Neural Information Processing Systems37,pp\. 78104–78146\.Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p1.1),[§2](https://arxiv.org/html/2605.23069#S2.p1.1)\.
- D\. Namazifard and L\. G\. Poech \(2025\)Isolating culture neurons in multilingual large language models\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 768–785\.Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p2.1)\.
- N\. Ousidhoum, J\. Myung, C\. Perez\-Almendros, J\. Jin, A\. Keleg, M\. Beloucif, Y\. Zhou, R\. Agerri, V\. Araujo, N\. Baes, J\. Barry, J\. Boisson, N\. F\. Chen, C\. de Kock, A\. Edwards, J\. F\. de Landa, M\. F\. Imam, H\. Hakami, S\. Hsieh, J\. M\. Imperial, R\. K\. Lee, C\. Lyu, Y\. Samih, J\. Sjons, B\. Tan, A\. Ushio, W\. Zheng, Z\. Liu, A\. Oh, and J\. Camacho\-Collados \(2026\)SemEval\-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures\.InProceedings of the 20th International Workshop on Semantic Evaluation \(SemEval\-2026\),Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p1.1),[§1](https://arxiv.org/html/2605.23069#S1.p2.1),[§2](https://arxiv.org/html/2605.23069#S2.p1.1)\.
- S\. Pawar, J\. Park, J\. Jin, A\. Arora, J\. Myung, S\. Yadav, F\. G\. Haznitrama, I\. Song, A\. Oh, and I\. Augenstein \(2025\)Survey of cultural awareness in language models: text and beyond\.Computational Linguistics51\(3\),pp\. 907–1004\.Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p1.1)\.
- N\. Rimsky, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. Turner \(2024\)Steering llama 2 via contrastive activation addition\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15504–15522\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.828),[Link](https://aclanthology.org/2024.acl-long.828/)Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p2.1)\.
- D\. Romero, C\. Lyu, H\. A\. Wibowo, T\. Lynn, I\. Hamed, A\. N\. Kishore, A\. Mandal, A\. Dragonetti, A\. Abzaliev, A\. L\. Tonja,et al\.\(2024\)CULTURALBENCH: a human\-verified benchmark for evaluating cultural knowledge in large language models\.ArXiv preprintabs/2410\.02677\.External Links:[Link](https://arxiv.org/abs/2410.02677)Cited by:[§1](https://arxiv.org/html/2605.23069#S1.p1.1)\.
- Y\. Tao, O\. Viberg, R\. S\. Baker, and R\. F\. Kizilcec \(2024\)Cultural bias and cultural alignment of large language models\.PNAS nexus3\(9\),pp\. pgae346\.Cited by:[§5\.2](https://arxiv.org/html/2605.23069#S5.SS2.p4.1)\.
- N\. Team, M\. R\. Costa\-jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard, A\. Sun, S\. Wang, G\. Wenzek, A\. Youngblood, B\. Akula, L\. Barrault, G\. M\. Gonzalez, P\. Hansanti, J\. Hoffman, S\. Jarrett, K\. R\. Sadagopan, D\. Rowe, S\. Spruit, C\. Tran, P\. Andrews, N\. F\. Ayan, S\. Bhosale, S\. Edunov, A\. Fan, C\. Gao, V\. Goswami, F\. Guzmán, P\. Koehn, A\. Mourachko, C\. Ropers, S\. Saleem, H\. Schwenk, and J\. Wang \(2022\)No language left behind: scaling human\-centered machine translation\.Vol\.abs/2207\.04672\.External Links:[Link](https://arxiv.org/abs/2207.04672)Cited by:[§3\.1](https://arxiv.org/html/2605.23069#S3.SS1.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.23069#S3.SS1.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2605.23069#S3.SS1.p1.4)\.
- Q\. Team \(2024\)Qwen2\.5: a party of foundation models\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5/)Cited by:[§3\.3](https://arxiv.org/html/2605.23069#S3.SS3.p1.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.Vol\.abs/2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3\.3](https://arxiv.org/html/2605.23069#S3.SS3.p1.1)\.
- Z\. Wu, A\. Arora, A\. Geiger, Z\. Wang, J\. Huang, D\. Jurafsky, C\. D\. Manning, and C\. Potts \(2025\)Axbench: steering llms? even simple baselines outperform sparse autoencoders\.ArXiv preprintabs/2501\.17148\.External Links:[Link](https://arxiv.org/abs/2501.17148)Cited by:[§3\.1](https://arxiv.org/html/2605.23069#S3.SS1.p1.4)\.

## Appendix

## Appendix AFLORES Mapping for Shared\-Task language\-region Pairs

We map BLEnD language\-region pairs to FLORES language/script identifiers to compute language vectors\. In cases where FLORES does not provide a region\-specific variety, we use the closest available language\-level approximation \(e\.g\., a shared Spanish FLORES code for multiple Spanish\-speaking regions\)\. Table[3](https://arxiv.org/html/2605.23069#A1.T3)provides the complete mapping from BLEnD language\-region pairs to FLORES language/script identifiers used to compute language vectors\.

Table 3:Complete mapping from BLEnD language–region pairs to FLORES language/script identifiers used for language vector computation\. Some mappings are approximations when an exact region\-specific FLORES variety is unavailable\.
## Appendix BFLORES Sample\-Size Convergence Study

We test whether the DiffMean language vectors used in §3\.1 are sensitive to the number of FLORES sentences used for extraction\. For each of the six post\-hoc models \(Qwen2\.5\-72/7B\-Instruct, Qwen3\-32/8B, and Aya\-Expanse\-32/8B\) we re\-estimate vectors for the same 28 FLORES languages atN∈\{100,200,…,1000\}N\\in\\\{100,200,\\ldots,1000\\\}\. For each language and layer, we computevN\(ℓ,l\)v\_\{N\}^\{\(\\ell,l\)\}from the firstNNFLORES dev sentences, using the same post\-normalization residual\-stream activations and token averaging as in §3\.1\. Since eachNNis a strict prefix of theN=1000N\{=\}1000set, differences fromv1000v\_\{1000\}isolate the effect of adding more sentences rather than changing the sample\.

We measure convergence with cosine similarity,cos⁡\(vN\(ℓ,l\),v1000\(ℓ,l\)\)\\cos\(v\_\{N\}^\{\(\\ell,l\)\},v\_\{1000\}^\{\(\\ell,l\)\}\), for every language–layer–NNcell\. Figure[6](https://arxiv.org/html/2605.23069#A2.F6)summarizes the results with per\-language curves, averaged over layers, and a joint median with 25–75% IQR over all language–layer cells\.

Across all six models, the joint median is already at least0\.990\.99atN=100N\{=\}100and reaches at least0\.9990\.999byN=500N\{=\}500; the IQR is essentially collapsed near1\.01\.0from aboutN=300N\{=\}300onward\. Low\-NNoutliers are model\-dependent: the worst per\-language layer means atN=100N\{=\}100range from about0\.960\.96for Aya models to about0\.860\.86for Qwen2\.5\-7B\. Qwen2\.5\-72B does not show worse stability than smaller models, suggesting that greater depth does not require more FLORES data within the tested\[100,1000\]\[100,1000\]range\.

These results indicate that using1,0001\{,\}000FLORES sentences is conservative for the six models studied here\. Since the downstream intervention uses the unit\-norm directionvvwithβ=1\\beta=1, a cosine similarity of0\.990\.99corresponds to a steering\-direction change belowarccos⁡\(0\.99\)≈8∘\\arccos\(0\.99\)\\approx 8^\{\\circ\}, smaller than the layer\-to\-layer variation observed in our steering sweeps\. We therefore do not expect FLORES sample size to be a major source of instability in our reported results\.

![Refer to caption](https://arxiv.org/html/2605.23069v1/x2.png)Figure 6:DiffMean vector convergence vs\. FLORES sample size:cos⁡\(vN,v1000\)\\cos\(v\_\{N\},v\_\{1000\}\)as a function ofNNfor all six post\-hoc models over 28 FLORES languages\. Faint grey curves are individual languages, averaged over layers; the bold curve is the joint median over language–layer pairs, with the shaded 25–75% IQR\. The dotted reference line markscos=0\.97\\cos=0\.97, and theyy\-axis is zoomed to show sub\-1 variation\.
## Appendix CPost\-processing for SAQ

For SAQ evaluation, we normalize model outputs using simple string cleanup heuristics:

- •truncate at<\|end\_of\_text\|\>if present;
- •keep only the first line;
- •keep text before the first period;
- •collapse repeated whitespace;
- •remove quotation marks\.

This post\-processing is applied before matching generated answers to the set of human\-annotated acceptable responses\.

## Appendix DPost\-hoc MCQ Analysis Plots

This appendix provides additional post\-hoc analysis plots for all tested models\.

### D\.1Qwen2\.5\-72B\-Instruct

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen72b_overall_layer_sweep_multi_beta.png)

Figure 7:Post\-hoc overall MCQ layer sweeps for Qwen2\.5\-72B\-Instruct under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_mcq/Qwen__Qwen2p5-72B-Instruct__quen72b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 8:Post\-hoc cross\-prompt MCQ layer sweep for Qwen2\.5\-72B\-Instruct withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 2 for the cultural prompt and Layer 3 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen72b_beta1_top_bottom_L2.png)Figure 9:Top and bottom per\-language MCQ accuracy changes \(steered minus baseline, percentage points\) for Qwen2\.5\-72B\-Instruct at Layer 2 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### D\.2Qwen2\.5\-7B\-Instruct

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen7b_overall_layer_sweep_multi_beta.png)

Figure 10:Post\-hoc overall MCQ layer sweeps for Qwen2\.5\-7B\-Instruct under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_mcq/Qwen__Qwen2p5-7B-Instruct__quen7b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 11:Post\-hoc cross\-prompt MCQ layer sweep for Qwen2\.5\-7B\-Instruct withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 25 for the cultural prompt and Layer 8 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen7b_beta1_top_bottom_L25.png)Figure 12:Top and bottom per\-language MCQ accuracy changes \(steered minus baseline, percentage points\) for Qwen2\.5\-7B\-Instruct at Layer 25 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### D\.3Aya Expanse 8B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/aya8b_overall_layer_sweep_multi_beta.png)

Figure 13:Post\-hoc overall MCQ layer sweeps for Aya Expanse 8B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_mcq/CohereLabs__aya-expanse-8b__aya8b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 14:Post\-hoc cross\-prompt MCQ layer sweep for Aya Expanse 8B withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects the baseline accuracy but, with this model, delivers the same optimal steering layer \(Layer 14 for both the cultural and generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/aya8b_beta1_top_bottom_L14.png)Figure 15:Top and bottom per\-language MCQ accuracy changes \(steered minus baseline, percentage points\) for Aya Expanse 8B at Layer 14 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### D\.4Aya Expanse 32B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/aya32b_overall_layer_sweep_multi_beta.png)

Figure 16:Post\-hoc overall MCQ layer sweeps for Aya Expanse 32B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, while improving or meeting performance in mid to late layers\. In comparison,β=1\\beta=1remains stable across layers althoughβ=5\\beta=5yields the best overall Acc in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_mcq/CohereLabs__aya-expanse-32b__aya32b__beta_5/cross_prompt_overall_layer_sweep.png)Figure 17:Post\-hoc cross\-prompt MCQ layer sweep for Aya Expanse 32B withβ=5\\beta=5\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 14 for the cultural prompt and Layer 20 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/aya32b_beta1_top_bottom_L23.png)Figure 18:Top and bottom per\-language MCQ accuracy changes \(steered minus baseline, percentage points\) for Aya Expanse 32B at Layer 23 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### D\.5Qwen3\-8B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen3-8b_overall_layer_sweep_multi_beta.png)

Figure 19:Post\-hoc overall MCQ layer sweeps for Qwen3\-8B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_mcq/Qwen__Qwen3-8B__quen3-8b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 20:Post\-hoc cross\-prompt MCQ layer sweep for Qwen3\-8B withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 14 for the cultural prompt and Layer 25 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen3-8b_beta1_top_bottom_L14.png)Figure 21:Top and bottom per\-language MCQ accuracy changes \(steered minus baseline, percentage points\) for Qwen3\-8B at Layer 14 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### D\.6Qwen3\-32B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen3-32b_overall_layer_sweep_multi_beta.png)

Figure 22:Post\-hoc overall MCQ layer sweeps for Qwen3\-32B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable althoughβ=5\\beta=5yields the best overall Acc in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_mcq/Qwen__Qwen3-32B__quen3-32b__beta_5/cross_prompt_overall_layer_sweep.png)Figure 23:Post\-hoc cross\-prompt MCQ layer sweep for Qwen3\-32B withβ=5\\beta=5\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 23 for the cultural prompt and Layer 55 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_mcq/quen3-32b_beta1_top_bottom_L23.png)Figure 24:Top and bottom per\-language MCQ accuracy changes \(steered minus baseline, percentage points\) for Qwen3\-32B at Layer 23 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.## Appendix EPost\-hoc SAQ Analysis Plots

This appendix provides additional post\-hoc analysis plots for all tested models\.

### E\.1Qwen2\.5\-72B\-Instruct

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen72b_overall_layer_sweep_multi_beta.png)

Figure 25:Post\-hoc overall SAQ layer sweeps for Qwen2\.5\-72B\-Instruct under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_saq/Qwen__Qwen2p5-72B-Instruct__quen72b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 26:Post\-hoc cross\-prompt SAQ layer sweep for Qwen2\.5\-72B\-Instruct withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 8 for the cultural prompt and Layer 7 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen72b_beta1_top_bottom_L8.png)Figure 27:Top and bottom per\-language SAQ accuracy changes \(steered minus baseline, percentage points\) for Qwen2\.5\-72B\-Instruct at Layer 8 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### E\.2Qwen2\.5\-7B\-Instruct

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen7b_overall_layer_sweep_multi_beta.png)

Figure 28:Post\-hoc overall SAQ layer sweeps for Qwen2\.5\-7B\-Instruct under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_saq/Qwen__Qwen2p5-7B-Instruct__quen7b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 29:Post\-hoc cross\-prompt SAQ layer sweep for Qwen2\.5\-7B\-Instruct withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 2 for the cultural prompt and Layer 1 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen7b_beta1_top_bottom_L2.png)Figure 30:Top and bottom per\-language SAQ accuracy changes \(steered minus baseline, percentage points\) for Qwen2\.5\-7B\-Instruct at Layer 2 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### E\.3Aya Expanse 8B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/aya8b_overall_layer_sweep_multi_beta.png)

Figure 31:Post\-hoc overall SAQ layer sweeps for Aya Expanse 8B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_saq/CohereLabs__aya-expanse-8b__aya8b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 32:Post\-hoc cross\-prompt SAQ layer sweep for Aya Expanse 8B withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 8 for the cultural prompt and Layer 17 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/aya8b_beta1_top_bottom_L8.png)Figure 33:Top and bottom per\-language SAQ accuracy changes \(steered minus baseline, percentage points\) for Aya Expanse 8B at Layer 8 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### E\.4Aya Expanse 32B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/aya32b_overall_layer_sweep_multi_beta.png)

Figure 34:Post\-hoc overall SAQ layer sweeps for Aya Expanse 32B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially improve performance in early layers andβ=5\\beta=5yields the best overall Acc in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_saq/CohereLabs__aya-expanse-32b__aya32b__beta_5/cross_prompt_overall_layer_sweep.png)Figure 35:Post\-hoc cross\-prompt SAQ layer sweep for Aya Expanse 32B withβ=5\\beta=5\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 7 for the cultural prompt and Layer 1 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/aya32b_beta1_top_bottom_L7.png)Figure 36:Top and bottom per\-language SAQ accuracy changes \(steered minus baseline, percentage points\) for Aya Expanse 32B at Layer 7 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### E\.5Qwen3\-8B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen3-8b_overall_layer_sweep_multi_beta.png)

Figure 37:Post\-hoc overall SAQ layer sweeps for Qwen3\-8B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Large steering strengths can substantially degrade performance in early layers, whileβ=1\\beta=1remains stable and yields the best overall trade\-off in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_saq/Qwen__Qwen3-8B__quen3-8b__beta_1/cross_prompt_overall_layer_sweep.png)Figure 38:Post\-hoc cross\-prompt SAQ layer sweep for Qwen3\-8B withβ=1\\beta=1\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 3 for the cultural prompt and Layer 7 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen3-8b_beta1_top_bottom_L7.png)Figure 39:Top and bottom per\-language SAQ accuracy changes \(steered minus baseline, percentage points\) for Qwen3\-8B at Layer 7 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.### E\.6Qwen3\-32B

![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen3-32b_overall_layer_sweep_multi_beta.png)

Figure 40:Post\-hoc overall SAQ layer sweeps for Qwen3\-32B under different steering strengths \(β∈\{1,3,5\}\\beta\\in\\\{1,3,5\\\}\)\. Steering strengths show unstable performance across all layers, whileβ=5\\beta=5yields the best overall Acc in our experiments\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_cross_prompt_saq/Qwen__Qwen3-32B__quen3-32b__beta_5/cross_prompt_overall_layer_sweep.png)Figure 41:Post\-hoc cross\-prompt SAQ layer sweep for Qwen3\-32B withβ=5\\beta=5\. The official submission uses thecultural prompt\. Prompt choice affects both baseline accuracy and the optimal steering layer \(here, Layer 20 for the cultural prompt and Layer 35 for the generic prompt\)\.![Refer to caption](https://arxiv.org/html/2605.23069v1/assets/semeval/eval/Qwen2.5-72B-Instruct/figures_main_saq/quen3-32b_beta1_top_bottom_L6.png)Figure 42:Top and bottom per\-language SAQ accuracy changes \(steered minus baseline, percentage points\) for Qwen3\-32B at Layer 6 withβ=1\\beta=1using the cultural prompt\. Steering produces substantially different effects across language\-region pairs, including both strong gains and degradations\.
## Appendix FRandom\- vs\. Language\-Vector Steering

This appendix summarizes the random\-vector control used to test whether language\-vector steering effects reflect language\-specific structure or generic activation perturbations\.

#### Setup\.

For each model, layer, prompt, and locale, we compare two unit\-norm steering directions applied with the same intervention

h~\(l\)=h\(l\)\+βu\(l\),β=1\.\\tilde\{h\}^\{\(l\)\}=h^\{\(l\)\}\+\\beta u^\{\(l\)\},\\qquad\\beta=1\.The language direction is the FLORES DiffMean vector from §3\.1,

vℓ\(l\)=1\|Dℓ\|∑x∈Dℓh\(l\)\(x\)−1\|D¬ℓ\|∑x∈D¬ℓh\(l\)\(x\),v\_\{\\ell\}^\{\(l\)\}=\\frac\{1\}\{\|D\_\{\\ell\}\|\}\\sum\_\{x\\in D\_\{\\ell\}\}h^\{\(l\)\}\(x\)\-\\frac\{1\}\{\|D\_\{\\neg\\ell\}\|\}\\sum\_\{x\\in D\_\{\\neg\\ell\}\}h^\{\(l\)\}\(x\),computed from token\-mean post\-normalization residual activations and then L2\-normalized\. The random control samples

rℓ,k\(l\)∼𝒩\(0,Id\),r^ℓ,k\(l\)=rℓ,k\(l\)/∥rℓ,k\(l\)∥2,r\_\{\\ell,k\}^\{\(l\)\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\),\\qquad\\hat\{r\}\_\{\\ell,k\}^\{\(l\)\}=r\_\{\\ell,k\}^\{\(l\)\}/\\lVert r\_\{\\ell,k\}^\{\(l\)\}\\rVert\_\{2\},with the same dimensionality and norm as the language vector\. Within each drawkk, the same random vector is reused for all locales sharing the corresponding FLORES language code, mirroring the language\-vector setup\.

#### Bootstrap\.

For the random baseline, we run four independent Gaussian draws and compute

Δrandom,k=accmod,k−accbase,k\.\\Delta\_\{\\mathrm\{random\},k\}=\\mathrm\{acc\}\_\{\\mathrm\{mod\},k\}\-\\mathrm\{acc\}\_\{\\mathrm\{base\},k\}\.We then averageΔrandom,k\\Delta\_\{\\mathrm\{random\},k\}over draws and compare it to the single language\-vector delta,

Δlanguage=acclang−accbase\.\\Delta\_\{\\mathrm\{language\}\}=\\mathrm\{acc\}\_\{\\mathrm\{lang\}\}\-\\mathrm\{acc\}\_\{\\mathrm\{base\}\}\.Both deltas are reported in percentage points\. The comparison is matched at the same model, layer, prompt, locale, and steering strength\.

#### Interpretation\.

The Qwen2\.5\-72B comparison in Figure[5](https://arxiv.org/html/2605.23069#S5.F5)shows that averaged random\-vector effects remain concentrated near zero, while language\-vector effects are somewhat more dispersed\. This supports the main\-text conclusion that language\-vector steering is not simply equivalent to a generic Gaussian perturbation, but also does not yield reliably positive gains at the tested layers\.
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

Similar Articles

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

Multi-Attribute Steering of Language Models via Targeted Intervention

AlignCultura: Towards Culturally Aligned Large Language Models?

Submit Feedback

Similar Articles

Cultural Value Alignment Via Latent Activation Steering in Large Language Models
Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
Multi-Attribute Steering of Language Models via Targeted Intervention
AlignCultura: Towards Culturally Aligned Large Language Models?