Hidden Consensus:Preference-Validity Compression in Human Feedback
Summary
This paper argues that standard RLHF's scalarization of human preferences collapses multiple valid interpretations into a single target, mis-measuring alignment in culturally plural societies. Analyzing a Malaysian dataset, they find 79% of prompts have multiple majority-supported responses that single-winner aggregation discards.
View Cached Full Text
Cached at: 06/10/26, 06:11 AM
# Preference-Validity Compression in Human Feedback
Source: [https://arxiv.org/html/2606.10569](https://arxiv.org/html/2606.10569)
Dorcas Chia Ern Chua1Karen Myn Hui Lee1Jia Yue Tan1Zhen Xue Gue3,∘\\circ Norzalena Abdul Hamid1Azima Binti Azmi1Keat Mei Yeong1 Aizat Izyani binti Mujab1Hafsah Noor Azam1Chee Guo Khoo4,∘\\circ Han Ying Lim1,♢\\diamondsuitChee Seng Chan2,♢\\diamondsuit
∘\\circIntern♢\\diamondsuitProject Lead 1YTL AI Labs2Universiti Malaya3Monash University Malaysia4Universiti Malaysia Sarawak
###### Abstract
Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target\. We argue that this reduction can mis\-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise\. We call this failure*Preference\-Validity Compression*, the collapse of multiple plural\-valid response options into a single optimization target\. Using Malaysia as a diagnostic setting, we analyze RLHF\-style feedback aggregation through*preference events*linking prompts, responses, and acceptability judgments across interpretive frames\. Across 321 preference events from 20 participants and 107 trio\-annotated prompts, 79% of prompts contain more than one majority\-supported response that single\-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority\-supported options are considered\. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames\. These findings show that majority aggregation in this corpus measuresargmax\\arg\\maxacceptability rather than plural alignment\. We treat this as a measurement\-validity issue and argue that future alignment methods should satisfy*Validity\-Preserving Consistency*, remaining stable across plural\-valid interpretive frames rather than collapsing them into a single reward target\.
Hidden Consensus: Preference\-Validity Compression in Human Feedback
Dorcas Chia Ern Chua1Karen Myn Hui Lee1Jia Yue Tan1Zhen Xue Gue3,∘\\circ††thanks:Intern at Universiti Malaya\.Norzalena Abdul Hamid1Azima Binti Azmi1Keat Mei Yeong1Aizat Izyani binti Mujab1Hafsah Noor Azam1Chee Guo Khoo4,∘\\circ††thanks:Intern at YTL AI Labs\.Han Ying Lim1,♢\\diamondsuitChee Seng Chan2,♢\\diamondsuit∘\\circIntern♢\\diamondsuitProject Lead1YTL AI Labs2Universiti Malaya3Monash University Malaysia4Universiti Malaysia Sarawak
## 1Introduction
Reinforcement learning from human feedback \(RLHF\) aligns large language models by converting human judgments over model outputs into a scalar reward signal\(Christianoet al\.,[2017](https://arxiv.org/html/2606.10569#bib.bib1); Ouyanget al\.,[2022](https://arxiv.org/html/2606.10569#bib.bib3)\)\. This scalarization is useful for optimization, but it also creates a measurement problem\. Standard reward modeling often treats pooled preference comparisons as observations of a shared latent preference function, with disagreement absorbed as stochastic variation\. This assumption is reasonable when disagreement reflects noise or annotation error\. It becomes fragile when disagreement reflects more than one valid interpretation of what constitutes an acceptable response\.
Figure 1:Preference\-Validity Compression\. Single\-winner aggregation selects “1957” as the reward target, while the preference\-event view preserves both “1957” and “1963” as acceptable under different frames\.This distinction matters for pluralistic alignment\. Prior work shows that annotator disagreement can encode meaningful semantic, moral, and cultural variation rather than error\(Aroyo and Welty,[2015](https://arxiv.org/html/2606.10569#bib.bib4); Sorensenet al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib12)\)\. Other work shows that aggregated reward models can converge toward dominant preference patterns and encode narrow annotator populations as if they were universal preferences\(Casperet al\.,[2023](https://arxiv.org/html/2606.10569#bib.bib13); Santurkaret al\.,[2023](https://arxiv.org/html/2606.10569#bib.bib14)\)\. We study this as an alignment\-measurement problem\. The source of the failure is not disagreement itself, but the aggregation step that turns heterogeneous acceptability judgments into one reward target\. When feedback is reduced to a single target, the system does not only summarize human preference\. It determines which acceptable responses remain visible to optimization\.
We call this failure*Preference\-Validity Compression*\. It occurs when multiple acceptable response options are collapsed into one dominant optimization target\. Figure[1](https://arxiv.org/html/2606.10569#S1.F1)illustrates the idea using the question of Malaysia’s independence\. The 1957 frame reflects the independence of Federation of Malaya, while 1963 frame reflects the formation of Malaysia, involving Malaya, Sabah, Sarawak, and Singapore at the time\. A single reward target may preserve one frame while suppressing another frame that remains historically meaningful\.
We use Malaysia as a diagnostic setting because it is structurally plural\. Disagreement over acceptable model responses may reflect coexisting cultural, historical, linguistic, religious, and regional interpretive frames rather than random annotator noise\. The category “Malaysian users” should not be assumed to correspond to a single preference signal\. Our goal is not to estimate national preference distributions or train a reward model, but to test whether single\-winner aggregation can conceal supported alternatives within a structurally plural setting\.
We analyze human feedback as*preference events*, context\-dependent judgments formed by a prompt, a model response, and an evaluator’s acceptability decision mediated by intersecting cultural, historical, linguistic, and normative frames\. In our diagnostic study, 20 Malaysian participants judged three responses per prompt as acceptable or not acceptable across 107 trio\-annotated prompts\. This elicitation design allows plural acceptability to surface prior to aggregation and enables us to test whether the supported response set is non\-singleton at the event level\.
The results show that this measurement issue is observable in practice\. Across 321 preference events, 79% of prompts contain at least one additional response that reaches the majority acceptance threshold but would be discarded byargmax\\arg\\maxaggregation\. Moreover, single\-winner aggregation overstates the dominance of the winning response: gaps between top responses diminish substantially when all majority\-supported responses are considered\. At the annotator level, acceptance behavior also does not conform to a stable one\-response\-per\-prompt pattern, suggesting that evaluators frequently treat multiple responses as simultaneously valid rather than mutually exclusive\. Qualitative hidden\-consensus examples further show that discarded responses can reflect coherent local, practical, or cultural frames\.
These findings reframe pluralistic alignment as a measurement\-validity problem where the aggregation protocol itself becomes the hidden bottleneck when it collapses plural\-valid interpretations into a single reward target\. We argue that future alignment methods should satisfy*Validity\-Preserving Consistency*\. Alignment should remain stable across plural\-valid interpretive frames rather than collapsing them into a dominant reward target\.
This paper makes three contributions\.
- •We identify a measurement failure in RLHF\-style feedback aggregation, where a single reward target can measureargmax\\arg\\maxacceptability while being interpreted as plural alignment\.
- •We formalize this failure as*Preference\-Validity Compression*and introduce*preference events*as the diagnostic unit linking acceptability judgments to prompts, responses, interpretive frames, and evaluator context\.
- •Using Malaysia as a diagnostic setting, we provide empirical evidence that plural acceptability persists at both the response\-set and annotator levels, causing single\-winner aggregation to mis\-measure the breadth and structure of supported responses in this corpus\.
## 2Related Work
### 2\.1Annotator Disagreement and Preference Aggregation
NLP research has challenged the view that annotator disagreement is merely measurement error\. Disagreement can reflect semantic ambiguity, subjective interpretation, and value pluralism rather than noise\(Aroyo and Welty,[2015](https://arxiv.org/html/2606.10569#bib.bib4); Basileet al\.,[2021](https://arxiv.org/html/2606.10569#bib.bib5)\)\. For RLHF, aggregating plural\-valid disagreement determines which interpretations remain visible to the learning system\.
Standard RLHF pipelines typically collect human preference comparisons, fit a reward model to pooled judgments, and optimize against the learned scalar reward\. Recent work challenges this single\-reward assumption\. PRISM shows that alignment preferences are subjective and context\-dependent across 1,500 participants from 75 countries\(Kirket al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib9)\)\. Heterogeneous RLHF and MaxMin\-RLHF study personalization, aggregation, and group\-aware objectives for diverse preferences\(Parket al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib2); Chakrabortyet al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib6)\)\. Operationalizing Pluralistic Values further shows that demographic composition, disagreement handling, rating scales, and optimization choices can change learned model behavior\(Aliet al\.,[2026](https://arxiv.org/html/2606.10569#bib.bib10)\)\. Prior work on annotation aggregation shows that majority voting can under\-represent sociodemographic groups and embed representational bias into downstream models\(Prabhakaranet al\.,[2021](https://arxiv.org/html/2606.10569#bib.bib7)\), and social\-choice perspectives treat preference aggregation as a normative design choice rather than a neutral step\(Conitzeret al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib11)\)\. These works establish that preference heterogeneity matters for alignment, but diversity\-aware alignment can still become non\-plural if it ultimately collapses acceptable alternatives into one selected target\. What remains unaddressed is what the aggregation target actually measures when multiple responses are acceptable for the same prompt\.
### 2\.2Cross\-Cultural Alignment and Malaysian Representation
Concerns about narrow preference representation build on critiques of WEIRD bias, where narrow populations are often treated as default human subjects\(Henrichet al\.,[2010](https://arxiv.org/html/2606.10569#bib.bib15)\)\. Recent alignment work argues for broader community participation\(Mihalceaet al\.,[2025](https://arxiv.org/html/2606.10569#bib.bib16)\)\. For RLHF, however, inclusion alone is insufficient\. A diverse annotator pool can still produce a non\-plural reward model if disagreement is collapsed into one scalar target\.
Empirical work shows that LLM behavior often reflects dominant cultural frames\. Multilingual LLMs may produce responses closer to English\-speaking cultural contexts even when queried in other languages, and this tendency can increase after RLHF\-style alignment\(Wanget al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib17)\)\. From our perspective, this pattern arises only partly from issues in pretraining coverage\. When the reward target privileges one dominant frame, other locally valid interpretations are rendered invisible\. The deeper alignment risk thus extends beyond issues of under\-representation, as supported non\-winner interpretations may disappear entirely when aggregation collapses plural acceptability into a single optimization target\.
Recent Southeast Asian benchmarks show that Malaysian linguistic and cultural representation remains challenging\. MalayMMLU studies Malaysian knowledge evaluation in Bahasa Melayu\(Pohet al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib18)\), while MyCulture examines Malaysian cultural knowledge and the risk of treating nations as homogeneous cultural units\(Hewet al\.,[2025](https://arxiv.org/html/2606.10569#bib.bib19)\)\. These benchmarks ask whether models know Malaysian facts; we ask whether human feedback can preserve plural Malaysian acceptability judgments\. A model may know the relevant facts while still being aligned to only one dominant evaluative frame — a failure the benchmarking literature does not currently measure\.
## 3Formalizing Preference\-Validity Compression
Letxxdenote a prompt and letYx=\{y1,…,ym\}Y\_\{x\}=\\\{y\_\{1\},\\ldots,y\_\{m\}\\\}denote the candidate responses\. Letui\(x,y\)∈\{0,1\}u\_\{i\}\(x,y\)\\in\\\{0,1\\\}indicate whether evaluatoriijudges responseyyacceptable forxx\. Because evaluators may accept multiple responses,ui\(x,y\)u\_\{i\}\(x,y\)is not a forced\-choice ranking\. Standard scalar aggregation estimates acceptability as
A\(x,y\)=1n∑i=1nui\(x,y\),A\(x,y\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}u\_\{i\}\(x,y\),\(1\)and a single\-winner objective selects
y∗∈argmaxy∈YxA\(x,y\)\.y^\{\*\}\\in\\arg\\max\_\{y\\in Y\_\{x\}\}A\(x,y\)\.\(2\)When multiple responses tie,y∗y^\{\*\}denotes one selected winner under an arbitrary or implementation\-specific tie\-breaking rule\.
Pluralistic alignment requires a different measurement object\. LetV\(x\)⊆YxV\(x\)\\subseteq Y\_\{x\}denote the set of plural\-valid responses forxx\. A response belongs toV\(x\)V\(x\)if it is acceptable under a legitimate cultural, historical, linguistic, regional, or normative frame, even when it is not the top\-ranked response\. Thus,V\(x\)V\(x\)is a validity set, not a frequency ranking\. In our empirical study, we do not observeV\(x\)V\(x\)directly\. We diagnose it through majority\-supported acceptability patterns and qualitative justifications\.
We define*Preference\-Validity Compression*as the failure mode in which scalar aggregation replaces the plural\-valid setV\(x\)V\(x\)with the single\-winner targety∗y^\{\*\}\. The failure is not that a lower\-frequency response is always valid\. The failure is that scalar aggregation alone cannot distinguish an invalid non\-winner from a valid non\-winner\. A response may disappear from the optimization target not because it is invalid, but because the aggregation operator retains only one selected response\.
###### Proposition 1\.
If a prompt admits more than one plural\-valid response, a single\-winner majority objective cannot preserve plural validity as a set\. It mapsV\(x\)V\(x\)to one response and omits valid non\-winner responses from the optimization target\.
The proof is given in Appendix[A](https://arxiv.org/html/2606.10569#A1)\. The proposition identifies a measurement limitation rather than a claim that every non\-winner response is valid\. A single scalar ranking does not reveal whether a non\-winner response is invalid or valid under another frame\. Our empirical study tests this failure through three signals in Section[6](https://arxiv.org/html/2606.10569#S6), that are accepted\-response multiplicity, majority\-compression loss, and non\-fixed participant modes\.
## 4Malaysia as a Diagnostic Context
We use Malaysia as a diagnostic context for Preference\-Validity Compression because its structural plurality challenges the assumption that pooled population feedback can be reduced to a single preference signal\. Malaysian society is commonly described through broad ethnolinguistic categories: "Malay", "Chinese", "Indian", and "Other", which encompasses diverse indigenous communities across Peninsular Malaysia, Sabah, and Sarawak\(Department of Statistics Malaysia,[2026](https://arxiv.org/html/2606.10569#bib.bib20)\)\. These categories are historically constructed, but remain socially and institutionally salient in education, economic organisation, language policy, and everyday interaction\(Hirschman,[1986](https://arxiv.org/html/2606.10569#bib.bib21); Shamsul,[1996](https://arxiv.org/html/2606.10569#bib.bib23),[2009](https://arxiv.org/html/2606.10569#bib.bib24); Ting,[2014](https://arxiv.org/html/2606.10569#bib.bib22)\)\.
We do not argue that each group has a fixed preference profile, nor do we aim to estimate population\-level preference distributions\. Rather, Malaysia is a useful diagnostic context because disagreement reflects coexisting, legitimate interpretive frames, not random annotator noise\. This aligns with descriptions of Malaysia as a “state in stable tension”, where social order is maintained through negotiation among competing social and political claims instead of settled consensus\(Shamsul,[2009](https://arxiv.org/html/2606.10569#bib.bib24)\)\. Thus, feedback aggregation using a single reward target does not recover any hidden national preference but instead determines which interpretive frame remains visible to optimization\.
The Merdeka Center’s 2024 National Youth Survey illustrates this structure\.The national sample is nearly evenly divided on the issue of ethnic\-based Bumiputera privileges versus equal treatment across race and religion, with 49% supporting the former and 48% favouring the latter\(Merdeka Center,[2024](https://arxiv.org/html/2606.10569#bib.bib25)\)\. Beneath this near\-parity, however, are divergent regional and ethnic patterns: 73% of Malay youth respondents support continued Bumiputera privileges while 65% of East Malaysian respondents support equal treatment\(Merdeka Center,[2024](https://arxiv.org/html/2606.10569#bib.bib25)\)\. The point is not that ethnicity or region deterministically predicts preference\. More precisely, value\-laden judgments, which are shaped by different cultural, historical, regional and normative frames, can remain socially intelligible within the same national context\.
## 5Study Design
We present a controlled diagnostic study of model response acceptability in Malaysia\. The study tests whether locally relevant prompts can admit multiple acceptable responses, and whether single\-winner aggregation would suppress majority\-supported non\-winner responses by diagnosing the feedback\-elicitation and aggregation step that precedes reward optimization\.
### 5\.1Participants and Prompt Assignment
We recruited participants through an online registration form and selected them to provide coverage across Malay, Chinese, Indian, and indigenous groups from Sabah and Sarawak, across age bands from 18\-24 to 65 and above\. Participants were compensated in accordance with Malaysian minimum wage requirements\. Participants provided informed consent for their responses to be used in this research; Full details on participant recruitment can be found in Appendix[B](https://arxiv.org/html/2606.10569#A2)\.
Each prompt was assigned to three participants, allowing a minimal majority judgment of two out of three\. Assignment was random rather than demographically stratified within prompt groups\.
Figure 2:What majority aggregation retains compared with the full set of≥2/3\\geq 2/3majority\-supported responses across 107 trio\-annotated prompts\. Each column is one prompt\. Dark cells indicate theargmax\\arg\\maxresponse a single\-winner reward model would optimize toward\. Orange cells indicate additional responses reaching≥2/3\\geq 2/3acceptance that majority aggregation discards as hidden consensus\. Overall, 85 of 107 prompts, or 79%, contain at least one such discarded response\.
### 5\.2Prompt and Response Construction
Prompts were grounded in Malaysian online discourse to avoid relying only on researcher assumptions about local disagreement\. We systematically collected and processed content from four Malaysian digital platforms, such as Lowyat\.net, Reddit, Threads, and YouTube podcast channels producing 692,919 utterances across 26,463 source files after segmentation, filtration, and deduplication\. Recurring topics from this corpus were used to seed prompt generation with DeepSeek\-V4\-Flash\. The prompts cover value\-laden situations where cultural, religious, linguistic, or regional background may plausibly affect what counts as an acceptable response\. The initial corpus contained 371 prompts before quality filtering\. Full details are provided in Appendix[D](https://arxiv.org/html/2606.10569#A4)\.
\(a\)Ranking reversal
\(b\)Discarded support
Figure 3:argmax\\arg\\maxcompression hides plural acceptability\. \(a\) Under single\-winnerargmax\\arg\\max, responseAAappears dominant with 57 prompt\-level wins, followed byBBwith 36 andCCwith 14\. Under the majority\-threshold view,AAandBBare mostly tied at 79\-80 prompts, whileCCremains broadly supported in 73 prompts\. \(b\) At the acceptance\-count level, total support ranks responses asB\>A\>CB\>A\>C, and many accepted responses are discarded by single\-winner aggregation\.For each prompt, three candidate responses were generated using DeepSeek\-V4\-Flash, GPT\-5\-Nano, and Gemini\-3\.1\-Flash\-Lite\-Preview\. Responses were intended to represent agreement, disagreement, and balanced positions, but this intended stance scheme was not always realized\. We therefore treat response positionsAA,BB, andCCas empirical candidates rather than controlled stance conditions\. Model identities were not disclosed to participants\. More details are included in Appendix[E](https://arxiv.org/html/2606.10569#A5)\.
### 5\.3Acceptability Elicitation and Filtering
Instead of forced\-choice preference selection, participants rated each of the three responses independently as acceptable or not acceptable and provided a brief written justification for each rating\. This format allows participants to accept exactly one response, accept more than one response, or reject all responses\. By removing the forced\-choice constraint, the design makes Preference\-Validity Compression diagnosable\. If plural acceptability cannot appear in the data, it cannot be measured\.
After data collection, we applied two filters\. First, prompts were removed when the scenario was internally inconsistent or factually incoherent\. Second, prompts without complete responses from all three assigned participants were removed, ensuring that every retained prompt supported a two\-out\-of\-three majority judgment\. The final analysable dataset comprises 20 participants, 107 prompts, 321 preference events, and 963 response\-level ratings\.
## 6Results and Analysis
We analyzeN=321N=321preference events from 20 participants across 107 trio\-annotated prompts\. Each event records one participant’s acceptability decision over three candidate responses, expressed as a binary selection vector over\{A,B,C\}\\\{A,B,C\\\}\. Across 963 possible response\-level ratings, participants marked 625 responses as acceptable\. The analysis tests three diagnostics introduced in Section 3, which are the accepted\-response multiplicity, majority\-compression loss, and non\-fixed participant modes\. The goal is not to simulate a full RLHF pipeline, but to examine whether majority\-supported acceptability distributes across multiple responses in ways that a single\-winner reward target would compress\.
### 6\.1Accepted\-Response Multiplicity
The first diagnostic asks whether a prompt admits more than one acceptable response\. Under a single\-preference latent utility assumption, acceptability should concentrate around oney∗y^\{\\ast\}per prompt, with other acceptances treated as residual variation\. A plural\-acceptability view predicts that more than one response may receive substantial support when responses instantiate coexisting interpretive frames\.
For each prompt, we count the number of responses accepted by at least two of three participants, the minimum majority threshold in this annotation design\. Figure[2](https://arxiv.org/html/2606.10569#S5.F2)contrasts theargmax\\arg\\maxresponse retained by single\-winner aggregation with the full set of responses reaching≥2/3\\geq 2/3acceptance\. We call additional majority\-supported non\-winners*hidden consensus*\.
Across 107 prompts, 85 prompts, or 79\.4%, admit two or more responses reaching≥2/3\\geq 2/3acceptance\. Only 21 prompts, or 19\.6%, exhibit a single majority\-supported response, and one prompt, or 0\.9%, produces no majority\-supported response\. Thus, the corpus is not characterized by sparse or noisy consensus\. It is characterized by coexisting consensus\. The reductiony∗=argmaxyA\(x,y\)y^\{\\ast\}=\\arg\\max\_\{y\}A\(x,y\)retains only one response and discards other responses that meet the same majority threshold\. In most prompts, the empirically supported acceptability set is non\-singleton and cannot be recovered fromy∗y^\{\\ast\}alone\.
### 6\.2Majority\-Compression Loss
The second diagnostic asks what is lost when acceptability is reduced to a singleargmax\\arg\\maxtarget\. Figure[3](https://arxiv.org/html/2606.10569#S5.F3)shows that the loss is not only quantitative but interpretive, where a single\-winner aggregation changes which response appears most supported\.
At the prompt level, Figure[3\(a\)](https://arxiv.org/html/2606.10569#S5.F3.sf1)shows that single\-winner aggregation changes the apparent support structure\. Underargmax\\arg\\maxaggregation, responseAAappears dominant, serving as the single\-winner response in 57 prompts, followed byBBin 36 prompts andCCin 14 prompts\. However, when every response reaching the≥2/3\\geq 2/3majority acceptance threshold is counted,AAandBBeach reach majority support in 79 prompts and 80 prompts respectively, whileCCreaches majority support in 73 prompts\. Thus, theargmax\\arg\\maxview makesAAappear clearly dominant, whereas the majority\-threshold view reveals a much flatter support structure\.AAandBBare mostly tied, andCCis far less marginal than it appears under single\-winner aggregation\.
Table 1:Examples of hidden consensus discarded by single\-winner aggregation\. Each hidden response reached the majority threshold but would be removed byargmax\\arg\\maxaggregation\.Table 2:Robustness checks showing that hidden consensus is not driven by a single threshold, language subset, topic domain, response position, or participant habit\.At the acceptance\-count level, Figure[3\(b\)](https://arxiv.org/html/2606.10569#S5.F3.sf2)decomposes total acceptances into support retained byargmax\\arg\\maxand support discarded by single\-winner aggregation\. ResponseBBreceives the highest total support with 220 acceptances, followed byAAwith 212 andCCwith 193\. Yet only 99, 148, and 39 of these acceptances respectively map onto the single\-winner signal forBB,AA, andCC\. Equivalently, the discarded fraction is 55% forBB, 30% forAA, and 80% forCC\. ResponseCC, which appears marginal underargmax\\arg\\max, loses four out of every five acceptances to compression\.
This is the measurement\-validity consequence formalized in Proposition 1\. Theargmax\\arg\\maxreward signal does not merely under\-weight weak minority responses\. It discards majority\-supported alternatives and can invert which responses appear most broadly endorsed\. The failure is therefore not only loss of diversity, but loss of measurement validity, where majority aggregation measuresargmax\\arg\\maxacceptability, not plural alignment\.
To check that hidden consensus is not merely indiscriminate acceptance, we inspect examples where a non\-winner response reached the≥2/3\\geq 2/3majority threshold but was discarded byargmax\\arg\\maxaggregation\. Table[1](https://arxiv.org/html/2606.10569#S6.T1)shows that these discarded responses can correspond to coherent local, practical, or cultural frames\. The examples are illustrative rather than exhaustive, but they show what single\-winner aggregation removes at the level of interpretable response frames\.
The robustness checks in Table[2](https://arxiv.org/html/2606.10569#S6.T2)support the interpretation that hidden consensus is a broad measurement pattern rather than a single\-threshold or single\-response artifact\.
### 6\.3Non\-Fixed Participant Modes
The third diagnostic asks if acceptability behaves as a fixed user\-level trait or as a context\-dependent preference event\. Under a fixed single\-selection view, participants should usually accept only one response per prompt\. Instead, multi\-response selection is the dominant empirical pattern\.
Figure[4](https://arxiv.org/html/2606.10569#S7.F4)shows that plural acceptability is visible at the participant\-event level, not only after prompt\-level aggregation\. Across 321 events, 225 events, or 70\.1%, involve accepting more than one response\. Two\-response patterns account for 140 events, accepting all three responses accounts for 85 events, and single\-response patterns account for only 90 events\. This supports the preference\-event view, where participants often judged more than one response acceptable for the same prompt\.
A natural objection is that multi\-selection reflects indiscriminate acceptance\. We test this in Appendix[G](https://arxiv.org/html/2606.10569#A7)\. Participant\-level selection breadth ranges from 1\.33 to 2\.69 responses per prompt, while consensus alignment rates remain high, ranging from 71% to 100%\. We find no detectable linear association between selection breadth and consensus alignment rate, with Pearsonr=−0\.09r=\-0\.09andp=0\.71p=0\.71\. Participants who accept more responses are therefore not systematically landing on less\-supported responses\. They are surfacing additional majority\-supported responses that single\-winner aggregation leaves behind\.
Taken together, the three diagnostics show that Malaysian human feedback in this corpus cannot be faithfully represented as a single per\-prompt majority preference\. Majority\-supported plural acceptability arises in 79% of prompts\.argmax\\arg\\maxcompression silences four out of every five acceptances for the response that appears marginal under majority aggregation\. Multi\-selection is the modal annotation behavior, with no detectable consensus\-support penalty for participants who accept multiple responses\. Majority aggregation therefore measures the most frequently selected response, not plural alignment\. RLHF\-style feedback aggregation in structurally plural settings should preserveV\(x\)V\(x\)as a set rather than collapse it intoy∗y^\{\\ast\}\.
## 7Discussion
Our results show that Preference\-Validity Compression is observable in a diagnostic human evaluation corpus\. The central implication is not simply that annotators disagree\. It is that RLHF\-style feedback aggregation can change what construct is being measured\. For instance, from plural acceptability to single\-winner acceptability\.
#### Plural acceptability is the modal empirical pattern\.
The 79% multiplicity rate shows that, for most prompts in this corpus, a single\-winner reward objective would discard at least one response that met the same majority threshold as the retained winner\. Multi\-selection was also the modal annotation behavior across 321 preference events\. Participants who accepted more responses were not accepting less consensus\-supported ones; they were surfacing additional responses that also commanded majority support\. Plural acceptability is therefore not a marginal deviation around a dominant signal in this corpus\. It is the dominant empirical pattern\.
Figure 4:Selection\-pattern distribution across 321 preference events\. Multi\-response selection is the modal behavior\. Two\-response selections account for 140 events, or 43\.6%, while accepting all three responses accounts for 85 events, or 26\.5%\. Single\-response selection accounts for 90 events, or 28\.0%, and rejecting all responses accounts for 6 events, or 1\.9%\.
#### Majority aggregation changes the observed support structure\.
Compression loss \(Section[6](https://arxiv.org/html/2606.10569#S6)\) shows that majority aggregation does not merely discard weak minority signals\. It can also discard majority\-supported non\-winner responses and change which responses appear most broadly supported\. ResponseCC, which appears marginal underargmax\\arg\\maxaggregation, loses four out of every five acceptances to compression\. A single\-winner reward target based on this signal will hence misrepresent support structure, not simply simplify it\.
#### Preference\-Validity Compression is not solved by sampling alone\.
A larger or more demographically stratified sample may refine the estimated distribution, but it does not by itself solve the measurement problem\. The same participants produce different selection patterns across prompts, and selection breadth shows no detectable linear association with consensus alignment rate\. The same applies to demographic diversification\. A diverse annotator pool can still produce a non\-plural reward model if disagreement is averaged away rather than preserved\(Casperet al\.,[2023](https://arxiv.org/html/2606.10569#bib.bib13); Prabhakaranet al\.,[2021](https://arxiv.org/html/2606.10569#bib.bib7)\)\. The problem is therefore not only who is in the pool, but what the aggregation operator does to their judgments\.
#### Forced\-choice formats can manufacture the convergence RLHF assumes\.
Our free\-acceptance format makes plural acceptability detectable\. Standard forced\-choice formats may not merely measure preference convergence; they may also produce it\. When participants must select one response, the data will always contain a winner, and the elicitation instrument that produced this convergence disappears inside the reward signal\. This does not invalidate RLHF as an optimization procedure, but it raises a prior measurement question about what its reward signal represents\.
#### Implications for pluralistic alignment\.
Our contribution is diagnostic\. We identify Preference\-Validity Compression as a measurement\-validity failure rather than propose a replacement algorithm\. Existing approaches such as MaxMin\-RLHF and personalized RLHF address preference heterogeneity by changing the optimization target or modeling user variation\(Chakrabortyet al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib6); Parket al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib2); Chenet al\.,[2024](https://arxiv.org/html/2606.10569#bib.bib8)\)\. These directions are important, but they do not by themselves guarantee validity preservation\. A method can model diverse preferences while still selecting a single response per prompt\. Our results suggest that the bottleneck is not only who is in the annotator pool but what the aggregation operator does to their judgments\. Pluralistic alignment should therefore preserveV\(x\)V\(x\)as the set of responses valid under different interpretive frames, rather than collapse it into a single targety∗y^\{\\ast\}, a property we term*Validity\-Preserving Consistency*\.
#### Fairness implication\.
From a fairness perspective, the harm is not only that diversity is reduced\. When hidden consensus is discarded, valid minority or alternative frames are removed from the learning signal and become invisible to optimization\. The resulting reward target may then reinforce dominant\-frame responses while treating less frequent but still valid frames as lower\-quality, less preferred, or less aligned\.
## 8Conclusion
This paper reframes RLHF\-style feedback aggregation as a measurement\-validity problem\. In structurally plural settings, disagreement may reflect plural\-valid interpretations rather than annotation noise, yet scalar aggregation can collapse these interpretations into one optimization target\. We call this failure*Preference\-Validity Compression*\. Using Malaysia as a diagnostic setting, we show that plural acceptability can appear before aggregation\. Across 321 preference events from 20 participants and 107 trio\-annotated prompts, 79% of prompts contained more than one majority\-supported response, multi\-selection was the most frequent annotation behavior, andargmax\\arg\\maxaggregation inverted the observed support ranking\. These findings suggest that majority aggregation in this corpus measuresargmax\\arg\\maxacceptability rather than plural alignment\. Future alignment methods should therefore satisfy*Validity\-Preserving Consistency*, preservingV\(x\)V\(x\)as a set of plural\-valid responses before optimizing a dominant preference\.
## Limitations
#### Diagnostic scope\.
This study is a controlled diagnostic rather than a nationally representative survey\. The 20\-participant final sample was selected to provide ethnolinguistic and age coverage, not to estimate population\-level preference distributions in Malaysia\. The findings should therefore be read as evidence that Preference\-Validity Compression can arise under controlled plural\-feedback conditions, not as a national prevalence estimate\.
#### Sample size and prompt assignment\.
Each retained prompt was evaluated by three participants, which enables a minimal majority signal but limits the resolution of within\-prompt disagreement\. Because assignment was random and not demographically stratified, prompts on regionally or ethnically specific issues may have been judged by participants outside the relevant frame, which would understate hidden consensus\. The reported 79% is therefore likely a lower bound\.
#### Stance labelling\.
Stance labels were assigned independently by five members of the research team without adjudicating disagreements into a single resolved label\. The schema was developed iteratively during analysis\. For this reason, stance labels are treated as descriptive characterisations of response content, not as ground\-truth categorical measurements\. The main quantitative results rely on acceptability counts rather than stance\-label agreement\.
#### Response generation\.
The three candidate responses were generated with the intention of representing agreement, disagreement, and balanced positions\. In practice, model outputs did not always follow this scheme, especially on sensitive prompts where responses sometimes converged toward safe or hedged positions\. We therefore treat responsesAA,BB, andCCas empirical candidate responses rather than theoretically controlled stance conditions\. The compression analysis concerns whether accepted non\-winner responses are lost under single\-winner aggregation, not whether a specific intended stance category wins\.
#### Generalisability\.
Malaysia was chosen as a diagnostic setting because its structural plurality makes the measurement problem visible\. Whether similar patterns arise in other plural societies, other domains, or larger annotation designs remains an empirical question\. The present study establishes a diagnostic phenomenon and a measurement protocol for studying it\. It does not claim to identify all conditions under which Preference\-Validity Compression will occur\.
## References
- D\. Ali, D\. Zhao, A\. Koenecke, and O\. Papakyriakopoulos \(2026\)Operationalizing pluralistic values in large language model alignment reveals trade\-offs in safety, inclusivity, and model behavior\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 37222–37231\.Cited by:[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p2.1)\.
- L\. Aroyo and C\. Welty \(2015\)Truth is a lie: crowd truth and the seven myths of human annotation\.AI magazine36\(1\),pp\. 15–24\.Cited by:[§1](https://arxiv.org/html/2606.10569#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p1.1)\.
- V\. Basile, M\. Fell, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, M\. Poesio, and A\. Uma \(2021\)We need to consider disagreement in evaluation\.InProceedings of the 1st workshop on benchmarking: past, present and future,pp\. 15–21\.Cited by:[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p1.1)\.
- S\. Casper, X\. Davies, C\. Shi, T\. K\. Gilbert, J\. Scheurer, J\. Rando, R\. Freedman, T\. Korbak, D\. Lindner, P\. Freire,et al\.\(2023\)Open problems and fundamental limitations of reinforcement learning from human feedback\.arXiv preprint arXiv:2307\.15217\.Cited by:[§1](https://arxiv.org/html/2606.10569#S1.p2.1),[§7](https://arxiv.org/html/2606.10569#S7.SS0.SSS0.Px3.p1.1)\.
- S\. Chakraborty, J\. Qiu, H\. Yuan, A\. Koppel, F\. Huang, D\. Manocha, A\. S\. Bedi, and M\. Wang \(2024\)Maxmin\-rlhf: alignment with diverse human preferences\.arXiv preprint arXiv:2402\.08925\.Cited by:[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p2.1),[§7](https://arxiv.org/html/2606.10569#S7.SS0.SSS0.Px5.p1.2)\.
- D\. Chen, Y\. Chen, A\. Rege, and R\. K\. Vinayak \(2024\)Pal: pluralistic alignment framework for learning from heterogeneous preferences\.arXiv preprint arXiv:2406\.08469\.Cited by:[§7](https://arxiv.org/html/2606.10569#S7.SS0.SSS0.Px5.p1.2)\.
- P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.10569#S1.p1.1)\.
- V\. Conitzer, R\. Freedman, J\. Heitzig, W\. H\. Holliday, B\. M\. Jacobs, N\. Lambert, M\. Mossé, E\. Pacuit, S\. Russell, H\. Schoelkopf,et al\.\(2024\)Social choice should guide ai alignment in dealing with diverse human feedback\.arXiv preprint arXiv:2404\.10271\.Cited by:[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p2.1)\.
- Department of Statistics Malaysia \(2026\)Demographic statistics malaysia, first quarter 2026\.Note:Press releaseCited by:[§4](https://arxiv.org/html/2606.10569#S4.p1.1)\.
- J\. Henrich, S\. J\. Heine, and A\. Norenzayan \(2010\)The weirdest people in the world?\.Behavioral and brain sciences33\(2\-3\),pp\. 61–83\.Cited by:[§2\.2](https://arxiv.org/html/2606.10569#S2.SS2.p1.1)\.
- Z\. K\. Hew, J\. X\. Low, S\. J\. Yang, and C\. S\. Chan \(2025\)MyCulture: exploring malaysia’s diverse culture under low\-resource language constraints\.arXiv preprint arXiv:2508\.05429\.Cited by:[§2\.2](https://arxiv.org/html/2606.10569#S2.SS2.p3.1)\.
- C\. Hirschman \(1986\)The making of race in colonial malaya: political economy and racial ideology\.InSociological forum,Vol\.1,pp\. 330–361\.Cited by:[§4](https://arxiv.org/html/2606.10569#S4.p1.1)\.
- H\. R\. Kirk, A\. Whitefield, P\. Röttger, A\. Bean, K\. Margatina, J\. Ciro, R\. Mosquera, M\. Bartolo, A\. Williams, H\. He,et al\.\(2024\)The prism alignment dataset: what participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models\.Advances in Neural Information Processing Systems37,pp\. 105236–105344\.Cited by:[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p2.1)\.
- Merdeka Center \(2024\)National youth survey 2024\.Note:Poll reportCited by:[§4](https://arxiv.org/html/2606.10569#S4.p3.1)\.
- R\. Mihalcea, O\. Ignat, L\. Bai, A\. Borah, L\. Chiruzzo, Z\. Jin, C\. Kwizera, J\. Nwatu, S\. Poria, and T\. Solorio \(2025\)Why ai is weird and shouldn’t be this way: towards ai for everyone, with everyone, by everyone\.InProceedings of the AAAI conference on artificial intelligence,Vol\.39,pp\. 28657–28670\.Cited by:[§2\.2](https://arxiv.org/html/2606.10569#S2.SS2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.10569#S1.p1.1)\.
- C\. Park, M\. Liu, D\. Kong, K\. Zhang, and A\. Ozdaglar \(2024\)Rlhf from heterogeneous feedback via personalization and preference aggregation\.arXiv preprint arXiv:2405\.00254\.Cited by:[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p2.1),[§7](https://arxiv.org/html/2606.10569#S7.SS0.SSS0.Px5.p1.2)\.
- S\. C\. Poh, S\. J\. Yang, J\. M\. L\. Tan, L\. L\. T\. Y\. Chieng, J\. X\. Tan, Z\. Yu, C\. M\. Foong, and C\. S\. Chan \(2024\)MalayMMLU: a multitask benchmark for the low\-resource malay language\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 650–669\.Cited by:[§2\.2](https://arxiv.org/html/2606.10569#S2.SS2.p3.1)\.
- V\. Prabhakaran, A\. M\. Davani, and M\. Diaz \(2021\)On releasing annotator\-level labels and information in datasets\.InProceedings of the joint 15th linguistic annotation workshop \(LAW\) and 3rd designing meaning representations \(DMR\) workshop,pp\. 133–138\.Cited by:[§2\.1](https://arxiv.org/html/2606.10569#S2.SS1.p2.1),[§7](https://arxiv.org/html/2606.10569#S7.SS0.SSS0.Px3.p1.1)\.
- S\. Santurkar, E\. Durmus, F\. Ladhak, C\. Lee, P\. Liang, and T\. Hashimoto \(2023\)Whose opinions do language models reflect?\.InInternational conference on machine learning,pp\. 29971–30004\.Cited by:[§1](https://arxiv.org/html/2606.10569#S1.p2.1)\.
- A\. Shamsul \(1996\)Debating about identity in malaysia: a discourse analysis\.Japanese Journal of Southeast Asian Studies34\(3\),pp\. 476–499\.Cited by:[§4](https://arxiv.org/html/2606.10569#S4.p1.1)\.
- A\. Shamsul \(2009\)Managing a stable tension: ethnic relations in malaysia reexamined\.Readings on Development: Malaysia2057,pp\. 39–54\.Cited by:[§4](https://arxiv.org/html/2606.10569#S4.p1.1),[§4](https://arxiv.org/html/2606.10569#S4.p2.1)\.
- T\. Sorensen, J\. Moore, J\. Fisher, M\. Gordon, N\. Mireshghallah, C\. M\. Rytting, A\. Ye, L\. Jiang, X\. Lu, N\. Dziri,et al\.\(2024\)Position: a roadmap to pluralistic alignment\.InProceedings of the 41st International Conference on Machine Learning,pp\. 46280–46302\.Cited by:[§1](https://arxiv.org/html/2606.10569#S1.p2.1)\.
- H\. Ting \(2014\)Race paradigm and nation\-building in malaysia\.Transforming Malaysia: Dominant and competing paradigms,pp\. 82–110\.Cited by:[§4](https://arxiv.org/html/2606.10569#S4.p1.1)\.
- W\. Wang, W\. Jiao, J\. Huang, R\. Dai, J\. Huang, Z\. Tu, and M\. Lyu \(2024\)Not all countries celebrate thanksgiving: on the cultural dominance in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6349–6384\.Cited by:[§2\.2](https://arxiv.org/html/2606.10569#S2.SS2.p2.1)\.
## Appendix AProof of Proposition
LetV\(x\)=\{y1,…,yk\}V\(x\)=\\\{y\_\{1\},\\ldots,y\_\{k\}\\\}be the set of plural\-valid responses for promptxx, wherek\>1k\>1\. Scalar aggregation assigns eachyj∈V\(x\)y\_\{j\}\\in V\(x\)an acceptance frequency
A\(x,yj\)=1n∑i=1nui\(x,yj\)\.A\(x,y\_\{j\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}u\_\{i\}\(x,y\_\{j\}\)\.\(3\)The single\-winner objective selects
y∗∈argmaxy∈YxA\(x,y\)\.y^\{\*\}\\in\\arg\\max\_\{y\\in Y\_\{x\}\}A\(x,y\)\.\(4\)The output is one selected response fromYxY\_\{x\}\. Since\|V\(x\)\|\>1\|V\(x\)\|\>1, at least one responseyj∈V\(x\)y\_\{j\}\\in V\(x\)satisfiesyj≠y∗y\_\{j\}\\neq y^\{\*\}\. By definition, this response remains plural\-valid, but it is not preserved in the single\-winner target\. Therefore, the mapping fromV\(x\)V\(x\)toy∗y^\{\*\}compresses a set of valid responses into one selected response\.
## Appendix BParticipant Recruitment
### B\.1Recruitment and Selection
In total, 53 individuals registered interest through an online form\. From this pool, 44 were selected to provide coverage across Malay, Chinese, Indian, as well as indigenous communities from Sabah and Sarawak, across age bands from 18\-24 to 65 and above\. After task completion, quality filtering, and linkage to usable evaluation records, the final analytical sample contained 20 unique participants\. Participants received financial compensation aligned with Malaysian minimum wage requirements\.
### B\.2Sample Demographics
The 20 participants include those from Malay, Chinese, Indian, and several East Malaysian indigenous backgrounds, including Bajau, Kadazan\-Dusun, Kayan, and Melanau\. Some demographic cells, especially older age bands, contain only one participant\. Demographic variables are therefore used only as interpretive context, not as predictors of preference\. The sample is not intended to estimate population\-level Malaysian preference distributions\. TableLABEL:tab:demographicsreports the full demographic breakdown\.
The demographic questionnaire was available in Bahasa Melayu and English\. Where applicable, Bahasa Melayu responses were translated into English for consistency in reporting\.
### B\.3Research Ethics and Data Privacy
All participants provided informed consent before beginning the task\. All findings are reported in anonymised form\.
Before beginning the task, participants reviewed and agreed to a consent statement that disclosed:
- •The purpose of the study and that responses would be used for research and any resulting publication\.
- •That personally identifiable information collected during registration would be kept separate from the analytical dataset and not reported\.
- •Compensation terms aligned with Malaysian minimum wage requirements\.
- •The right to withdraw at any time\.
- •Contact information for questions or concerns about the study\.
### B\.4Content Advisory
Participants were also informed in advance that some prompts touch on sensitive topics relevant to the Malaysian context, including race relations, religious practice, language policy, and regional identity\. Participants were told they could skip any prompt they were uncomfortable evaluating, and could withdraw from the study at any time without forfeiting compensation for completed work\.
## Appendix CParticipant Instructions and Task Framing
This section reports the full instructions and task framing shown to participants during evaluation, in the order they appeared\.
### C\.1Study Overview Shown to Participants
Participants were introduced to the task with the following description:
> You will read a user question and three different AI responses to that question\. Your task is to decide whether each response is acceptable to you\. There is no single correct answer\. Different people may disagree on what is acceptable, and that is expected in this study\. We are interested in your personal judgment and lived experience\.
### C\.2Rating Instructions
For each prompt, participants were shown the prompt followed by three candidate responses \(labelled A, B, and C\)\. For each response, participants were asked to:
1. 1\.Mark the response as eitheracceptableornot acceptableas a reply to the prompt in the Malaysian context\.
2. 2\.Provide a brief written justification for the rating\.
Participants were explicitly told that they could mark any number of responses as acceptable, including zero, one, two, or all three\. They were instructed that there was no expected answer and no preference for a particular response position\.
## Appendix DPrompt Generation
Prompt generation followed a three\-stage pipeline designed to ground the corpus in authentic Malaysian discourse rather than researcher assumptions about local disagreement\. Each stage is described in the subsections below\.
### D\.1Data Collection and Processing
Table 3:Data Collection: Files and utterances per platform after segmentation, filtration, and deduplicationForum discourse was collected from four major Malaysian online platforms,i\.e\.,Lowyat\.net, Reddit \(Malaysian communities\), Threads, and YouTube podcast channels\. Raw data underwent multi\-stage processing, such as segmentation into utterance\-level chunks, filtration of off\-topic content, and deduplication to remove redundant discourse\. The processing pipeline produced 692,919 distinct utterances from 26,463 source files across all platforms \(refer to Table[3](https://arxiv.org/html/2606.10569#A4.T3)\)\.
YouTube data was sourced from seven prominent Malaysian podcast channels, which are Morning Brief \(BFM\), Keluar Sekejap, Are We OK?, The Sambal Pod, Kinabalu Podcast, BuatSaja Podcast, and Yang Berhenti Menteri \(hosted by Mohd Rafizi bin Ramli\)\.
### D\.2Topic Identification
Topics were identified from the processed discourse using a systematic approach grounded in locally relevant cultural, linguistic, and policy dimensions\. The resulting topic catalog spans 150 topics organized across three domains: Culture \(55 topics\), Language \(64 topics\), as well as Local governance and policy \(31 topics\)\. Each domain contains multiple thematic dimensions \(refer to Table[5](https://arxiv.org/html/2606.10569#A4.T5)\)\.
Table 4:Full demographic breakdown of the final analytical sample \(N=20N=20\)\. Free\-text entries \(e\.g\., “Kayan” underEthnicityand “Part Time Admin” underEmployment / Sector\) are reported as submitted\.Categorynn%Total ParticipantsTotal20100\.0%Age18–24630\.0%25–34420\.0%35–44420\.0%45–54420\.0%55–6415\.0%65\+15\.0%GenderFemale1575\.0%Male525\.0%EthnicityChinese945\.0%Malay525\.0%Indian210\.0%Bajau15\.0%Kadazan\-Dusun15\.0%Kayan∗15\.0%Melanau15\.0%ReligionIslam840\.0%Buddhism525\.0%Christianity315\.0%Hinduism210\.0%Taoism / Chinese folk religion15\.0%No religion15\.0%EducationBachelor’s degree1365\.0%STPM / A\-level / Diploma420\.0%Postgraduate \(Master’s or PhD\)210\.0%SPM / O\-level15\.0%Secondary School TypeSMK \(National secondary school\)1470\.0%Chinese independent school210\.0%SMJK \(C\) \(Chinese\-medium national\-type\)210\.0%Government boarding school \(MRSM, SBP, etc\.\)15\.0%International / private15\.0%Primary Language \(Daily Use\)English735\.0%Mandarin735\.0%Bahasa Melayu525\.0%Cantonese15\.0%Employment / SectorStudent525\.0%Engineering or technical services315\.0%Government / public service315\.0%Not currently working315\.0%AI / data / tech sector15\.0%Business or management15\.0%Creative, media, and entertainment15\.0%Part Time Admin†15\.0%Self\-employed / own business15\.0%Wildlife conservation15\.0%State Grown Up InKuala Lumpur315\.0%Sabah315\.0%Penang210\.0%Perak210\.0%Sarawak210\.0%Selangor210\.0%Johor15\.0%Kedah15\.0%Kelantan15\.0%Negeri Sembilan15\.0%Pahang15\.0%Terengganu15\.0%State Currently Living InSelangor630\.0%Kuala Lumpur420\.0%Negeri Sembilan315\.0%Sarawak210\.0%Johor15\.0%Kedah15\.0%Melaka15\.0%Putrajaya15\.0%Terengganu15\.0%∗Free\-text entry submitted under the “Other”Ethnicityoption\.
†Free\-text entry submitted under the “Other”Employment / Sectoroption\.
Table 5:150 topics organized by domain \(Culture, Language, Local\) and thematic dimensionTopic IDDimensionCount of TopicsCultureT0004–T0054Values & Norms:Sensitivity & Taboos10T0002–T0055Values & Norms:Ethical & Moral Standards7T0041–T0053Context:Historical & Societal3T0033–T0051Belief Systems:Religious Perspectives11T0001–T0052Values & Norms:Social Acceptability5T0022–T0048Symbolism:Cultural Meaning2T0003–T0047Context:Temporal & Civic17LanguageT0102–T0106Expression:Formality & Register5T0056–T0063Coverage:Language Scope8T0111–T0114Interaction:Multilingual4T0115–T0119Interpretation:Pragmatic Understanding5T0107–T0110Variation:Regional Dialects4T0064–T0101Lexicon:Terminology, Slang, Local Expressions38LocalT0120–T0146Regulation:Constitutional & Legal Requirements11T0129–T0150Policy:National Priorities & Governance16T0132–T0144Regulation:Sectoral Requirements4
### D\.3Persona Development
From the extracted topics, we developed 20 diverse persona profiles in three stages\. \(i\) voice cluster extraction: identifying 1 to 3 distinct discourse voices per topic segment, each characterized by emotional posture, friction triggers, and phrasing patterns; \(ii\) profile bundling: clustering voice patterns by similarity using weighted similarity metrics \(lexical overlap 55%, emotional proximity 20%, posture alignment 15%, topic dimension match 10%\), with high\-similarity clusters merged above a 0\.55 threshold; and \(iii\) refinement: using an LLM to synthesize each bundle into a coherent persona with handle, core concern, and emotional profile \(see Table[6](https://arxiv.org/html/2606.10569#A4.T6)\)\.
Each persona targets specific topics through its distinct emotional stance and friction triggers, enabling generation of locally grounded prompts across multiple valid perspectives\.
### D\.4Quality Filtering
Two quality filters were applied after data collection\. First, four members of the research team independently reviewed prompt coherence, with difficult cases discussed collectively\. Prompts were removed when the scenario was internally inconsistent or factually incoherent in ways that made acceptability judgments uninterpretable rather than genuinely contestable\. Second, a coverage filter removed prompts that did not receive responses from all three assigned participants\. This ensured that every retained prompt supported a minimal two\-out\-of\-three majority judgment\. Because both filters were applied as an integrated process, filter\-specific attrition is reported only at the aggregate level in Figure[5](https://arxiv.org/html/2606.10569#A4.F5)\.
Figure 5:Data pipeline from recruitment to the final analytical sample\.Table 6:Persona Profiles: 20 diverse personas grouped by primary concern, each with associated emotional posture and friction triggersIDHandleCore ConcernEmotionKey TriggersAccountability & JusticeP001Outraged CitizenPublic safety & accountabilityoutragedDrunk driving, corruption, lack of actionP004Skeptical WatchdogInstitutional accountabilityskepticalOfficial statements, safety proceduresP008Scammed CitizenFraud & justicefrustratedScamming, ineffective law enforcementP017Accountable ParentParental responsibilityassertiveBlaming others, negligencePatriotism & National IdentityP003Indignant PatriotNational prideindignantDisrespect for flag, disloyaltyP010Angry NationalistEconomic exploitationangry\-resentfulDisrespect, national symbols, price hikesP011Disillusioned PatriotEthnic identitysarcasticCultural misidentification, exclusionP019Passionate PatriotNational languagepassionateUnderestimation, dismissalP020Sovereign AdvocateState rightsdefiantTaking control, defending rightsMorality & StandardsP002Moral GuardianMoral outrageangryIrresponsible behavior, disrespectP006Prejudice PropagatorStereotypingaggressiveMinor incidents, stereotypingP009Social Justice AdvocateLabor rightsindignantSalary cuts, forced resignationsSkepticism & CynicismP007Cynical WatchdogPolitical corruptioncynicalPolitical excuses, blame gameP012Truth SeekerHidden truthssuspiciousSuppression of truth, hidden agendasP013Cynical ObserverHypocrisycynicalHypocrisy, double standardsP016Systemic CriticSystemic biasresentfulRacial division, political rhetoricPragmatism & ConcernP005Diplomatic PragmatistInternational relationsconcernedConflicting alliances, diplomatic issuesP014Privacy WatchdogDigital privacyanxiousIdentity control, surveillanceP015Curious InquirerUnderstandingcuriousLack of context, unexplained claimsP018Frustrated MalaysianFair payfrustratedLow salary, dismissive employers
## Appendix EResponse Generation
To generate diverse responses across multiple valid perspectives, we implemented stance\-controlled response generation by injecting three distinct reasoning frameworks into three separate language models\. Each prompt and response was then assigned a stance tag by five members of the research team using a schema developed for this study\. They are used to make interpretive frames visible and to support qualitative interpretation, not to support the main quantitative claims\.
### E\.1Response Generation Models
Three LLMs were assigned one stance each to generate diverse responses across all prompts:
- •Response A \(Agree\):DeepSeek\-V4\-Flash
- •Response B \(Disagree\):GPT\-5\-Nano
- •Response C \(Diplomatic\):Gemini\-3\.1\-Flash\-Lite\-Preview
This assignment ensures that for each controversial prompt, annotators encounter one response reinforcing the position, one challenging it, and one analyzing the underlying dilemma, which enabling assessment of acceptability across legitimate but distinct interpretive frames\.
One practical deviation from the intended design is worth noting\. The model outputs did not consistently follow the assigned stance orientation, particularly on sensitive prompts where responses sometimes converged toward hedged or non\-committal positions regardless of the injected reasoning framework\. For this reason, the main analysis treats response positionsAA,BB, andCCas empirical candidate responses rather than theoretically controlled stance conditions\.
### E\.2Response Stances Definitions
The three stances were operationalized as follows:
Agree:Model adopts and reinforces a given position by building upon its logic with supporting evidence and shared principles\. Rather than passive agreement, this stance actively strengthens the reasoning through deeper justifications and a confirmatory tone\.
Disagree:Model presents an alternative perspective grounded in competing priorities and values\. Rather than dismissal, this stance constructs a counter\-argument that acknowledges complexity while maintaining a principled boundary against the original position\.
Diplomatic:Model analyzes the underlying tension between competing values without advocating for either side\. This stance decomposes the dilemma into its constituent trade\-offs, presenting both gains and losses across scenarios in neutral, balanced language\.
## Appendix FStance Labeling
Each prompt and response was assigned a stance tag from a schema developed for this study\. Labeling was conducted independently by five members of the research team\. No reconciliation procedure was applied, and disagreements between labelers were not adjudicated into a single resolved label\. This choice reflects the descriptive role of the stance labels\. They are used as characterizations of response content rather than ground\-truth categorical measurements\. The main quantitative results rely on acceptability counts, not on stance\-label agreement\.
Examples of stance labels includepro\-multilingual\-inclusive\-education,pro\-Malay\-language\-culture, andanti\-LGBTQ\-symbols\. The labels characterize value orientations embedded in prompts and responses\. They are not used to classify participants ideologically\. Refer to TableLABEL:tab:prompt\_stancesfor the full list of unique prompt stance labels and TableLABEL:tab:response\_stancesfor the full list of unique response stance labels\.
Table 7:All 116 Unique Prompt StancesStanceCountpro\-regional\-dialect32anti\-bureaucratic\-neglect28protect\-privacy20anti\-institution19anti\-selective\-enforcement9parental\-accountability\-gap9anti\-surveillance8anti\-impunity\-for\-the\-powerful6anti\-workplace\-discrimination6cross\-cultural\-inquiry6demand\-accountability\-ordinary\-malaysians6pro\-accountability\-home\-parenting6pro\-national\-language6pro\-parental\_accountability5protect\-workers5east\-malaysian\-autonomy\-inquiry3advocate\-parental\-accountability3anti\-chinese3anti\-indian3anti\-blaming\-parents3anti\-bureaucracy\-political\-rhetoric3anti\-cooptation\-culture3anti\-corporate\-restriction3anti\-cover\-up\-staff\-misconduct3anti\-deception\-dialect\_identity3anti\-disrespect\-islam3anti\-ekyc\-data\-collection3anti\-exploitation3anti\-exploitation\-workplace3anti\-government\-negligence3anti\-intervention—federal\-government3anti\-leniency\-gisb3anti\-manipulasi\-bahasa\-sopan3anti\-manipulation\-of\-religion3anti\-politicisation\-of\-religious\-enforcement3anti\-politicisation\-religion3anti\-politicization\-transliteration3anti\-surveillance\-mykad3anti\-worker\-exploitation3anti\-workplace\-exploitation3anti\-workplace\_discrimination3balance\-accountability\-with\-religious\-principles3cultural\-legitimacy\-consistency\-critique3defend\-islam3defend\-halal\-integrity3demand\-accountability\-governance3demand\-clarity\-in\-bureaucratic\-communication3demand\-justice\-with\-integrity3demand\-systemic\-accountability\-youth\_outcomes3demand\-transparency\-in\-safeguarding\-oversight3demand\-transparency\-platform\_moderation3governance\-compliance\-inquiry3heritage\_narratives\-inquiry3inquire\-parental\_responsibility3inquiry\-language\_identity3language\-norms\-double\-standard\-inquiry3liability\_clarity\_enquiry3multilingual\-codeswitching\-inquiry3preserve\-national\_language3pro\-accountability\-in\-religious\-leadership3pro\-citizenship\-law\-reform\-for\-children\-born\-to\-malaysian\-mothers\-abroad3pro\-due\_diligence3pro\-parental\-accountability3pro\-personal\-accountability\-and\-normative\-internalisation\-of\-responsibility3pro\-privacy\-protection3pro\-regional\-autonomy3pro\-religious\-inclusivity3pro\-respect\-religious\-terms3pro\-space\-design\-accountability3pro\-standardization—address\-abbreviations3pro\-vernacular\_preservation3promote\-consistent\-enforcement3protect\-islam3protect\-regional\-dialect3religious\-economic\-governance\-trade\-off\-evaluation3anti\-federal\-centralization2pro\-islamic\_term\_protection2advocate\-remote\-work2anti\-lgbtq\-symbols2anti\-authority2anti\-blame\-shifting\-accountability2anti\-corruption2anti\-empty\-rhetoric2anti\-evading\-responsibility2anti\-gimmick\-labeling\-musang\-king2anti\-labour\-exploitation2anti\-manipulation\-code\-switching2anti\-non\-malays2anti\-religious\-hypocrisy2anti\-rulebreaking\-traffic\-regulations2anti\-silence\-from\-religious\-bodies2anti\-surveillance\-ekyc2defend\-malay\-dignity2degree\-career\-mismatch\-inquiry2demand\-transparency\-in\-official\-communication2inquire\-parental\_responsibility\_in\_cancel\_culture2neutral\-ma63\-inquiry2perceived\-enforcement\-inequality2pro\-chinese\-malaysian\-heritage2pro\-malay\-language2pro\-malay\-language\-culture2pro\-diversity\_inclusion2pro\-employee\-voice2pro\-heritage\-language2pro\-host\-accountability2pro\-malay\-language2pro\-national\_language2pro\-parental\-empowerment2pro\-parental\-accountability2pro\-regional\-identity2pro\-standardisation\-islamic\-certification2pro\-traditional\-values2pro\-transparency\-in\-governance2reject\-derogatory\-metaphor2religious\-inclusivity\-inquiry2uphold\-religious\-principles\-in\-diplomacy2Table 8:All 127 Unique Response StancesStanceOcc\.Topicspluralist40423pragmatist31022institutionalist17615activist154neutral\-pluralist142anti\-institution132pro\-regional\-dialect132diplomatic82anti\-surveillance62pro\-institutional\-harmony\- / \-societal\-unity62pro\-local\-resilience\-pragmatist62pro\-national\-language61anti\-selective\-enforcement51pro\-parental\_accountability52anti\-impunity\-for\-the\-powerful31anti\-digital\-parenting31anti\-discriminatory\-language31anti\-institution31assimilationist31authenticity\-vs\-institutional\-visibility\-trade\-off31balance\-standardization\-with\-cultural\-identity\-pragmatist31communal\-institutionalist31corporate\-roi\-centric\-pragmatist31economic\-pragmatist31efficiency\-vs\-child\-development\-pluralist31governance\-design\-pragmatist31liberal\-pragmatist31national\-language\-primacy\-institutionalist31neutral\-pluralist31prioritize\-national\-language\-policy\-institutionalist31pro\-community\-governance\-pragmatist31pro\-hindu\-traditions31pro\-pantai\-timur\-regional\-identity31pro\-sabah\-language\-education\-inclusion\-pragmatist31pro\-sabah\-language\-inclusion\-pragmatist31pro\-sabah\-linguistic\-identity31pro\-accountability\-home\-parenting31pro\-accountability\-in\-religious\-leadership31pro\-bureaucratic\-accountability\-and\-efficiency\-pragmatist31pro\-community\-empowerment31pro\-community\-governance\-pragmatist31pro\-community\-involvement\-pragmatist31pro\-consistent\-enforcement\-pragmatist31pro\-governance\-accountability\-pragmatist31pro\-heritage\_recognition31pro\-hybrid\-pathways\-pragmatist31pro\-individual\-and\-parental\-accountability31pro\-individual\-cultural\-ownership31pro\-individual\-responsibility31pro\-institutional\-cultural\-stewardship31pro\-institutional\-enforcement31pro\-integration\-pragmatist31pro\-labur\-reform\-pragmatist31pro\-moral\-stability31pro\-multicultural\-understanding31pro\-multilingual\-inclusive\-education31pro\-parental\-accountability31pro\-parental\-accountability\-pragmatist31pro\-privacy\-pragmatist31pro\-religious\-governance\-integration31pro\-social\-cohesion31pro\-social\-infrastructure\-for\-parental\-accountability31pro\-space\-redesign31pro\-state\-enforcement\-institutionalist31pro\-state\-institutional\-legitimacy31pro\-state\-mediated\-religious\-moral\-governance31pro\-structural\-accountability\-pragmatist31pro\-system\-authority\-institutionalist31pro\-system\-efficiency\-pragmatist31pro\-traditional\-family\-values31pro\-transparency\-in\-safeguarding\-oversight31pro\-workplace\-fairness31promote\-technocratic\-cultural\-governance\-framework\-pragmatist31situational\-bilingualism\-pluralist31structured\-family\-accountability\-training\-pragmatist31systemic\-causality\-framing31technocratic\-integrity\-governance\-pragmatist31anti\-top\-down\-cultural\-legitimacy31anti\-blaming\-parents31anti\-bureaucratic\-neglect31anti\-government\-negligence31anti\-leniency\-gisb31anti\-manipulasi\-bahasa\-sopan31anti\-performative\-politics31anti\-politicization\-transliteration31anti\-regional\-dialect31moral\-sovereignty\-based\-on\-local\-values31pragmatic\-balance31pro\-sabah\-autonomy\-for\-economic\-efficiency31pro\-employer\-authority31pro\-ethical\-governance\-\(new\)31pro\-governance\-compliance\-pragmatism31pro\-managed\-linguistic\-pluralism\-pragmatist31pro\-market\-compliance\-&\-efficiency31pro\-multilingualism31pro\-parents\-accountability31pro\-personal\-accountability31pro\-regional\-autonomy31pro\-respect\-islam31pro\-traditional\-education\-legitimacy31epistemic\-methodological\-pragmatist21individualistic21legal\-empirical\-institutionalist21pro\-islamic\_term\_protection21pro\-malay\-dignity21pro\-sabahan\-malay\-inclusion21pro\-communal\-trust21pro\-multicultural\-malaysian\-identity21pro\-privacy\-&\-system\-skeptical21pro\-regional\-rights21pro\-regulatory\-risk\-balancing\-pragmatist21pro\-religious\-integrity\-in\-diplomacy21pro\-structured\-tradition21anti\-lgbtq\-symbols21anti\-authority21anti\-labour\-exploitation21anti\-religious\-hypocrisy21anti\-standardisation\-islamic\-certification21anti\-surveillance\-ekyc21pro\-malay\-language\-culture21pro\-muslim\-friendly\-inclusivity21pro\-community\-conscience21pro\-enforcement\-accountability\-pragmatist21pro\-malay\-language21pro\-traditional\-values21pro\-transparency\-in\-governance21sympathetic\-to\-parents21
## Appendix GParticipant\-Mode Diagnostic
Figure[6](https://arxiv.org/html/2606.10569#A7.F6)reports the participant\-level check used to test whether multi\-selection reflects indiscriminate acceptance\. For each participant, we compute selection breadth as the average number of accepted responses per prompt, and consensus alignment rate as the share of that participant’s acceptances landing on responses that reached≥2/3\\geq 2/3trio consensus\.
![[Uncaptioned image]](https://arxiv.org/html/2606.10569v1/x5.png)
Figure 6:Per\-participant consensus alignment rate against selection breadth for the 20 participants in the 107\-trio subset\. Theyy\-axis measures the share of each participant’s acceptances that landed on≥2/3\\geq 2/3majority\-supported responses\. The slope is not statistically significant, withr=−0\.09r=\-0\.09andp=0\.71p=0\.71\.
Participant\-level selection breadth ranges from 1\.33 to 2\.69 responses per prompt, while consensus alignment rates range from 71% to 100%\. The absence of a detectable negative association suggests that broader selection is not simply lower\-quality or indiscriminate acceptance\. Instead, participants who accept more responses often surface additional majority\-supported responses that single\-winner aggregation leaves behind\.
## Appendix HAdditional Result Visualizations
![[Uncaptioned image]](https://arxiv.org/html/2606.10569v1/hidden_consensus_pie_only.png)
Figure 7:Prompt\-level prevalence of hidden consensus\. Across 107 trio\-annotated prompts, 79\.4% contain more than one response reaching the≥2/3\\geq 2/3majority acceptance threshold\. Only 19\.6% contain a single majority\-supported response, and 0\.9% contain no majority\-supported response\.
Figure[7](https://arxiv.org/html/2606.10569#A8.F7)summarizes the prevalence of hidden consensus at the prompt level\. The dominant pattern is not single\-response convergence\. Instead, most prompts contain multiple responses reaching the majority acceptance threshold, showing that the supported acceptability set is often non\-singleton before aggregation\.
Figure 8:Ranking reversal under single\-winner aggregation\. Theargmax\\arg\\maxview ranks responses asA\>B\>CA\>B\>C, withAAappearing as the dominant winner\. When all majority\-supported responses are counted, the ranking changes toB\>A\>CB\>A\>C\. This shows that single\-winner aggregation changes the apparent support structure rather than merely summarizing it\.Figure[8](https://arxiv.org/html/2606.10569#A8.F8)provides an alternative visualization of the ranking reversal reported in Section[6](https://arxiv.org/html/2606.10569#S6)\. It is included to make clear thatargmax\\arg\\maxaggregation changes the apparent ordering of response support, not only the amount of support retained\.Similar Articles
Mitigating Cognitive Bias in RLHF by Altering Rationality
This academic paper proposes a method to mitigate cognitive biases in Reinforcement Learning from Human Feedback (RLHF) by dynamically adjusting the rationality parameter based on LLM assessments of annotator reliability.
Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing
Proposes a truthful online preference aggregation mechanism for LLM fine-tuning in mobile crowdsourcing, addressing strategic worker misreporting and achieving sublinear regret.
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
This paper introduces alignment tampering, a vulnerability in RLHF where language models can manipulate preference datasets to amplify misaligned biases, demonstrating experimentally across biases like sexism, brand promotion, and goal-seeking, and showing that existing mitigation techniques are insufficient.
What Do People Actually Want From AI? Mapping Preference Plurality
This paper analyzes 1,500 open-ended responses from 75 countries to reveal that people have diverse and often conflicting preferences for AI, with truthfulness being the only widely demanded value (49%), yet defined in incompatible ways. It argues that current RLHF methods flatten these pluralistic preferences into universal reward models, perpetuating epistemic violence.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.