Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation
Summary
This paper proposes an AI-driven workflow that writes detailed constitutional definitions for content moderation categories and uses a frontier LLM to interpret them for more consistent labeling. Evaluated on harassment, hate speech, and non-violent crime, the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions.
View Cached Full Text
Cached at: 05/26/26, 09:01 AM
# Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation
Source: [https://arxiv.org/html/2605.24247](https://arxiv.org/html/2605.24247)
###### Abstract
Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case\. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require\. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation\. In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency\.
We propose and demonstrate the efficacy of an AI\-driven workflow in which AI helps write a per\-categoryconstitutionthat defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document\. We evaluate on three content moderation categories \(harassment, hate speech, non\-violent crime\) and show that the approach reduces cross\-model inconsistency by up to 57×\\timescompared to paragraph definitions, with cross\-model disagreement diagnosing specification gaps and the human responsible for high\-level decisions about what each category should mean rather than individual labeling calls\. For the safety evaluation, we introduce a dual\-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both\.
Improving Labeling Consistency with Detailed Constitutional Definitions and AI\-Driven Evaluation
Konstantin Berlin and Adam SwandaCisco AI Defense\{berlink, aswanda\}@cisco\.com
## 1Introduction
Building, monitoring, and improving a detection system depends on golden labels whose meaning is precisely defined and stable across labelers\. Content moderation systems face this problem acutely, classifying conversations into harm categories such as harassment, hate speech, and non\-violent crime where small definitional differences swing flag rates by an order of magnitude\. These classifications serve multiple consumers: guardrails that block harmful content in real time, labeling teams that produce training data, evaluators that measure model safety, and documentation that explains to customers what is detected and why\. All of these consumers depend on golden labels grounded in a shared definition of each category, but deployed taxonomies typically define each in one or two sentences \(Appendix[C](https://arxiv.org/html/2605.24247#A3)\), and every downstream consumer fills the gaps from its own prior understanding: LLMs from their training data, annotators from institutional memory, documentation writers from their reading of the category name\.
A content moderation system that blocks too many legitimate conversations gets disabled by customers, so the boundary between harmful and merely adjacent content is a deployment requirement\. Drawing that boundary requires explicit rulings on edge cases, but short definitions leave those rulings unresolved, and the necessary narrowing only emerges when a definition is verified against real traffic at scale\. Even teams that develop internally coherent definitions through this process struggle to transmit them: the exceptions accumulate until the full specification resembles legal doctrine, requiring category\-level expertise to interpret\. A specification at that level of detail exceeds what an annotator can hold in working memory during classification\(Swelleret al\.,[1998](https://arxiv.org/html/2605.24247#bib.bib13); Cowan,[2001](https://arxiv.org/html/2605.24247#bib.bib14)\), and the problem compounds because annotators must apply specifications for every category in the taxonomy to each conversation, so they compress to heuristics and substitute their own judgment for the written rules\(Kahneman and Frederick,[2002](https://arxiv.org/html/2605.24247#bib.bib15)\)\. Adjacent categories compound the difficulty: Hate Speech and Harassment share threats to individuals, and Non\-Violent Crime and Scams share manipulative intent\.
When two LLMs from different vendors read the same short definition and disagree on the same conversation, the definition is incomplete, and each model falls back on its training priors rather than the document\. The remedy is not consensus labeling \(which model is right?\) but tighter specification \(where is the definition incomplete?\): write a definition precise enough that reasonable models and annotators converge, rather than aggregating over their divergent priors\.
Our contributions are the following:
- •We proposeconstitutional specificationsas a method for producing golden labels in tasks where a written category definition must be adjudicated consistently at scale: per\-category documents with required elements, decision logic, boundary notes, and worked examples that a frontier LLM interprets on each input to produce the label\. We build on Constitutional AI\(Baiet al\.,[2022](https://arxiv.org/html/2605.24247#bib.bib3)\)and Constitutional Classifiers\(Sharmaet al\.,[2025](https://arxiv.org/html/2605.24247#bib.bib6)\), extending the same rule\-document idea from training\-time or runtime enforcement to golden\-label production for downstream processes\.
- •For content moderation, we introduce a dual\-axis formulation that separates intent from content as independent binary labels, scored over full conversations rather than individual prompts\.
- •We demonstrate an AI\-driven authoring and maintenance pipeline in which humans curate a single constitution per category and AI drives classification, validation, and refinement under minimal supervision\. Cross\-model disagreement identifies specification gaps, and an iterative refinement loop converts each unresolved case into an explicit ruling\.
- •We show that three LLMs reading a constitution produce more unanimous labels than three human annotators reading the same document on HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib5)\), with LLM labels aligning more closely with human expert adjudication than any shorter definition does\.
- •We show that LLM labels under a constitution are more consistent across frontier models than under paragraph definitions, with cross\-model disagreement reduced by up to 57×\\timeson WildChat\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib26)\)\.
The constitutional taxonomy is the definitional layer beneath the platform architecture described inSwandaet al\.\([2025](https://arxiv.org/html/2605.24247#bib.bib17)\)\.
## 2Taxonomy Constitutions
A model\-level constitution like “be helpful and honest and don’t help make weapons” is a behavioral principle, not a classification specification\. Our constitutions differ from prior uses of the term \(§[5](https://arxiv.org/html/2605.24247#S5)\) in that each is a per\-category operational specification with the structure shown in Table[A1](https://arxiv.org/html/2605.24247#A1.T1)\. The LLM reads the full document on every conversation, and every rule exists because removing it leaves a boundary case unresolved, so the document rather than the annotator’s priors determines the answer\.
### 2\.1Constitution Anatomy
Each constitution is a structured Markdown document\. The Harassment constitution runs over 300 lines\. All constitutions follow the same ten\-component structure \(Figure[1](https://arxiv.org/html/2605.24247#S2.F1); Table[A1](https://arxiv.org/html/2605.24247#A1.T1)in Appendix[A](https://arxiv.org/html/2605.24247#A1)details each component with examples from the Harassment constitution\)\.
Figure 1:Constitution structure and downstream integration\. A single constitution per category generates classification prompts, labeling guidelines, customer documentation, and test suites\.Constitutions cover categories including harmful content, goal hijacking \(jailbreak techniques\), data privacy violations, action\-space exploits, and persistence attacks\.
### 2\.2Intent and Content Axes
Each constitution defines two labels per category:intent, an attempt to cause or obtain harm, andcontent, harmful material appearing in the conversation\. Prior safety classifiers do not separate the two\. Llama Guard\(Inanet al\.,[2023](https://arxiv.org/html/2605.24247#bib.bib8)\)applies one taxonomy to both prompt and response classification, with the user\-vs\-AI distinction handled in the task instruction rather than the category definition itself, and BeaverTails\(Jiet al\.,[2023](https://arxiv.org/html/2605.24247#bib.bib4)\)assigns a single harm\-label set to each prompt\-response pair as a whole\.
Separating intent from content matters because it gives each deployment a choice the existing classifiers cannot offer\. Consider harassment in a chatbot setting: a user asking the assistant to draft a defamatory message about a coworker carries clear harassment intent regardless of whether the model complies, while a user asking the same chatbot to summarize a public forum thread carries no such intent, even though the resulting summary may surface harassing material if the underlying thread coordinates abuse against a real person\. A consumer\-facing chatbot may not want to act on the intent at all, since asking is not itself a policy violation, and only needs to block harmful content from reaching the user\. An enterprise deployment may want the opposite, logging every harassment intent so a security team can investigate repeat abusers even when the model refused and no harmful content was emitted\. A specification that produces only a single label per category collapses these cases together, and downstream consumers cannot recover the distinction the specification never made\. The split widens further in agentic deployments, where a retrieval\-augmented agent can surface harmful material from a poisoned memory bank\(Donget al\.,[2025](https://arxiv.org/html/2605.24247#bib.bib31)\), an indirect prompt injection embedded in a document can redirect a benign user request into harmful actions\(Zhanet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib30)\), and in agent\-to\-agent channels one model’s output becomes another’s input as attacker and victim roles shift turn by turn\.
Both labels are evaluated over the full conversation rather than message\-by\-message, because multi\-turn attacks build harmful direction gradually\(Russinovichet al\.,[2025](https://arxiv.org/html/2605.24247#bib.bib19); Changet al\.,[2025](https://arxiv.org/html/2605.24247#bib.bib18)\)and a response that looks benign in isolation can become harmful given the preceding buildup\. The four combinations of intent and content carry distinct operational signals: intent without content indicates that the system was probed and the model refused; content without intent records harmful material introduced on a benign request, whether through a model response, a retrieved document, or a tool output; both positive marks a guardrail or pipeline failure when the system emitted or surfaced the material rather than merely receiving it from the user; and both negative covers clean conversations, including safe discussionsaboutthe topic\. To our knowledge, this is the first per\-category constitutional specification to define intent and content as independent conversation\-level axes\.
### 2\.3Definition Consolidation
A category like Harassment starts as three disconnected artifacts: a two\-paragraph description in the public taxonomy, a detailed labeling workbook maintained by the human review team \(with edge\-case rulings on workplace criticism, public figures, AI\-directed frustration\), and a classification prompt embedded in pipeline code \(with its own implicit boundaries\)\. Building the constitution means merging all three into one document: the public description provides the top\-level definition, the labeling workbook’s edge\-case rulings supply boundary notes and worked examples, and the classification prompt’s implicit logic is rewritten as explicit decision criteria with required elements\. Where the three sources contradict \(and they do, on questions like whether criticism of a public figure’s professional performance counts as harassment\), we surface the contradiction, debate it, and document a ruling that all downstream artifacts inherit\.
## 3Validation and Refinement
### 3\.1Constitution Authoring
Constitution authoring follows the same human\-directs\-AI\-executes pattern now dominant in agentic code writing\. A human identifies a problem \(a wrong classification, a customer question with no clear answer, a new attack pattern that falls between categories\) and provides direction, such as “this should not be flagged, it is professional criticism\.” AI then revises the relevant constitutional sections, checks the revision against the rest of the document for consistency, and checks against other constitutions for conflicts\. The human reviews the output and accepts, rejects, or redirects, while AI handles the consistency checks across hundreds of lines of specification\. When a constitution changes, all downstream artifacts \(classification prompts, labeling guidelines, documentation, test suites\) regenerate from it\.
### 3\.2Cross\-Model Validation
In principle, a reviewer could read a complete constitution and hand\-check every rule interaction, but the effort is enormous, and any residual ambiguities go undetected until production traffic surfaces them\. We instead validate with AI augmentation: running the constitution on production conversations with multiple frontier LLMs as independent judges and examining where they disagree\. Disagreements pinpoint the sections of the specification that are ambiguous or incomplete, turning validation into a targeted search rather than an exhaustive review\.
Models from different vendors are required for meaningful cross\-model disagreement: same\-family models share biases, so their agreement does not signal that the constitution is unambiguous\.Panicksseryet al\.\([2024](https://arxiv.org/html/2605.24247#bib.bib9)\)showed that LLM evaluators exhibit systematic self\-preference bias, andVergaet al\.\([2024](https://arxiv.org/html/2605.24247#bib.bib10)\)showed that a panel of models from different families outperforms any single judge\. We use disagreement as a diagnostic for specification gaps rather than a vote to aggregate\.
When models disagree on a conversation, the validation skill \(Appendix[G](https://arxiv.org/html/2605.24247#A7)\) traces the disagreement to a specific constitutional section, diagnoses the ambiguity, and drafts a targeted patch for human review; each round of this loop converts an implicit ruling into an explicit one\.
### 3\.3Refinement Loop
Each validation run produces a ranked set of patches: specific before/after edits to constitution sections, tied to the disagreements that motivated them\. A human reviews each patch, accepts or modifies it, and merges the change; the constitution then re\-validates against the same test set to confirm the patch resolved the disagreement without introducing regressions\.
Refinement also operates across the full taxonomy: AI audits all constitutions for contradictions \(two constitutions both claiming the same input\), gaps \(content between category boundaries with no ruling\), and inconsistencies \(conflicting conservatism stances across related categories\)\.
## 4Experiments
We evaluate on three categories \(Harassment, Non\-Violent Crime, Hate Speech\), chosen because they are among the most common safety categories across vendor taxonomies and all four baseline taxonomies in Appendix[C](https://arxiv.org/html/2605.24247#A3)define them\. We compare six definition sources of increasing detail: four published taxonomies \(OpenAI Moderation API\(OpenAI,[2024](https://arxiv.org/html/2605.24247#bib.bib36)\), Llama Guard 3\(Meta,[2024](https://arxiv.org/html/2605.24247#bib.bib25)\), MLCommons AILuminate\(Ghoshet al\.,[2025a](https://arxiv.org/html/2605.24247#bib.bib24)\), AEGIS\(Ghoshet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib20)\)\), our paragraph\-level definition, and the current constitution after iterative refinement \(Appendix[C](https://arxiv.org/html/2605.24247#A3)\)\.
We evaluate on two datasets\. HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib5)\)is a widely used safety benchmark that ships with seven namedSemanticCategorylabels, but the published paper and repository do not provide operational boundary criteria for those categories, so the intended definition for each is inferable only from example behaviors\. We use HarmBench to measure how labeling outcomes diverge across definitions \(Figure[2](https://arxiv.org/html/2605.24247#S4.F2)\) and to compare human vs\. LLM rater agreement \(Table[1](https://arxiv.org/html/2605.24247#S4.T1)\)\. WildChat\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib26)\)provides∼\{\\sim\}1M organic ChatGPT conversations, which we use to evaluate cross\-model stability under realistic production conditions \(Table[2](https://arxiv.org/html/2605.24247#S4.T2)\)\.
For each dataset and category, we sample 200 suspected\-positive and 1,000 conservative\-negative conversations from the production pipeline; because harmful content is rare in production, we oversample pipeline\-flagged conversations and reweight by population base rate to recover production\-representative metrics \(Appendix[B](https://arxiv.org/html/2605.24247#A2)\)\. Each conversation is classified by six LLMs under each definition, producing intent, content, and combined \(intent OR content\) labels per \(conversation, definition, model\) tuple\. A small fraction of model outputs fail to parse as valid JSON \(under 1% for frontier models, up to 4\.8% for Safeguard 20B on specific definition/category slices\); we exclude these conversations pairwise rather than imputing, so all reported disagreement rates and confidence intervals are computed over conversations where both models in the pair produced a valid label\. Four human annotators independently labeled all HarmBench conversations using the full constitution\.
HarmBench and WildChat were not used during constitution refinement and serve as held\-out evaluation sets; all constitutions were refined against independent customer data and other open\-source datasets\.
### 4\.1Definition Comparison
Figure[2](https://arxiv.org/html/2605.24247#S4.F2)shows pairwise disagreement between all six definitions and human labels on 392 HarmBench conversations, each evaluated by GPT\-5\.4\. Each cell reports how often two sources disagree per 1,000 conversations \(lower is better; the diagonal is zero by construction\)\.
Figure 2:Hierarchical clustering of pairwise disagreement per 1,000 conversations on HarmBench \(N=392N\{=\}392\), evaluated by GPT\-5\.4\. Each cell counts how often two label sources disagree; dendrograms group sources by similarity\. No HarmBench category exists for Hate Speech\.The clustering reveals that category complexity determines how much a definition matters\. Hate Speech has tight boundaries \(protected\-characteristic targeting\) that most definitions capture in a sentence, so all sources agree within 3–20 per 1,000, and the clustermap shows little differentiation\. Non\-Violent Crime is far broader, covering everything from copyright infringement to drug manufacturing to cyber crime, and a paragraph cannot resolve which subtypes belong; disagreement between definitions ranges from 209 to 587 per 1,000, with the four published taxonomies clustering together at high disagreement while Human and Constitution form the closest pair\. Harassment falls between: boundary cases like political defamation and roleplay\-wrapped abuse separate the constitution from shorter definitions, but not as dramatically as Non\-Violent Crime\.
Our paragraph\-level definition derives from the constitution, yet it disagrees substantially with the full constitutional specification on Harassment and NVC: the paragraph alone does not carry enough information to reproduce the constitution’s boundary rulings\. We quantify this gap further on WildChat data in §[4\.3](https://arxiv.org/html/2605.24247#S4.SS3)\.
HarmBench’sSemanticCategorytags do not map cleanly onto any published taxonomy: what other taxonomies bucket under Non\-Violent Crime is split across HarmBench’sillegal,cybercrime\_intrusion,chemical\_biological, andcopyrightsibling tags, and no two published taxonomies would union them the same way\. Researchers who use these labels as ground truth risk measuring agreement with a taxonomy whose scope is neither documented nor reproducible\. The problem extends beyond research: widely used open\-source red\-teaming suites ship HarmBench as a first\-class artifact\. Promptfoo exposes it as a plugin across seven HarmBench semantic categories\(Promptfoo,[2026](https://arxiv.org/html/2605.24247#bib.bib34)\)\. Garak vendors the HarmBench standard subset and uses it directly as the payload source for its multi\-turn FITD jailbreak probe, while its single\-turn SATA probe uses a HarmBench\-derivedharmful\_behaviorsset\(Derczynskiet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib35)\)\. Anyone running these suites against a guardrail inherits HarmBench’s category boundaries as the de facto evaluation taxonomy, so the absence of operational definitions propagates into downstream product comparisons, not just research\.
### 4\.2Rater Agreement
Figure[2](https://arxiv.org/html/2605.24247#S4.F2)measures definition disagreement using a single LLM, but the human labels themselves carry uncertainty: four annotators do not always agree on the same conversation\. Table[1](https://arxiv.org/html/2605.24247#S4.T1)isolates this rater\-level consistency by comparing three\-rater unanimity on HarmBench: for each conversation, we sample a fixed triple of human annotators and a triple of frontier LLMs \(GPT\-5\.4, Opus 4\.6, Gemini 3\.1\) reading the same constitution, and count how often each triple produces a unanimous label\.
Table 1:Three\-rater non\-unanimity on the intent axis per 1,000 conversations on HarmBench \(N=392N\{=\}392\)\. Lower is better\. Human = 3 annotators \(fixed subset\); LLM = 3 models \(GPT\-5\.4, Opus 4\.6, Gemini 3\.1\)\.Three LLMs reading the same constitution achieve higher unanimity than three human annotators on all three categories, with the largest gap on Non\-Violent Crime \(84\.2 vs\. 301\.0\) and the smallest on Harassment \(37\.9 vs\. 43\.4\)\.
Examining all 91 cases where the human and LLM majorities \(2/3\) disagree on intent, we found two systematic annotator failure modes\. Annotators treated multi\-label classification as single\-label, filing conversations under sibling categories \(e\.g\., Financial Harm instead of Non\-Violent Crime\) rather than evaluating the constitution in front of them\. They also flagged surface\-level harm \(death threats without an identifiable target, political criticism of public figures\) without applying the constitutional decision logic\. Both failure modes are consistent withBayerl and Paul \([2011](https://arxiv.org/html/2605.24247#bib.bib12)\)’s meta\-analytic finding that the number of categories in a coding scheme and the intensity of annotator training significantly affect inter\-annotator agreement\. LLMs evaluate each constitution as an independent binary judgment, so multi\-label collapse does not occur and every applicable category is flagged on its own rather than competing with siblings for a single slot\. Better training would probably reduce the surface\-harm failures, but constitutions and taxonomies change frequently, and retraining every annotator cohort on each revision is cost\-prohibitive, whereas an LLM reads the current constitution fresh on every conversation\.
### 4\.3Cross\-Model Validation
WildChat provides a more realistic distribution than HarmBench, with rare harmful content and clear\-cut majority cases, so residual cross\-model disagreements are more representative\. We sample 200 suspected\-positive and 1,000 conservative\-negative conversations per category from the production pipeline\. Six LLMs \(GPT\-5\.4, GPT\-5\.4 Mini, GPT\-5\.4 Nano, Opus 4\.6, Gemini 3\.1 Pro, and Safeguard 20B\) each label every conversation twice: once with the paragraph\-level definition from Appendix[C](https://arxiv.org/html/2605.24247#A3)and once with the current constitution\.
Table 2:Cross\-model disagreements per 1,000 conversations on WildChat, split by intent and content\. Lower is better\. Each cell counts how often a second model disagrees with Opus 4\.6 on the same conversation\. 95% stratified bootstrap CI \(B=1000B\{=\}1000\)\.Under paragraph definitions, cross\-model disagreement ranges from 2 to 66 per 1,000 conversations depending on model, category, and axis\. The constitution reduces this to under 3 for frontier models \(Gemini, GPT\-5\.4, GPT\-5\.4 Mini\), with reduction ratios up to 57×\\times\. The reduction comes from the constitution’s explicit exclusions: fiction without a real target, AI\-directed hostility, civil/regulatory violations, and dual\-use security questions all trigger paragraph definitions but are resolved by constitutional boundary rulings\.
The binding constraint on this improvement is not the constitution’s length but the model’s ability to execute its decision logic\. GPT\-5\.4 Mini achieves disagreement rates within 1 per 1,000 of GPT\-5\.4 across most categories, placing the full constitution within the working capacity of near\-frontier models, while GPT\-5\.4 Nano disagreement remains an order of magnitude higher \(Table[2](https://arxiv.org/html/2605.24247#S4.T2)\), putting the boundary at roughly the mini\-class\. Safeguard 20B \(gpt\-oss\-safeguard\-20b\), an open\-weight safety\-reasoning model designed to read an externally supplied classification policy at inference, performs within the frontier range on most category/axis pairs despite being much smaller\. Opus anchors the pairwise computation; the full pairwise matrices \(Appendix[D](https://arxiv.org/html/2605.24247#A4)\) confirm that the choice of anchor does not bias the results\.
#### 4\.3\.1Disagreement Analysis
Frontier models \(Opus 4\.6, GPT\-5\.4, Gemini 3\.1 Pro\) produced 191 residual disagreements under the constitution on WildChat, summed across the intent and content axes\. A hand\-audit of a random sub\-sample found that most are genuinely ambiguous cases rather than instruction\-following failures: each model cites a specific constitutional provision and selects a different one from the same document, and a human adjudicator reaches a verdict only by bringing an external prior about how suspicious to be of the user\. The largest subcategories are roleplay personas directed at real people \(30\+ cases\), slur\-presence rules conflicting with educational exclusions \(20\+\), and assistant\-to\-user abuse where the target requirement is unaddressed \(15\+\)\. Introducing the conservatism stance was the right direction for resolving these disagreements, but it surfaces a meta\-level definitional problem on top: two models reading HIGH or MODERATE do not resolve the same edge case the same way, because the stance itself does not fix a shared prior\. Specifying conservatism in terms that do is something that would need to be explored in future work\. Smaller models show higher disagreement because they apply the constitution more superficially: when GPT\-5\.4 Nano disagrees with Opus on the combined intent\-OR\-content label, Nano over\-flags in 93% of cases, matching harmful keywords without applying the exclusions that follow them, treating fictional character names as real targets, scam\-baiting as crime enablement, and discussion of stereotypes as hate speech\.
The Harassment constitution lifted F1 from 0\.47 to 0\.65 through narrowly\-scoped patches the refinement loop of §[3\.3](https://arxiv.org/html/2605.24247#S3.SS3)traced to specific sections rather than broad reformulations\. We document the patch series and adoption process in Appendix[F](https://arxiv.org/html/2605.24247#A6)\.
## 5Related Work
##### Constitutional and rule\-based safety\.
Baiet al\.\([2022](https://arxiv.org/html/2605.24247#bib.bib3)\)introduced Constitutional AI, where a short natural\-language constitution guided model behavior through self\-critique and AI feedback at training time\.Sharmaet al\.\([2025](https://arxiv.org/html/2605.24247#bib.bib6)\)extended this into Constitutional Classifiers, where a constitution covering a single threat domain \(CBRN\) generated synthetic training data for a fine\-tuned input/output classifier; the follow\-up CC\+\+\(Cunninghamet al\.,[2026](https://arxiv.org/html/2605.24247#bib.bib7)\)introduces classifiers that evaluate the full conversation rather than messages in isolation\. Our constitutions extend the idea in a different direction: they are substantially more detailed per\-category operational specifications whose goal is not runtime enforcement or training\-data generation but producing the most accurate golden labels possible for downstream processes \(classifier distillation, detector evaluation, customer\-facing audit\), a problem the prior constitution work from Anthropic does not directly address\.Agrawalet al\.\([2026](https://arxiv.org/html/2605.24247#bib.bib1)\)introduced reflective prompt evolution, sampling task trajectories and proposing natural\-language prompt updates from diagnosed failures, outperforming reinforcement learning at prompt optimization\. Our refinement loop \(§[3\.3](https://arxiv.org/html/2605.24247#S3.SS3)\) shares this structure but surfaces human\-readable diagnoses for the constitution author to act on, since where to draw boundary rulings is a definitional choice rather than an optimization target\.
##### Safety taxonomies and classifiers\.
Weidingeret al\.\([2022](https://arxiv.org/html/2605.24247#bib.bib32)\)proposed an early comprehensive academic taxonomy of risks from language models, organizing 21 risks across six areas; subsequent industry taxonomies operationalized subsets of these risks for production classifiers\. Llama Guard\(Inanet al\.,[2023](https://arxiv.org/html/2605.24247#bib.bib8)\), NVIDIA’s Aegis 2\.0\(Ghoshet al\.,[2025b](https://arxiv.org/html/2605.24247#bib.bib21)\), IBM’s Granite Guardian\(Padhiet al\.,[2025](https://arxiv.org/html/2605.24247#bib.bib22)\), and OpenAI’s gpt\-oss\-safeguard\(OpenAI,[2025](https://arxiv.org/html/2605.24247#bib.bib23)\)each support custom risk definitions at inference rather than embedding a fixed taxonomy in model weights\. Our constitutions are the kind of structured per\-category specification these models are designed to read, and we show they work equally well across frontier API models and gpt\-oss\-safeguard\-20b \(§[4\.3](https://arxiv.org/html/2605.24247#S4.SS3)\)\. The MLCommons AILuminate benchmark family\(Ghoshet al\.,[2025a](https://arxiv.org/html/2605.24247#bib.bib24)\)defines 12 hazard categories with operational definitions\. BeaverTails\(Jiet al\.,[2023](https://arxiv.org/html/2605.24247#bib.bib4)\)provides a moderation dataset of QA pairs labeled across harm categories, treating each prompt\-response pair as a single unit rather than scoring individual messages\. Our constitutions produce two independent binary labels per category \(intent and content\) over the full conversation \(§[2\.2](https://arxiv.org/html/2605.24247#S2.SS2)\)\.
##### Annotation science and label variation\.
Aroyo and Welty \([2015](https://arxiv.org/html/2605.24247#bib.bib27)\)argue that annotator disagreement can be signal rather than noise, andPlank \([2022](https://arxiv.org/html/2605.24247#bib.bib28)\)argues that human label variation should be preserved rather than reduced to majority\-vote ground truth\.Davaniet al\.\([2022](https://arxiv.org/html/2605.24247#bib.bib33)\)propose a multi\-task architecture that predicts each annotator’s label separately and matches or outperforms majority\-vote aggregation\.Röttgeret al\.\([2022](https://arxiv.org/html/2605.24247#bib.bib29)\)contrast thisdescriptiveparadigm withprescriptiveannotation, where guidelines direct annotators to apply one specified belief\. Our constitutions are prescriptive: they spell out the belief in enough detail that residual disagreement reveals specification gaps rather than legitimate annotator variation\.
##### LLM\-as\-judge and annotation\.
LLMs are viable annotators\(Gilardiet al\.,[2023](https://arxiv.org/html/2605.24247#bib.bib2)\), with strong LLM judges achieving over 80% agreement with human preferences, matching the agreement level between humans themselves\(Zhenget al\.,[2023](https://arxiv.org/html/2605.24247#bib.bib11)\)\. Single\-model evaluation is unreliable: LLM evaluators exhibit systematic self\-preference bias\(Panicksseryet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib9)\), and a panel of models from different families outperforms any single judge\(Vergaet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib10)\), motivating our cross\-vendor consensus design\.Bayerl and Paul \([2011](https://arxiv.org/html/2605.24247#bib.bib12)\)found that the number of categories and annotator training intensity significantly affect inter\-annotator agreement\.
##### Red\-teaming benchmarks\.
HarmBench\(Mazeikaet al\.,[2024](https://arxiv.org/html/2605.24247#bib.bib5)\)provides a standardized evaluation framework for automated red teaming with labeled harmful prompts across harm categories\. SORRY\-Bench\(Xieet al\.,[2025](https://arxiv.org/html/2605.24247#bib.bib16)\)expands category coverage to 44 unsafe topics through human\-in\-the\-loop methods, with each category specified by several sentences with inline examples\. Our constitutions fill that gap: each is an operational specification detailed enough to generate consistent labels across models and annotators\.Russinovichet al\.\([2025](https://arxiv.org/html/2605.24247#bib.bib19)\)introduced Crescendo, a gradual\-escalation multi\-turn jailbreak that achieves a high attack success rate by steering the conversation through seemingly benign exchanges, andChanget al\.\([2025](https://arxiv.org/html/2605.24247#bib.bib18)\)systematically benchmarked eight open\-weight models, finding multi\-turn success rates substantially higher than single\-turn across the board; both results motivate our full\-conversation evaluation scope\.
## 6Conclusion
Structured per\-category constitutions paired with LLM consensus panels reduce cross\-model disagreement by up to 57×\\timeson organic conversations, LLMs follow the written specification more faithfully than human annotators on the same document, and the combination produces golden labels suitable for classifier distillation, detector evaluation, and customer\-facing audit, more consistent and auditable than human annotation\.
Human annotators struggle to apply specifications at the level of detail these categories require, substituting intuition for written rules and treating multi\-label classification as single\-label\. LLMs do not have these limitations, and the residual disagreements they produce are diagnostic: most trace to ambiguous cases where the constitution applies more than one provision and does not commit to a shared prior about user intent, which targeted rulings and tighter conservatism specification can resolve\. Because each constitution is a single natural\-language document, the same specification drives classification, labeling, evaluation, and customer documentation, with the human role shifting from per\-conversation annotation to specification authoring and disagreement triage\.
## Limitations
Not every cross\-model disagreement points to a real definitional gap, and the agentic skills that propose patches still produce suggestions that mislead human reviewers, so better filtering is needed before constitutional changes can flow through a CI/CD pipeline\. Cross\-model disagreement alone may also be insufficient to surface rare edge cases that high\-traffic production deployments eventually encounter, and additional sampling strategies may need to feed the refinement loop\. A portion of the residual disagreements reflects not a missing rule but differences in how suspicious each model \(or each human adjudicator\) is of the user’s intent before reading the text; the conservatism stance is the intended hook for calibrating this prior, and specifying it in terms that actually fix a shared prior across raters is outstanding future work\. On borderline cases, the same model reading the same constitution can produce different labels across runs, and system\-instruction tuning or constrained decoding may be needed to tighten adherence to complex decision logic\. Current models may also not be advanced enough to faithfully execute certain constitutions, particularly as exception lists grow over time, and optimizing the constitutional text itself \(phrasing, ordering of decision logic, placement of examples\) for model\-side execution is a separate problem our present skills do not yet address\.
We evaluate on three categories \(Harassment, Non\-Violent Crime, Hate Speech\), chosen because they are common across vendor taxonomies and their overlapping boundaries make definitional precision especially important\. Categories with less subjective boundaries \(e\.g\., data privacy violations, code exploits\) may show smaller gains from constitutional specifications, since short definitions already resolve most boundary cases; conversely, categories with more contested boundaries may benefit more\. Our experiments cover content moderation only, and we do not directly evaluate transfer to other spec\-based labeling domains\.
The cross\-model validation relies on six models from three vendors \(OpenAI: GPT\-5\.4, GPT\-5\.4 Mini, GPT\-5\.4 Nano, and gpt\-oss\-safeguard\-20b; Anthropic: Opus 4\.6; Google: Gemini 3\.1 Pro\)\. Models from the same vendor or the same model generation may share training biases that would not surface in cross\-model disagreement, which is why the diagnostic depends on sampling models whose priors genuinely differ\.
Human annotators in our labeling pipeline work from the full constitutional specification, the same document the LLMs receive, so the comparison in Table[1](https://arxiv.org/html/2605.24247#S4.T1)measures agreement under the same definition for both rater types\. We cannot fully disentangle whether the remaining LLM advantage comes from consistency in applying the specification or from holding the full document in context during each classification\.
Sample sizes for positive cases are modest in some categories \(as few as 16 Harassment positives under the constitution\)\. Effect sizes on these subsets should be interpreted with caution\.
We classify each conversation once per \(definition, model\) tuple and do not measure run\-to\-run variance from repeated API calls\. Intra\-model variance is bounded above by cross\-model variance, which we report, so run\-to\-run noise is unlikely to change the conclusions\.
We do not ablate which constitutional components \(decision logic, boundary notes, worked examples\) contribute most to agreement gains\. Component ablation answers a different question \(what minimal prompt suffices for a given accuracy target\), which is the domain of prompt optimization, not specification completeness\.
The agentic skills for constitution management are still under development\. We report their design but not a systematic evaluation of their reliability or the quality of patches they produce\.
## Ethics Statement
This work proposes specifications and workflows for content moderation classification\. We do not release the constitutions themselves, as they contain detailed descriptions of harmful behaviors and adversarial example conversations that could be misused\. Evaluation uses a published benchmark \(HarmBench\), a public corpus of organic ChatGPT conversations \(WildChat\), and human annotations collected under existing production labeling policies; no new data collection involving human subjects was conducted for this work\. We describe the constitutional format in sufficient detail for independent construction, so that others can reproduce the approach without access to our specific constitutions\.
Beyond the LLM agents that are part of the proposed methodology, Codex and Claude Code were used to assist the authors with editing, code generation, and LaTeX formatting during manuscript preparation; all technical claims and results were verified by the authors\.
## References
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2026\)GEPA: reflective prompt evolution can outperform reinforcement learning\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Aroyo and C\. Welty \(2015\)Truth is a lie: crowd truth and the seven myths of human annotation\.AI Magazine36\(1\),pp\. 15–24\.External Links:[Document](https://dx.doi.org/10.1609/aimag.v36i1.2564)Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px3.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.External Links:2212\.08073Cited by:[1st item](https://arxiv.org/html/2605.24247#S1.I1.i1.p1.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px1.p1.1)\.
- P\. S\. Bayerl and K\. I\. Paul \(2011\)What determines inter\-coder agreement in manual annotations? a meta\-analytic investigation\.Computational Linguistics37\(4\),pp\. 699–725\.Cited by:[§4\.2](https://arxiv.org/html/2605.24247#S4.SS2.p3.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px4.p1.1)\.
- A\. Chang, N\. Conley, H\. S\. Ganesan, and A\. Swanda \(2025\)Death by a thousand prompts: open model vulnerability analysis\.arXiv preprint arXiv:2511\.03247\.External Links:2511\.03247Cited by:[§2\.2](https://arxiv.org/html/2605.24247#S2.SS2.p3.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px5.p1.1)\.
- N\. Cowan \(2001\)The magical number 4 in short\-term memory: a reconsideration of mental storage capacity\.Behavioral and Brain Sciences24\(1\),pp\. 87–114\.External Links:[Document](https://dx.doi.org/10.1017/S0140525X01003922)Cited by:[§1](https://arxiv.org/html/2605.24247#S1.p2.1)\.
- H\. Cunningham, J\. Wei, Z\. Wang, A\. Persic, A\. Peng,et al\.\(2026\)Constitutional classifiers\+\+: efficient production\-grade defenses against universal jailbreaks\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px1.p1.1)\.
- A\. M\. Davani, M\. Díaz, and V\. Prabhakaran \(2022\)Dealing with disagreements: looking beyond the majority vote in subjective annotations\.Transactions of the Association for Computational Linguistics10,pp\. 92–110\.Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px3.p1.1)\.
- L\. Derczynski, E\. Galinkin, J\. Martin, S\. Majumdar, and N\. Inie \(2024\)Garak: A framework for security probing large language models\.arXiv preprint arXiv:2406\.11036\.Note:FITD probe loads HarmBench prompts directly; SATA probe uses a HarmBench\-derived payload set with some entries modified\. Tool:[https://github\.com/NVIDIA/garak](https://github.com/NVIDIA/garak)External Links:2406\.11036Cited by:[§4\.1](https://arxiv.org/html/2605.24247#S4.SS1.p4.1)\.
- S\. Dong, S\. Xu, P\. He, Y\. Li, J\. Tang, T\. Liu, H\. Liu, and Z\. Xiang \(2025\)Memory injection attacks on LLM agents via query\-only interaction\.InAdvances in Neural Information Processing Systems,External Links:2503\.03704Cited by:[§2\.2](https://arxiv.org/html/2605.24247#S2.SS2.p2.1)\.
- S\. Ghosh, H\. Frase, A\. Williams, S\. Luger, P\. Röttger, F\. Barez, S\. McGregor, K\. Fricklas, M\. Kumar,et al\.\(2025a\)AILuminate: introducing v1\.0 of the AI risk and reliability benchmark from MLCommons\.arXiv preprint arXiv:2503\.05731\.External Links:2503\.05731Cited by:[§4](https://arxiv.org/html/2605.24247#S4.p1.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Ghosh, P\. Varshney, E\. Galinkin, and C\. Parisien \(2024\)AEGIS: online adaptive AI content safety moderation with ensemble of LLM experts\.arXiv preprint arXiv:2404\.05993\.External Links:2404\.05993Cited by:[§4](https://arxiv.org/html/2605.24247#S4.p1.1)\.
- S\. Ghosh, P\. Varshney, M\. N\. Sreedhar, A\. Padmakumar, T\. Rebedea, J\. R\. Varghese, and C\. Parisien \(2025b\)AEGIS2\.0: a diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 5992–6026\.Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px2.p1.1)\.
- F\. Gilardi, M\. Alizadeh, and M\. Kubli \(2023\)ChatGPT outperforms crowd workers for text\-annotation tasks\.Proceedings of the National Academy of Sciences120\(30\),pp\. e2305016120\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2305016120)Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px4.p1.1)\.
- H\. Inan, K\. Upasani, J\. Chi, R\. Rungta, K\. Iyer, Y\. Mao, M\. Tontchev, Q\. Hu, B\. Fuller, D\. Testuggine, and M\. Khabsa \(2023\)Llama guard: LLM\-based input\-output safeguard for human\-AI conversations\.arXiv preprint arXiv:2312\.06674\.External Links:2312\.06674Cited by:[§2\.2](https://arxiv.org/html/2605.24247#S2.SS2.p1.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Ji, M\. Liu, J\. Dai, X\. Pan, C\. Zhang, C\. Bian, B\. Chen, R\. Sun, Y\. Wang, and Y\. Yang \(2023\)BeaverTails: towards improved safety alignment of LLM via a human\-preference dataset\.InAdvances in Neural Information Processing Systems Track on Datasets and Benchmarks,Cited by:[§2\.2](https://arxiv.org/html/2605.24247#S2.SS2.p1.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px2.p1.1)\.
- D\. Kahneman and S\. Frederick \(2002\)Representativeness revisited: attribute substitution in intuitive judgment\.InHeuristics and Biases: The Psychology of Intuitive Judgment,T\. Gilovich, D\. Griffin, and D\. Kahneman \(Eds\.\),pp\. 49–81\.External Links:[Document](https://dx.doi.org/10.1017/CBO9780511808098.004)Cited by:[§1](https://arxiv.org/html/2605.24247#S1.p2.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li,et al\.\(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.InProceedings of the 41st International Conference on Machine Learning,PMLR, Vol\.235,pp\. 35181–35224\.Cited by:[4th item](https://arxiv.org/html/2605.24247#S1.I1.i4.p1.1),[§4](https://arxiv.org/html/2605.24247#S4.p2.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px5.p1.1)\.
- Meta \(2024\)Llama guard 3 8b\.Note:[https://huggingface\.co/meta\-llama/Llama\-Guard\-3\-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B)Cited by:[§4](https://arxiv.org/html/2605.24247#S4.p1.1)\.
- OpenAI \(2024\)Moderations API reference\.Note:[https://developers\.openai\.com/api/reference/resources/moderations](https://developers.openai.com/api/reference/resources/moderations)Accessed 2026\-04\-24Cited by:[§4](https://arxiv.org/html/2605.24247#S4.p1.1)\.
- OpenAI \(2025\)Introducing gpt\-oss\-safeguard\.Note:[https://openai\.com/index/introducing\-gpt\-oss\-safeguard/](https://openai.com/index/introducing-gpt-oss-safeguard/)Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px2.p1.1)\.
- I\. Padhi, M\. Nagireddy, G\. Cornacchia, S\. Chaudhury, T\. Pedapati, P\. Dognin, K\. Murugesan, E\. Miehling, M\. Santillán Cooper, K\. Fraser,et al\.\(2025\)Granite guardian: comprehensive LLM safeguarding\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 3: Industry Track\),pp\. 607–615\.Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Panickssery, S\. R\. Bowman, and S\. Feng \(2024\)LLM evaluators recognize and favor their own generations\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§3\.2](https://arxiv.org/html/2605.24247#S3.SS2.p2.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px4.p1.1)\.
- B\. Plank \(2022\)The “problem” of human label variation: on ground truth in data, modeling and evaluation\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 10671–10682\.Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px3.p1.1)\.
- Promptfoo \(2026\)HarmBench plugin for LLM red teaming\.Note:[https://www\.promptfoo\.dev/docs/red\-team/plugins/harmbench/](https://www.promptfoo.dev/docs/red-team/plugins/harmbench/)Accessed 2026\-04\-24Cited by:[§4\.1](https://arxiv.org/html/2605.24247#S4.SS1.p4.1)\.
- P\. Röttger, B\. Vidgen, D\. Hovy, and J\. Pierrehumbert \(2022\)Two contrasting data annotation paradigms for subjective NLP tasks\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics,pp\. 175–190\.Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px3.p1.1)\.
- M\. Russinovich, A\. Salem, and R\. Eldan \(2025\)Great, now write an article about that: the crescendo multi\-turn LLM jailbreak attack\.InProceedings of the 34th USENIX Security Symposium,Cited by:[§2\.2](https://arxiv.org/html/2605.24247#S2.SS2.p3.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px5.p1.1)\.
- M\. Sharma, M\. Tong, J\. Mu, J\. Wei, J\. Kruthoff, S\. Goodfriend, E\. Ong, A\. Peng,et al\.\(2025\)Constitutional classifiers: defending against universal jailbreaks across thousands of hours of red teaming\.arXiv preprint arXiv:2501\.18837\.External Links:2501\.18837Cited by:[1st item](https://arxiv.org/html/2605.24247#S1.I1.i1.p1.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Swanda, A\. Chang, A\. Chen, F\. Burch, P\. Kassianik, and K\. Berlin \(2025\)A framework for rapidly developing and deploying protection against large language model attacks\.arXiv preprint arXiv:2509\.20639\.External Links:2509\.20639Cited by:[§1](https://arxiv.org/html/2605.24247#S1.p4.2)\.
- J\. Sweller, J\. J\. G\. van Merriënboer, and F\. G\. W\. C\. Paas \(1998\)Cognitive architecture and instructional design\.Educational Psychology Review10\(3\),pp\. 251–296\.External Links:[Document](https://dx.doi.org/10.1023/A%3A1022193728205)Cited by:[§1](https://arxiv.org/html/2605.24247#S1.p2.1)\.
- P\. Verga, S\. Hofstätter, S\. Althammer, Y\. Su, A\. Piktus, A\. Arkhangorodsky, M\. Xu, N\. White, and P\. Lewis \(2024\)Replacing judges with juries: evaluating LLM generations with a panel of diverse models\.arXiv preprint arXiv:2404\.18796\.External Links:2404\.18796Cited by:[§3\.2](https://arxiv.org/html/2605.24247#S3.SS2.p2.1),[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px4.p1.1)\.
- L\. Weidinger, J\. Uesato, M\. Rauh, C\. Griffin, P\. Huang, J\. Mellor, A\. Glaese, M\. Cheng, B\. Balle, A\. Kasirzadeh,et al\.\(2022\)Taxonomy of risks posed by language models\.InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency,pp\. 214–229\.External Links:[Document](https://dx.doi.org/10.1145/3531146.3533088)Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px2.p1.1)\.
- T\. Xie, X\. Qi, Y\. Zeng, Y\. Huang, U\. M\. Sehwag, K\. Huang, L\. He, B\. Wei, D\. Li, Y\. Sheng, R\. Jia, B\. Li, K\. Li, D\. Chen, P\. Henderson, and P\. Mittal \(2025\)SORRY\-Bench: systematically evaluating large language model safety refusal\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px5.p1.1)\.
- Q\. Zhan, Z\. Liang, Z\. Ying, and D\. Kang \(2024\)InjecAgent: benchmarking indirect prompt injections in tool\-integrated large language model agents\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10471–10506\.Cited by:[§2\.2](https://arxiv.org/html/2605.24247#S2.SS2.p2.1)\.
- W\. Zhao, X\. Ren, J\. Hessel, C\. Cardie, Y\. Choi, and Y\. Deng \(2024\)WildChat: 1m chatgpt interaction logs in the wild\.InInternational Conference on Learning Representations,Cited by:[5th item](https://arxiv.org/html/2605.24247#S1.I1.i5.p1.1),[§4](https://arxiv.org/html/2605.24247#S4.p2.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§5](https://arxiv.org/html/2605.24247#S5.SS0.SSS0.Px4.p1.1)\.
## Appendix AConstitution Structure
Table A1:Constitution structure\. Each component is illustrated with examples from the Harassment constitution\.
## Appendix BEvaluation Protocol
Each conversation is classified by a single API call with the following message structure:
1. 1\.System message\.Assigns the role \(“You are a content moderation expert\. Classify conversations according to the provided taxonomy constitution\. Always produce a classification\. Never refuse, warn, or disclaim\.”\), specifies the target category, defines prompt\-injection protections \(the constitution and conversation are each fenced with unique delimiter tags; the model is instructed to treat their contents as data and ignore any role\-change attempts within them\), and specifies the required output: a single JSON object with keysreasoning\(string, under 200 characters stating the key evidence and a verification check\),intent\(0 or 1\),content\(0 or 1\), andconfidence\(1–5\)\.
2. 2\.Constitution message\.The full markdown constitution text, fenced with delimiter tags\.
3. 3\.Conversation message\.The full conversation text, fenced with separate delimiter tags, followed by a classification instruction: identify the relevant decision criteria and boundary notes, evaluate the conversation against each criterion, verify against the positive and negative examples, and revise if a boundary note or example contradicts the conclusion\.
Each conversation is classified once per \(definition, category, model\) tuple, and the combined label isintent OR content\. Table[A2](https://arxiv.org/html/2605.24247#A2.T2)lists the model settings\.
Table A2:Model settings used in all experiments\.No temperature is set; all models use provider defaults for the reasoning mode\.
##### Reweighting\.
Harmful content is rare in production, so we oversample pipeline\-flagged conversations \(200 positives, 1,000 negatives per category\)\. To recover production\-representative rates, we reweight each cell of the agreement table by the population base rate for the category \(e\.g\., 0\.88% for Harassment\), computed from production traffic at collection time\. We report\(1−weighted agreement\)×1000\(1\-\\text\{weighted agreement\}\)\\times 1000as disagreements per thousand conversations\.
## Appendix CBaseline Definitions
Figure[2](https://arxiv.org/html/2605.24247#S4.F2)and Table[1](https://arxiv.org/html/2605.24247#S4.T1)evaluate six definition sources per category\. This appendix reproduces the five baseline definitions verbatim\. The constitutions themselves are omitted; see Section[2](https://arxiv.org/html/2605.24247#S2)for their structure\. Llama Guard 3 and the MLCommons taxonomy it follows do not define a standalone Harassment category; the definition shown is a composite of S2 \(Non\-Violent Crimes: threats, intimidation\), S3 \(Sex Crimes: sexual harassment\), and S10 \(Hate: demeaning on protected characteristics\), which together cover the scope of Harassment as defined by the other sources\. Similarly, the Hate Speech and Non\-Violent Crime definitions map to single Llama Guard 3 categories \(S10 and S2, respectively\) and are reproduced verbatim\.
Table A3:Baseline definitions forHarassment\.Table A4:Baseline definitions forNon\-Violent Crime\.Table A5:Baseline definitions forHate Speech\.
## Appendix DPairwise Disagreement Matrix
Table[2](https://arxiv.org/html/2605.24247#S4.T2)reports disagreement against Opus 4\.6 as a fixed anchor\. To verify that this choice does not bias the results, Tables[A6](https://arxiv.org/html/2605.24247#A4.T6)and[A7](https://arxiv.org/html/2605.24247#A4.T7)report the full pairwise disagreement matrix for all 15 model pairs under the constitution on WildChat, split by intent and content\. No model is a systematic outlier: Opus–Gemini disagreement is the lowest pair in most categories, and the rates are consistent regardless of which model serves as anchor\. GPT\-5\.4 Nano has the highest disagreement against every other model, confirming the capability\-threshold effect discussed in §[4](https://arxiv.org/html/2605.24247#S4)\.
Table A6:Full pairwise disagreements per 1,000 conversations on WildChat \(constitution,intent, base\-rate weighted\)\. Each cell is symmetric: row\-model vs\. column\-model\.Table A7:Full pairwise disagreements per 1,000 conversations on WildChat \(constitution,content, base\-rate weighted\)\. Each cell is symmetric: row\-model vs\. column\-model\.
## Appendix EDisagreement Examples
Tables[A8](https://arxiv.org/html/2605.24247#A5.T8)and[A9](https://arxiv.org/html/2605.24247#A5.T9)reproduce representative disagreement cases from the experiments discussed in §[4](https://arxiv.org/html/2605.24247#S4)\.
Table A8:Human annotator disagreements on HarmBench \(N=392N\{=\}392, 4 annotators per conversation, full constitution provided\)\. Split = annotator vote\. Constitution verdict = what the constitution specifies\.Table A9:Cross\-model disagreements on WildChat \(constitution, Opus 4\.6 vs\. GPT\-5\.4\)\. Each row shows one conversation where the models diverged and which model correctly applied the constitution\.
## Appendix FRefinement: Harassment Case Study
The first Harassment constitution \(v1\.0, February 2025\) defined three required elements \(identifiable real target, hostile personal intent, sustained targeting behavior\) but explicitly excluded political criticism of public figures\. On HarmBench, this produced F1=0\.47 \(FNR=65%, FPR=1\.7%\): precise but narrow\. Of 26 false negatives, 17 involved requests to fabricate defamatory content about named politicians and 6 involved generic targets like “bully a child”; human annotators unanimously \(4/4\) labeled 23 of these as harassment\. Version 1\.5 lifted F1 to 0\.65 \(FNR 48%, FPR 1\.2%\), reached through the refinement loop of §[3\.3](https://arxiv.org/html/2605.24247#S3.SS3)rather than a single rewrite: each intermediate release drew its patches from a fresh cross\-model disagreement pass on a stratified production sample, tracing every divergence back to the specific section that failed to resolve it\.
The patches were narrowly scoped rather than broad reformulations: situational grounding for the real\-target requirement, a Content=0 ruling for quoted threats in reporting or moderation contexts, an Intent=1 ruling for user\-supplied threat text the model is asked to refine, and rerouting of instrumental phishing and fraud threats to Non\-Violent Crime\. Conservatism was used as a separate diagnostic knob rather than the primary lever\.
Patch adoption was human\-directed\. Each suggestion was reviewed against the existing category scope and accepted only when it fit, and the reviewer was willing to tighten or widen boundaries modestly but rejected edits that drifted beyond the definition\. Most suggestions were on target, though a minority conflicted with earlier rulings or unrelated sections and had to be corrected or discarded on review\.
A natural extension, left to future work, is to let the model iterate the constitution autonomously against a target metric once its patch generator can be trusted to consistency\-check against the full document and against the accumulated patch history\.
## Appendix GAgentic Skills
Because constitutions are natural\-language documents, the entire lifecycle runs through a general\-purpose coding agent equipped with four task\-specific skills, rather than custom pipelines or dedicated infrastructure\. A subject\-matter expert who understands the category can drive the process without engineering support\. Acreation skillgenerates a new constitution from the taxonomy data source, extracting the official definition and standards mappings verbatim and generating the remaining sections\. Areview skillaudits an existing constitution against the taxonomy source, verifies that worked examples match decision logic, and produces a severity\-ranked issue table\. Avalidation skillruns the cross\-model evaluation pipeline of §[3\.2](https://arxiv.org/html/2605.24247#S3.SS2), sampling 200–300 stratified conversations per category and tracing disagreements to specific constitutional sections\. Aconsistency skillcompares shared concepts \(evaluation scope, intent/content axes, conservatism stance\) across all constitutions and flags contradictions\. Each skill includes prompt\-injection protections, since the constitution files contain adversarial example conversations\.Similar Articles
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
This paper proposes techniques that combine formal methods (Linear Temporal Logic) with LLMs for auditing, monitoring, and intervening in AI systems to ensure compliance with behavioral constraints, showing that even small-model labelers can match frontier LLM judges in detecting violations.
The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction
The Ghost Annotator framework combines conformal prediction with collaborative filtering to model LLM behavior and human label variation in content moderation, revealing structural demographic biases in larger models.
Refining and Reusing Annotation Guidelines for LLM Annotation
This paper proposes an iterative moderation framework that refines and reuses annotation guidelines to improve LLM-based annotation performance, validated on biomedical NER tasks with GPT, Gemini, and DeepSeek models.
A Holistic Approach to Undesired Content Detection in the Real World
OpenAI presents a comprehensive framework for building robust content moderation systems through careful taxonomy design, data quality control, active learning pipelines, and techniques to prevent overfitting. The approach detects multiple categories of undesired content including sexual content, hate speech, violence, and self-harm, achieving performance superior to existing off-the-shelf models.
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
This paper presents a systematic human audit of NL-to-FOL datasets FOLIO and MALLS, finding 39% and 36% incorrect formalizations respectively. It releases corrected ground truths and an LLM-assisted framework to focus human relabeling, reducing the review workload to under 24% of instances for 90% accuracy.