Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
Summary
This paper introduces a relevance-sensitive evaluation suite for legal AI, demonstrating that LLMs are overly sensitive to legally irrelevant perturbations, and proposes LexGuard, an adversarial multi-agent framework using formal reasoning to improve legal reasoning reliability.
View Cached Full Text
Cached at: 05/27/26, 09:05 AM
# Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
Source: [https://arxiv.org/html/2605.26530](https://arxiv.org/html/2605.26530)
Chen Linze1, Cai Yufan1, Hou Zhe2and Dong Jin Song1 1National University of Singapore 2Griffith University e1344652@u\.nus\.edu, cai\_yufan@u\.nus\.edu,z\.hou@griffith\.edu\.au, dcsdjs@nus\.edu\.sg
###### Abstract
Legal reasoning requires distinguishing changes that matter from those that do not\. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points\. We formulate this requirement as a legal\-relevance\-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change\. We introduce a unified evaluation suite covering*should\-change*and*should\-not\-change*evaluation across judicial fairness, robustness, and statute\-confusion scenarios\. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules\. To mitigate these failures, we presentLexGuard, an adversarial multi\-agent framework grounded in formal reasoning\.LexGuardformalizes statutes into executable constraints, uses adversarial agents to extract competing fact–statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency\. Experiments show thatLexGuardimproves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations\. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes\.
## 1Introduction
Large language models \(LLMs\) are increasingly used for legal tasks such as legal question answering, opinion summarization, pleading drafting, and bar\-exam\-style reasoningKatzet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib20)\); Hendryckset al\.\([2021](https://arxiv.org/html/2605.26530#bib.bib19)\)\. However, legal reasoning is a high\-stakes setting where accuracy alone is insufficient\. In rule\-of\-law systems, judicial outcomes are expected to satisfy*formal rationality*: conclusions should be justified by explicit, general, and logically coherent rulesLinna and Linna \([2025](https://arxiv.org/html/2605.26530#bib.bib12)\); Sadowski and Chudziak \([2025b](https://arxiv.org/html/2605.26530#bib.bib13)\), and traceable to governing statutes, interpretations, and precedentsKesariet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib14)\)\. Current LLM\-based legal systems, including domain\-adapted models such asChatLawCuiet al\.\([2023](https://arxiv.org/html/2605.26530#bib.bib23)\),LawLLMShuet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib28)\), andLexiHoet al\.\([2025](https://arxiv.org/html/2605.26530#bib.bib5)\), still remain vulnerable to hallucinated authoritiesHouet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib29)\), confusion among similar statutesSavelka \([2023](https://arxiv.org/html/2605.26530#bib.bib21)\), and sensitivity to irrelevant attributes\.
A central requirement of legal reasoning is deciding*which changes matter*\. Changes to statutory elements, mental state, harm severity, or applicable exceptions may properly alter the outcome\. In contrast, decisions should remain invariant to legally irrelevant perturbations, such as demographic attributes, procedural background, stylistic reformulation, irrelevant expert opinions, adversarial framing, or misleading references to inapplicable statutes[Yiranet al\.](https://arxiv.org/html/2605.26530#bib.bib3)\. This ability to distinguish legally relevant from legally irrelevant changes is fundamental to rule\-of\-law values: similar cases should be treated alike, different cases should be distinguished for legally grounded reasons, and legal conclusions should be traceable to explicit and coherent legal rules\.
Existing evidence suggests that current LLMs often fail precisely at this distinctionHuet al\.\([2025b](https://arxiv.org/html/2605.26530#bib.bib2)\)\. Recent judicial fairness evaluation shows that LLMs acting as judges exhibit pervasive inconsistency, bias, and imbalanced inaccuracy across demographic, substantive, and procedural factors\. LLMs can also be misled by perturbations to the major premise, minor premise, and conclusion\-generation stages of legal syllogistic reasoning, even when the legally operative facts remain unchanged[Yiranet al\.](https://arxiv.org/html/2605.26530#bib.bib3)\.
However, existing evaluations usually study these phenomena in isolation\. We formulate fair and robust legal reasoning as a*legal\-relevance\-sensitive evaluation*problem\. Instead of evaluating only whether LLMs remain unchanged, we ask whether LLMs’ reasoning changes under*should\-change*perturbations and remains stable under*should\-not\-change*perturbations\. We also unify judicial fairness, robustness, adversarial framing, and statute\-confusion evaluation\. Based on this view, we introduce a new evaluation suite covering four perturbation families: \(1\) fairness perturbations over legally relevant or irrelevant attributes; \(2\) robustness perturbations that preserve or change legal meaning; \(3\) adversarial framing perturbations that attempt to manipulate the conclusion; and \(4\) statute\-confusion perturbations involving similar legal rules and elements\.
Our evaluations show that current LLMs are often unstable under the defined perturbations and attacks\. To mitigate these failures, we proposeLexGuard, a solver\-grounded adversarial multi\-agent framework for legal reasoning\.LexGuardfirst autoformalizes statutory provisions into executable constraints capturing legal elements\. It then uses prosecutor\- and defense\-aligned adversarial agents to extract structured fact–statute argument tuples independently from case narratives\. Finally, an SMT solver checks whether the competing arguments satisfy the formalized legal constraints and whether their conclusions are logically consistent\. This design turns relevance\-sensitive legal reasoning into a auditable process: legally material changes should modify the satisfiability of the corresponding legal constraints, whereas irrelevant changes should not\.
We evaluateLexGuardon public legal datasetsLiet al\.\([2023](https://arxiv.org/html/2605.26530#bib.bib7)\); Xueet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib8)\)and our relevance\-sensitive evaluation suite\. Experiments demonstrate improvements along four dimensions: standard legal reasoning performance, including statute selection, verdict prediction, and sentencing quality; fairness under legally relevant and irrelevant perturbations; robustness under benign reformulations and adversarial framing attacks; and resistance to statute\-confusion errors\. Beyond performance gains,LexGuardproduces solver\-checked symbolic justifications, enabling legal conclusions to be audited against explicit statutory constraints\.
Our contributions are summarized as follows:
- •We formulate trustworthy legal reasoning as a*legal\-relevance\-sensitive evaluation*problem and instantiate it with a unified suite of*should\-change*and*should\-not\-change*perturbations spanning fairness, robustness, adversarial attacks, and statute\-confusion scenarios\.
- •We proposeLexGuard, a solver\-grounded adversarial multi\-agent framework that formalizes statutes into executable constraints, extracts competing legal arguments through adversarial agents, and check the judgment with an SMT solver\.
- •We empirically show that existing LLM legal reasoners are sensitive to legal element changes and vulnerable to statute\-confusion and attacks, whileLexGuardimproves verdict accuracy, statute selection, sentencing quality, fairness, robustness, and trustworthiness\.
## 2Legal\-Relevance\-Sensitive Evaluation
Table 1:Overview of the proposed legal\-relevance\-sensitive evaluation framework\. Each evaluation axis corresponds to a class of perturbations targeting a specific component of legal reasoning\.A trustworthy legal model should remain stable under changes that are irrelevant, yet update its decision when a perturbation changes statutory elements, exceptions, or legal consequences\. We propose a*legal\-relevance\-sensitive evaluation*framework that unifies counterfactual fairness evaluation and reasoning\-chain robustness evaluation\. Table[1](https://arxiv.org/html/2605.26530#S2.T1)summarizes the evaluation design\. The upper block contains label\-preserving perturbations, where the model should remain invariant\. The lower block contains label\-changing perturbations, where the model should update its prediction\.
### 2\.1Evaluation Principle
Formally, letxxdenote an original case,ffa legal reasoning model, andτ\\taua perturbation operator that produces a modified caseτ\(x\)\\tau\(x\)\. The model produces:y=f\(x\),y′=f\(τ\(x\)\)\.y=f\(x\),y^\{\\prime\}=f\(\\tau\(x\)\)\.The central question is not simply whetheryyandy′y^\{\\prime\}are identical, but whether the change fromyytoy′y^\{\\prime\}is legally justified\. We classify perturbations into two categories\. A*label\-preserving perturbation*changes only legally irrelevant information while preserving the material facts and applicable law\. For these perturbations, a legally grounded model should satisfy:f\(x\)=f\(τ\(x\)\)\.f\(x\)=f\(\\tau\(x\)\)\.A*label\-changing perturbation*modifies at least one legally material condition\. For these perturbations, the model should update its prediction according to the new legal label:f\(τ\(x\)\)=yτ,f\(\\tau\(x\)\)=y\_\{\\tau\},whereyτy\_\{\\tau\}denotes the new label after the legally material modification\. For label\-preserving perturbations, we measure whether the model remains invariant:Inv\(f\)=𝔼\(x,τ\)\[𝕀\{f\(x\)=f\(τ\(x\)\)\}\]\.\\mathrm\{Inv\}\(f\)=\\mathbb\{E\}\_\{\(x,\\tau\)\}\\left\[\\mathbb\{I\}\\\{f\(x\)=f\(\\tau\(x\)\)\\\}\\right\]\.Lower invariance indicates that the model is sensitive to legally irrelevant changes\. For label\-changing perturbations, we measure change alignment:Align\(f\)=𝔼\(x,τ\)\[𝕀\{f\(τ\(x\)\)≠f\(x\)\}\]\.\\mathrm\{Align\}\(f\)=\\mathbb\{E\}\_\{\(x,\\tau\)\}\\left\[\\mathbb\{I\}\\\{f\(\\tau\(x\)\)\\neq f\(x\)\\\}\\right\]\.More details are in the Appendix\([E\.1](https://arxiv.org/html/2605.26530#A5.SS1)\)\.
### 2\.2Perturbation Taxonomy
As summarized in Table[1](https://arxiv.org/html/2605.26530#S2.T1), our taxonomy covers extra\-legal factors, surface factual expression, legal\-rule selection, fact extraction, conclusion generation, statutory elements, legal applicability conditions, and boundaries between related statutes\.
##### Label\-preserving perturbations\.
These perturbations preserve the legally material facts and applicable law\.*Judicial fairness*targets extra\-legal factors such as defendant demographics, victim attributes, defender attributes, court level, trial publicity, and procedural background\.*Benign robustness*targets the surface form of facts through paraphrases, synonym substitutions, reordered descriptions, stylistic rewriting, and irrelevant narration\.*Major\-premise robustness*targets the governing legal rule by injecting similar but inapplicable statutes, misleading references, fabricated authorities, wrong charge names, or irrelevant retrieved provisions\.*Minor\-premise robustness*targets fact extraction through non\-dispositive factual edits, legally equivalent element descriptions, irrelevant factual additions, and confusing but immaterial details\.*Conclusion\-level robustness*targets final decision generation by adding irrelevant expert opinions, prior unrelated behavior, emotional framing, role hijacking, verdict\-forcing instructions, or format\-mimicking prompt injections\.
##### Label\-changing perturbations\.
These perturbations alter legally material conditions and therefore require the model to update its prediction\.*Statutory\-element sensitivity*changes constitutive elements such as conduct, object, subject identity, role in offense, amount threshold, harm severity, or causation\.*Mental\-state sensitivity*modifies subjective culpability, such as intent, negligence, knowledge, purpose, or awareness of illegality\.*Exception and condition sensitivity*changes legal applicability conditions, including self\-defense, attempt, accomplice status, surrender, recidivism, mitigation, or aggravation\.*Statute\-confusion sensitivity*tests boundaries between related statutes with overlapping surface facts but different applicability conditions\.
### 2\.3Evaluation Protocol
For each original case, we construct paired perturbation cases\. Each perturbation is annotated with two types of metadata: \(i\) whether it is label\-preserving or label\-changing, and \(ii\) which reasoning component it targets\. For label\-preserving perturbations, the gold label remains unchanged\. For label\-changing perturbations, the gold label should reflect the legally modified outcome\. We then evaluate each model on both original and perturbed cases\. The outputs are compared at multiple levels, including final verdict, applicable statute set, general\-provision selection, specific\-provision selection, and sentencing result when available\. This protocol enables us to diagnose four distinct failure modes: unfairness under extra\-legal counterfactuals, instability under benign reformulations, susceptibility to adversarial legal framing, and inability to distinguish legally material from immaterial statutory changes\.
## 3Solver\-grounded Reasoning
Figure 1:Overview ofLexGuard\. \(Top\)*Law Formalization*: statutes and judicial interpretations are translated into SMT\-checkable legal constraints\. \(Left\)*Adversarial Agents*: prosecutor and defense agents independently extract facts and candidate statutes from the same case narrative\. \(Bottom\)*Solver\-grounded Legal Reasoning*: encode the extracted facts and candidate statutes into a unified constraint set\. The SMT solver checks statutory applicability, detects inconsistencies, and the judge returns a formally grounded judgment\.As shown in[Figure 1](https://arxiv.org/html/2605.26530#S3.F1),LexGuardcombines LLM\-based legal interpretation with the formal reasoning\. Given a casexx, the system outputs a judgmenty=⟨statute,clause,penalty,explanation⟩,y=\\langle\\text\{statute\},\\text\{clause\},\\text\{penalty\},\\text\{explanation\}\\rangle,where the explanation consists of checked legal conditions and supporting case facts\. The key idea is to separate*legal proposal*from*legal check*: LLM agents identify potentially relevant facts and statutes, while the SMT solver determines whether those statutes can actually be applied under a formal legal knowledge base\. The detailed formalization is in Appendix[B](https://arxiv.org/html/2605.26530#A2)\.
### 3\.1Law Formalization
We first construct a formal legal knowledge base𝒦\\mathcal\{K\}from statutes and judicial interpretations with the aid of LLMs\. Each legal rule is represented as a conditional constraint\. The conditions specify when a statute or clause applies, and the legal effect specifies the resulting offense classification, liability, or penalty range\. Concretely, each statute article has an article\-level guard, and each clause has a clause\-level guard\. The article guard checks whether the case falls within the general scope of a statute\. The clause guard checks fine\-grained legal requirements such as the actor, conduct, intent, harm, causation, protected legal interest, aggravating factors, mitigating factors, or statutory exceptions\. A simplified rule is:ArticleGuardi∧ClauseGuardi,j⇒Penaltyi,j\.\\textsc\{ArticleGuard\}\_\{i\}\\land\\textsc\{ClauseGuard\}\_\{i,j\}\\Rightarrow\\textsc\{Penalty\}\_\{i,j\}\.The formal knowledge base is automatically generated from natural\-language legal materials and then validated before use\. We apply three validation steps\. First, syntactic checking ensures that all generated rules are well\-formed\. Second, semantic checking detects contradictory, vacuous, or overly broad rules\. Third, case\-level testing checks whether known cases activate the expected statute and clause guards\. Only validated constraints are included in𝒦\\mathcal\{K\}\.
### 3\.2Adversarial Agents
Given the case narrativexx,LexGuarduses two role\-differentiated LLM agents: a prosecutor agent and a defense agent\. Both agents read the same case, but they extract legal information from different argumentative perspectives\. The prosecutor agent focuses on facts supporting liability, while the defense agent focuses on missing elements, exceptions, and alternative interpretations\.
Each agent outputs a structured argument⟨ℱ,ℒ⟩\.\\langle\\mathcal\{F\},\\mathcal\{L\}\\rangle\.The fact setℱ\\mathcal\{F\}contains suspect\-centric facts extracted from the case narrative\. Each fact is typed by a legally meaningful element, including actor, conduct, mental state, protected interest, object, amount, harm, consequence, causation, or exception, and is linked to a supporting text span\. This grounding mitigates the risk of unsupported or hallucinated facts\. The candidate law setℒ\\mathcal\{L\}contains plausible statutes and clauses proposed by the agents\. These proposals are not treated as final legal conclusions\. Instead, we organize statutes into similarity clusters and prompt adversarial agents to debate confusing provisions within the same cluster\. The SMT solver in the next step determines which proposed rules are actually applicable by checking them against the extracted facts and formalized legal constraints\.
### 3\.3Solver\-Centered Adjudication
Given the fact setℱ\\mathcal\{F\}, candidate statutesℒ\\mathcal\{L\}, and the formal legal knowledge base𝒦\\mathcal\{K\},LexGuarddoes not encode all information at once\. Instead, it first searches𝒦\\mathcal\{K\}for the formal constraints corresponding to the proposed statutes:𝒞ℒ=Search\(ℒ,𝒦\)\.\\mathcal\{C\}\_\{\\mathcal\{L\}\}=\\textsc\{Search\}\(\\mathcal\{L\},\\mathcal\{K\}\)\.These constraints specify the statutory elements, applicability conditions, exceptions, and penalty rules of each candidate article and clause\.
Next,LexGuardrefines the extracted facts with respect to the retrieved constraints:𝒞ℱ=Refine\(ℱ,𝒞ℒ\)\.\\mathcal\{C\}\_\{\\mathcal\{F\}\}=\\textsc\{Refine\}\(\\mathcal\{F\},\\mathcal\{C\}\_\{\\mathcal\{L\}\}\)\.Each suspect\-centric fact is matched to a required legal element, such as actor, conduct, mental state, protected interest, object, amount, harm, causation, consequence, or exception\. The refined facts and statute constraints are then encoded into an SMT formula:Φ=Encode\(𝒞ℱ,𝒞ℒ\)\.\\Phi=\\textsc\{Encode\}\(\\mathcal\{C\}\_\{\\mathcal\{F\}\},\\mathcal\{C\}\_\{\\mathcal\{L\}\}\)\.The SMT solver Z3De Moura and Bjørner \([2008](https://arxiv.org/html/2605.26530#bib.bib1)\)checks the satisfiability of article guards, clause guards, and exception guards\. Unsatisfied articles or clauses are discarded, while satisfied guards determine the applicable offense classification and penalty range\. When multiple provisions are satisfiable, priority rules in𝒦\\mathcal\{K\}, including statutory hierarchy, specificity, exception handling, and penalty ordering, select the final applicable rule\. Irrelevant facts are filtered out from formal reasoning\. Therefore, the final judgment is produced by solver\-verified legal applicability rather than direct LLM generation\. If prosecutor and defense agents produce conflicting facts or incompatible statute proposals, the solver returns an unsat core that identifies the conflicting facts, missing elements, or incompatible guards\. This conflict feedback triggers fact re\-grounding or constraint repair\. The reasoning loop continues untilΦ\\Phibecomes satisfiable or all candidate statutes are rejected\.
## 4Experiments
We design the following five research questions\. RQ1–RQ2 measure standard prediction performance and ablation effects, while RQ3–RQ5 test whether predictions remain legally grounded under our legal\-relevance\-sensitive evaluation\.
##### Research Questions\.
- •RQ1: How accurately doesLexGuardpredict applicable statutes and sentencing outcomes compared with legal and general LLM baselines?
- •RQ2: Which components ofLexGuardcontribute most to statute prediction?
- •RQ3: CanLexGuardmake legally appropriate prediction changes under should\-change perturbations?
- •RQ4: CanLexGuardremain stable under should\-not\-change perturbations?
- •RQ5: What types of errors doesLexGuardmake, especially in confusing statute cases?
##### Datasets\.
We use three datasets\. LeCaRDv2Liet al\.\([2023](https://arxiv.org/html/2605.26530#bib.bib7)\)contains 55,192 cases, with an average case\-fact length of 889\.17 Chinese characters and 4\.53 applicable statutes per case, and supports case\-level statute and sentence evaluation\. LEECXueet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib8)\)provides suspect\-level annotations for multi\-defendant cases; we use 9,470 suspect\-level instances, with an average fact length of 654\.49 Chinese characters and 5\.25 applicable statutes per instance\. We further construct an 8,000\-case controlled perturbation set from criminal\-law fact patterns, with an average fact length of 134\.89 Chinese characters and 1\.65 statutes per case\. The artifact is available on the anonymous websiteGitHub \([2026](https://arxiv.org/html/2605.26530#bib.bib11)\)\.
##### Implementation Details\.
We use the z3De Moura and Bjørner \([2008](https://arxiv.org/html/2605.26530#bib.bib1)\)as the formal reasoner\. RQ3–RQ5 instantiate the legal\-relevance\-sensitive axes in Table[1](https://arxiv.org/html/2605.26530#S2.T1)through factual perturbations, prompt\-injection families, and confusing\-statute clusters, with all model outputs normalized to statute identifiers before scoring\. Our legal\-relevance\-sensitive evaluation also follows the work J&HHuet al\.\([2025a](https://arxiv.org/html/2605.26530#bib.bib74)\)\. The baseline LLM\-J&H\-CoT keeps the same output schema but uses a reasoning\-augmented prompt, and LLM\-J&H\-Few\-shot uses example\-calibrated demonstrations under the same output schema\. Unless otherwise specified,LexGuarduses GPT\-5\.2 as the base LLM for agent generation and final judgment, and GPT\-5\.2 is also used as the direct model baseline\. Details appear in Appendix[D](https://arxiv.org/html/2605.26530#A4)and Appendix[E\.1](https://arxiv.org/html/2605.26530#A5.SS1)\.
### 4\.1RQ1: Accuracy on Legal Prediction
Table[2](https://arxiv.org/html/2605.26530#S4.T2)reports provision prediction results on LeCaRDv2 and LEEC\. Across both datasets and provision granularities,LexGuardconsistently achieves the best F1 score\. The gains mainly come from higher precision, indicating that solver\-grounded verification helps suppress irrelevant statute predictions rather than merely increasing recall\. This pattern is particularly clear on LeCaRDv2, where general\-purpose LLMs often over\-predict statutes and obtain moderate recall but low precision\. It indicates that many baseline models fail not because they cannot retrieve any relevant statute, but because they cannot reliably reject similar yet inapplicable provisions\. On LEEC, the gap becomes larger under suspect\-level evaluation, suggesting that provision prediction becomes substantially harder when the model must decompose case facts by individual defendants\.
Table[3](https://arxiv.org/html/2605.26530#S4.T3)further evaluates downstream sentencing and suspect\-level extraction\. Providing golden statutes generally reduces sentencing error, confirming that statute misidentification is a major source of downstream punishment error\.LexGuardstill achieves the lowest RMSE in both the w/o\-golden and w/\-golden settings on LeCaRDv2 and LEEC, while maintaining competitive or best legal validity\. Overall, these results suggest thatLexGuardprovides a more reliable reasoning pipeline: it selects fewer irrelevant provisions, produces legally valid outputs, and transfers these gains to downstream sentencing\.
Table 2:Provision Prediction Performance on LeCaRDv2 and LEEC \(%\)\.Table 3:Sentencing error, legal validity, and suspect\-level performance with or without golden statutes\. RMSE denotes sentencing root mean square error in months\. Valid denotes the proportion of solver\-verified legally valid predictions\. SusF1 denotes suspect\-level F1 score on LEEC\.
### 4\.2RQ2: Ablation Study
Table[4](https://arxiv.org/html/2605.26530#S4.T4)reports the ablation study on LeCaRDv2\. The fullLexGuardachieves the best F1 for both general and specific provisions, indicating that the three components are complementary\. Removing the Z3 reasoner leads to the largest degradation in statute prediction\. This shows that symbolic consistency checking is crucial for filtering legally invalid or internally inconsistent statute candidates\. Removing the debating module also substantially reduces both G\-F1 and S\-F1, confirming that fine\-grained competition among confusable provisions is necessary for accurate statute selection\. Removing the attorney module mainly weakens adversarial coverage: recall drops for both general and specific provisions, suggesting that the prosecutor–defense decomposition helps expose missing candidate statutes\.
Table 4:Ablation study results with different components removed\. G\-P, G\-R, and G\-F1 denote precision, recall, and F1 for general\-provision prediction; S\-P, S\-R, and S\-F1 denote precision, recall, and F1 for specific\-provision prediction\. All methods are based on GPT\-5\.2\.
### 4\.3RQ3: Should\-Change to Perturbations
RQ3 evaluates whether models can make legally appropriate prediction changes when the case facts are modified in ways that should alter statutory applicability\. Table[5](https://arxiv.org/html/2605.26530#S4.T5)reports performance under should\-change factual perturbations\. The results show thatLexGuardhandles should\-change perturbations more reliably than all baselines\. This indicates thatLexGuardis better at recognizing when a factual modification changes the applicable statutory conditions and at updating its statute prediction in the legally expected direction\. The improvement is especially important because should\-change perturbations test legal sensitivity rather than mere stability\.
Table 5:RQ3 and RQ4 results\. RQ3: Overall denotes counterfactual statute\-set score, Align\. denotes change\-alignment score, Sta\. denotes statute correctness score, and Bias denotes perturbation\-factor bias magnitude\. RQ4: ASR denotes attack success rate, CRR denotes clean\-correct retention rate, Inv\. denotes prediction invariance, and F1 denotes attack\-aware prediction F1\. All methods are based on GPT\-5\.2\.
### 4\.4RQ4: Should\-Not\-Change Perturbations
RQ4 evaluates whether models can maintain legally correct predictions under adversarial perturbations that should not change the applicable statute\. As shown in Table[5](https://arxiv.org/html/2605.26530#S4.T5),LexGuardachieves the strong robustness across all metrics\. It is less likely to be misled by injected legal distractions, better preserves originally correct predictions, and produces more stable statute sets under attack\. The improvement in attack\-aware F1 further shows thatLexGuarddoes not merely keep predictions unchanged, but maintains legally accurate decisions when facing should\-not\-change perturbations\.
The J&H prompting baselines provide only limited protection\. CoT and few\-shot prompting improve robustness slightly over the vanilla LLM, but their clean\-correct retention and invariance remain weak, suggesting that prompt\-level reasoning guidance cannot reliably prevent the model from following misleading but legally irrelevant cues\. In contrast,LexGuardverifies extracted facts and candidate statutes against formalized legal constraints, helping distinguish legally operative conditions from adversarial noise\. These results indicate that robustness to should\-not\-change perturbations requires more than prompt engineering; it benefits from solver\-grounded verification that enforces consistency with the governing legal rule\.
### 4\.5RQ5: Distinguishing Confusing Statutes
RQ5 analyzes model errors in legally confusing statute clusters\. Table[6](https://arxiv.org/html/2605.26530#S4.T6)reports both cluster\-level exactness and error types\. The results show thatLexGuardis substantially better at fully recovering applicable statutes when a confusing cluster is relevant, achieving 88\.71% positive exactness compared with 58\.57% for the base LLM\. It also achieves the highest macro cluster exactness, indicating more balanced performance across different similar\-statute clusters\. By contrast, few\-shot prompting reduces omission rate, but increases wrong similar\-statute selection, suggesting that demonstrations make the model more willing to activate cluster statutes without reliably resolving statutory boundaries\.LexGuardreduces both error types, lowering omission and wrong selection\. This suggests that reverse verification helps the system not only select the applicable member of a confusing cluster, but also reject legally similar yet inapplicable alternatives\. The false\-activation diagnostic and cluster\-level breakdowns are provided in Appendix[F](https://arxiv.org/html/2605.26530#A6)\.
Table 6:RQ5 error analysis\. Pos\. denotes exact selection when the gold confusing\-statute cluster is non\-empty; Macro denotes macro\-averaged cluster exactness across statute clusters; Omit denotes gold\-statute omission; Wrong denotes wrong similar\-statute selection\. All methods are based on GPT\-5\.2\.
### 4\.6Limitations
Despite its promising results, our framework still faces three main challenges\. First, its formalization quality is constrained by the accuracy of LLM outputs, so errors in identifying actors or conditions can propagate throughout the reasoning pipeline\. Second, our current formalization system is limited to statutory rules, leaving the extension to case law, open\-textured norms, and evolving jurisprudence as future work\. Third, the system assumes deterministic rule parsing, which prevents it from fully capturing legal provisions that deliberately incorporate normative ambiguity\. In addition, the framework introduces moderate computational overhead due to its multi\-stage LLM\-based reasoning process\. Detailed costs are shown in the appendix \([Table 7](https://arxiv.org/html/2605.26530#A1.T7)\)\.
## 5Related Work
##### Domain\-specific Legal LLMs\.
General\-purpose LLMs often mishandle legal terminology and citation style\.ChatLawcouples a knowledge\-graph\-enhanced mixture\-of\-experts backbone with a multi\-agent pipeline that mirrors law\-firm SOPs, outperforming GPT\-4 on LawBench and national bar examsCuiet al\.\([2023](https://arxiv.org/html/2605.26530#bib.bib23)\)\.Lawyer GPTshows that lightweight domain pre\-training plus retrieval boosts statute\-matching and consultation accuracy while remaining compute\-efficientYaoet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib41)\)\. Gao et al\.Gaoet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib48)\)further show that careful construction of high\-quality synthetic query–candidate pairs can markedly improve legal\-case retrieval\.Agents on the Benchsimulates a collegial bench to improve judgment quality through deliberative votingJiang and Yang \([2024](https://arxiv.org/html/2605.26530#bib.bib40)\), whereasAgentCourtevolves adversarial lawyer agents via long\-horizon self\-play, yielding measurable skill gainsChenet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib42)\)\. These systems, however, lack guarantees on logical soundness\. We retain the agent metaphor and introduce the legal\-relevance\-sensitive evaluation and the*neural–symbolic*pipeline\. Our work goes further by compiling the extracted norms into*executable*formal reasoning\.
##### Legal LLM Evaluation and Benchmarks\.
LegalBenchconstructs a collaboratively designed benchmark of 162 legal\-reasoning tasks and evaluates both open\-source and commercial LLMs across different forms of legal reasoningGuhaet al\.\([2023](https://arxiv.org/html/2605.26530#bib.bib70)\)\.LawBenchfocuses on Chinese legal tasks and organizes 20 tasks according to Bloom’s cognitive taxonomy, providing a more jurisdiction\-specific view of legal knowledge memorization, understanding, and applicationFeiet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib71)\)\.LEXTREMEextends legal NLP evaluation to multilingual and multi\-task settings, covering 11 datasets across 24 languages and showing that legal\-domain evaluation remains challenging even for strong language modelsNiklauset al\.\([2023](https://arxiv.org/html/2605.26530#bib.bib72)\)\. More recently,LegalAgentBenchevaluates LLM agents in Chinese legal scenarios with real\-world corpora, external legal tools, and progress\-based intermediate metricsLiet al\.\([2025](https://arxiv.org/html/2605.26530#bib.bib73)\)\. Our evaluation is complementary: instead of measuring only task\-level accuracy, we stress\-test whether legal predictions change for legally justified reasons\. Unlike J&HHuet al\.\([2025b](https://arxiv.org/html/2605.26530#bib.bib2)\), which only studies label\-preserving knowledge\-injection robustness, our framework additionally evaluates should\-change legal sensitivity\. We introduce relevance\-sensitive evaluation and provide a solver\-grounded mechanism\.
##### Neural Symbolic Methods\.
Early work such as Logic Tensor NetworksBadreddineet al\.\([2022](https://arxiv.org/html/2605.26530#bib.bib49)\)embeds many\-valued fuzzy logic into differentiable architectures, while DeepProbLogManhaeveet al\.\([2018](https://arxiv.org/html/2605.26530#bib.bib50)\)integrates probabilistic logic programming with neural predicates\. More recent studies add large\-scale LLM components: NS\-LCR learns explicit law\- and case\-level rules for explainable case retrievalSunet al\.\([2024](https://arxiv.org/html/2605.26530#bib.bib51)\); Logic\-LM introduces structured prompting plus theorem\-prover feedback to enforce logical soundnessSadowski and Chudziak \([2025a](https://arxiv.org/html/2605.26530#bib.bib52)\); and Kant et al\. combine neuro\-symbolic reasoning with contract analysis to improve coverage decisionsKantet al\.\([2025](https://arxiv.org/html/2605.26530#bib.bib53)\)\. These results show that logic grounding enhances transparency and robustness, which we pursue by marrying adversarial agents with a symbolic solver for trustworthy legal AI\.
## 6Conclusion
In this paper, we frame trustworthy legal AI as a problem of*legal relevance sensitivity*: models should remain stable under legally irrelevant changes and respond appropriately to legally material ones\. We show that existing LLMs often fail to distinguish relevant legal changes from irrelevant perturbations\. To address these failures, we proposeLexGuard, a solver\-grounded multi\-stage reasoning framework that anchors legal decisions in explicit statutory conditions and verifiable reasoning\. Our results suggest that trustworthy legal AI requires not only high accuracy, but also stable, fair, and legally grounded sensitivity to what truly matters under law\.
## References
- \[1\]\(2022\)Logic tensor networks\.Artificial Intelligence303,pp\. 103649\.External Links:[Document](https://dx.doi.org/10.1016/j.artint.2021.103649)Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px3.p1.1)\.
- \[2\]G\. Chen, L\. Fan, Z\. Gong, N\. Xie, Z\. Li, Z\. Liu, C\. Li, Q\. Qu, S\. Ni, and M\. Yang\(2024\)AgentCourt: simulating court with adversarial evolvable lawyer agents\.arXiv preprint arXiv:2408\.08089\.Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px1.p1.1)\.
- \[3\]J\. Cui, Z\. Li, Y\. Yan, B\. Chen, and L\. Yuan\(2023\)ChatLaw: open\-source legal large language model with integrated external knowledge bases\.CoRRabs/2306\.16092\.External Links:[Link](https://arxiv.org/abs/2306.16092)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1),[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px1.p1.1)\.
- \[4\]L\. De Moura and N\. Bjørner\(2008\)Z3: an efficient smt solver\.InProceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems,TACAS’08/ETAPS’08,Berlin, Heidelberg,pp\. 337–340\.External Links:ISBN 3540787992Cited by:[§3\.3](https://arxiv.org/html/2605.26530#S3.SS3.p2.4),[§4](https://arxiv.org/html/2605.26530#S4.SS0.SSS0.Px3.p1.1)\.
- \[5\]Z\. Fei, X\. Shen, D\. Zhu, F\. Zhou, Z\. Han, A\. Huang, S\. Zhang, K\. Chen, Z\. Yin, Z\. Shen, J\. Ge, and V\. Ng\(2024\)LawBench: benchmarking legal knowledge of large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 7933–7962\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.452),[Link](https://aclanthology.org/2024.emnlp-main.452/)Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px2.p1.1)\.
- \[6\]C\. Gao, C\. Xiao, Z\. Liu, H\. Chen, Z\. Liu, and M\. Sun\(2024\)Enhancing legal case retrieval via scaling high\-quality synthetic query–candidate pairs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 7086–7100\.Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px1.p1.1)\.
- \[7\]A\. GitHub\(2026\)LexGuard artifact\.Note:[https://sites\.google\.com/view/legalai\-aaai/home](https://sites.google.com/view/legalai-aaai/home)Cited by:[§4](https://arxiv.org/html/2605.26530#S4.SS0.SSS0.Px2.p1.1)\.
- \[8\]N\. Guha, J\. Nyarko, D\. E\. Ho, C\. Ré, A\. Chilton, A\. Narayana, A\. Chohlas\-Wood, A\. Peters, B\. Waldon, D\. Rockmore, D\. Zambrano, D\. Talisman, E\. Hoque, F\. Surani, F\. Fagan, G\. Sarfaty, G\. Dickinson, H\. Porat, J\. Hegland, J\. Wu,et al\.\(2023\)LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets_and_Benchmarks.html)Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px2.p1.1)\.
- \[9\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.InProc\. ICLR 2021,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[10\]J\. Ho, A\. Colby, and W\. Fisher\(2025\)Incorporating legal structure in retrieval\-augmented generation: a case study on copyright fair use\.External Links:2505\.02164,[Link](https://arxiv.org/abs/2505.02164)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[11\]A\. B\. Hou, W\. Jurayj, N\. Holzenberger, A\. Blair\-Stanek, and B\. V\. Durme\(2024\)Gaps or hallucinations? gazing into machine\-generated legal analysis for fine\-grained text evaluations\.External Links:2409\.09947,[Link](https://arxiv.org/abs/2409.09947)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[12\]Y\. Hu, H\. Liu, Q\. Chen, N\. Zheng, C\. Wang, Y\. Liu, C\. L\. A\. Clarke, and W\. Shen\(2025\)J&H: evaluating the robustness of large language models under knowledge\-injection attacks in legal domain\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 28106–28114\.Cited by:[§4](https://arxiv.org/html/2605.26530#S4.SS0.SSS0.Px3.p1.1)\.
- \[13\]Y\. Hu, H\. Liu, Q\. Chen, N\. Zheng, C\. Wang, Y\. Liu, C\. L\. Clarke, and W\. Shen\(2025\)J&h: evaluating the robustness of large language models under knowledge\-injection attacks in legal domain\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 28106–28115\.Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p3.1),[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px2.p1.1)\.
- \[14\]C\. Jiang and X\. Yang\(2024\)Agents on the bench: large language model based multi agent framework for trustworthy digital justice\.arXiv preprint arXiv:2412\.18697\.Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px1.p1.1)\.
- \[15\]M\. Kant, S\. Nabi, M\. Kant, R\. Scharrer, M\. Ma, and M\. Nabi\(2025\)Towards robust legal reasoning: harnessing logical llms in law\.arXiv preprint arXiv:2502\.17638\.Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px3.p1.1)\.
- \[16\]D\. M\. Katz, M\. J\. B\. II, S\. Gao, and P\. Arredondo\(2024\)GPT\-4 passes the bar exam\.Philosophical Transactions of the Royal Society A\.Note:First posted as SSRN 4389233, 2023External Links:[Document](https://dx.doi.org/10.1098/rsta.2023.0254)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[17\]A\. Kesari, D\. Sele, E\. Ash, and S\. Bechtold\(2024\)A legal framework for explainable artificial intelligence\.Center for Law & Economics Working Paper Series9\.Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[18\]H\. Li, Q\. Ai, Q\. Dong, and Y\. Liu\(2024\)LexiLaw: a scalable legal language model for comprehensive legal understanding\.Note:[https://github\.com/CSHaitao/LexiLaw](https://github.com/CSHaitao/LexiLaw)Cited by:[Appendix D](https://arxiv.org/html/2605.26530#A4.SSx3.p1.1)\.
- \[19\]H\. Li, J\. Chen, J\. Yang, Q\. Ai, W\. Jia, Y\. Liu, K\. Lin, Y\. Wu, G\. Yuan, Y\. Hu, W\. Wang, Y\. Liu, and M\. Huang\(2025\)LegalAgentBench: evaluating llm agents in legal domain\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2322–2344\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.116),[Link](https://aclanthology.org/2025.acl-long.116/)Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px2.p1.1)\.
- \[20\]H\. Li, Y\. Shao, Y\. Wu, Q\. Ai, Y\. Ma, and Y\. Liu\(2023\)LeCaRDv2: a large\-scale chinese legal case retrieval dataset\.arXiv preprint arXiv:2310\.17609\.External Links:[Link](https://arxiv.org/abs/2310.17609)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p6.1),[§4](https://arxiv.org/html/2605.26530#S4.SS0.SSS0.Px2.p1.1)\.
- \[21\]E\. Linna and T\. Linna\(2025\)Judicial requirements for generative ai in legal reasoning\.arXiv preprint arXiv:2508\.18880\.Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[22\]R\. Manhaeve, S\. Dumančić, A\. Kimmig, T\. Demeester, and L\. De Raedt\(2018\)DeepProbLog: neural probabilistic logic programming\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px3.p1.1)\.
- \[23\]J\. Niklaus, V\. Matoshi, P\. Rani, A\. Galassi, M\. Stürmer, and I\. Chalkidis\(2023\)LEXTREME: a multi\-lingual and multi\-task benchmark for the legal domain\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 3016–3054\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.200),[Link](https://aclanthology.org/2023.findings-emnlp.200/)Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px2.p1.1)\.
- \[24\]A\. Sadowski and J\. A\. Chudziak\(2025\)Explainable rule application via structured prompting: a neural–symbolic approach\.arXiv preprint arXiv:2506\.16335\.Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px3.p1.1)\.
- \[25\]A\. Sadowski and J\. A\. Chudziak\(2025\)On verifiable legal reasoning: a multi\-agent framework with formalized knowledge representations\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,pp\. 2535–2545\.Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[26\]J\. Savelka\(2023\)Unlocking practical applications in the legal domain: evaluation of GPT for zero\-shot semantic annotation of legal texts\.InProc\. ICAIL 2023,pp\. 447–451\.External Links:[Document](https://dx.doi.org/10.1145/3594536.3595161)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[27\]D\. Shu, H\. Zhao, X\. Liu, D\. Demeter, M\. Du, and Y\. Zhang\(2024\-10\)LawLLM: law large language model for the us legal system\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,CIKM ’24,pp\. 4882–4889\.External Links:[Link](http://dx.doi.org/10.1145/3627673.3680020),[Document](https://dx.doi.org/10.1145/3627673.3680020)Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p1.1)\.
- \[28\]Z\. Sun, K\. Zhang, W\. Yu, H\. Wang, and J\. Xu\(2024\)Logic rules as explanations for legal case retrieval \(ns\-lcr\)\.InProceedings of LREC\-COLING 2024,Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px3.p1.1)\.
- \[29\]Z\. Xue, H\. Liu, Y\. Hu, Y\. Qian, Y\. Wang, K\. Kong, C\. Wang, Y\. Liu, and W\. Shen\(2024\)Leec for judicial fairness: a legal element extraction dataset with extensive extra\-legal labels\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence, IJCAI\-24,pp\. 7527–7535\.Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p6.1),[§4](https://arxiv.org/html/2605.26530#S4.SS0.SSS0.Px2.p1.1)\.
- \[30\]S\. Yao, Q\. Ke, Q\. Wang, K\. Li, and J\. Hu\(2024\)Lawyer gpt: a legal large language model with enhanced domain knowledge and reasoning capabilities\.InProceedings of the 3rd International Symposium on Robotics, Artificial Intelligence and Information Engineering \(RAIIE ’24\),pp\. 108–112\.Cited by:[§5](https://arxiv.org/html/2605.26530#S5.SS0.SSS0.Px1.p1.1)\.
- \[31\]H\. Yiran, Z\. Xue, H\. Li, S\. Zheng, Q\. Chen, S\. Wang, X\. Zhang, N\. Zheng, Y\. Liu, Q\. Ai,et al\.LLMs on trial: evaluating judicial fairness for large language models\.InWorkshop on Socially Responsible Language Modelling Research,Cited by:[§1](https://arxiv.org/html/2605.26530#S1.p2.1),[§1](https://arxiv.org/html/2605.26530#S1.p3.1)\.
- \[32\]S\. Yue, W\. Chen, S\. Wang, B\. Li, C\. Shen, S\. Liu, Y\. Zhou, Y\. Xiao, S\. Yun, X\. Huang, and Z\. Wei\(2023\)DISC\-lawllm: fine\-tuning large language models for intelligent legal services\.arXiv preprint arXiv:2309\.11325\.External Links:[Link](https://arxiv.org/abs/2309.11325)Cited by:[Appendix D](https://arxiv.org/html/2605.26530#A4.SSx3.p1.1)\.
## Appendix ACost Statistics
Table 7:Average cost per case ofLexGuard\.Table[7](https://arxiv.org/html/2605.26530#A1.T7)shows thatLexGuardincurs a moderate computational overhead per case\. On average, each case requires 10\.33 LLM calls and 107\.36 seconds of runtime, with a total cost of only $0\.0819\. This suggests that, despite employing a multi\-stage agent\-and\-solver pipeline,LexGuardremains practically affordable for legal reasoning tasks\. The token statistics further indicate that most of the cost comes from processing relatively long legal inputs, while the overall per\-case monetary expense remains low\.
## Appendix BDetails of Statutory Formalization
This appendix provides the formal specification used to translate statutory provisions into solver\-checkable constraints\. At a high level, each statutory clause is represented as an implication from an article\-level applicability guard and a clause\-level condition guard to a legally admissible penalty range:
ArticleGuard\(a,x\)∧ClauseGuard\(c,x\)⇒Penalty\(c,x\),\\textsc\{ArticleGuard\}\(a,x\)\\wedge\\textsc\{ClauseGuard\}\(c,x\)\\Rightarrow\\textsc\{Penalty\}\(c,x\),whereaadenotes an article,ccdenotes a clause under the article, andxxdenotes a suspect\-centric case representation\. We make this template explicit below\.
### B\.1Typed Case Representation
We represent each case as a tuple of typed legal facts:
x=⟨s,v,A,R,M,Q,E⟩,x=\\langle s,v,A,R,M,Q,E\\rangle,wheressis the suspect,vvis the victim or protected legal interest,AAis the set of acts,RRis the set of harmful results,MMis the mental state,QQis the set of qualifying circumstances, andEEis the set of legally irrelevant or extra\-legal attributes\.
We use the following typed predicates:
Actor\(s\),Victim\(v\),Act\(s,a\),Result\(r\),\\displaystyle\\textsc\{Actor\}\(s\),\\quad\\textsc\{Victim\}\(v\),\\quad\\textsc\{Act\}\(s,a\),\\quad\\textsc\{Result\}\(r\),Causes\(a,r\),MentalState\(s,m\),ProtectedInterest\(v,i\),\\displaystyle\\textsc\{Causes\}\(a,r\),\\quad\\textsc\{MentalState\}\(s,m\),\\quad\\textsc\{ProtectedInterest\}\(v,i\),Amount\(r,n\),Severity\(r,ℓ\),Qualifier\(s,q\),\\displaystyle\\textsc\{Amount\}\(r,n\),\\quad\\textsc\{Severity\}\(r,\\ell\),\\quad\\textsc\{Qualifier\}\(s,q\),ExtraLegal\(e\),HasExtraLegalAttr\(x,e\)\.\\displaystyle\\textsc\{ExtraLegal\}\(e\),\\quad\\textsc\{HasExtraLegalAttr\}\(x,e\)\.
The domains of core symbols are typed as:
s∈𝒮,v∈𝒱,a∈𝒜,r∈ℛ,\\displaystyle s\\in\\mathcal\{S\},\\quad v\\in\\mathcal\{V\},\\quad a\\in\\mathcal\{A\},\\quad r\\in\\mathcal\{R\},m∈\{Intentional,Negligent,Knowing,Unknown\},\\displaystyle m\\in\\\{\\textsf\{Intentional\},\\textsf\{Negligent\},\\textsf\{Knowing\},\\textsf\{Unknown\}\\\},ℓ∈\{Minor,Serious,EspeciallySerious\},\\displaystyle\\ell\\in\\\{\\textsf\{Minor\},\\textsf\{Serious\},\\textsf\{EspeciallySerious\}\\\},q∈𝒬,e∈ℰ\.\\displaystyle q\\in\\mathcal\{Q\},\\quad e\\in\\mathcal\{E\}\.
For implementation, these predicates are encoded as Boolean or integer SMT variables\. For example, a case\-level fact such as “the suspect intentionally caused serious injury” is encoded as:
MentalState\(s,Intentional\)=⊤,Severity\(r,Serious\)=⊤,∃a\[Act\(s,a\)∧Causes\(a,r\)\]\.\\textsc\{MentalState\}\(s,\\textsf\{Intentional\}\)=\\top,\\quad\\textsc\{Severity\}\(r,\\textsf\{Serious\}\)=\\top,\\quad\\exists a\\,\[\\textsc\{Act\}\(s,a\)\\wedge\\textsc\{Causes\}\(a,r\)\]\.
### B\.2Article\-Level Applicability Guard
Each articleaais first associated with an article\-level guard\. This guard specifies the general legal domain in which the article may apply\. It prevents the solver from considering provisions that are categorically irrelevant to the case\.
For an articleAkA\_\{k\}, the article guard is defined as:
ArticleGuardk\(x\)=SubjectGuardk\(x\)∧ObjectGuardk\(x\)∧ConductGuardk\(x\)\.\\textsc\{ArticleGuard\}\_\{k\}\(x\)=\\textsc\{SubjectGuard\}\_\{k\}\(x\)\\wedge\\textsc\{ObjectGuard\}\_\{k\}\(x\)\\wedge\\textsc\{ConductGuard\}\_\{k\}\(x\)\.
More explicitly:
SubjectGuardk\(x\)\\displaystyle\\textsc\{SubjectGuard\}\_\{k\}\(x\):=⋀p∈𝒫ksubjp\(s\),\\displaystyle=\\bigwedge\_\{p\\in\\mathcal\{P\}^\{subj\}\_\{k\}\}p\(s\),ObjectGuardk\(x\)\\displaystyle\\textsc\{ObjectGuard\}\_\{k\}\(x\):=⋁i∈ℐkProtectedInterest\(v,i\),\\displaystyle=\\bigvee\_\{i\\in\\mathcal\{I\}\_\{k\}\}\\textsc\{ProtectedInterest\}\(v,i\),ConductGuardk\(x\)\\displaystyle\\textsc\{ConductGuard\}\_\{k\}\(x\):=∃a∈AActTypek\(a\)\.\\displaystyle=\\exists a\\in A\\;\\textsc\{ActType\}\_\{k\}\(a\)\.
Here,𝒫ksubj\\mathcal\{P\}^\{subj\}\_\{k\}is the set of subject requirements, such as being a natural person, state functionary, employee, or responsible person;ℐk\\mathcal\{I\}\_\{k\}is the set of protected legal interests covered by the article; andActTypek\\textsc\{ActType\}\_\{k\}denotes the conduct type regulated by the article\.
Thus, an article is activated only if:
ApplicableArticlek\(x\)⇔ArticleGuardk\(x\)\.\\textsc\{ApplicableArticle\}\_\{k\}\(x\)\\Leftrightarrow\\textsc\{ArticleGuard\}\_\{k\}\(x\)\.
### B\.3Clause\-Level Guard
A statutory article may contain multiple clauses corresponding to different factual thresholds, aggravating circumstances, mitigating circumstances, or penalty brackets\. For each clauseCk,jC\_\{k,j\}under articleAkA\_\{k\}, we define:
ClauseGuardk,j\(x\)=ElementGuardk,j\(x\)∧ThresholdGuardk,j\(x\)∧ExceptionGuardk,j\(x\)\.\\textsc\{ClauseGuard\}\_\{k,j\}\(x\)=\\textsc\{ElementGuard\}\_\{k,j\}\(x\)\\wedge\\textsc\{ThresholdGuard\}\_\{k,j\}\(x\)\\wedge\\textsc\{ExceptionGuard\}\_\{k,j\}\(x\)\.
The element guard checks whether the necessary legal elements are satisfied:
ElementGuardk,j\(x\)=MentalGuardk,j\(x\)∧ResultGuardk,j\(x\)∧CausationGuardk,j\(x\)\.\\textsc\{ElementGuard\}\_\{k,j\}\(x\)=\\textsc\{MentalGuard\}\_\{k,j\}\(x\)\\wedge\\textsc\{ResultGuard\}\_\{k,j\}\(x\)\\wedge\\textsc\{CausationGuard\}\_\{k,j\}\(x\)\.
The mental\-state guard is:
MentalGuardk,j\(x\)=⋁m∈ℳk,jMentalState\(s,m\),\\textsc\{MentalGuard\}\_\{k,j\}\(x\)=\\bigvee\_\{m\\in\\mathcal\{M\}\_\{k,j\}\}\\textsc\{MentalState\}\(s,m\),whereℳk,j\\mathcal\{M\}\_\{k,j\}is the set of legally admissible mental states for the clause\.
The result guard is:
ResultGuardk,j\(x\)=⋁r∈R\[ResultTypek,j\(r\)∧Severityk,j\(r\)\]\.\\textsc\{ResultGuard\}\_\{k,j\}\(x\)=\\bigvee\_\{r\\in R\}\\left\[\\textsc\{ResultType\}\_\{k,j\}\(r\)\\wedge\\textsc\{Severity\}\_\{k,j\}\(r\)\\right\]\.
The causation guard is:
CausationGuardk,j\(x\)=∃a∈A,r∈R\[Act\(s,a\)∧Result\(r\)∧Causes\(a,r\)\]\.\\textsc\{CausationGuard\}\_\{k,j\}\(x\)=\\exists a\\in A,r\\in R\\;\[\\textsc\{Act\}\(s,a\)\\wedge\\textsc\{Result\}\(r\)\\wedge\\textsc\{Causes\}\(a,r\)\]\.
The threshold guard encodes quantitative requirements, such as amount, number of victims, loss value, injury level, or repetition:
ThresholdGuardk,j\(x\)=⋀t∈𝒯k,jSatisfyThreshold\(x,t\)\.\\textsc\{ThresholdGuard\}\_\{k,j\}\(x\)=\\bigwedge\_\{t\\in\\mathcal\{T\}\_\{k,j\}\}\\textsc\{SatisfyThreshold\}\(x,t\)\.
For an amount\-based thresholdt=\(r,n,⋈,θ\)t=\(r,n,\\bowtie,\\theta\), this becomes:
SatisfyThreshold\(x,t\)⇔Amount\(r,n\)∧n⋈θ\.\\textsc\{SatisfyThreshold\}\(x,t\)\\Leftrightarrow\\textsc\{Amount\}\(r,n\)\\wedge n\\bowtie\\theta\.
The exception guard excludes clauses that are blocked by legally recognized exceptions:
ExceptionGuardk,j\(x\)=¬⋁e∈𝒳k,jExceptione\(x\)\.\\textsc\{ExceptionGuard\}\_\{k,j\}\(x\)=\\neg\\bigvee\_\{e\\in\\mathcal\{X\}\_\{k,j\}\}\\textsc\{Exception\}\_\{e\}\(x\)\.
Therefore, a clause is admissible iff:
AdmissibleClausek,j\(x\)⇔ArticleGuardk\(x\)∧ClauseGuardk,j\(x\)\.\\textsc\{AdmissibleClause\}\_\{k,j\}\(x\)\\Leftrightarrow\\textsc\{ArticleGuard\}\_\{k\}\(x\)\\wedge\\textsc\{ClauseGuard\}\_\{k,j\}\(x\)\.
### B\.4Penalty Encoding
Each admissible clause maps to a legally permitted penalty interval\. We encode the sentence as an integer variable:
y∈ℤ≥0,y\\in\\mathbb\{Z\}\_\{\\geq 0\},whereyydenotes the imprisonment term in months\. Special punishments such as life imprisonment or death penalty can be represented by reserved symbols:
y∈ℤ≥0∪\{Life,Death\}\.y\\in\\mathbb\{Z\}\_\{\\geq 0\}\\cup\\\{\\textsf\{Life\},\\textsf\{Death\}\\\}\.
For ordinary fixed\-term imprisonment, each clauseCk,jC\_\{k,j\}defines a penalty interval:
Penaltyk,j\(y\)⇔Lk,j≤y≤Uk,j,\\textsc\{Penalty\}\_\{k,j\}\(y\)\\Leftrightarrow L\_\{k,j\}\\leq y\\leq U\_\{k,j\},whereLk,jL\_\{k,j\}andUk,jU\_\{k,j\}are the lower and upper statutory bounds\.
The complete clause implication is:
∀x,y\.\[ArticleGuardk\(x\)∧ClauseGuardk,j\(x\)\]⇒Penaltyk,j\(y\)\.\\forall x,y\\;\.\\left\[\\textsc\{ArticleGuard\}\_\{k\}\(x\)\\wedge\\textsc\{ClauseGuard\}\_\{k,j\}\(x\)\\right\]\\Rightarrow\\textsc\{Penalty\}\_\{k,j\}\(y\)\.
Equivalently:
∀x,y\.AdmissibleClausek,j\(x\)⇒Lk,j≤y≤Uk,j\.\\forall x,y\\;\.\\textsc\{AdmissibleClause\}\_\{k,j\}\(x\)\\Rightarrow L\_\{k,j\}\\leq y\\leq U\_\{k,j\}\.
### B\.5Aggravating and Mitigating Circumstances
Some statutory clauses do not directly define a separate offense, but modify the penalty range\. We represent aggravating and mitigating factors as monotonic transformations over the base interval\.
Let:
Bk,j\(x\)=\[Lk,j,Uk,j\]B\_\{k,j\}\(x\)=\[L\_\{k,j\},U\_\{k,j\}\]be the base penalty interval\. Let𝒢\(x\)\\mathcal\{G\}\(x\)be the set of aggravating factors andℳ\(x\)\\mathcal\{M\}\(x\)the set of mitigating factors\.
An aggravating factorggis encoded as:
Aggravatingg\(x\)⇒y≥Lk,j\+Δg−,\\textsc\{Aggravating\}\_\{g\}\(x\)\\Rightarrow y\\geq L\_\{k,j\}\+\\Delta^\{\-\}\_\{g\},whereΔg−\\Delta^\{\-\}\_\{g\}raises the effective lower bound\.
A mitigating factormmis encoded as:
Mitigatingm\(x\)⇒y≤Uk,j−Δm\+,\\textsc\{Mitigating\}\_\{m\}\(x\)\\Rightarrow y\\leq U\_\{k,j\}\-\\Delta^\{\+\}\_\{m\},whereΔm\+\\Delta^\{\+\}\_\{m\}lowers the effective upper bound\.
Thus, the adjusted penalty interval is:
AdjustedPenaltyk,j\(x,y\)⇔Lk,j′\(x\)≤y≤Uk,j′\(x\),\\textsc\{AdjustedPenalty\}\_\{k,j\}\(x,y\)\\Leftrightarrow L^\{\\prime\}\_\{k,j\}\(x\)\\leq y\\leq U^\{\\prime\}\_\{k,j\}\(x\),where:
Lk,j′\(x\)=Lk,j\+∑g∈𝒢\(x\)Δg−,L^\{\\prime\}\_\{k,j\}\(x\)=L\_\{k,j\}\+\\sum\_\{g\\in\\mathcal\{G\}\(x\)\}\\Delta^\{\-\}\_\{g\},and:
Uk,j′\(x\)=Uk,j−∑m∈ℳ\(x\)Δm\+\.U^\{\\prime\}\_\{k,j\}\(x\)=U\_\{k,j\}\-\\sum\_\{m\\in\\mathcal\{M\}\(x\)\}\\Delta^\{\+\}\_\{m\}\.
To avoid inconsistent intervals, we require:
Lk,j′\(x\)≤Uk,j′\(x\)\.L^\{\\prime\}\_\{k,j\}\(x\)\\leq U^\{\\prime\}\_\{k,j\}\(x\)\.
If this constraint is unsatisfiable, the solver rejects the corresponding statutory interpretation\.
### B\.6Invariance to Legally Irrelevant Attributes
To enforce legal relevance, extra\-legal attributes are explicitly excluded from article and clause guards\. For any extra\-legal attributee∈ℰe\\in\\mathcal\{E\}, we require:
e∉Vars\(ArticleGuardk∪ClauseGuardk,j∪Penaltyk,j\)\.e\\notin\\textsc\{Vars\}\\left\(\\textsc\{ArticleGuard\}\_\{k\}\\cup\\textsc\{ClauseGuard\}\_\{k,j\}\\cup\\textsc\{Penalty\}\_\{k,j\}\\right\)\.
Equivalently, for two casesxxandx′x^\{\\prime\}that differ only in extra\-legal attributes:
x≡¬ℰx′⇒AdmissibleClausek,j\(x\)=AdmissibleClausek,j\(x′\)\.x\\equiv\_\{\\neg\\mathcal\{E\}\}x^\{\\prime\}\\Rightarrow\\textsc\{AdmissibleClause\}\_\{k,j\}\(x\)=\\textsc\{AdmissibleClause\}\_\{k,j\}\(x^\{\\prime\}\)\.
The predicted statutory set should therefore remain invariant:
x≡¬ℰx′⇒𝒞∗\(x\)=𝒞∗\(x′\),x\\equiv\_\{\\neg\\mathcal\{E\}\}x^\{\\prime\}\\Rightarrow\\mathcal\{C\}^\{\*\}\(x\)=\\mathcal\{C\}^\{\*\}\(x^\{\\prime\}\),where:
𝒞∗\(x\)=\{Ck,j∣AdmissibleClausek,j\(x\)\}\.\\mathcal\{C\}^\{\*\}\(x\)=\\\{C\_\{k,j\}\\mid\\textsc\{AdmissibleClause\}\_\{k,j\}\(x\)\\\}\.
This constraint is used to distinguish legally irrelevant perturbations from legally material changes\.
### B\.7Solver Objective
Given extracted factsFxF\_\{x\}and formalized statutory knowledge base𝒦\\mathcal\{K\}, the solver searches for admissible clauses and penalty assignments:
𝒦∪Fx⊧AdmissibleClausek,j\(x\)∧AdjustedPenaltyk,j\(x,y\)\.\\mathcal\{K\}\\cup F\_\{x\}\\models\\textsc\{AdmissibleClause\}\_\{k,j\}\(x\)\\wedge\\textsc\{AdjustedPenalty\}\_\{k,j\}\(x,y\)\.
The set of solver\-supported clauses is:
𝒞∗\(x\)=\{Ck,j∣𝒦∪Fx⊧AdmissibleClausek,j\(x\)\}\.\\mathcal\{C\}^\{\*\}\(x\)=\\left\\\{C\_\{k,j\}\\mid\\mathcal\{K\}\\cup F\_\{x\}\\models\\textsc\{AdmissibleClause\}\_\{k,j\}\(x\)\\right\\\}\.
The final penalty set is:
𝒴∗\(x\)=\{y∣∃Ck,j∈𝒞∗\(x\),𝒦∪Fx⊧AdjustedPenaltyk,j\(x,y\)\}\.\\mathcal\{Y\}^\{\*\}\(x\)=\\left\\\{y\\mid\\exists C\_\{k,j\}\\in\\mathcal\{C\}^\{\*\}\(x\),\\mathcal\{K\}\\cup F\_\{x\}\\models\\textsc\{AdjustedPenalty\}\_\{k,j\}\(x,y\)\\right\\\}\.
If multiple clauses are admissible, we apply statutory priority rules\. Specifically, more specific clauses dominate more general clauses:
Ck,j1≻Ck,j2iffClauseGuardk,j1\(x\)⇒ClauseGuardk,j2\(x\)C\_\{k,j\_\{1\}\}\\succ C\_\{k,j\_\{2\}\}\\quad\\text\{iff\}\\quad\\textsc\{ClauseGuard\}\_\{k,j\_\{1\}\}\(x\)\\Rightarrow\\textsc\{ClauseGuard\}\_\{k,j\_\{2\}\}\(x\)and not conversely\. The selected clause set is:
𝒞^\(x\)=\{C∈𝒞∗\(x\)∣∄C′∈𝒞∗\(x\),C′≻C\}\.\\widehat\{\\mathcal\{C\}\}\(x\)=\\left\\\{C\\in\\mathcal\{C\}^\{\*\}\(x\)\\mid\\nexists C^\{\\prime\}\\in\\mathcal\{C\}^\{\*\}\(x\),C^\{\\prime\}\\succ C\\right\\\}\.
The final sentence is then selected from the legally valid interval:
y^∈⋃Ck,j∈𝒞^\(x\)\[Lk,j′\(x\),Uk,j′\(x\)\]\.\\widehat\{y\}\\in\\bigcup\_\{C\_\{k,j\}\\in\\widehat\{\\mathcal\{C\}\}\(x\)\}\[L^\{\\prime\}\_\{k,j\}\(x\),U^\{\\prime\}\_\{k,j\}\(x\)\]\.
### B\.8Example Schema
For illustration, consider a generic criminal articleAkA\_\{k\}with two clauses\. ClauseCk,1C\_\{k,1\}covers ordinary circumstances, while clauseCk,2C\_\{k,2\}covers serious circumstances\.
The article guard is:
ArticleGuardk\(x\)=Actor\(s\)∧∃a∈ARegulatedActk\(a\)∧∃r∈RProtectedResultk\(r\)\.\\textsc\{ArticleGuard\}\_\{k\}\(x\)=\\textsc\{Actor\}\(s\)\\wedge\\exists a\\in A\\;\\textsc\{RegulatedAct\}\_\{k\}\(a\)\\wedge\\exists r\\in R\\;\\textsc\{ProtectedResult\}\_\{k\}\(r\)\.
The ordinary clause is:
ClauseGuardk,1\(x\)=\\displaystyle\\textsc\{ClauseGuard\}\_\{k,1\}\(x\)=MentalState\(s,Intentional\)∧∃a,r\[Act\(s,a\)∧Causes\(a,r\)∧Severity\(r,Minor\)\]\\displaystyle\\textsc\{MentalState\}\(s,\\textsf\{Intentional\}\)\\wedge\\exists a,r\\;\[\\textsc\{Act\}\(s,a\)\\wedge\\textsc\{Causes\}\(a,r\)\\wedge\\textsc\{Severity\}\(r,\\textsf\{Minor\}\)\]∧¬EspeciallySeriousCircumstance\(x\)\.\\displaystyle\\wedge\\neg\\textsc\{EspeciallySeriousCircumstance\}\(x\)\.
Its penalty range is:
Penaltyk,1\(y\)⇔0≤y≤36\.\\textsc\{Penalty\}\_\{k,1\}\(y\)\\Leftrightarrow 0\\leq y\\leq 36\.
The serious clause is:
ClauseGuardk,2\(x\)=\\displaystyle\\textsc\{ClauseGuard\}\_\{k,2\}\(x\)=MentalState\(s,Intentional\)∧∃a,r\[Act\(s,a\)∧Causes\(a,r\)∧Severity\(r,Serious\)\]\\displaystyle\\textsc\{MentalState\}\(s,\\textsf\{Intentional\}\)\\wedge\\exists a,r\\;\[\\textsc\{Act\}\(s,a\)\\wedge\\textsc\{Causes\}\(a,r\)\\wedge\\textsc\{Severity\}\(r,\\textsf\{Serious\}\)\]∨EspeciallySeriousCircumstance\(x\)\.\\displaystyle\\vee\\textsc\{EspeciallySeriousCircumstance\}\(x\)\.
Its penalty range is:
Penaltyk,2\(y\)⇔36<y≤120\.\\textsc\{Penalty\}\_\{k,2\}\(y\)\\Leftrightarrow 36<y\\leq 120\.
The full formalization is therefore:
ArticleGuardk\(x\)∧ClauseGuardk,1\(x\)⇒0≤y≤36,\\displaystyle\\textsc\{ArticleGuard\}\_\{k\}\(x\)\\wedge\\textsc\{ClauseGuard\}\_\{k,1\}\(x\)\\Rightarrow 0\\leq y\\leq 6,ArticleGuardk\(x\)∧ClauseGuardk,2\(x\)⇒36<y≤120\.\\displaystyle\\textsc\{ArticleGuard\}\_\{k\}\(x\)\\wedge\\textsc\{ClauseGuard\}\_\{k,2\}\(x\)\\Rightarrow 6<y\\leq 20\.
This example shows how the compact templateArticleGuard∧ClauseGuard⇒Penalty\\textsc\{ArticleGuard\}\\wedge\\textsc\{ClauseGuard\}\\Rightarrow\\textsc\{Penalty\}is instantiated as a typed, clause\-specific, solver\-checkable representation\. In our framework, LLMs are used only to propose candidate facts and candidate statutory mappings, while the final admissible clauses and penalty ranges must be validated by the solver against the formalized statutory knowledge base\.
## Appendix CDataset Details
### Dataset 1: LeCaRDv2 Subset Dataset
The structure of the original candidate set of the LeCaRDv2 dataset is shown in Table[8](https://arxiv.org/html/2605.26530#A3.T8)\.
Table 8:Structure of a raw LeCaRDv2 case\.We use thefactfield as the model input\. Theresultfield is parsed to extract the sentencing result in months, saved astrue\_sentence\_monthsfor automated evaluation\. We also split thearticlefield intotrue\_general\_articlesandtrue\_specific\_articlesin accordance with Chinese Criminal Law to allow separate evaluation\. The processed structure is shown in Table[9](https://arxiv.org/html/2605.26530#A3.T9)\.
Table 9:Structure of a processed LeCaRDv2 case\.
### Dataset 2: LEEC Suspect\-Level Dataset
We construct a suspect\-level evaluation dataset based on the publicly released LEEC corpus\. We first apply rule\-based regular expressions to recover structured fields, and then parse sentencing decisions into suspect\-level labels, including each suspect’s charge and applicable article list\. The processed suspect\-level format used in our experiments is shown in Table[10](https://arxiv.org/html/2605.26530#A3.T10)\.
Table 10:Structure of the processed LEEC suspect\-level format\.
### Dataset 3: Controlled Perturbation Dataset for RQ3
The RQ3 dataset evaluates whether a legal reasoning system makes the expected prediction change when case facts are modified in legally material ways\. It is stored asperturbed\_cases\_nips\.jsonand contains 8,000 paired criminal\-law cases\. Each pair consists of a base case and a perturbed case, together with structured metadata describing the perturbation rule, perturbation category, label\-change status, and before/after statute labels\. The reported statistics are computed from the perturbed case facts and suspect\-level statute annotations\. This design supports should\-change sensitivity tests and related label\-preserving diagnostics\.
Table 11:Structure of the RQ3 controlled perturbation dataset\.Table 12:Example record from the RQ3 perturbation dataset\.
## Appendix DExperiment Setup Details
### Hardware Environment
All experiments were conducted on a dedicated research server equipped with two NVIDIA RTX 6000 Ada GPUs, each with 48GB VRAM, under CUDA 12\.4 and driver version 550\.144\.03\. The software environment used Anaconda\-managed Python environments for model execution and evaluation\.
### Evaluation Protocol
For RQ1, statute prediction is formulated as multi\-label classification\. We report precision, recall, and F1 for general provisions and specific provisions on LeCaRDv2 and LEEC\. Sentencing is evaluated with sentencing error \(SE\), RMSE, and legal validity, with and without golden statutes\. LEEC additionally reports suspect extraction F1 because each case may contain multiple defendants\. RQ2 disables one major component at a time and evaluates the resulting change in general precision, recall, F1, specific precision, recall, F1\.
For RQ3, every instance contains an original case, a perturbed case, perturbation metadata, and before/after statute labels\. We score predictions against the post\-perturbation gold statute set, focusing on whether the model changes its prediction when the perturbation changes statutory applicability\. For RQ4, adversarial user content is appended to the legal input while the gold legal label remains unchanged, following the should\-not\-change prompt\-injection setting described in the main text\. RQ5 provides a confusing\-statute cluster and requires the model to select only applicable articles inside that cluster, or abstain when the cluster is irrelevant\.
### Baselines and Prompt Variants
We compareLexGuardagainst direct LLM inference and specialized legal LLMs\. GPT\-5\.2, GPT\-4o, GPT o4\-mini, DeepSeek v3, and Claude 4 Sonnet are evaluated through API calls\. DISC\-LawLLM\[[32](https://arxiv.org/html/2605.26530#bib.bib9)\]and LexiLaw\[[18](https://arxiv.org/html/2605.26530#bib.bib10)\]are legal\-domain baselines\. GPT\-5\.2\-J&H\-CoT keeps the same output schema as GPT\-5\.2 but adds an explicit legal reasoning scaffold for rule identification, fact\-to\-element mapping, elimination of inapplicable alternatives, and final prediction\. GPT\-5\.2\-J&H\-Few\-shot uses the same evaluation scripts and API configuration as the corresponding J&H\-CoT runs, but calibrates the prompt with legal input\-output demonstrations rather than a procedural reasoning scaffold\.
### Implementation Details
All systems are evaluated with deterministic scripts and normalized statute identifiers\.LexGuardapplies statute extraction, fact extraction, competitive\-article refinement, and solver\-centric verification before final prediction\. For baseline outputs that are generated in natural language, we use the same structured extraction prompts to recover article identifiers and sentencing outcomes before computing metrics\.
## Appendix EAdditional Experimental Results and Robustness Protocols
### E\.1Metric Definitions
Table[13](https://arxiv.org/html/2605.26530#A5.T13)defines the metrics used in Table[1](https://arxiv.org/html/2605.26530#S2.T1)and in the experimental results\. Predictions and labels are normalized to statute identifiers before evaluation\. Higher values are better unless marked with↓\\downarrow\. The framework table uses the same metric names as the experiment tables\. Abbreviated headers such as Align\., Sta\., Inv\., ASR, CRR, Pos\., Macro, Omit, and Wrong refer to Change Alignment, Statute Correctness, Invariance, Attack Success Rate, Clean\-correct Retention Rate, Positive Exactness, Macro Exactness, Gold Omission, and Wrong Similar Selection, respectively\.
Table 13:Metric definitions for the evaluation framework and experiments\.
## Appendix FBreakdown Results for Robustness Experiments
The main text reports aggregate robustness results\. Tables[14](https://arxiv.org/html/2605.26530#A6.T14)–[16](https://arxiv.org/html/2605.26530#A6.T16)provide compact breakdowns by perturbation category, prompt\-injection template, and confusing\-statute cluster\. These breakdowns are intended to show whether the observed trends are concentrated in a narrow subset of cases or persist across legally distinct stress conditions\.
Table 14:RQ3 results by perturbation category forLexGuard\. Scores are percentages; score columns are averaged over successfully returned cases\.Table 15:RQ4 prompt\-injection breakdown forLexGuard\. Clean and attack accuracies, ASR, invariance, and F1 are reported as percentages\.Table 16:RQ5 cluster\-level breakdown forLexGuardon non\-perfect confusing\-statute clusters\. All rate columns are percentages\.### F\.1Robustness Benchmark Construction
##### Should\-change factual perturbations for RQ3\.
The factual perturbation benchmark is built from paired original and perturbed criminal\-law fact patterns\. Each perturbed instance records whether the legal label should change, the perturbation category, and the before/after statutory labels\. In the main text, RQ3 emphasizes should\-change cases, where a model should update its statute prediction when a factual modification changes statutory applicability\.
##### Should\-not\-change prompt injection for RQ4\.
The prompt\-injection benchmark evaluates whether models preserve legal judgment under adversarial user content\. We instantiate four attack families: fabricated authority, verdict forcing, role hijacking, and format mimicking\. The attacks are appended to legal inputs while the gold legal label remains unchanged, so degradation reflects susceptibility to external instructions rather than a genuine change in case facts\.
##### Confusing statutes\.
The confusing\-statute benchmark groups legally similar provisions into clusters\. Positive cases contain at least one applicable statute from the cluster, while negative cases contain no applicable statute from that cluster\. This setup evaluates both omission errors on relevant clusters and false activation errors on irrelevant clusters\.
### F\.2Reasoning\-Augmented GPT\-5\.2 Baseline
We construct GPT\-5\.2\-J&H\-CoT by adding a task\-level reasoning scaffold to the baseline prompt while keeping the input cases and output schema fixed\. The scaffold is organized into four blocks\. First, an issue\-framing block asks the model to identify the legally relevant question raised by the case\. Second, a rule\-identification block asks for candidate statutory provisions and the legal elements that must be satisfied\. Third, an element\-mapping block requires the model to align concrete case facts with each statutory element and mark unsupported elements\. Fourth, a contrastive\-exclusion block asks the model to compare legally similar alternatives and remove provisions whose elements are not established\. Only after these blocks does the prompt request the final structured prediction\.
The scaffold is specialized for each robustness task\. For RQ3, it requires an explicit comparison between the base and perturbed facts and asks whether the perturbation changes a legally material condition before predicting the post\-perturbation statutes\. For RQ4, it separates adjudicative facts from adversarial instructions and directs the model to ground the decision only in legally relevant case content\. For RQ5, it performs cluster\-level contrastive screening: each candidate provision in the confusing\-statute cluster is checked against its elements, and the model must either select the supported article\(s\) or abstain from activating the cluster\.
This construction does not provide gold statutes, perturbation labels, solver states, or intermediate outputs fromLexGuard\. GPT\-5\.2\-J&H\-CoT is therefore a controlled reasoning\-augmented prompting baseline: it tests whether an explicit legal\-reasoning scaffold can improve a strong LLM under the same structured\-output interface used for direct GPT\-5\.2\.
GPT\-5\.2\-J&H\-Few\-shot uses the same structured output interface and evaluation scripts, but replaces the procedural reasoning scaffold with task\-specific input\-output demonstrations\. For RQ3, examples illustrate label\-preserving and label\-changing factual edits; for RQ4, examples show that adversarial appended instructions should be separated from adjudicative facts; for RQ5, examples demonstrate exact selection inside a confusing\-statute cluster and abstention when the cluster is irrelevant\. The demonstrations do not reveal test labels or solver states, and are used only to calibrate the model’s response format and decision pattern\.
## Appendix GPrompt Details
This section presents the prompt templates used in each major module of our main workflow, covering statute selection, fact extraction, structured schema alignment, formal KB\-based reasoning, and final decision explanation\. All prompts are designed to elicit interpretable and structured outputs from LLMs, facilitating integration with downstream modules and evaluation logic\.
### Statute Selection and Fact Extraction Prompts
#### Attorney Agent Prompts
##### Attorney StatuteSelectorAgent Prompt
Purpose: Select applicable legal provisions from the Criminal Law of the People’s Republic of China fordefensepurposes\.
> You are a legal expert in the field of Chinese Criminal Law\. Based on the following case description, determine the potentially applicable legal provisions from the Criminal Law of the People’s Republic of China\. Divide your output into “General Provisions” and “Specific Provisions”\. Your objective is tosupportthe prosecution’s position in this case\. Case Description:\{case\_text\} Please return strictly in the following JSON format: \{ ”general\_articles”: \[article numbers, e\.g\., 17\], ”specific\_articles”: \[article numbers, e\.g\., 234\] \}
##### Attorney FactExtractorAgent Prompt
Purpose: Extract comprehensive case facts in natural Chinese language fromdefensepurposes\.
> You are a legal fact extraction expert\. Please extract factual information as comprehensively as possible from the following case description\. Return the facts in well\-organized, semantically clear natural Chinese language\. Your goal is todefendthe case\. Case Description:\{case\_text\} Do not return JSON\. Present the facts using paragraph or bullet\-list format\.
#### Prosecutor Agent Prompts
##### Prosecutor StatuteSelectorAgent Prompt
Purpose: Select applicable legal provisions from the Criminal Law of the People’s Republic of China from theprosecutionperspective\.
> You are a legal expert in the field of Chinese Criminal Law\. Based on the following case description, determine the potentially applicable legal provisions from the Criminal Law of the People’s Republic of China\. Divide your output into “General Provisions” and “Specific Provisions”\. Your objective is toaccusein this case\. Case Description:\{case\_text\} Please return strictly in the following JSON format: \{ ”general\_articles”: \[article numbers, e\.g\., 17\], ”specific\_articles”: \[article numbers, e\.g\., 234\] \}
##### Prosecutor FactExtractorAgent Prompt
Purpose: Extract comprehensive case facts in natural Chinese language foraccusationpurposes\.
> You are a legal fact extraction expert\. Please extract as detailed factual information as possible from the following case description\. Return the facts in well\-organized, semantically clear natural Chinese language\. Your goal is toaccusein this case\. Case Description:\{case\_text\} Do not return JSON\. Present the facts using paragraph or bullet\-list format\.
### Structured Schema Extraction Prompt
Purpose: To extract structured factual elements from raw LLM outputs according to the predefined schema in our Formal KB, including both General and Specific fields\. This enables Statute\-Fact alignment in reverse verification\.
> The following two analysts provide descriptions of the case facts\. Based on their descriptions, please extract structured legal elements: Analyst 1:\{fact\_extractor\_output\} Analyst 2:\{fact\_reviewer\_text\} Please extract: 1\.General Provision Fields\(Only return those marked as true\): \{GENERAL\_FIELD\_LIST\} Example output: \{ ”age\_under\_18”: true, ”truthful\_confession\_of\_crime”: true \} 2\.Specific Provision Fields\(select values only from the given options\): \- subject: \{SPECIFIC\_FIELDS\[’subject’\]\} \- action: \{SPECIFIC\_FIELDS\[’action’\]\} \- object: \{SPECIFIC\_FIELDS\[’object’\]\} \- intent: \{SPECIFIC\_FIELDS\[’intent’\]\} Output format: \{ ”general\_facts”: \{ ”age\_under\_18”: true, … \}, ”specific\_facts”: \{ ”subject”: \[…\], ”action”: \[…\], … \} \}
### Law\-Specific Fact Slicing Prompt
Purpose: Given a specific article and its Formal KB\-defined value ranges, this prompt guides the LLM to generate a fully structured fact slice suitable for formal reasoning\.
> You are a legal expert familiar with the Criminal Law of the People’s Republic of China\. Based on the following case description and the value range of Article \{article\}, generate a valid JSON input that matches the requirements of this article\. Requirements: 1\. Output must be strictly in structured JSON format and include only the following fields: \{list\(range\_data\.keys\(\)\)\}\. 2\. Each field must be assigned a single value from the provided range\. 3\. The “Age” field can be an integer or null; all other fields are strings\. 4\. If information is insufficient, choose the most reasonable default\. 5\. Severity\-related fields such as “Condition” and “Quantity” should be chosen conservatively and leniently\. 6\. Do not include code blocks, comments, or non\-JSON content\. Case Description:\{case\_text\} Value Ranges:\{range\_json\} Example Output:\{ ”Actor”: ”person”, ”Age”: null, ”Action”: ”selling”, ”Quantity”: ”small”, ”Condition”: ”none” \}
### Final Judgment Explanation Prompt
Purpose: Used by the Judge LLM to generate a human\-readable final decision summary based on the structured results, without reanalyzing the original case\. This ensures interpretability and transparency\.
> You are a legal expert familiar with the Criminal Law of the People’s Republic of China\. Based on the following judgment results, write a final decision summary strictly in the specified format\. Requirements: \- Only and strictly use the statutes, charges, and penalties mentioned in the judgment result\. \- Do not reanalyze the case description\. \- Base your summary on the “consequences” and “model\_details” fields in the judgment\. \- \{”\.join\(requirements\)\} \- Output must strictly follow the given format and must not contain any extra commentary\. Case Description \(for reference only\):\{case\_text\} Judgment Result:\{results\_json\}
## Appendix HMotivating Example Details
This appendix provides a complete walkthrough of the motivating example discussed in the main text, illustrating how each module processes the case to produce a structured and legally grounded decision\. For completeness, we also include the two tables moved from the main text: \(i\) statute\-specific KB examples used for auto\-formalizing \(Table[17](https://arxiv.org/html/2605.26530#A8.T17)\) and \(ii\) the pseudo\-code and example execution of solver\-based statutory sentencing for drug\-related crimes \(Table[18](https://arxiv.org/html/2605.26530#A8.T18)\)\.
### H\.1Statute\-Specific Knowledge Bases for Auto\-formalizing
This subsection presents representative statute\-specific knowledge bases used in the structured fact extraction and auto\-formalizing stage\. These knowledge bases define the required legal fields and constraints for both general and specific provisions, enabling consistent schema\-based fact extraction and subsequent statute\-specific fact slicing\.
Table 17:Examples of law\-specific knowledge bases for fine\-grained fact extraction\.
### H\.2Formal Solver\-Based Statutory Reasoning
This subsection details the formal reasoning process used for statute verification and sentencing determination\. We provide the pseudo\-code of the Z3\-based solver and a concrete execution trace for Article 347 \(drug trafficking\), illustrating how legal conditions are evaluated in severity order to derive a legally admissible sentencing clause\.
Table 18:Formal reasoning for statutory sentencing: pseudo\-code and drug\-crime example\.AlgorithmExample Execution: Article 347 \(Drug Trafficking\)Input:Fact sliceff, statute\-specific rulesΦ\\Phi
Output:Sentencing clauseCC
1: Initialize solverSS, setC=∅C=\\emptyset
2: Define valid actors, actions, drug quantities, circumstances
3: Encode factsffinto solverSS
4: For each ruleri∈Φr\_\{i\}\\in\\Phi\(high→\\rightarrowlow severity\):
5: Push solver state
6: Encoderir\_\{i\}constraints:
7:r1r\_\{1\}: Large quantity or severe circumstance→\\rightarrow15y/life/death
8:r2r\_\{2\}: Medium quantity \(10g\-50g meth\)→\\rightarrow≥\\geq7y
9:r3r\_\{3\}: Small quantity \+ serious circumstance→\\rightarrow3\-7y
10:r4r\_\{4\}: Small quantity \+ none circumstance→\\rightarrow≤\\leq3y
11: Check satisfiability ofrir\_\{i\}
12: If SAT andC=∅C=\\emptyset:
13:C←C\\leftarrowconsequence linked torir\_\{i\}
14: Pop andbreak
15: Pop solver state
16: IfC=∅C=\\emptyset:C=C=”No applicable clause”
17:returnCC1: Initialize solver with empty state
2: Recognize Actor = person, Action = selling narcotics
3: Encode facts \(Actor, Action, DrugQuantity = heroin/meth<<10g\)
4: Iterate over rulesr1r\_\{1\}tor4r\_\{4\}in severity order
5: Push solver state forr1r\_\{1\}
6: Encode large quantity rule constraints
7: Evaluater1r\_\{1\}⇒\\RightarrowUNSAT \(large quantity condition not met\)
8: Push solver state forr2r\_\{2\}, encode medium quantity rule
9: Evaluater2r\_\{2\}⇒\\RightarrowUNSAT \(medium quantity condition not met\)
10: Push solver state forr3r\_\{3\}, encode small quantity \+ serious circumstance
11: Evaluater3r\_\{3\}⇒\\RightarrowUNSAT \(no serious circumstance\)
12: Push solver state forr4r\_\{4\}, encode small quantity \+ none circumstance
13: Evaluater4r\_\{4\}⇒\\RightarrowSAT \(conditions satisfied\)
14: SetC=C=”Fixed\-term imprisonment≤\\leq3 years”, pop and break
15: Discard other solver states
16: Verify non\-emptyCC, ready for return
17: Return final sentencing clause under Article 347
### H\.3Running Example Outputs
This subsection consolidates the running example into the motivating\-example appendix\. Each stage briefly states its role and then reports the corresponding output produced in the example\.
#### Full Case Description \(Model Input\)
The following is the full text of the original judgment, which serves as the input to the model:
Jiangxi Province Nanchang City Donghu District People’s Court Criminal Judgment \(2020\) Gan 0102 Xingchu No\. 347
Procuratorate: Donghu District People’s Procuratorate, Nanchang City\.
Defendant: Gong Qiang, male, born February 10, 1974, Han ethnicity, junior high school education, employee of Xinchang Power Plant, registered in Donghu District, Nanchang City, Jiangxi Province\.
Previous drug\-related administrative detentions and compulsory detoxifications occurred between 2015 and 2019\. Gong Qiang was criminally detained in November 2019 on suspicion of drug trafficking and formally arrested on November 19, 2019\.
The prosecution accused that on August 30, 2019, at 13:32, the defendant received a WeChat message from Luo requesting to purchase drugs worth RMB 1,000\. After receiving payment via WeChat, at approximately 17:00 the same day, Gong Qiang delivered 2 grams of methamphetamine to Luo\. He was arrested on November 7, 2019, and confessed to the crime\. Drug tests for both Gong and Luo returned positive for methamphetamine\.
#### Outputs from Prosecutor Agent and Attorney Agent
The role\-differentiated legal agents propose candidate statutes and extract supporting facts from different legal perspectives\.
##### Prosecutor Agent Outputs:
Statute Selector output: \{general\_articles: \[65, 67\], specific\_articles: \[347\]\}
Fact Extractor output:
- •1\.1 Defendant: Gong Qiang
- •1\.2 Buyer: Luo \(purchaser of the drugs\)
- •2\.1 August 30, 2019, 13:32: Luo messages Gong via WeChat to purchase drugs
- •2\.2 Payment: RMB 1,000 via WeChat
- •2\.3 17:00: 2g methamphetamine handed over by Gong
- •3\.1 November 7: Gong surrenders at police station
- •3\.2 Gong confesses without resistance
- •4\.1 Urine tests conducted by police
- •4\.2 Tests returned meth\-positive for both parties
- •5\.1 Gong previously convicted multiple times for similar crimes within 5 years
- •6\.1 Transaction conducted via mobile payment; highly concealed
- •6\.2 Gong fully cooperated with the investigation
##### Attorney Agent Outputs:
Statute Selector 2 output: \{general\_articles: \[67, 64\], specific\_articles: \[347\]\}
Fact Extractor 2 output:
- •August 30, 2019, 13:32: Gong receives Luo’s message
- •17:00: Drug delivery \(2g methamphetamine\)
- •November 7: Gong voluntarily turns himself in
- •RMB 1,000 payment confirmed
- •Positive drug tests for both parties
- •Prior convictions within 5 years
- •Continued criminal activity post\-incarceration
#### Autoformalize and Field Intersection Results
The autoformalizer converts the extracted facts into statute\-specific structured fields and intersects them with each candidate article’s trigger conditions\.
General Provision Structured Facts Extracted:
\{
"prior\_sentence\_served\_or\_pardoned":true,
"reoffense\_within\_5\_years":true,
"recidivist\_status":true,
"voluntary\_surrender\_with\_confession":true,
"truthful\_confession\_of\_crime":true,
"provision\_of\_significant\_clue":true,
"multiple\_crimes\_before\_sentencing":true,
"illegal\_proceeds\_obtained":true
\}
Specific Provision Structured Facts Extracted:
\{
"subject":\["person"\],
"action":\["traffic\_narcotics"\],
"object":\["narcotics"\],
"intent":\["intentional"\]
\}
Field Intersections with Trigger Conditions:
Article64required:\["illegal\_proceeds\_obtained",\.\.\.\]
Matched:\{"illegal\_proceeds\_obtained":true\}
Article65required:\["prior\_sentence\_served\_or\_pardoned","reoffense\_within\_5\_years",\.\.\.\]
Matched:\{"prior\_sentence\_served\_or\_pardoned":true,"reoffense\_within\_5\_years":true\}
Article67required:\["voluntary\_surrender\_with\_confession","truthful\_confession\_of\_crime",\.\.\.\]
Matched:\{"voluntary\_surrender\_with\_confession":true,"truthful\_confession\_of\_crime":true\}
Article347required:\["subject","action","object","intent"\]
Matched:\{
"subject":\["person"\],
"action":\["traffic\_narcotics"\],
"object":\["narcotics"\],
"intent":\["intentional"\]
\}
Final Structured Slices:
\{
"general":\[
\{
"article":64,
"fields":\{"illegal\_proceeds\_obtained":true\}
\},
\{
"article":65,
"fields":\{
"prior\_sentence\_served\_or\_pardoned":true,
"reoffense\_within\_5\_years":true
\}
\},
\{
"article":67,
"fields":\{
"voluntary\_surrender\_with\_confession":true,
"truthful\_confession\_of\_crime":true
\}
\}
\],
"specific":\[
\{
"article":347,
"fields":\{
"subject":\["person"\],
"action":\["traffic\_narcotics"\],
"object":\["narcotics"\],
"intent":\["intentional"\]
\}
\}
\]
\}
#### Verified Articles by Z3 Backward Verification
The reverse\-verification solver checks which candidate articles are legally triggerable from the structured fact slices\.
Verified General Articles: \[64, 65, 67\] Verified Specific Articles: \[347\]
#### Statute\-specific Sentencing Outputs from Z3
The statute\-specific solver computes the legal consequences attached to the verified provisions\.
Listing 1:Z3\-based Law\-specific Sentencing Outputs\{
"Article64":\{
"consequences":\["propertyshallbeconfiscatedandturnedovertothestatetreasury"\],
"model\_details":\{
"Condition":"none",
"Actor":"person",
"PropertyType":"contraband"
\}
\},
"Article65":\{
"consequences":\["theoffenderisarecidivistandshallreceiveaheavierpunishmentunderArticle65"\],
"model\_details":\{
"PriorSentenceType":"fixed\-termimprisonment",
"CrimeIntent":"intentional",
"ParoleStatus":"notonparole",
"Condition":"none",
"TimeSinceParoleCompletion":"within5years",
"NewCrimeSentenceType":"fixed\-termimprisonment",
"TimeSincePriorSentence":"within5years"
\}
\},
"Article67":\{
"consequences":\["punishmentmaybemitigated"\],
"model\_details":\{
"CompulsoryMeasure":"none",
"Outcome":"none",
"Actor":"person",
"Action":"voluntarysurrender",
"Condition":"none",
"CrimeStatus":"normal",
"Confession":"truthful"
\}
\},
"Article347":\{
"consequences":\["fixed\-termimprisonmentofnotmorethan3years,criminaldetention,orpublicsurveillance,andafine"\],
"model\_details":\{
"DrugQuantity":"heroin/meth<10g",
"Circumstance":"none",
"Condition":"none",
"Actor":"person",
"Action":"selling",
"Age":"notprovided\(assumed\>=18forArticle347\)"
\}
\}
\}
#### Final Verdict Generated by LLM
The final judge agent renders the verified statutes and statute\-specific consequences into a natural\-language judgment\.
Charge:Drug Trafficking Judgment:Fixed\-term imprisonment of 2 years and a fine of RMB 10,000; confiscation of the illegal substances, which are to be turned over to the state treasury\. Applicable Provisions: Article 64, Article 65, Article 67, Article 347
General Provisions:
- •Article 64: Illegal drugs are contraband and must be confiscated and turned over to the state treasury\.
- •Article 65: Defendant committed the same crime within five years of serving a sentence, thus considered a recidivist and subject to harsher punishment\.
- •Article 67: Defendant voluntarily surrendered and confessed truthfully, qualifying for leniency\.
Specific Provision:
- •Article 347: Selling 2g of methamphetamine constitutes low\-quantity drug trafficking punishable by up to three years’ imprisonment or criminal detention, plus a fine\.
Final Sentencing Outcome:Two years’ fixed\-term imprisonment
Reasoning:The defendant’s act of selling 2g of methamphetamine meets the criteria for Article 347\. As a recidivist under Article 65, he is subject to heavier punishment\. However, under Article 67, his voluntary surrender and truthful confession allow for mitigation\. Article 64 mandates confiscation of contraband\. Thus, the final sentence includes imprisonment, fine, and confiscation\.Similar Articles
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
This paper introduces REFLECT, a meta-evaluation benchmark for assessing the reliability of LLM judges in evaluating deep research agents. Experiments show current LLM judges remain unreliable, with overall accuracies below 55% across reasoning, tool-use, and report-quality failures.
Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law
This paper empirically studies LLMs' legal reasoning in tax law, showing that data contamination inflates performance and that neuro-symbolic hybrid systems offer more reliable and robust generalization than monolithic LLMs.
Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning
This paper identifies a systematic gap between legal interpretation and formal logic in AI legal reasoning, proposes a neuro-symbolic approach to bridge it, and demonstrates substantial label shifts when re-annotating legal NLI data under strict formal entailment.
Why most legal-AI demos fail in production
The article details three common failure modes for legal AI systems in production: treating all sources as equally credible, failing to handle conflicting legal opinions, and lacking firm-specific institutional knowledge. It suggests solutions such as authority weighting, disagreement detection, and annotation layers to build trust and utility.
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
This paper investigates how incorporating web retrieval into LLM agents can degrade safety alignment, revealing the 'Safe Source Paradox' where even safety-oriented documents increase harmful compliance. It introduces the AgentREVEAL diagnostic framework and HarmURLBench benchmark to analyze and evaluate retrieval-induced safety vulnerabilities.