LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation
Summary
This paper introduces LP-Eval, a rubric and dataset for evaluating legal proposition generation by large language models, with annotations by legal experts. Results show that rubric-guided LLM evaluations align more closely with expert assessments than direct scoring.
View Cached Full Text
Cached at: 05/20/26, 08:27 AM
# Rubric and Dataset for Measuring the Quality of Legal Proposition Generation Source: [https://arxiv.org/html/2605.19815](https://arxiv.org/html/2605.19815) ,Johan Lindholm[johan\.lindholm@umu\.se](https://arxiv.org/html/2605.19815v1/mailto:[email protected])Umeå UniversityUmeåSweden,Amogh Raina[amogh\.raina@di\.ku\.dk](https://arxiv.org/html/2605.19815v1/mailto:[email protected])University of CopenhagenCopenhagenDenmark,Henrik Palmer Olsen[henrik@jur\.ku\.dk](https://arxiv.org/html/2605.19815v1/mailto:[email protected])University of CopenhagenCopenhagenDenmarkandDaniel Hershcovich[dh@di\.ku\.dk](https://arxiv.org/html/2605.19815v1/mailto:[email protected])University of CopenhagenCopenhagenDenmark \(5 June 2009\) ###### Abstract\. Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under\-examined in Legal NLP\. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models \(LLMs\)\. We introduce LP\-Eval, a three\-step evaluation rubric co\-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions\. Using this rubric, we release a dataset of two experts’ annotations for 100 LLM\-generated legal propositions\. Our results show that LLMs can generate predominantly well\-formed and high\-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones\. We further examine LLMs as evaluators and find that rubric\-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer\-grained distinctions captured by human experts\. Natural Legal Language Processing, Large Language Models, Legal Proposition, Evaluation Rubric ††copyright:rightsretained††journalyear:2026††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2026; Singapore††isbn:979\-8\-4007\-1939\-4/25/06††ccs:Applied computing Law††ccs:Computing methodologies Natural language generation††ccs:Computing methodologies Language resources## 1\.Introduction Legal work, whether in public administration, private corporations, solicitors or barristers offices or courts, centers around determining what the law ‘is’\. Only if we know what the law is, can we apply it to the case we have before us\. Lawyers, then, constantly have to state what the law is\. Such statements, often called statements on ‘valid law’, articulate the content of general legal rules and principles in some jurisdiction\. Such claims, which we in this paper refer to asLegal Propositions\(LPs\), are statements on legal norms \(rules or principles, i\.e\. they are normative and not factual statements\), related to their scope, conditions, or consequences\. LPs are foundational to doctrinal legal scholarship: the central aim of legal textbooks and legal commentary \(often published as journal articles\) is to state what the law is by formulating and refining such propositions\. Automatic Legal Proposition GenerationParagraph from CJEU Decision:That article therefore lays down strict conditions for the adaptation, by the competent authority of the executing State, … under Article 8\(1\) of that framework decision, to recognise a judgment which has been forwarded to it and to execute the sentence …Citation Paragraphs:In particular, Article 8 of Framework Decision 2008/909 …Legal Proposition generated by GPToss:Article 8\(1\) of Framework Decision 2008/909 imposes a general duty on the executing State to recognise and execute a forwarded judgment …Expert Annotation based on LN\-Eval RubricFormal Components:Stance ✓ Object ✓ Specification ✓\(VALID\)Quality Dimensions:Source Independence2/3Fact Independence3/3Conciseness3/3Generality3/3Fidelity3/3Overall:3/3\(Excellent\) Figure 1\.Example of legal proposition generation and expert annotation based on LN\-Eval rubricLarge\-scale collections of legal propositions would have substantial practical value\. Legal propositions are lawyers’ primary unit of legal normativity; without them, legal analysis and knowledge organization become markedly more difficult\. Legal propositions are today still exclusively manually constructed, either by legal academics when they write textbooks or legal commentary or by legal practitioners when they write pleas or decisions\. For example, in the U\.S\., leading providers of legal information use human editors to generate legal propositions for court decisions, which they refer to as Headnotes\(Eshelman,[2018](https://arxiv.org/html/2605.19815#bib.bib11)\)\. Automating legal proposition generation, the task of producing legal statements based on case decision paragraphs and citation contexts \(see upper panel of[Fig 1](https://arxiv.org/html/2605.19815#S1.F1)\), has the potential to improve both the coverage and quality of a wide range of LegalTech applications and downstream legal reasoning pipelines\. Despite its practical importance, automatic legal proposition generation has received comparatively limited attention in prior Legal NLP and LegalTech research\. Much prior work on legal reasoning has focused either on the sources of law \(legislation and prior decisions\) e\.g\., precedent cases retrieval\(T\.y\.s\.s\. et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib28)\), based on the narrative of case fact \(e\.g\., judgment outcome and related classification tasks\(Chalkidis et al\.,[2019](https://arxiv.org/html/2605.19815#bib.bib6)\)\. Legal proposition generation is complementary to these lines of work: it targets the interface between “raw” legal sources and the fact patterns of new disputes by producing normative statements that can be invoked, combined, and applied in subsequent analysis\. As such, it sits at thenexusbetween legal authority and factual context, offering a potentially useful intermediate representation for both human and machine legal reasoning\. A key challenge in legal proposition generation is evaluation\. As an open\-ended task, there is rarely a single “gold” reference: multiple outputs may be valid, while small wording differences can alter legal meaning\. Consequently, n\-gram–based metrics such as BLEU\(Papineni et al\.,[2002](https://arxiv.org/html/2605.19815#bib.bib23)\)and ROUGE\(Lin,[2004](https://arxiv.org/html/2605.19815#bib.bib20)\)are ill\-suited, as they capture surface overlap rather than legal content\. Another challenge is grounding: propositions must be faithful to authoritative sources\. Recent work exploresLLM\-as\-judgefor evaluation\(Li et al\.,[2024b](https://arxiv.org/html/2605.19815#bib.bib18)\), but this approach remains vulnerable to hallucinations\(Li et al\.,[2024a](https://arxiv.org/html/2605.19815#bib.bib19)\), such as accepting unsupported claims or fabricated citations\(Dahl et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib9)\)\. These concerns have motivated recent NLP work in rubric\-based evaluation protocols that decompose quality into human\-designed explicit, structured criteria \(e\.g\., faithfulness and conciseness\)\(Hashemi et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib15)\), rather than a single holistic score\. For instance, TN\-Eval\(Shah et al\.,[2025](https://arxiv.org/html/2605.19815#bib.bib25)\)develops a rubric and associated protocols to assess the quality of behavioral therapy notes and reports improved reliability and interpretability compared to traditional evaluation approaches\. Motivated by these challenges, we investigate the ability of LLMs to generate and evaluate legal propositions from decisions of the Court of Justice of the European Union \(CJEU\)\. To our knowledge, this is the first study in legal NLP to systematically examine both the automatic generation and evaluation of legal propositions\. Our contributions are:111Our dataset, code and appendix is available at[https://github\.com/sxu3/LP\-Eval](https://github.com/sxu3/LP-Eval)1\) We introduce LP\-Eval, a three\-step rubric co\-designed with legal experts to support structured and consistent evaluation of generated legal propositions\. 2\) We release the LP\-Eval dataset, consisting of expert annotations for LLM\-generated legal propositions following the LP\-Eval guidelines\. 3\) Through quantitative experiments, we show that LLMs can act as both generators and evaluators of legal propositions, though with notable limitations: experts rate propositions from well established cases higher than those from recent cases, while LLM\-based evaluators, despite improved alignment when using LP\-Eval, fail to capture such finer\-grained distinctions\. ## 2\.Task Definition We examine LLMs’ ability to generate and evaluate LPs using decisions of the CJEU\. Our experiments comprise two tasks:*LLM as generator*and*LLM as evaluator*\. LLM as generator:We prompt off\-the\-shelf LLMs to generate LPs using an expert\-designed template that provides instructions of the task, definition of LP, focus paragraph, and its cited context\.[§ 4\.2](https://arxiv.org/html/2605.19815#S4.SS2)offeres details of the prompt and implementation\. Two legal experts evaluate the generated propositions using the LP\-Eval rubric \([§ 4\.3](https://arxiv.org/html/2605.19815#S4.SS3)\), which we treat as ground truth for the generation quality\. LLM as evaluator:We prompt LLMs to evaluate generated propositions using a template derived from the LP\-Eval rubric\. Evaluator performance is assessed by measuring agreement between LLM scores and expert annotations\. Details of the evaluating prompt and implementation are given in[§ 4\.4](https://arxiv.org/html/2605.19815#S4.SS4)\. ## 3\.LP\-Eval Rubric Building on prior work in rubric design\(Dawson,[2017](https://arxiv.org/html/2605.19815#bib.bib10); Galvan\-Sosa et al\.,[2025](https://arxiv.org/html/2605.19815#bib.bib13)\), our rubric comprises two elements: \(1\) required LPCOMPONENTS—Stance, Object, and Specification—and \(2\) qualityDIMENSIONSthat distinguish stronger from weaker outputs, including Source Independence, Fact Independence, Conciseness, Generality, and Fidelity\. The lower panel of[Fig 1](https://arxiv.org/html/2605.19815#S1.F1)illustrates the rubric\. Due to space constraints, a full description is provided in[Appendix B](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)\. We developed the rubric through a two\-step co\-design process with legal experts\. First, interviews with two European law professors identified key components and dimensions, which were iteratively refined and summarized into an initial draft\. Second, experts applied the rubric to a sample of LLM\-generated LPs, followed by feedback to assess its discriminative ability and further refine it\. Finally, a computational linguist formalized the evaluation protocol, which we used to develop expert annotation guidelines for human evaluation and a prompt template for LLM\-as\-judge automatic evaluation\. ## 4\.LP\-Eval Dataset The dataset consists of 50 paragraphs sampled from 10 CJEU decisions\. For each paragraph, we include two LLM\-generated LPs, manual annotations on the quality of the generated LPs from two legal professionals following the LP\-Eval rubric, and corresponding automatic evaluations produced by GPT\-oss\(Agarwal et al\.,[2025](https://arxiv.org/html/2605.19815#bib.bib3)\), an open LLM\. ### 4\.1\.CJEU Cases Sampling We sampled the 10 CJEU decisions from a proprietary database compiled and curated by LawLibrary\.AI222https://www\.lawlibrary\.ai/from different official sources\. Previous work reveals that LLMs to tend memorize parts of their training data\(Carlini et al\.,[2021](https://arxiv.org/html/2605.19815#bib.bib5)\)\. A possible concern is therefore that LLMs may generated LPs based on their memorisation of the pretraining data\. This is particularly likely when it comes to older and more frequently discussed court decisions\. To test for such effects, we selected a combination ofwell\-establishedandrecentdecisions\. Thewell\-establishedgroup consists of five highly\-cited and discussed decisions, spread across time that were manually selected by the legal experts \(Full list of the selected cases are available in[Appendix D](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)\)\. We selected five paragraphs per case, resulting in a dataset of 50 paragraphs \(See[Appendix D](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)for sampling details\)\. Descriptive statistics are reported in[Appendix E](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)\. ### 4\.2\.LLM Legal Proposition Generation We employed three models for proposition generation: \(1\) GPT\-OSS: A large\-scale open\-source model \(120 billion parameters\)\(Agarwal et al\.,[2025](https://arxiv.org/html/2605.19815#bib.bib3)\)\. \(2\) OLMo\-3\-7B\-Instruct: A 7 billion parameter instruction\-tuned model\(Olmo et al\.,[2025](https://arxiv.org/html/2605.19815#bib.bib22)\)\. \(3\) Saul\-7B\-Instruct: A 7 billion parameter model specifically trained on legal data\(Colombo et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib8)\)\. All models were configured with a temperature of 0\.3 to balance between determinism and output diversity\. See[Appendix A](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)for implementation details and expert\-crafted LP generation prompts\. ### 4\.3\.Expert Evaluation Collection We recruited a final\-year law student and a research assistant with a PhD in law to do the annotation\. Both were paid current collective bargaining salaries through their respective universities\. They conducted annotations following the LP\-Eval annotation protocol\. Due to resource constraints, human evaluation was limited to the LPs generated by GPT\-OSS and OLMo3, which demonstrated the most promising quality in an initial expert assessment\. The annotation process was done on the label studio platform\(Tkachenko et al\.,[2020](https://arxiv.org/html/2605.19815#bib.bib27)\)\. Inter\-Annotator AgreementWe observe a strong positive skew in the experts’ ratings\. Only 5 out of 200 generated LPs are labeled asINVALID, due to missing any one of the required formality components \(i\.e\., specification, object, and stance\)\. Across the three quality dimensions rated on a 1–3 Likert scale, the annotation distributions exhibit substantial class imbalance, with the highest rating \(“3”\) assigned to the majority of items in all dimensions\. Given the pronounced class imbalance, we assess inter\-annotator agreement usingGwet’s AC1\(Gwet,[2008](https://arxiv.org/html/2605.19815#bib.bib14)\), consistent with prior work\(Battisti et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib4)\)\. We prefer AC1 over the more common Cohen’s Kappa due to its robustness against theprevalence paradox\(Feinstein and Cicchetti,[1990](https://arxiv.org/html/2605.19815#bib.bib12)\), which can artificially deflate reliability estimates in skewed distributions despite high observed agreement\. We obtained a high agreement score \(AC1 = 0\.96\), indicating strong consistency among annotators and validating the clarity of the LP\-Eval Rubric guidelines\. ### 4\.4\.Automatic LLM Evaluation We also assess proposition quality via an LLM\-as\-a\-judge approach, utilizing GPT\-OSS, OLMo3, and Saul as evaluators\. All evaluator models were configured with a temperature of 0 to ensure deterministic behavior and maximize reproducibility\. The evaluation prompts were constructed using templates derived from the LN\-Eval Rubric \(see[Appendix C](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)\)\. ### 4\.5\.Dataset Statistics Table 1\.Dataset Statistics of the 100 selected CJEU case text paragraphs and generated LPs \(mean ± std\)[Tab 1](https://arxiv.org/html/2605.19815#S4.T1)shows that paragraph lengths are comparable across categories, with*recent*cases averaging 174\.56 tokens and*well\-established*cases 159\.24 tokens \(std\. 50\.75 and 68\.79\)\. Generated LPs are substantially shorter than their source paragraphs for both models: GPT\-OSS produces LPs averaging 48\.92 tokens, while OLMo\-3 produces slightly longer LPs at 51\.98 tokens, corresponding to a compression of approximately 29–31% relative to the average paragraph length \(166\.90 tokens\)\. The length of generated LPs also varies by model and case category\. OLMo\-3 generates longer LPs than GPT\-OSS for*recent*cases \(59\.12 vs\. 52\.36 tokens\), while their outputs are nearly identical for*well\-established*cases \(44\.84 vs\. 45\.48\)\. Saul\-LM produces slightly longer LPs \(53\.30 tokens\) than GPT\-OSS \(48\.92\) and is close to OLMo\-3 \(51\.98\)\. The category trend is consistent: Saul\-LM generates longer LPs for*recent*cases \(59\.84\) than for*well\-established*cases \(46\.76\)\. Notably, Saul\-LM shows higher variance in LP length, especially for*recent*cases, which suggests less consistent compression behavior than GPT\-OSS or OLMo\-3\. ## 5\.Experiments ### 5\.1\.LLM as LP generator Expert Evaluation of LLM\-Generated Legal Propositions We evaluate the quality of LLM\-generated LPs through expert assessment following our LP\-Eval rubric\. An LP is considered valid only if it contains all three required components:*Stance*,*Object*, and*Specification*\. Across the dataset, experts judged the vast majority of generated LPs to be formally valid\. Out of 100 LPs, 95 were rated asVALID\. At the formalCOMPONENTlevel, experts consistently identified the presence of both*Stance*and*Specification*in all generated LPs\. The five invalid cases all failed due to a missing*Object*component, defined as a normative \(rather than factual\) statement\. Furthermore,*Object*component was a significant source of inter\-annotator disagreement\. According to the qualitative study from legal experts \(see[Appendix F](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)\), legal norms consist of different subtypes, which may have confused the expert annotators\. fact\_indfidelityconcisesrc\_indgeneraloverall2\.96 ± 0\.192\.95 ± 0\.232\.48 ± 0\.622\.31 ± 0\.632\.65 ± 0\.642\.50 ± 0\.60 Table 2\.mean and avg of experts annotation\. Likert score 1\-3[Tab 2](https://arxiv.org/html/2605.19815#S5.T2)summarizes expert evaluations at Step 2 \(qualityDIMENSIONS\) and Step 3 \(OVERALLquality scores\)\. Experts assigned consistently high scores for*Fact Independence*\(mean = 2\.96\) and*Fidelity*\(mean = 2\.95\), indicating that the generated propositions are largely abstracted from case\-specific facts and are well supported by the focus paragraph and its surrounding context\.*Conciseness*and*source independence*received slightly lower but still positive average scores \(means = 2\.48 and 2\.31, respectively\), reflecting occasional redundancy or explicit references to legal sources\. TheOVERALLquality score averaged 2\.5, rated as high on our 3\-point Likert scale\. Valid propositions frequently received ratings of 3\.0 \(“Excellent”\) or 2\.0 \(“Satisfactory”\), suggesting that while experts prefer more source\-independent drafting, they generally regard the generated propositions as competent and legally sound summaries of the underlying norms\. recentmeanestablishedmeant\_statisticp\_valuefact\_ind2\.952\.98\-1\.130\.26fidelity2\.942\.97\-0\.910\.37general2\.572\.73\-1\.770\.08concise2\.382\.58\-2\.272e\-02 \*\*src\_ind2\.112\.52\-4\.744e\-06 \*\*overall2\.352\.66\-3\.792e\-04 \*\* Table 3\.T\-tests comparing LPs generated fromwell establishedcases torecentcasesQuality Difference in Well\-Established vs\. Recent Cases We compared expert evaluation scores for generated LPs derived fromwell\-establishedcases against those fromrecentcases\. For each quality dimension, we computed the mean expert score per case\-prominence group \(establishedmeannvs\. recentmean\) and performed paired two\-sample t\-tests to assess differences between the two groups\. As reported in[Tab 3](https://arxiv.org/html/2605.19815#S5.T3), statistically significant differences emerge forConciseness,Source Independence, andOVERALLQuality\. LPs generated fromwell\-establishedcases were rated as more concise \(p=0\.02p=0\.02\) and more source\-independent \(p<10−5p<10^\{\-5\}\)\. These advantages are reflected in theOVERALLquality scores, which are significantly higher forwell\-establishedcases \(mean = 2\.66 vs\. 2\.35,p<0\.001p<0\.001\)\. These results suggest that while core properties, such as factual abstraction and contextual alignment, are stable across case types, higher\-level qualities related to abstraction and formulation vary with case prominence\. A qualitative analysis by legal experts \(LABEL:sec:qualitative\_study\) suggests these disparities may stem from the variations in inherent structural differences between the case categories\. ### 5\.2\.LLM as LP Evaluator Alignment Between LLM and Expert Evaluations Among the three LLMs tested, OLMo and SaulLM assigned maximum scores to all propositions \(zero variance\), therefore we focus on the analysis of GPT\-OSS’s evaluation\. GPT\-OSS achieves Gwet’s AC1 agreement of 0\.91 and 0\.93 with the two experts\. While slightly lower than the inter\-expert agreement 0\.94, these values indicate comparable agreement when GPT\-OSS follows the LP\-Eval protocol\. To test the LN\-Eval rubric’s efficacy, we prompted GPT\-OSS in aone\-gosetting: evaluate only theOVERALLquality score, skipping all intermediate rubric steps for formal COMPONENT and quality Dimensions \(see[Appendix C](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)\)\. Agreement drops to 0\.85 \(approximately 7% relative decrease\), demonstrating that structured rubric guidance substantially improves alignment with expert judgments\. LLM\-Judge Sensitivity to Case ProminenceWe conducted a t\-test comparing GPT\-OSS evaluations of LPs derived fromwell\-establishedandrecentcases but found no statistically significant differences between the two groups\. In contrast, expert evaluators identified statistically significant distinctions across these case categories\. This suggests that while GPT\-OSS broadly aligns with experts at an overall level, it is less sensitive to the nuanced qualitative differences captured by human experts\. ## 6\.Related Work Legal NLPRecent work in computational law has explored various Legal NLP tasks such as legal case retrieval\(Ma et al\.,[2021](https://arxiv.org/html/2605.19815#bib.bib21)\), judgement prediction\(T\.y\.s\.s et al\.,[2022](https://arxiv.org/html/2605.19815#bib.bib29)\), legal case summarization\(Agarwal et al\.,[2022](https://arxiv.org/html/2605.19815#bib.bib2)\), vulnerability detection\(Xu et al\.,[2023a](https://arxiv.org/html/2605.19815#bib.bib32)\), and drafting support for legislative text\(Chouhan and Gertz,[2024](https://arxiv.org/html/2605.19815#bib.bib7)\)\. Across these research lines, proposition\-level representations remain largely implicit in statute\-based formulations\(Holzenberger et al\.,[2020](https://arxiv.org/html/2605.19815#bib.bib16)\)and are only partially captured in case\-law settings\(Santosh et al\.,[2025](https://arxiv.org/html/2605.19815#bib.bib24)\)\. Our work targets this by focusing on proposition\-level representations derived from case law and assessing whether language models can generate and preserve the normative propositional content that supports downstream legal reasoning\. LLM\-as\-a\-Judgeis increasingly used as a scalable alternative to expert human assessment for generation tasks, but its reliability depends heavily on the evaluation protocol and prompt design\(Zheng et al\.,[2023](https://arxiv.org/html/2605.19815#bib.bib35)\)\. Recent analyses show that zero\-shot LLM evaluators can exhibit systematic biases\(Stureborg et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib26)\)\. To reduce ambiguity and improve reproducibility, recent work emphasizes more explicit rubric construction\. CheckEval decomposes subjective criteria into Boolean checklist questions, improving reliability relative to single scalar ratings, improving inter\-evaluator agreement\(Lee et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib17)\)\. ## 7\.Limitations Due to computational constraints, we do not perform a causal analysis of pre\-training data and its influence on model outputs\. Prior work shows that LLMs can reflect biases in their training data\(Xu et al\.,[2025b](https://arxiv.org/html/2605.19815#bib.bib31)\), motivating future work on detecting and mitigating memorized biases\. Owing to space limitations, we emphasize quantitative analysis, with detailed qualitative results provided in[Appendix F](https://github.com/sxu3/LP-Eval/raw/main/LP-Eval_appendix.pdf)\. Finally, while expert annotations show high overall agreement, disagreements persist on abstract components \(e\.g\.,*object*\)\. This is consistent with prior findings on limited inter\-expert agreement in legal NLP\(Xu et al\.,[2023b](https://arxiv.org/html/2605.19815#bib.bib34)\)and imperfect alignment between models and inherently diverse human judgments\(Xu et al\.,[2024](https://arxiv.org/html/2605.19815#bib.bib33)\)\. Given the high\-stakes legal setting, future work should better incorporate annotation variability to preserve pluralistic human values and support human\-centered AI systems\(Xu et al\.,[2025a](https://arxiv.org/html/2605.19815#bib.bib30)\)\. ## 8\.Conclusion We investigate the automatic generation and evaluation of legal propositions \(LPs\), a core yet underexplored unit of legal reasoning\. In collaboration with legal experts, we introduce LP\-Eval, a rubric\-based evaluation framework for assessing both the validity and quality of generated LPs, along with corresponding evaluation protocols for both expert annotation and LLM\-as\-judge evaluation\. We further release the LP\-Eval dataset, a curated corpus of CJEU cases and LLM\-generated legal propositions annotated by two legal experts following the LP\-Eval rubric\. Our experiment results show that modern LLMs can generate generally well\-formed and high\-quality legal propositions, while exhibiting systematic quality differences across case\-prominence types\. We also find that rubric\-guided LLM evaluation aligns more closely with expert judgments than direct overall scoring, but remains insufficiently sensitive to finer\-grained distinctions captured by human evaluators\. These findings highlight both the promise and current limitations of LLMs for legal proposition generation and evaluation\. ###### Acknowledgements\. We thank the anonymous reviewers for valuable comments\. This paper is supported by the Independent Research Fund Denmark \(DFF\) ALIKE grant 426000028B\. ## References - \(1\) - Agarwal et al\.\(2022\)Abhishek Agarwal, Shanshan Xu, and Matthias Grabmair\. 2022\.Extractive Summarization of Legal Decisions using Multi\-task Learning and Maximal Marginal Relevance\. In*Findings of EMNLP 2022*, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang \(Eds\.\)\. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1857–1872\.[doi:10\.18653/v1/2022\.findings\-emnlp\.134](https://doi.org/10.18653/v1/2022.findings-emnlp.134) - Agarwal et al\.\(2025\)Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al\.2025\.gpt\-oss\-120b & gpt\-oss\-20b model card\.*arXiv preprint arXiv:2508\.10925*\(2025\)\. - Battisti et al\.\(2024\)Alessia Battisti, Katja Tissi, and et al\. 2024\.Advancing Annotation for Continuous Data in Swiss German Sign Language\. In*Proceedings of the LREC\-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources*\. ELRA and ICCL, Torino, Italia, 1–12\.[https://aclanthology\.org/2024\.signlang\-1\.1/](https://aclanthology.org/2024.signlang-1.1/) - Carlini et al\.\(2021\)Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert\-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel\. 2021\.Extracting Training Data from Large Language Models\. In*30th USENIX Security Symposium \(USENIX Security 21\)*\. USENIX Association, 2633–2650\. - Chalkidis et al\.\(2019\)Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras\. 2019\.Neural Legal Judgment Prediction in English\. In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, Anna Korhonen, David Traum, and Lluís Màrquez \(Eds\.\)\. Association for Computational Linguistics, Florence, Italy, 4317–4323\.[doi:10\.18653/v1/P19\-1424](https://doi.org/10.18653/v1/P19-1424) - Chouhan and Gertz \(2024\)Ashish Chouhan and Michael Gertz\. 2024\.LexDrafter: Terminology Drafting for Legislative Documents Using Retrieval Augmented Generation\. In*Proceedings of the LREC\-COLING 2024*\. ELRA and ICCL\.[https://aclanthology\.org/2024\.lrec\-main\.913/](https://aclanthology.org/2024.lrec-main.913/) - Colombo et al\.\(2024\)Pierre Colombo, Telmo Pessoa Pires, and et al\. 2024\.Saullm\-7b: A pioneering large language model for law\. arXiv\.\(2024\)\. - Dahl et al\.\(2024\)Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho\. 2024\.Large legal fictions: Profiling legal hallucinations in large language models\.*Journal of Legal Analysis*16, 1 \(2024\), 64–93\. - Dawson \(2017\)Phillip Dawson\. 2017\.Assessment rubrics: towards clearer and more replicable design, research and practice\.*Assessment & Evaluation in Higher Education*42, 3 \(2017\), 347–360\. - Eshelman \(2018\)Michael O Eshelman\. 2018\.A History of the Digests\.*Law Library Journal*110, 2 \(2018\), 235–260\. - Feinstein and Cicchetti \(1990\)Alvan R Feinstein and Domenic V Cicchetti\. 1990\.High agreement but low kappa: I\. The problems of two paradoxes\.*Journal of clinical epidemiology*43, 6 \(1990\), 543–549\. - Galvan\-Sosa et al\.\(2025\)Diana Galvan\-Sosa, Gabrielle Gaudeau, Pride Kavumba, and et al\. 2025\.Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset\. In*Proceedings of ACL 2025*\. Association for Computational Linguistics, Vienna, Austria, 23800–23839\.[doi:10\.18653/v1/2025\.acl\-long\.1160](https://doi.org/10.18653/v1/2025.acl-long.1160) - Gwet \(2008\)Kilem Li Gwet\. 2008\.Computing inter\-rater reliability and its variance in the presence of high agreement\.*Brit\. J\. Math\. Statist\. Psych\.*61, 1 \(2008\), 29–48\. - Hashemi et al\.\(2024\)Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie\. 2024\.LLM\-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts\. In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*\. Association for Computational Linguistics\.[doi:10\.18653/v1/2024\.acl\-long\.745](https://doi.org/10.18653/v1/2024.acl-long.745) - Holzenberger et al\.\(2020\)Nils Holzenberger, Andrew Blair\-Stanek, and Benjamin Van Durme\. 2020\.A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering\. In*Proceedings of the Natural Legal Language Processing Workshop 2020*\.[https://arxiv\.org/abs/2005\.05257](https://arxiv.org/abs/2005.05257) - Lee et al\.\(2024\)Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim\. 2024\.CheckEval: A Reliable LLM\-as\-a\-Judge Framework for Evaluating Text Generation Using Checklists\.*arXiv preprint arXiv:2403\.18771*\(2024\)\.[https://arxiv\.org/abs/2403\.18771](https://arxiv.org/abs/2403.18771) - Li et al\.\(2024b\)Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu\. 2024b\.LLMs\-as\-Judges: A Comprehensive Survey on LLM\-based Evaluation Methods\.*arXiv preprint arXiv:2412\.05579*\(2024\)\.[https://arxiv\.org/abs/2412\.05579](https://arxiv.org/abs/2412.05579) - Li et al\.\(2024a\)Junyi Li, Jie Chen, and et al\. 2024a\.The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models\. In*Proceedings of ACL 2024*, Lun\-Wei Ku, Andre Martins, and Vivek Srikumar \(Eds\.\)\. Association for Computational Linguistics, Bangkok, Thailand, 10879–10899\.[doi:10\.18653/v1/2024\.acl\-long\.586](https://doi.org/10.18653/v1/2024.acl-long.586) - Lin \(2004\)Chin\-Yew Lin\. 2004\.ROUGE: A Package for Automatic Evaluation of Summaries\. In*Text Summarization Branches Out*\. Association for Computational Linguistics, Barcelona, Spain, 74–81\.[https://aclanthology\.org/W04\-1013/](https://aclanthology.org/W04-1013/) - Ma et al\.\(2021\)Yixiao Ma, Yunqiu Shao, Yueyue Wu, Yiqun Liu, Ruizhe Zhang, Min Zhang, and Shaoping Ma\. 2021\.LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System\. In*Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR ’21\)*\. ACM, 2342–2348\. - Olmo et al\.\(2025\)Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al\.2025\.Olmo 3\.*arXiv preprint arXiv:2512\.13961*\(2025\)\. - Papineni et al\.\(2002\)Kishore Papineni, Salim Roukos, Todd Ward, and Wei\-Jing Zhu\. 2002\.Bleu: a Method for Automatic Evaluation of Machine Translation\. In*Proceedings of ACL 2002*, Pierre Isabelle, Eugene Charniak, and Dekang Lin \(Eds\.\)\. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318\.[doi:10\.3115/1073083\.1073135](https://doi.org/10.3115/1073083.1073135) - Santosh et al\.\(2025\)T\. Y\. S\. S\. Santosh, Isaac Misael Olguín Nolasco, and Matthias Grabmair\. 2025\.LeCoPCR: Legal Concept\-guided Prior Case Retrieval for European Court of Human Rights cases\. In*Findings of NAACL 2025*\. Association for Computational Linguistics\.[doi:10\.18653/v1/2025\.findings\-naacl\.89](https://doi.org/10.18653/v1/2025.findings-naacl.89) - Shah et al\.\(2025\)Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Andrew Bertagnolli, and Chaitanya Shivade\. 2025\.TN\-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes\. In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 6: Industry Track\)*, Georg Rehm and Yunyao Li \(Eds\.\)\. Association for Computational Linguistics, Vienna, Austria, 179–199\.[doi:10\.18653/v1/2025\.acl\-industry\.14](https://doi.org/10.18653/v1/2025.acl-industry.14) - Stureborg et al\.\(2024\)Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara\. 2024\.Large Language Models are Inconsistent and Biased Evaluators\.*arXiv preprint arXiv:2405\.01724*\(2024\)\.[https://arxiv\.org/abs/2405\.01724](https://arxiv.org/abs/2405.01724) - Tkachenko et al\.\(2020\)Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov\. 2020\.Label Studio: Data labeling software\.[https://github\.com/HumanSignal/label\-studio](https://github.com/HumanSignal/label-studio) - T\.y\.s\.s\. et al\.\(2024\)Santosh T\.y\.s\.s\., Rashid Haddad, and Matthias Grabmair\. 2024\.ECtHR\-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights\. In*Proceedings of the LREC\-COLING 2024*\. ELRA and ICCL, Torino, Italia, 5473–5483\.[https://aclanthology\.org/2024\.lrec\-main\.486/](https://aclanthology.org/2024.lrec-main.486/) - T\.y\.s\.s et al\.\(2022\)Santosh T\.y\.s\.s, Shanshan Xu, Oana Ichim, and Matthias Grabmair\. 2022\.Deconfounding Legal Judgment Prediction for European Court of Human Rights Cases Towards Better Alignment with Experts\. In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang \(Eds\.\)\. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1120–1138\.[doi:10\.18653/v1/2022\.emnlp\-main\.74](https://doi.org/10.18653/v1/2022.emnlp-main.74) - Xu et al\.\(2025a\)Shanshan Xu, Barbara Plank, et al\.2025a\.From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post\-training in NLP\.*arXiv preprint arXiv:2510\.12817*\(2025\)\. - Xu et al\.\(2025b\)Shanshan Xu, TYS Santosh, Yanai Elazar, Quirin Vogel, Barbara Plank, and Matthias Grabmair\. 2025b\.Better aligned with survey respondents or training data? unveiling political leanings of llms on us supreme court cases\.*arXiv preprint arXiv:2502\.18282*\(2025\)\. - Xu et al\.\(2023a\)Shanshan Xu, Leon Staufer, Santosh T\.y\.s\.s, Oana Ichim, Corina Heri, and Matthias Grabmair\. 2023a\.VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights\. In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, Houda Bouamor, Juan Pino, and Kalika Bali \(Eds\.\)\. Association for Computational Linguistics, Singapore, 11738–11752\.[doi:10\.18653/v1/2023\.emnlp\-main\.718](https://doi.org/10.18653/v1/2023.emnlp-main.718) - Xu et al\.\(2024\)Shanshan Xu, Santosh Tyss, Oana Ichim, Barbara Plank, and Matthias Grabmair\. 2024\.Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification\. In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*\. 199–216\. - Xu et al\.\(2023b\)Shanshan Xu, Santosh T\.y\.s\.s, Oana Ichim, Isabella Risini, Barbara Plank, and Matthias Grabmair\. 2023b\.From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification\. In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, Houda Bouamor, Juan Pino, and Kalika Bali \(Eds\.\)\. Association for Computational Linguistics, Singapore, 9558–9576\.[doi:10\.18653/v1/2023\.emnlp\-main\.594](https://doi.org/10.18653/v1/2023.emnlp-main.594) - Zheng et al\.\(2023\)Lianmin Zheng, Wei\-Lin Chiang, and et al\. 2023\.Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\. In*Advances in Neural Information Processing Systems*, Vol\. 36\. 46595–46623\.
Similar Articles
Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge
This paper proposes a training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-judge without human annotation, and further introduces an iterative fine-tuning strategy for a rubric generator that outperforms larger proprietary models.
Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative Optimization
This paper proposes learning assessment skills for LLMs to automate rubric construction for scoring tasks, achieving performance comparable to expert-written rubrics without requiring human-written examples.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
Introduces UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions. Evaluates 11 LLMs, revealing task-dependent few-shot effects and the misleading nature of accuracy on imbalanced legal tasks.
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.