Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Summary
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.
View Cached Full Text
Cached at: 05/12/26, 06:44 AM
# Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Source: [https://arxiv.org/html/2605.08437](https://arxiv.org/html/2605.08437)
\(2026\)
###### Abstract\.
Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to*judge*such arguments—weighing competing claims, applying doctrine to facts, and rendering reasoned decisions—is arguably as fundamental to a well\-functioning legal system as advocacy itself\. We introduce Magis\-Bench, a benchmark for evaluating LLMs on magistrate\-level writing tasks derived from recent Brazilian competitive examinations for judicial positions\. Magis\-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi\-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences\. We evaluate 23 state\-of\-the\-art LLMs using an LLM\-as\-a\-judge methodology with four independent frontier models as evaluators\. Our results show strong inter\-judge agreement \(Kendall’sW=0\.984W=0\.984; pairwise Kendall’sτ≥0\.897\\tau\\geq 0\.897\), with Google’s Gemini\-3\-Pro\-Preview achieving the highest average score \(6\.97/10\), followed by Gemini\-3\-Flash\-Preview \(6\.67\) and Claude\-4\.5\-Opus \(6\.46\)\. Even the best\-performing models score below 70% of the maximum, indicating that judicial\-level legal reasoning and writing remain challenging for current LLMs\. We release the complete benchmark, model outputs, and evaluation code111[https://github\.com/maritaca\-ai/magis\-bench](https://github.com/maritaca-ai/magis-bench)to support further research on legal AI capabilities\.
Sentence Drafting, Magistrate\-level Legal Tasks, Open\-ended Tasks, LLM Judge, Large Language Models
††copyright:acmlicensed††journalyear:2026††doi:XXXXXXX\.XXXXXXX††conference:International Conference on Artificial Intelligence and Law; June 8–12, 2026; Singapore, Singapore††isbn:978\-1\-4503\-XXXX\-X/2026/06††ccs:Applied computing Law††ccs:Computing methodologies Natural language generation††ccs:General and reference EvaluationFigure 1\.Overview of Magis\-Bench: dataset construction and multi\-LLM judging pipeline\.Magis\-Bench workflow diagramLeft\-to\-right flowchart of the benchmark\. Data collection from Brazilian judicial exams \(2023–2025\) yields 58 discursive questions and 16 sentence\-drafting tasks \(8 civil, 8 criminal\)\. These form the Magis\-Bench dataset with 74 tasks and official rubrics\. Twenty\-three candidate LLMs are evaluated by a multi\-judge panel of four models \(GPT\-5\.1, Gemini\-2\.5\-Pro, Gemini\-3\-Pro, and Claude\-4\.5\-Opus\)\. Each judge receives the question, rubric, and model response and outputs a 0–10 score with a justification\. Scores are aggregated to compute mean scores and inter\-judge agreement, producing a model ranking for magistrate\-level legal tasks\.
## 1\.Introduction
The growing adoption of LLMs in professional and specialized domains has created an urgent need for evaluation frameworks that can rigorously assess their capabilities beyond general\-purpose tasks\. Existing open\-ended benchmarks in the legal domain focus primarily on tasks where LLMs must produce legal arguments or documents\(Pireset al\.,[2026](https://arxiv.org/html/2605.08437#bib.bib1); Chlapaniset al\.,[2025](https://arxiv.org/html/2605.08437#bib.bib4); Shiet al\.,[2026](https://arxiv.org/html/2605.08437#bib.bib18); Fanet al\.,[2025](https://arxiv.org/html/2605.08437#bib.bib20)\)\. Yet in a well\-functioning legal system, the capacity to*judge*arguments is as fundamental as advocacy itself\. While producing persuasive arguments tests one set of skills, judicial reasoning requires impartiality, comprehensive analysis of both sides, and authoritative resolution grounded in law\. This distinction matters as LLMs are increasingly considered for applications involving adjudicative reasoning, from legal research assistance to decision support systems\.
We introduceMagis\-Bench, a benchmark for evaluating LLMs on magistrate\-level legal tasks derived from recent Brazilian competitive examinations for judicial positions\. In Brazil, judicial positions are filled through highly competitive public examinations that assess candidates’ readiness to serve as judges\. Magis\-Bench comprises 74 questions from eight such examinations between 2023 and 2025, including discursive legal analysis questions with multi\-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences\. Each question is accompanied by official evaluation rubrics that specify the expected legal concepts, analytical steps, and structural elements\.
We evaluate 23 state\-of\-the\-art LLMs using an LLM\-as\-a\-judge methodology with four independent frontier models as evaluators\. Our results show strong inter\-judge agreement \(Kendall’sW=0\.984W=0\.984\), with Google’s Gemini\-3\-Pro\-Preview achieving the highest average score \(6\.97/10\)\. Even the best\-performing models score below 70% of the maximum, indicating that magistrate\-level legal reasoning and writing remain challenging for current LLMs\.
Our contributions are: \(1\) Magis\-Bench, a benchmark of 74 magistrate\-level legal tasks with official rubrics; \(2\) a multi\-judge evaluation methodology that achieves high inter\-judge agreement; and \(3\) a comprehensive evaluation of 23 LLMs, establishing performance baselines for judicial writing tasks\. We release the benchmark, model outputs, and evaluation code to support further research\.
The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2605.08437#S2)reviews related work\. Section[3](https://arxiv.org/html/2605.08437#S3)describes Magis\-Bench and our evaluation methodology\. Section[4](https://arxiv.org/html/2605.08437#S4)reports experimental results\. Section[5](https://arxiv.org/html/2605.08437#S5)discusses limitations, and Section[6](https://arxiv.org/html/2605.08437#S6)concludes the paper\.
## 2\.Related Work
Recent benchmarks assess LLMs’ capacity to perform advocacy\-oriented legal tasks, such as drafting legal documents and providing legal advice\. OAB\-Bench\(Pireset al\.,[2026](https://arxiv.org/html/2605.08437#bib.bib1)\)and Rabula\(Pachecoet al\.,[2025](https://arxiv.org/html/2605.08437#bib.bib17)\)evaluate LLMs on the Brazilian Bar Examination \(OAB\), which tests advocacy competencies through essay tasks, using official FGV examination rubrics with LLM\-as\-a\-judge evaluation\. OAB\-Bench achieves strong correlation with human expert scores\. LEXam\(Fanet al\.,[2025](https://arxiv.org/html/2605.08437#bib.bib20)\)provides a benchmark derived from 340 law school examinations across 116 courses at the University of Zurich, comprising questions in English and German\. PLawBench\(Shiet al\.,[2026](https://arxiv.org/html/2605.08437#bib.bib18)\)evaluates practical legal skills in the Chinese domain through 850 questions across 13 scenarios\. Other examination\-based benchmarks include GreekBarBench\(Chlapaniset al\.,[2025](https://arxiv.org/html/2605.08437#bib.bib4)\)for the Greek legal system, KCL\(Ohet al\.,[2025](https://arxiv.org/html/2605.08437#bib.bib19)\)for Korean canonical legal reasoning with bar exam questions and instance\-level rubrics\.
A distinct research direction investigates whether LLMs can replicate judicial reasoning and produce sentencing decisions comparable to human judges\. Posner & Saran\(Posner and Saran,[2025](https://arxiv.org/html/2605.08437#bib.bib21)\)tested GPT\-4 on a simulated appellate case involving war crimes, manipulating case framing and precedent alignment\. The model exhibited formalist behavior, strictly following precedent while remaining insensitive to emotional appeals that influenced human judges\. Gazal Ayal et al\.\(Ayalet al\.,[2026](https://arxiv.org/html/2605.08437#bib.bib22)\)compared sentencing decisions from LLMs against 123 retired judges across two criminal cases\. LLMs demonstrated substantially lower inter\-model variability than human judges\. In the Chinese legal context, JuDGE\(Suet al\.,[2025](https://arxiv.org/html/2605.08437#bib.bib23)\)benchmarks complete judgment document generation from factual case descriptions, demonstrating that retrieval\-augmented approaches improve performance but substantial room for improvement remains\.
While existing benchmarks cover legal reasoning, judgment prediction, document generation, and bar examinations, none evaluates judicial competency using official certification criteria for judges\. Magis\-Bench fills this gap by using rubrics from competitive magistrate selection examinations in Brazil, testing whether LLMs can reason like judges as evaluated by professional examiners\.
## 3\.Methodology
This section describes how we constructed Magis\-Bench and how we evaluate LLMs on it using rubric\-grounded, multi\-judge LLM\-as\-a\-judge scoring\.
### 3\.1\.Data Collection
Magis\-Bench is constructed from written examinations administered as part of competitive selection processes for substitute judges in Brazil\. These examinations constitute the second phase of judicial selection, following the objective \(multiple\-choice\) phase, and are designed to assess candidates’ ability to apply legal knowledge in sophisticated writing tasks under examination conditions\.
We collected examinations from eight judicial selection processes conducted between 2023 and 2025, comprising both federal \(TRF1, TRF2, and TRF3\) and state \(TJMS, TJPE, TJGO, TJAM, and TJSE\) courts\. The selection criteria for including examinations were: \(1\) the examination must have been conducted in 2023 or later; \(2\) the written examination questions and official evaluation rubrics must be publicly available; and \(3\) the examination must include both discursive questions and practical sentence\-drafting exercises\. Table[1](https://arxiv.org/html/2605.08437#S3.T1)summarizes the examinations included in the benchmark\.
Table 1\.Examinations included in Magis\-Bench\.CourtOrganizerYearQuestionsTRF1FGV20236TRF2Internal Commission202414TRF3FGV20258TJMSFGV202312TJPEFGV202310TJGOFGV202310TJAMFGV20257TJSEFGV20257Total74
### 3\.2\.Magis\-Bench
Magis\-Bench comprises 74 questions: 58 discursive questions and 16 sentence\-drafting exercises \(8 civil and 8 criminal\)\. Each question is scored on a scale of 0 to 10 points based on official evaluation rubrics produced by the examination boards\. The discursive questions present factual scenarios followed by one or more prompts requiring legal analysis; many are multi\-turn \(108 total turns across all discursive questions\)\. For evaluation, we present each sub\-question sequentially and allow the model to see its previous responses, simulating the examination context\.
The sentence\-drafting exercises require candidates to draft complete judicial decisions following Brazilian procedural law structure\. Civil sentences involve disputes such as tax, administrative, and social security law, while criminal sentences require analysis of charges and defenses and, when conviction is warranted, sentencing guidelines\. Figure[2](https://arxiv.org/html/2605.08437#S3.F2)shows an example of a practical civil sentence question\.
PRACTICAL CIVIL JUDGMENT EXAMXYZ Comércio Ltda\., a business company operating in the retail trade sector, has tax\-enforcement\-registered debts in the amount of R$ 200,000\.00, relating to the contribution levied on payroll in favor of the National Commercial Apprenticeship Service \(SENAC\)\. The taxable events for such contributions occurred throughout the year 2016\. Such assessments were never challenged either administratively or judicially\.
Due to these debts, the company became subject to a tax enforcement proceeding \(execução fiscal\), filed by the Federal Government on 04/03/2017, to collect the said debt\. The case was assigned to the 3rd Federal Tax Enforcement Court of the seat of the Judiciary Section\.
…
In its response to the objections, the Federal Government argued:
i\) that it is a proper party to collect;
ii\) being a proper party, the jurisdiction to process and adjudicate such collection by means of tax enforcement lies with the Federal Judiciary;
iii\) the intercurring statute of limitations was not consummated;
iv\) the contributions in favor of the “S System” levied on payroll were received by the 1988 Federal Constitution;
v\) there is no limitation of 20 minimum wages on the tax base of the contributions to SENAC\.
The records were submitted for judgment\.
In view of the data above \(to which no facts created by the candidate must be added\),*render the judgment*\(grounds/reasons and operative part\), addressing each of the allegations with the appropriate legal basis and/or current prevailing understanding of the case law\. Preparation of the report section is dispensed with\.Figure 2\.Example of a practical sentence\-drafting question in Magis\-Bench: Civil sentence of TRF2\.Exam prompt excerptImage containing the prompt text for the example practical civil sentence\-drafting task\.
We evaluate a diverse set of LLMs on Magis\-Bench, covering proprietary and open\-source models across different sizes and architectures\. The evaluated models include offerings from OpenAI, Anthropic, Google, Maritaca AI, and several open\-source families\.
Model outputs for most models were obtained through OpenRouter222[http://openrouter\.ai/](http://openrouter.ai/), except for OpenAI and Maritaca AI models, which were accessed via their respective APIs\. We use each model’s default temperature setting \(temperature unspecified\) rather than imposing a uniform value across all models\. This approach ensures that each model operates under developer\-recommended conditions, providing a fair assessment of real\-world deployment performance\. All other generation parameters use each model’s default settings\.
### 3\.3\.Rubric\-Grounded Multi\-Judge Evaluation
Evaluating open\-ended legal writing at scale is costly and difficult to standardize with human experts\. We therefore employ an LLM\-as\-a\-judge methodology\(Zhenget al\.,[2023](https://arxiv.org/html/2605.08437#bib.bib16)\), wherein strong LLMs evaluate generated responses against the official rubrics and assign a score from 0 to 10 with a brief justification\.
For each response, the judge receives: \(1\) the original question \(including the factual scenario and prompt\); \(2\) the official evaluation rubric specifying expected elements and point allocation; and \(3\) the model’s generated response\. Rubric grounding helps reduce reliance on subjective preferences by anchoring evaluation to explicit, pre\-defined criteria\.
Single\-judge evaluation can be sensitive to model\-specific biases and inconsistent criterion weighting\. To improve robustness, we use four independent frontier models as judges and report both per\-judge scores and their aggregate \(mean\) across all questions\. Evaluation is blind: judges receive only the question, rubric, and response, without any identification of the candidate model\. Together with rubric grounding, this mitigates stylistic preferences that could favor same\-family candidates\.
The four judge models used in our evaluation are:
- •GPT\-5\.1\(OpenAI\)
- •Gemini\-2\.5\-Pro\(Google\)
- •Gemini\-3\-Pro\-Preview\(Google\)
- •Claude\-4\.5\-Opus\(Anthropic\)
All judge evaluations are conducted with temperature set to 0 to maximize reproducibility and minimize variation in scoring\.
To quantify consensus among judges, we report Kendall’sWW\(Kendall and Smith,[1939](https://arxiv.org/html/2605.08437#bib.bib15)\)for overall concordance across all four judges and Kendall’sτ\\tau\(KENDALL,[1938](https://arxiv.org/html/2605.08437#bib.bib14)\)for pairwise rank correlations, both computed over per\-judge rankings of the evaluated models\.WWranges from 0 to 1 \(higher indicates stronger consensus\) andτ\\taufrom−1\-1to\+1\+1\.
## 4\.Results
This section presents the experimental results of evaluating 23 LLMs on Magis\-Bench, including overall performance rankings, inter\-judge agreement analysis, and examination of judge\-specific evaluation patterns\.
### 4\.1\.Overall Performance
Table[2](https://arxiv.org/html/2605.08437#S4.T2)presents the performance of all evaluated models across the four judges\. Models are ranked by their average score across all judges, with scores representing the mean across all 74 questions in the benchmark \(each scored 0–10\)\.
The results reveal a clear performance hierarchy\. Google’s Gemini\-3\-Pro\-Preview achieves the highest average score \(6\.97\), followed by Gemini\-3\-Flash\-Preview \(6\.67\) and Claude\-4\.5\-Opus \(6\.46\)\. The top five positions are occupied exclusively by frontier models from Google, Anthropic, and OpenAI, with scores ranging from 6\.18 to 6\.97\. Notably, even the best\-performing model achieves less than 70% of the maximum possible score, indicating that Magis\-Bench presents a substantial challenge for current LLMs\.
The mid\-tier models \(ranks 6–14\) achieve scores between 4\.06 and 5\.55, including Claude\-4\.5\-Sonnet, GPT\-4\.1, the Maritaca AI Sabiá models, and various reasoning\-enhanced models\. The lower tier \(ranks 15–23\) comprises smaller open\-source models, primarily from the Qwen family, with scores ranging from 1\.82 to 4\.00\. Bootstrap 95% confidence intervals \(10,000 resamples over per\-exam scores\) confirm this tier structure: top\-5 CIs overlap within the group but are separated from mid\-tier models \(e\.g\., rank 1 \[6\.34, 7\.56\] vs\. rank 6 \[4\.89, 6\.16\]\)\. Averaged across models, discursive questions are the easiest \(mean 4\.68\), followed by criminal sentence drafting \(4\.39\) and civil sentence drafting \(3\.89\)\.
Table 2\.Model rankings across four LLM judges\.Green/redcells: judge ranked above/below mean \(intensity to±3\+\\pm 3\+\)\.RankModelGPT\-5\.1Gemini\-2\.5\-ProGemini\-3\-ProClaude\-4\.5\-OpusAVG1Gemini\-3\-Pro\-Preview6\.457\.297\.436\.696\.972Gemini\-3\-Flash\-Preview6\.086\.867\.296\.466\.673Claude\-4\.5\-Opus5\.896\.746\.976\.246\.464GPT\-5\.16\.066\.606\.495\.756\.235Gemini\-2\.5\-Pro5\.776\.376\.675\.916\.186Claude\-4\.5\-Sonnet5\.305\.865\.635\.395\.557GPT\-4\.14\.685\.395\.735\.015\.208Sabiá\-44\.465\.265\.604\.795\.039Sabiá\-3\.13\.984\.774\.764\.154\.4110DeepSeek\-V3\.24\.044\.704\.903\.734\.3411Kimi\-K2\.53\.974\.434\.433\.834\.1712Kimi\-K2\-Thinking3\.914\.264\.593\.804\.1413GPT\-5\-Mini3\.624\.344\.663\.704\.0814Sabiazinho\-43\.654\.394\.493\.704\.0615Qwen3\-235B\-Thinking3\.484\.214\.603\.744\.0016Qwen3\-235B\-Instruct3\.404\.094\.223\.643\.8417GPT\-4\.1\-Mini3\.624\.064\.293\.383\.8418Sabiazinho\-33\.113\.133\.502\.833\.1419Qwen3\-30B\-Instruct2\.593\.243\.282\.512\.9120Qwen2\.5\-72B\-Instruct2\.582\.832\.922\.162\.6221Qwen3\-30B\-Thinking2\.332\.552\.502\.012\.3522Qwen2\.5\-14B\-Instruct2\.101\.952\.301\.822\.0423Qwen3\-8B1\.861\.812\.131\.461\.82
### 4\.2\.Inter\-Judge Agreement
A central question in LLM\-as\-a\-judge evaluation is whether different judge models produce consistent assessments\. Our multi\-judge methodology enables direct measurement of inter\-judge agreement, providing insight into the robustness of the evaluation\.
#### Overall Concordance\.
The four judge models exhibit remarkably high agreement in their rankings\. Kendall’s coefficient of concordance reachesW=0\.984W=0\.984, indicating near\-perfect consensus on the relative ordering of models\. This level of agreement substantially exceeds what would be expected by chance and suggests that the judges are responding to genuine quality differences in model outputs\.
#### Pairwise Correlations\.
Table[3](https://arxiv.org/html/2605.08437#S4.T3)presents Kendall’sτ\\taucoefficients for all six pairwise combinations of judges\. All correlations exceed 0\.89, with a mean ofτ=0\.913\\tau=0\.913\. The highest agreement is observed between GPT\-5\.1 and Gemini\-2\.5\-Pro \(τ=0\.945\\tau=0\.945\), while the lowest, though still strong, is between Gemini\-3\-Pro\-Preview and both Gemini\-2\.5\-Pro and Claude\-4\.5\-Opus \(τ=0\.897\\tau=0\.897\)\. All p\-values are below10−1310^\{\-13\}, confirming statistical significance\.
Table 3\.Pairwise Kendall’sτ\\taucorrelation between judge models\.GPT\-5\.1Gemini\-2\.5\-ProGemini\-3\-ProClaude\-4\.5\-OpusGPT\-5\.11\.0000\.9450\.9050\.913Gemini\-2\.5\-Pro0\.9451\.0000\.8970\.921Gemini\-3\-Pro0\.9050\.8971\.0000\.897Claude\-4\.5\-Opus0\.9130\.9210\.8971\.000
### 4\.3\.Judge\-Specific Patterns
Despite the high overall agreement, the colored cells in Table[2](https://arxiv.org/html/2605.08437#S4.T2)reveal subtle but interpretable differences in how judges evaluate certain models\.
#### Scoring Tendencies\.
Gemini\-3\-Pro\-Preview tends to assign relatively higher scores across the board, as reflected in its column showing the highest raw scores for most models\. Conversely, Claude\-4\.5\-Opus tends toward slightly more conservative scores, particularly for mid\-tier models\. GPT\-5\.1 and Gemini\-2\.5\-Pro occupy intermediate positions\.
#### Notable Divergences\.
A few specific cases warrant attention\. Claude\-4\.5\-Opus assigns DeepSeek\-V3\.2 a notably lower score than other judges \(rank divergence of−3\-3\), suggesting this judge may be more critical of certain response patterns exhibited by DeepSeek\. Similarly, Gemini\-3\-Pro\-Preview ranks Qwen3\-235B\-Thinking substantially higher than the consensus \(\+3\+3\)\. These divergences, while informative, do not substantially affect the overall ranking given the high concordance observed\.
#### Judge Calibration\.
To verify that judges can identify optimal performance, we generated responses for all 74 questions using GPT\-5\.2 \(reasoning effort high\) with access to official rubrics as privileged information\. All four judges assigned near\-perfect scores to these oracle responses \(mean 9\.957; individual scores≥9\.87\\geq 9\.87\), substantially higher than the best\-performing evaluated model \(6\.97\), confirming that judges can distinguish truly excellent from typical outputs\.
#### Implications\.
The high inter\-judge agreement validates the use of LLM\-as\-a\-judge methodology for Magis\-Bench evaluation\. The consistency across four independent frontier models from three different providers suggests that the evaluation captures genuine quality differences rather than arbitrary preferences\. The minor divergences observed provide additional insight into how different judges weight various aspects of legal writing, but do not undermine the reliability of the overall rankings\.
### 4\.4\.Robustness of the Judge Panel
A leave\-one\-judge\-out analysis yields Kendall’sτ≥0\.976\\tau\\geq 0\.976against the full ranking in all four runs, with no model moving more than two positions\. A self\-bias check, comparing each judge\-candidate’s self\-score with the mean of the other three judges, finds differences from−0\.29\-0\.29\(Claude\-4\.5\-Opus\) to\+0\.62\+0\.62\(Gemini\-3\-Pro\-Preview\), with mean\+0\.09\+0\.09\. This reflects general leniency rather than self\-favoritism: Gemini\-3\-Pro\-Preview assigns the top score to 19 of 23 models\.
## 5\.Limitations
Magis\-Bench provides a rubric\-grounded way to study magistrate\-level writing, but it has important limitations\. First, our main results rely on LLM\-as\-a\-judge scoring \(Section[3](https://arxiv.org/html/2605.08437#S3)\); although we mitigate model\-specific biases by using four independent frontier judges and observe high inter\-judge agreement, agreement is not the same as correctness, and judges may systematically reward surface features \(e\.g\., verbosity or formality\) or miss subtle doctrinal errors\. A natural next step is targeted human validation on a stratified subset of questions and model outputs, both to calibrate absolute score levels and to audit whether judge explanations and scores track the official rubrics as human graders would apply them\. Another limitation is that, because the rubrics reflect examination\-board expectations, they may encode particular preferences that differ from real\-world best practices; extending the benchmark to additional sources and task types would improve coverage\. Examinations span 2023–2025 \(several administered after most models’ training cutoffs\) and rubrics are released separately from the questions, though inclusion in proprietary training corpora cannot be verified\. The benchmark uses only publicly released materials and does not advocate deploying LLMs to adjudicate real cases\.
## 6\.Conclusion
We introduced Magis\-Bench, a benchmark for evaluating LLMs on magistrate\-level legal tasks derived from recent Brazilian competitive examinations for judicial positions\. The benchmark comprises 74 questions from eight examinations between 2023 and 2025\. Each question is accompanied by official evaluation rubrics that enable systematic, criteria\-grounded assessment\.
Our evaluation of 23 state\-of\-the\-art LLMs using a multi\-judge methodology with four frontier models revealed several key findings\. First, inter\-judge agreement is remarkably high \(Kendall’sW=0\.984W=0\.984\), and judges demonstrate appropriate calibration: when evaluating oracle responses generated with access to official rubrics, all four judges assigned near\-perfect scores, confirming they can reliably distinguish truly excellent responses from typical model outputs\. Second, even the best\-performing models achieve less than 70% of the maximum possible score, indicating that magistrate\-level legal reasoning and writing remain challenging for current LLMs\. Third, the performance hierarchy is largely consistent across judges, with Google’s Gemini\-3\-Pro\-Preview, Gemini\-3\-Flash\-Preview, and Claude\-4\.5\-Opus occupying the top positions\.
Magis\-Bench addresses a gap in legal AI evaluation by focusing on magistrate\-level legal tasks rather than advocacy\-oriented skills\. This distinction is important as LLMs are increasingly considered for applications that involve adjudicative reasoning, from legal research assistance to decision support systems\.
The average cost to evaluate a single model with a single judge ranged from $4\.43 to $5\.91 USD across the four judges, demonstrating the feasibility of large\-scale automated legal writing evaluation\.
## References
- O\. G\. Ayal, Z\. Elyoseph, and A\. Solomon \(2026\)Evaluating large language models as judicial decision\-makers\.Justice Quarterly0\(0\),pp\. 1–36\.External Links:[Document](https://dx.doi.org/10.1080/07418825.2026.2618254),[Link](https://doi.org/10.1080/07418825.2026.2618254),https://doi\.org/10\.1080/07418825\.2026\.2618254Cited by:[§2](https://arxiv.org/html/2605.08437#S2.p2.1)\.
- O\. S\. Chlapanis, D\. Galanis, N\. Aletras, and I\. Androutsopoulos \(2025\)GreekBarBench: a challenging benchmark for free\-text legal reasoning and citations\.arXiv preprint arXiv:2505\.17267\.Cited by:[§1](https://arxiv.org/html/2605.08437#S1.p1.1),[§2](https://arxiv.org/html/2605.08437#S2.p1.1)\.
- Y\. Fan, J\. Ni, J\. Merane, Y\. Tian, Y\. Hermstrüwer, Y\. Huang, M\. Akhtar, E\. Salimbeni, F\. Geering, O\. Dreyer,et al\.\(2025\)Lexam: benchmarking legal reasoning on 340 law exams\.arXiv preprint arXiv:2505\.12864\.Cited by:[§1](https://arxiv.org/html/2605.08437#S1.p1.1),[§2](https://arxiv.org/html/2605.08437#S2.p1.1)\.
- M\. G\. Kendall and B\. B\. Smith \(1939\)The problem of m rankings\.The Annals of Mathematical Statistics10\(3\),pp\. 275–287\.External Links:ISSN 00034851,[Link](http://www.jstor.org/stable/2235668)Cited by:[§3\.3](https://arxiv.org/html/2605.08437#S3.SS3.p6.6)\.
- M\. G\. KENDALL \(1938\)A new measure of rank correlation\.Biometrika30\(1\-2\),pp\. 81–93\.External Links:ISSN 0006\-3444,[Document](https://dx.doi.org/10.1093/biomet/30.1-2.81),[Link](https://doi.org/10.1093/biomet/30.1-2.81),https://academic\.oup\.com/biomet/article\-pdf/30/1\-2/81/423380/30\-1\-2\-81\.pdfCited by:[§3\.3](https://arxiv.org/html/2605.08437#S3.SS3.p6.6)\.
- H\. Oh, W\. Hwang, and K\. On \(2025\)Korean canonical legal benchmark: toward knowledge\-independent evaluation of llms’ legal reasoning capabilities\.arXiv preprint arXiv:2512\.24572\.Cited by:[§2](https://arxiv.org/html/2605.08437#S2.p1.1)\.
- E\. C\. B\. Pacheco, F\. M\. Suriani, and R\. Ribeiro \(2025\)Rabula: a benchmark for evaluating llms in brazilian legal tasks\.Cited by:[§2](https://arxiv.org/html/2605.08437#S2.p1.1)\.
- R\. Pires, R\. Malaquias Junior, and R\. Nogueira \(2026\)Automatic legal writing evaluation of llms\.InProceedings of the Twentieth International Conference on Artificial Intelligence and Law,ICAIL ’25,New York, NY, USA,pp\. 420–424\.External Links:ISBN 9798400719394,[Link](https://doi.org/10.1145/3769126.3769227),[Document](https://dx.doi.org/10.1145/3769126.3769227)Cited by:[§1](https://arxiv.org/html/2605.08437#S1.p1.1),[§2](https://arxiv.org/html/2605.08437#S2.p1.1)\.
- E\. A\. Posner and S\. Saran \(2025\)Judge ai: assessing large language models in judicial decision\-making\.University of Chicago Coase\-Sandor Institute for Law & Economics Research Paper\(2503\)\.Cited by:[§2](https://arxiv.org/html/2605.08437#S2.p2.1)\.
- Y\. Shi, H\. Liu, Y\. Hu, G\. Song, X\. Xu, Y\. Ma, T\. Tang, L\. Zhang, Q\. Chen, D\. Feng,et al\.\(2026\)PLawBench: a rubric\-based benchmark for evaluating llms in real\-world legal practice\.arXiv preprint arXiv:2601\.16669\.Cited by:[§1](https://arxiv.org/html/2605.08437#S1.p1.1),[§2](https://arxiv.org/html/2605.08437#S2.p1.1)\.
- W\. Su, B\. Yue, Q\. Ai, Y\. Hu, J\. Li, C\. Wang, K\. Zhang, Y\. Wu, and Y\. Liu \(2025\)JuDGE: benchmarking judgment document generation for chinese legal system\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’25,New York, NY, USA,pp\. 3573–3583\.External Links:ISBN 9798400715921,[Link](https://doi.org/10.1145/3726302.3730295),[Document](https://dx.doi.org/10.1145/3726302.3730295)Cited by:[§2](https://arxiv.org/html/2605.08437#S2.p2.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 46595–46623\.Cited by:[§3\.3](https://arxiv.org/html/2605.08437#S3.SS3.p1.1)\.Similar Articles
LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Researchers release LegalBench-BR, the first public benchmark for evaluating LLMs on Brazilian legal text classification, showing LoRA-fine-tuned BERTimbau dramatically outperforms GPT-4o mini and Claude 3.5 Haiku.
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
VLegal-Bench is a cognitively grounded benchmark for evaluating large language models on Vietnamese legal reasoning tasks, containing 10,450 expert-annotated samples designed to address the gap in legal benchmarks for civil law systems. The benchmark assesses multiple levels of legal understanding through question answering, multi-step reasoning, and scenario-based problem solving, providing a replicable framework for evaluating LLMs in non-English, codified legal contexts.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.