VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
Summary
VLegal-Bench is a cognitively grounded benchmark for evaluating large language models on Vietnamese legal reasoning tasks, containing 10,450 expert-annotated samples designed to address the gap in legal benchmarks for civil law systems. The benchmark assesses multiple levels of legal understanding through question answering, multi-step reasoning, and scenario-based problem solving, providing a replicable framework for evaluating LLMs in non-English, codified legal contexts.
View Cached Full Text
Cached at: 04/20/26, 08:31 AM
# Benchmarking Vietnamese Legal Knowledge of Large Language Models
Source: https://arxiv.org/html/2512.14554
Nguyen Tien Dong1,2,*, Minh-Anh Nguyen1,2,*, Thanh Dat Hoang1, Nguyen Tuan Ngoc1, Dao Xuan Quang Minh1, Phan Phi Hai1, Nguyen Thi Ngoc Anh1,3,†, Binh Vu4,†
1CMC OpenAI, 2VinUniversity, 3HUST, Vietnam
4SRH University Heidelberg, Germany
{dongnt,minhna}@cmcai.vn, {25dong.nt,minh.na2}@vinuni.edu.vn
*Co-first author. †Co-last author
## Abstract
The rapid advancement of large language models (LLMs) has expanded their potential in the legal domain. However, existing legal benchmarks remain largely English-centric and oriented toward common law, leaving a critical gap in evaluating LLMs for civil law systems that govern most jurisdictions worldwide. To address this gap, we introduce Vietnamese Legal Benchmark (VietLegal), a cognitively grounded benchmark designed for the hierarchical and codified structure of Vietnamese law. Although instantiated in Vietnamese legislation, VietLegal provides a replicable evaluation framework for civil law systems characterized by complex statutory hierarchies and frequent amendments. Inspired by Bloom's taxonomy, VietLegal assesses multiple levels of legal understanding through tasks that mirror real-world legal assistant use cases, including legal question answering, multi-step reasoning, and scenario-based problem solving. The benchmark contains 10,450 expert-annotated samples, each cross-validated against authoritative legal sources to ensure fidelity to practical legal workflows. By offering the first standardized legal benchmark for Vietnamese, VietLegal enables systematic assessment of LLMs in civil law contexts and supports the development of more reliable and interpretable AI-assisted legal systems.
## 1 Introduction
The rapid progress of large language models (LLMs) has enabled transformative applications in the legal domain (Homoki and Ződi, 2024; Sun, 2023). While LLMs demonstrate strong performance on general tasks, their effectiveness in legally complex and low-resource languages like Vietnamese remains largely unexplored. Vietnamese law is characterized by a formal, hierarchical, and continuously evolving statutory system, requiring specialized evaluation to ensure that model outputs remain legally accurate, consistent, and ethically aligned.
Existing legal NLP benchmarks predominantly target common law in English, emphasizing case-based reasoning (Chalkidis et al., 2022; Guha et al., 2023). This focus overlooks civil law systems, which govern over 60% of global jurisdictions and derive authority from hierarchical statutes rather than judicial precedent (Merryman and Pérez-Perdomo, 2018; JuriGlobe, 2023). Civil law introduces distinct challenges, requiring models to navigate complex statutory interpretations and track temporal validity across frequent amendments. While recent benchmarks have addressed Chinese civil law (Li et al., 2024; Dai et al., 2025; Fei et al., 2024), other codified traditions remain underrepresented. Vietnamese law, specifically, presents unique difficulties due to its heavy reliance on intricate references among Articles, Clauses, and Points, requiring specialized evaluation to ensure legal fidelity.
To address these limitations, we introduce VietLegal, the first comprehensive benchmark designed to evaluate LLMs on Vietnamese legal tasks within a civil law framework. Grounded in Bloom's cognitive taxonomy, VietLegal assesses model capabilities across progressively deeper levels ranging from basic recall to multi-step reasoning and ethical judgment. The benchmark contains 10,450 expert-annotated samples, each cross-validated against authoritative legal sources to ensure fidelity to practical legal workflows. By offering the first standardized legal benchmark for Vietnamese, VietLegal enables systematic assessment of LLMs in civil law contexts and supports the development of more reliable AI-assisted legal systems.
Our main contributions are summarized as follows:
First, we introduce VietLegal, a benchmark for evaluating LLMs on Vietnamese legal tasks with a replicable design applicable to other civil law jurisdictions.
Second, we propose a cognitively grounded evaluation methodology informed by Bloom's taxonomy that enables systematic assessment from basic legal recall to advanced multi-step reasoning.
Third, we release a high-quality dataset of 10,450 expert-verified samples and conduct extensive experiments across 23 diverse LLMs, offering insights into their strengths and limitations in civil law reasoning.
The benchmark and evaluation code are available at an anonymous repository: github.com/CMC-OPENAI/VLegal-Bench
| Level | ID | Task | Purpose | Type | Metric | Test set |
|-------|----|----|---------|------|--------|----------|
| 1. Recognition & Recall | 1.1 | Legal Entity Recognition | To detect and classify named entities, including persons, organizations, monetary amounts, ... within legal documents. | MCQ, NER | Accuracy, EM | 750 |
| | 1.2 | Legal Topic Classification | Classifies legal questions into predefined legal topics. | MCQ, MLQ | Accuracy, Macro-F1 | 700 |
| | 1.3 | Legal Concept Recall | Recalls statutory definitions or meanings of legal terms and concepts. | MCQ | Accuracy | 300 |
| | 1.4 | Article Recall | Retrieves or cites the correct legal article corresponding to a term, concept, or question. | MCQ | Accuracy | 1000 |
| | 1.5 | Legal Schema Recall | Recognizes and recalls hierarchical and temporal relations among legal documents (e.g., amendments, replacements, ...) | MCQ | Accuracy | 800 |
| 2. Understanding & Structuring | 2.1 | Relation Extraction | Extracts the subject, object, and content of a legal relationship from a factual scenario | MCQ | Accuracy | 253 |
| | 2.2 | Legal Element Recognition | Identifies the hypothesis, disposition, and sanction components within a legal provision | MCQ | Accuracy | 300 |
| | 2.3 | Legal Graph Structuring | Convert legal documents into structured knowledge graphs representing entities, relations, and inter-article references. | Generat., MLQ | ROUGE-L, Node-F1, Edge-F1 | 296 |
| | 2.4 | Judgment Verification | Evaluates whether a court's reasoning or statement is consistent with the factual and legal content of the actual judgment. | BC | Accuracy | 600 |
| | 2.5 | User Intent Understanding | Determines the underlying intent or query type of the user when interacting with a legal assistant. | MLC | macro-F1 | 1359 |
| 3. Reasoning & Inference | 3.1 | Article / Clause Prediction | Predict which legal article or clause applies to a given legal question or short query, instead of a lengthy factual scenario | MCQ | Accuracy | 600 |
| | 3.2 | Legal Court Decision Prediction | Predicts the final court decision or judgment outcome from the factual and legal content of a real case. | MCQ | Accuracy | 600 |
| | 3.3 | Multi-Article Reasoning | Perform multi-step reasoning by connecting several legal provisions or facts to derive a consistent conclusion. | MCQ | Accuracy | 292 |
| | 3.4 | Conflict & Consistency Detection | Identify contradictions or overlaps between different legal clauses or interpretations across statutes or contracts. | BC | Binary F1 | 161 |
| | 3.5 | Penalty / Remedy Estimation | Estimates the appropriate legal penalty or remedy for a given factual situation. | MCQ | Accuracy | 358 |
| 4. Interpretation & Generation | 4.1 | Legal Document Summarization | Generate concise summaries of long legal texts (statutes, judgments, contracts) while preserving key information. | Generat. | ROUGE-L | 384 |
| | 4.2 | Judicial Reasoning Generation | Produce structured reasoning paragraphs based on the IRAC template (Issue - Rule - Application - Conclusion). | Generat. | ROUGE-L | 299 |
| | 4.3 | Objective Legal Opinion Generation | Generate a balanced and impartial legal opinion or advisory text that aligns with statutory interpretation. | Generat. | ROUGE-L | 498 |
| 5. Ethics, Fairness & Bias | 5.1 | Bias Detection | Detect gender, racial, political, or religious bias in generated answers or decisions to ensure fairness. | MCQ | Accuracy | 250 |
| | 5.2 | Privacy & Data Protection | Identify and redact sensitive or personal data in legal texts to ensure privacy compliance. | MCQ | Accuracy | 216 |
| | 5.3 | Ethical Consistency Assessment | Evaluate whether the model's outputs align with professional ethics and moral standards in legal reasoning. | MCQ | Accuracy | 200 |
| | 5.4 | Unfair Contract Detection | Compare model judgments across similar cases or parties to assess impartiality and equitable reasoning. | MCQ | Accuracy | 234 |
**Table 1:** Overview of VietLegal: The benchmark evaluates legal LLMs across five levels, from basic recognition to ethical reasoning, using five question templates: Multiple-Choice Question Answering (MCQ), Multi-Label Classification (MLC), Binary Classification (BC), Named Entity Recognition (NER) and Generation for Vietnamese law.
## 2 Related Work
### Legal LLM Benchmarks
Early legal NLP benchmarks primarily targeted isolated tasks such as judgment prediction or statute classification, exemplified by CaseHOLD (Zheng et al., 2021). More recent efforts have shifted toward multi-task evaluations of general legal intelligence, most notably LexGLUE (Chalkidis et al., 2022) and LegalBench (Guha et al., 2023), which emphasize legal reasoning beyond surface-level language understanding. Parallel developments include benchmarks for legally specialized LLMs (Cui et al., 2023; Yue et al., 2023) and civil law-oriented resources, particularly for Chinese (Fei et al., 2024; Dai et al., 2025). European civil law benchmarks, such as French statutory retrieval (Louis and Spanakis, 2022) and German civil law QA (Büttner and Habernal, 2024), further highlight structural differences in codified legal systems. Despite this progress, low-resource languages and many civil law jurisdictions in the Global South remain underrepresented, leaving Vietnamese law largely unexplored in existing benchmark landscapes.
### Vietnamese Legal NLP
Vietnamese legal NLP research has been driven largely by community-led shared tasks, particularly through the VLSP workshops (Nguyen et al., 2021), which have produced datasets for legal retrieval, entailment, and question answering. Pre-trained models such as PhoBERT (Nguyen and Nguyen, 2020) and ViT5 (Phan et al., 2022) have enabled strong performance on these foundational tasks. However, existing resources are fragmented and predominantly retrieval- or extraction-focused, offering limited evaluation of generative reasoning, multi-step inference, or legislative amendment tracking. Recent work on Vietnamese legal RAG systems (Nguyen et al., 2024) further underscores the lack of a unified, standardized benchmark capable of evaluating realistic legal assistant workflows.
### Cognitive Evaluation and Metrics
Recent benchmarking efforts increasingly draw on cognitive frameworks to distinguish memorization from higher-order reasoning. Bloom's Taxonomy has been adopted to structure task difficulty, while Chain-of-Thought prompting (Wei et al., 2022) has emphasized the importance of evaluating intermediate reasoning. In legal NLP, evaluation remains challenging, as standard generation metrics often correlate poorly with factual correctness (Liu et al., 2023). We therefore adopt a hybrid evaluation strategy, combining extraction-based metrics for lower cognitive levels with generation metrics for higher-level tasks. Crucially, legal reasoning is grounded in legal syllogism and subsumption theory (Alexy, 1989), where correctly applying statutory norms to factual scenarios is central.
## 3 VietLegal
### 3.1 Design Principle of VietLegal
VietLegal is organized around a hierarchical cognitive framework inspired by Bloom's taxonomy and adapted to the linguistic and structural properties of Vietnamese law, comprising five levels of legal cognition ranging from factual recognition to advanced legal reasoning. Each task is explicitly designed to reflect challenges inherent to civil-law systems, with Levels 3 and 4 in particular targeting complex statutory reasoning phenomena such as multi-article dependency, hierarchical interpretation across legal instruments, and consistency analysis under overlapping or amended regulations. Although developed in Vietnamese, VietLegal provides a replicable framework for evaluating AI in codified legal systems. Its task design reflects realistic legal assistant use cases and targets core civil-law reasoning patterns rather than case-based analysis, enabling straightforward adaptation to other civil-law languages and jurisdictions.
An overview of the benchmark is shown in Table 1, with detailed task descriptions in Appendix G.
**Level 1 - Recognition & Recall** targets foundational legal literacy in the Vietnamese context. It evaluates whether an LLM can accurately identify and retrieve core legal entities, concepts, and statutory provisions within dense and highly cross-referenced legal texts. These tasks assess basic factual competence, which is a prerequisite for deeper legal understanding, and simulate real-world interactions where users seek clarification of fundamental legal information.
**Level 2 - Understanding & Structuring** examines an LLM's ability to comprehend and organize complex statutory content. Given the hierarchical structure of Vietnamese law and its frequent amendments, this level evaluates whether models can capture relationships among Articles, Clauses, and Points, and represent legal norms as a coherent, evolving system. The tasks reflect practical legal assistant scenarios, including analyzing long legal documents, verifying judicial decisions, and explaining statutory relationships to users.
**Level 3 - Reasoning & Inference** assesses the model's capacity to apply legal provisions to factual scenarios through logical and multi-step reasoning. Tasks at this level require predicting relevant articles, estimating penalties or remedies, synthesizing information across multiple statutes, and resolving conflicts between overlapping or amended legal norms. These skills are essential for realistic legal problem-solving and judicial support.
**Level 4 - Interpretation & Generation** evaluates higher-order interpretive and generative abilities. This level tests whether an LLM can produce coherent, accurate, and unbiased legal texts, such as statute summaries,Similar Articles
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
Introduces UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions. Evaluates 11 LLMs, revealing task-dependent few-shot effects and the misleading nature of accuracy on imbalanced legal tasks.
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.
LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Researchers release LegalBench-BR, the first public benchmark for evaluating LLMs on Brazilian legal text classification, showing LoRA-fine-tuned BERTimbau dramatically outperforms GPT-4o mini and Claude 3.5 Haiku.
DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation
DLawBench is a new benchmark for evaluating large language models in multi-turn legal consultation, covering Chinese and US law with four client types. Experiments show significant room for improvement, with the best model achieving only 0.562 on legal reasoning.