Generating training datasets for legal chatbots in Korean
Summary
This paper presents a method for generating large-scale, labeled training datasets for legal chatbots in Korean using Local Grammar Graphs, achieving 91% F1-score with a DIET classifier.
View Cached Full Text
Cached at: 05/11/26, 07:03 AM
# Generating training datasets for legal chatbots in Korean Source: [https://arxiv.org/abs/2605.07432](https://arxiv.org/abs/2605.07432) [View PDF](https://arxiv.org/pdf/2605.07432) > Abstract:Chatbots are robots that can communicate with humans using text or voice signals\. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people\. However, capturing the diversity of actual user input in datasets for deep\-learning dialog systems \(chatbots\) is a technical challenge\. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume\. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high\-quality labels\. The generator of labelled datasets is based on language resources that take the form of local grammar graphs \(LGG\), which capture and generalize the vocabulary and local syntax observed by linguists in text\. The LGGs associate labels to the utterances according to a domain\-specific classification system\. We tested this approach by implementing LIGA, a legal chatbot in Korean\. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government\. We generated labelled utterances from the LGGs with the aid of the open\-source Unitex platform\. This process produced 700 million utterances\. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1\-score performance\. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases\. ## Submission history From: Eric Laporte \[[view email](https://arxiv.org/show-email/9dabeb0b/2605.07432)\] **\[v1\]**Fri, 8 May 2026 08:32:56 UTC \(825 KB\)
Similar Articles
Korean Culture into LLM Alignment: Toward Cultural Coherence
This paper proposes a dataset generation pipeline to align large language models with Korean cultural norms using DPO fine-tuning, improving cultural safety without degrading general performance.
KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems
KG2Cypher presents a data-centric pipeline for building enterprise text-to-Cypher systems from existing knowledge graphs. It uses LLMs to generate natural language question-Cypher pairs, validated by an LLM judge and human review, and achieves significant performance improvements on Korean enterprise datasets with LoRA-based fine-tuning.
Optimizing Korean-Centric LLMs via Token Pruning
This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.
LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Researchers release LegalBench-BR, the first public benchmark for evaluating LLMs on Brazilian legal text classification, showing LoRA-fine-tuned BERTimbau dramatically outperforms GPT-4o mini and Claude 3.5 Haiku.
How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas
NVIDIA's Nemotron-Personas-Korea is a dataset of 6-7 million synthetic personas grounded in official Korean demographic statistics, designed to help build culturally accurate Korean AI agents while complying with Korea's Personal Information Protection Act (PIPA). The tutorial demonstrates how to filter personas and deploy a grounded Korean AI agent using hosted APIs in approximately 20 minutes.