Generating training datasets for legal chatbots in Korean

arXiv cs.CL Papers

Summary

This paper presents a method for generating large-scale, labeled training datasets for legal chatbots in Korean using Local Grammar Graphs, achieving 91% F1-score with a DIET classifier.

arXiv:2605.07432v1 Announce Type: new Abstract: Chatbots are robots that can communicate with humans using text or voice signals. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people. However, capturing the diversity of actual user input in datasets for deep-learning dialog systems (chatbots) is a technical challenge. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high-quality labels. The generator of labelled datasets is based on language resources that take the form of local grammar graphs (LGG), which capture and generalize the vocabulary and local syntax observed by linguists in text. The LGGs associate labels to the utterances according to a domain-specific classification system. We tested this approach by implementing LIGA, a legal chatbot in Korean. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government. We generated labelled utterances from the LGGs with the aid of the open-source Unitex platform. This process produced 700 million utterances. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1-score performance. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:03 AM

# Generating training datasets for legal chatbots in Korean
Source: [https://arxiv.org/abs/2605.07432](https://arxiv.org/abs/2605.07432)
[View PDF](https://arxiv.org/pdf/2605.07432)

> Abstract:Chatbots are robots that can communicate with humans using text or voice signals\. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people\. However, capturing the diversity of actual user input in datasets for deep\-learning dialog systems \(chatbots\) is a technical challenge\. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume\. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high\-quality labels\. The generator of labelled datasets is based on language resources that take the form of local grammar graphs \(LGG\), which capture and generalize the vocabulary and local syntax observed by linguists in text\. The LGGs associate labels to the utterances according to a domain\-specific classification system\. We tested this approach by implementing LIGA, a legal chatbot in Korean\. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government\. We generated labelled utterances from the LGGs with the aid of the open\-source Unitex platform\. This process produced 700 million utterances\. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1\-score performance\. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases\.

## Submission history

From: Eric Laporte \[[view email](https://arxiv.org/show-email/9dabeb0b/2605.07432)\] **\[v1\]**Fri, 8 May 2026 08:32:56 UTC \(825 KB\)

Similar Articles

KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems

arXiv cs.CL

KG2Cypher presents a data-centric pipeline for building enterprise text-to-Cypher systems from existing knowledge graphs. It uses LLMs to generate natural language question-Cypher pairs, validated by an LLM judge and human review, and achieves significant performance improvements on Korean enterprise datasets with LoRA-based fine-tuning.

Optimizing Korean-Centric LLMs via Token Pruning

arXiv cs.CL

This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

Hugging Face Blog

NVIDIA's Nemotron-Personas-Korea is a dataset of 6-7 million synthetic personas grounded in official Korean demographic statistics, designed to help build culturally accurate Korean AI agents while complying with Korea's Personal Information Protection Act (PIPA). The tutorial demonstrates how to filter personas and deploy a grounded Korean AI agent using hosted APIs in approximately 20 minutes.