Generating training datasets for legal chatbots in Korean

arXiv cs.CL 05/11/26, 04:00 AM Papers

legal-tech dataset-generation korean-nlp chatbot intent-classification arxiv

Summary

This paper presents a method for generating large-scale, labeled training datasets for legal chatbots in Korean using Local Grammar Graphs, achieving 91% F1-score with a DIET classifier.

arXiv:2605.07432v1 Announce Type: new Abstract: Chatbots are robots that can communicate with humans using text or voice signals. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people. However, capturing the diversity of actual user input in datasets for deep-learning dialog systems (chatbots) is a technical challenge. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high-quality labels. The generator of labelled datasets is based on language resources that take the form of local grammar graphs (LGG), which capture and generalize the vocabulary and local syntax observed by linguists in text. The LGGs associate labels to the utterances according to a domain-specific classification system. We tested this approach by implementing LIGA, a legal chatbot in Korean. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government. We generated labelled utterances from the LGGs with the aid of the open-source Unitex platform. This process produced 700 million utterances. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1-score performance. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases.

Original Article

View Cached Full Text

Cached at: 05/11/26, 07:03 AM

# Generating training datasets for legal chatbots in Korean
Source: [https://arxiv.org/abs/2605.07432](https://arxiv.org/abs/2605.07432)
[View PDF](https://arxiv.org/pdf/2605.07432)

> Abstract:Chatbots are robots that can communicate with humans using text or voice signals\. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people\. However, capturing the diversity of actual user input in datasets for deep\-learning dialog systems \(chatbots\) is a technical challenge\. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume\. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high\-quality labels\. The generator of labelled datasets is based on language resources that take the form of local grammar graphs \(LGG\), which capture and generalize the vocabulary and local syntax observed by linguists in text\. The LGGs associate labels to the utterances according to a domain\-specific classification system\. We tested this approach by implementing LIGA, a legal chatbot in Korean\. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government\. We generated labelled utterances from the LGGs with the aid of the open\-source Unitex platform\. This process produced 700 million utterances\. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1\-score performance\. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases\.

## Submission history

From: Eric Laporte \[[view email](https://arxiv.org/show-email/9dabeb0b/2605.07432)\] **\[v1\]**Fri, 8 May 2026 08:32:56 UTC \(825 KB\)

Generating training datasets for legal chatbots in Korean

Similar Articles

Korean Culture into LLM Alignment: Toward Cultural Coherence

KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems

Optimizing Korean-Centric LLMs via Token Pruning

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

Submit Feedback

Similar Articles

Korean Culture into LLM Alignment: Toward Cultural Coherence

KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems

Optimizing Korean-Centric LLMs via Token Pruning

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas