EmbGen: Teaching with Reassembled Corpora

arXiv cs.CL 05/20/26, 04:00 AM Papers

synthetic-data fine-tuning instruction-tuning domain-adaptation embedding question-answering llm

Summary

EmbGen is a synthetic data generation pipeline that reassembles corpora into entity-description pairs using embedding similarity to generate diverse QA pairs for fine-tuning small language models on specialized domains, showing significant improvements in factual accuracy.

arXiv:2605.19394v1 Announce Type: new Abstract: Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

Original Article

View Cached Full Text

Cached at: 05/20/26, 08:25 AM

# EmbGen: Teaching with Reassembled Corpora
Source: [https://arxiv.org/abs/2605.19394](https://arxiv.org/abs/2605.19394)
[View PDF](https://arxiv.org/pdf/2605.19394)

> Abstract:Adapting small instruction\-tuned models to specialized domains often relies on supervised fine\-tuning \(SFT\) on curated instruction\-response examples, which is expensive to collect at scale\. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross\-passage or cross\-document dependencies\. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity\-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question\-answer \(QA\) pairs via proximity, intra\-cluster, and inter\-cluster sampling with cluster\-specialized system prompts\. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge\-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets \(5 and 20 million tokens\)\. We use lexical overlap metrics, an LLM\-as\-a\-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation\. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12\.5% at 5M and 88\.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity\.

## Submission history

From: Anna Leontjeva \[[view email](https://arxiv.org/show-email/cfdcde18/2605.19394)\] **\[v1\]**Tue, 19 May 2026 05:40:12 UTC \(2,573 KB\)

EmbGen: Teaching with Reassembled Corpora

Similar Articles

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

Submit Feedback

Similar Articles

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking