EmbGen: Teaching with Reassembled Corpora

arXiv cs.CL Papers

Summary

EmbGen is a synthetic data generation pipeline that reassembles corpora into entity-description pairs using embedding similarity to generate diverse QA pairs for fine-tuning small language models on specialized domains, showing significant improvements in factual accuracy.

arXiv:2605.19394v1 Announce Type: new Abstract: Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:25 AM

# EmbGen: Teaching with Reassembled Corpora
Source: [https://arxiv.org/abs/2605.19394](https://arxiv.org/abs/2605.19394)
[View PDF](https://arxiv.org/pdf/2605.19394)

> Abstract:Adapting small instruction\-tuned models to specialized domains often relies on supervised fine\-tuning \(SFT\) on curated instruction\-response examples, which is expensive to collect at scale\. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross\-passage or cross\-document dependencies\. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity\-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question\-answer \(QA\) pairs via proximity, intra\-cluster, and inter\-cluster sampling with cluster\-specialized system prompts\. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge\-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets \(5 and 20 million tokens\)\. We use lexical overlap metrics, an LLM\-as\-a\-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation\. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12\.5% at 5M and 88\.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity\.

## Submission history

From: Anna Leontjeva \[[view email](https://arxiv.org/show-email/cfdcde18/2605.19394)\] **\[v1\]**Tue, 19 May 2026 05:40:12 UTC \(2,573 KB\)

Similar Articles

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Hugging Face Daily Papers

Q-RAG introduces a reinforcement learning-based fine-tuning approach for embedder models to enable efficient multi-step retrieval, achieving state-of-the-art results on long-context benchmarks up to 10M tokens. This method provides a resource-efficient alternative to fine-tuning small LLMs for complex multi-step search tasks.