EmbGen: Teaching with Reassembled Corpora
Summary
EmbGen is a synthetic data generation pipeline that reassembles corpora into entity-description pairs using embedding similarity to generate diverse QA pairs for fine-tuning small language models on specialized domains, showing significant improvements in factual accuracy.
View Cached Full Text
Cached at: 05/20/26, 08:25 AM
# EmbGen: Teaching with Reassembled Corpora Source: [https://arxiv.org/abs/2605.19394](https://arxiv.org/abs/2605.19394) [View PDF](https://arxiv.org/pdf/2605.19394) > Abstract:Adapting small instruction\-tuned models to specialized domains often relies on supervised fine\-tuning \(SFT\) on curated instruction\-response examples, which is expensive to collect at scale\. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross\-passage or cross\-document dependencies\. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity\-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question\-answer \(QA\) pairs via proximity, intra\-cluster, and inter\-cluster sampling with cluster\-specialized system prompts\. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge\-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets \(5 and 20 million tokens\)\. We use lexical overlap metrics, an LLM\-as\-a\-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation\. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12\.5% at 5M and 88\.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity\. ## Submission history From: Anna Leontjeva \[[view email](https://arxiv.org/show-email/cfdcde18/2605.19394)\] **\[v1\]**Tue, 19 May 2026 05:40:12 UTC \(2,573 KB\)
Similar Articles
SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia
SEA-Embedding presents a fully open and reproducible text embedding pipeline for Southeast Asian languages, trained solely on public data, achieving state-of-the-art results on the SEA-BED benchmark.
What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework
This paper presents a corpus-centric diagnostic framework for analyzing biomedical NER and EL benchmarks, revealing substantial differences across nine corpora and arguing that standard statistics are insufficient for characterizing evaluation demands.
MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
MM-BizRAG is a multimodal retrieval-augmented generation system for enterprise Q&A that uses document structure-aware splitting and layout-aware parsing to outperform vision-centric baselines by up to 32% on heterogeneous enterprise documents. The paper also introduces FastRAGEval, a cost-efficient LLM-based evaluation metric with stronger human alignment than RAGChecker.
Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training
Q-RAG introduces a reinforcement learning-based fine-tuning approach for embedder models to enable efficient multi-step retrieval, achieving state-of-the-art results on long-context benchmarks up to 10M tokens. This method provides a resource-efficient alternative to fine-tuning small LLMs for complex multi-step search tasks.
BeLink: Biomedical Entity Linking Meets Generative Re-Ranking
BeLink introduces a set-wise instruction-tuning formulation for generative re-ranking in biomedical entity linking, achieving 3-24% accuracy improvements and faster inference compared to state-of-the-art systems.