Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
Summary
DOMINO is a novel framework that learns minimal sufficient domain representations from reference examples to synthesize domain-specific data for LLMs, improving code benchmark performance without requiring explicit domain descriptions.
View Cached Full Text
Cached at: 06/03/26, 03:38 PM
Paper page - Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
Source: https://huggingface.co/papers/2605.30039
Abstract
DOMINO enables domain-specific data synthesis through an inductive approach that learns domain representations from reference examples, improving code benchmark performance without requiring explicit domain descriptions.
Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem ofdomain-specific data synthesisthrough aninductive paradigm, where the target domain is defined only through a set ofreference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integratesprompt tuningwith acontrastive disentanglement objectiveto separatedomain-level patternsfromsample-specific noise, mitigatingoverfittingwhile preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of thesynthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improvesPass@1 accuracyby up to 4.63\% over strong,instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm fordomain-specific data synthesis, enabling practical and scalabledomain adaptationwithout manual prompt design or natural language domain specifications.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.30039
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30039 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30039 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30039 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
This paper introduces SOMA, a framework for efficient multi-turn LLM serving that uses small language models adapted via soft prompts and LoRA fine-tuning to reduce latency and cost.
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning
This paper presents Connect the Dots (CoD), a framework for training LLMs via reinforcement learning to develop meta-capabilities for long-lifecycle agents, enabling continuous learning and cross-domain generalization.
DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
DALM proposes a domain-algebraic language model that generates text under exact structural constraints derived from a domain lattice, addressing hallucination by organizing knowledge into separate domain fibers with algebraic guarantees. The model uses three-phase structured denoising (domain → relation → concept) with domain-annotated training data to prevent cross-domain contamination.
Auditing Training Data in Domain-adapted LLMs: LoRA-MINT
LoRA-MINT is a methodology for membership inference testing on LLMs fine-tuned with LoRA, achieving high precision in determining if data was used in training, outperforming baselines.
An Information-Theoretic Criterion for Efficient Data Synthesis
This paper provides an information-theoretic account of when synthetic data improves or degrades LLM training, distinguishing between information-open and information-closed generation loops and explaining collapse via the data processing inequality.