Tag
ReGeN is a reference-guided generative pipeline for multivariate time series data that decomposes observed sequences into periodic backbone, stochastic residuals, and cross-variable dependencies to synthesize controllable synthetic data. It demonstrates that generated data can substitute for real data in forecasting tasks, outperforming prior synthetic data generators.
GenesisFunc is an automated multi-agent pipeline for generating high-quality, diverse synthetic training data for function-calling in LLMs. Fine-tuning an 8B model on this data achieves strong in-domain and out-of-domain performance, rivaling some API-based models.
This paper introduces TabKG, a knowledge-graph-guided framework for generating logically consistent synthetic supply chain tabular data. It uses an LLM ensemble to discover operational dependencies and a latent diffusion model to generate independent columns, achieving high logical consistency while preserving statistical fidelity.
Hugging Face shared slides detailing how they generated 1 trillion tokens of synthetic data for training foundation models.
This paper proposes Multi-Stage In-Flight Rejection (MSIFR), a training-free framework that reduces token waste in LLM-based synthetic data generation by detecting and terminating low-quality generation trajectories at intermediate checkpoints. Across five models and seven benchmarks, MSIFR reduces token consumption by 11–77% as a standalone method and up to 78.2% when combined with early-exit methods, while preserving or improving accuracy.
Abliteration launches a made-to-order synthetic training data workflow that generates negative, rare, and adversarial examples for classifiers, with schema, real-world facts, labels, provenance, and export to platforms like Hugging Face.