data-generation

#data-generation

REGEN: Reference-Guided Synthetic Multivariate Time Series Generation for Forecasting

arXiv cs.LG ↗ · 21h ago Cached

ReGeN is a reference-guided generative pipeline for multivariate time series data that decomposes observed sequences into periodic backbone, stochastic residuals, and cross-variable dependencies to synthesize controllable synthetic data. It demonstrates that generated data can substitute for real data in forecasting tasks, outperforming prior synthetic data generators.

0 favorites 0 likes

#data-generation

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

arXiv cs.CL ↗ · 2026-05-29 Cached

GenesisFunc is an automated multi-agent pipeline for generating high-quality, diverse synthetic training data for function-calling in LLMs. Fine-tuning an 8B model on this data achieves strong in-domain and out-of-domain performance, rivaling some API-based models.

0 favorites 0 likes

#data-generation

Generating Logically Consistent Synthetic Supply Chain Data with LLM-Driven Knowledge Graph Reasoning

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper introduces TabKG, a knowledge-graph-guided framework for generating logically consistent synthetic supply chain tabular data. It uses an LLM ensemble to discover operational dependencies and a latent diffusion model to generate independent columns, achieving high logical consistency while preserving statistical fidelity.

0 favorites 0 likes

#data-generation

@yacinelearning: very awesome resource from hugging face with available slides about how they generated 1T synthetic data a really cool …

X AI KOLs Following ↗ · 2026-05-26 Cached

Hugging Face shared slides detailing how they generated 1 trillion tokens of synthetic data for training foundation models.

0 favorites 0 likes

#data-generation

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

arXiv cs.AI ↗ · 2026-05-15 Cached

This paper proposes Multi-Stage In-Flight Rejection (MSIFR), a training-free framework that reduces token waste in LLM-based synthetic data generation by detecting and terminating low-quality generation trajectories at intermediate checkpoints. Across five models and seven benchmarks, MSIFR reduces token consumption by 11–77% as a standalone method and up to 78.2% when combined with early-exit methods, while preserving or improving accuracy.

0 favorites 0 likes

#data-generation

What matters when synthetic training data is generated on demand?

Reddit r/ArtificialInteligence ↗ · 2026-05-14

Abliteration launches a made-to-order synthetic training data workflow that generates negative, rare, and adversarial examples for classifiers, with schema, real-world facts, labels, provenance, and export to platforms like Hugging Face.

0 favorites 0 likes

data-generation

REGEN: Reference-Guided Synthetic Multivariate Time Series Generation for Forecasting

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

Generating Logically Consistent Synthetic Supply Chain Data with LLM-Driven Knowledge Graph Reasoning

@yacinelearning: very awesome resource from hugging face with available slides about how they generated 1T synthetic data a really cool …

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

What matters when synthetic training data is generated on demand?

Submit Feedback