Tag
This paper presents RaDaR, a 32B open-source reasoning LLM trained on public and synthetic rare disease cases, which outperforms larger models like DeepSeek-R1 in diagnosis benchmarks and improves physician accuracy by 21.44 percentage points in a randomized trial.
Natolambert announces a new lecture covering synthetic data and the history of distillation, from Hinton 2015 to modern on-policy distillation, with over 7 hours of video content.
NVIDIA announces new AI agents and tools for telecom operations, including synthetic data generation and secure agent runtimes, showcased at DTW Ignite 2026. The platform aims to enable autonomous networks by combining domain-specific models, privacy-safe synthetic data, and policy-based guardrails.
A tweet highlighting Joël Niklaus's HuggingFace article on the Synthetic Data Playbook, which inspired the text-albumentations library.
Presents a framework for financial sentiment analysis using distillation with synthetic data, transferring knowledge from a large teacher to compact student models, with clustering-based seed selection for efficient low-resource domain adaptation.
This paper investigates activation steering as an alternative to few-shot prompting for generating synthetic data in low-resource languages. The authors propose LanguageSteering and QualitySteering strategies, showing that steering on early layers improves diversity and downstream model performance.
PSyGenTAB is a privacy-preserving framework that uses constrained optimization to generate synthetic clinical tabular data, balancing privacy and utility while preserving clinical relationships and minority-class patterns.
A small team trained a frontier-level Deep Research Agent on an academic budget using only 32 H100s and 8K synthetic samples, releasing fully open weights, code, and paper for models from 2B to 35B that match or beat closed frontier agents on key benchmarks.
Yu Su's team trained a frontier Deep Research Agent on an academic budget using 8K synthetic samples and RL, releasing fully open training infrastructure and models from 2B to 35B parameters.
This paper measures information degradation in AI-rewritten radiology reports, finding that tasks producing cleaner text for multimodal training cause greater cross-modal alignment loss, a phenomenon termed the 'slop paradox'.
Presents a diffusion-based approach for generating irregular clinical time series that jointly models laboratory values and their observation patterns, using the DACMI benchmark from MIMIC-III. The model captures clinically meaningful dependencies between patient physiology and testing behavior under MNAR-like missingness.
This paper diagnoses and repairs shape-prior shortcuts in learning-based long-range single-shot fringe projection profilometry, using mechanistic interpretability and conformal uncertainty quantification. The proposed PhiCalNet architecture achieves a 3.3x reduction in object MAE by replacing depth regression with wrapped-phase output and a differentiable calibration layer.
Joel Niklaus from Hugging Face will give a live stream on synthetic data's role in advancing pretraining; the team has also published a playbook on the topic.
VeriGeo introduces a controllable geometry question generation framework that uses verification-guided reflection to ensure numerical and analytical consistency. The method produces high-quality synthetic data, achieving state-of-the-art results on GeoQA and strong performance on PGPS9K and MathVista-GPS.
This paper demonstrates that data selection in low-resource verification regimes, where verifiers only have access to fragmented and biased slices of the target distribution, can paradoxically accelerate model collapse by pruning globally relevant tail modes. The authors provide theoretical proof and propose a collaborative proxy reference mechanism as a mitigation strategy.
This paper presents a synthetic data generation method for fine-tuning small LLMs to convert natural language to Cypher queries for property graphs, achieving competitive performance with large proprietary models while enabling local deployment and data sovereignty.
ProCUA-SFT is a large-scale synthetic dataset of 3.1M step-level SFT samples for training computer-use agents, produced via an automated pipeline using a single VLM (Kimi-K2.5). Fine-tuning UI-TARS 7B on it achieves 45.0% on OSWorld, an 18.7 point improvement over the base model.
TrajGenAgent proposes a hierarchical LLM agent framework that decouples macro-level activity planning from micro-level spatiotemporal instantiation for realistic human mobility trajectory generation without fine-tuning. It also introduces an anomaly-detection-based evaluation for behavioral fidelity.
This paper introduces HieraRAG, a hierarchical framework for determining optimal granularity in RAG benchmarks. It generates 5,872 synthetic QA pairs across three dimensions and finds that ideal granularity varies by dimension, offering a portable procedure for practitioners.
This paper systematically evaluates 11 synthetic time-series generators for foundation model pretraining and finds that generator rankings are not stable across architectures, but an equal-weight mixture of all generators matches or beats the best individual. Blending this mixture with real data yields the strongest pretraining corpora, reframing synthetic pretraining as a corpus composition problem rather than a generator selection problem.