synthetic-data

#synthetic-data

A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

arXiv cs.AI ↗ · 9h ago Cached

This paper presents RaDaR, a 32B open-source reasoning LLM trained on public and synthetic rare disease cases, which outperforms larger models like DeepSeek-R1 in diagnosis benchmarks and improves physician accuracy by 21.44 percentage points in a randomized trial.

0 favorites 0 likes

#synthetic-data

@natolambert: New lecture for the book! Nominally about synthetic data, but mostly is a walk through of the distillation literature f…

X AI KOLs Timeline ↗ · 21h ago Cached

Natolambert announces a new lecture covering synthetic data and the history of distillation, from Hinton 2015 to modern on-policy distillation, with over 7 hours of video content.

0 favorites 0 likes

#synthetic-data

NVIDIA Brings Trusted, 24/7 AI Agents to Telecom Operations

NVIDIA Blog ↗ · yesterday Cached

NVIDIA announces new AI agents and tools for telecom operations, including synthetic data generation and secure agent runtimes, showcased at DTW Ignite 2026. The platform aims to enable autonomous networks by combining domain-specific models, privacy-safe synthetic data, and policy-based guardrails.

0 favorites 0 likes

#synthetic-data

@neural_avb: Btw Joel is the author of the great huggingface article on the Synthetic Data Playbook. It's a marathon survey everyone…

X AI KOLs Timeline ↗ · 5d ago Cached

A tweet highlighting Joël Niklaus's HuggingFace article on the Synthetic Data Playbook, which inspired the text-albumentations library.

0 favorites 0 likes

#synthetic-data

Efficient Financial Language Understanding via Distillation with Synthetic Data

arXiv cs.CL ↗ · 6d ago Cached

Presents a framework for financial sentiment analysis using distillation with synthetic data, transferring knowledge from a large teacher to compact student models, with clustering-based seed selection for efficient low-resource domain adaptation.

0 favorites 0 likes

#synthetic-data

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

arXiv cs.CL ↗ · 6d ago Cached

This paper investigates activation steering as an alternative to few-shot prompting for generating synthetic data in low-resource languages. The authors propose LanguageSteering and QualitySteering strategies, showing that steering on early layers improves diversity and downstream model performance.

0 favorites 0 likes

#synthetic-data

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

arXiv cs.LG ↗ · 6d ago Cached

PSyGenTAB is a privacy-preserving framework that uses constrained optimization to generate synthetic clinical tabular data, balancing privacy and utility while preserving clinical relationships and minority-class patterns.

0 favorites 0 likes

#synthetic-data

@Ex0byt: A must bookmark.. tiny cracked team, 4 H100 nodes, open source 3 stage recipe, trained on 8k synthetic rubric tasks, fu…

X AI KOLs Timeline ↗ · 6d ago Cached

A small team trained a frontier-level Deep Research Agent on an academic budget using only 32 H100s and 8K synthetic samples, releasing fully open weights, code, and paper for models from 2B to 35B that match or beat closed frontier agents on key benchmarks.

0 favorites 0 likes

#synthetic-data

@KaiZhang_CS: Check out one of the best open-source search agents trained by @jianxie_ !! glad to see early experience methods work o…

X AI KOLs Timeline ↗ · 6d ago Cached

Yu Su's team trained a frontier Deep Research Agent on an academic budget using 8K synthetic samples and RL, releasing fully open training infrastructure and models from 2B to 35B parameters.

0 favorites 0 likes

#synthetic-data

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

arXiv cs.CL ↗ · 2026-06-17 Cached

This paper measures information degradation in AI-rewritten radiology reports, finding that tasks producing cleaner text for multimodal training cause greater cross-modal alignment loss, a phenomenon termed the 'slop paradox'.

0 favorites 0 likes

#synthetic-data

Informative Missingness to Generate Irregular Clinical Time Series

arXiv cs.LG ↗ · 2026-06-17 Cached

Presents a diffusion-based approach for generating irregular clinical time series that jointly models laboratory values and their observation patterns, using the DACMI benchmark from MIMIC-III. The model captures clinically meaningful dependencies between patient physiology and testing behavior under MNAR-like missingness.

0 favorites 0 likes

#synthetic-data

Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper diagnoses and repairs shape-prior shortcuts in learning-based long-range single-shot fringe projection profilometry, using mechanistic interpretability and conformal uncertainty quantification. The proposed PhiCalNet architecture achieves a 3.3x reduction in object MAE by replacing depth regression with wrapped-phase output and a differentiable calibration layer.

0 favorites 0 likes

#synthetic-data

@yacinelearning: okay folks buckle up because this thursday we have @joelniklaus from @huggingface that will join us on stream to teach …

X AI KOLs Timeline ↗ · 2026-06-15 Cached

Joel Niklaus from Hugging Face will give a live stream on synthetic data's role in advancing pretraining; the team has also published a playbook on the topic.

0 favorites 0 likes

#synthetic-data

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

arXiv cs.AI ↗ · 2026-06-15 Cached

VeriGeo introduces a controllable geometry question generation framework that uses verification-guided reflection to ensure numerical and analytical consistency. The method produces high-quality synthetic data, achieving state-of-the-art results on GeoQA and strong performance on PGPS9K and MathVista-GPS.

0 favorites 0 likes

#synthetic-data

When Sample Selection Bias Precipitates Model Collapse

arXiv cs.AI ↗ · 2026-06-15 Cached

This paper demonstrates that data selection in low-resource verification regimes, where verifiers only have access to fragmented and biased slices of the target distribution, can paradoxically accelerate model collapse by pruning globally relevant tail modes. The authors provide theoretical proof and propose a collaborative proxy reference mechanism as a mitigation strategy.

0 favorites 0 likes

#synthetic-data

Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

arXiv cs.CL ↗ · 2026-06-15 Cached

This paper presents a synthetic data generation method for fine-tuning small LLMs to convert natural language to Cypher queries for property graphs, achieving competitive performance with large proprietary models while enabling local deployment and data sovereignty.

0 favorites 0 likes

#synthetic-data

ProCUA-SFT Technical Report

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

ProCUA-SFT is a large-scale synthetic dataset of 3.1M step-level SFT samples for training computer-use agents, produced via an automated pipeline using a single VLM (Kimi-K2.5). Fine-tuning UI-TARS 7B on it achieves 45.0% on OSWorld, an 18.7 point improvement over the base model.

0 favorites 0 likes

#synthetic-data

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

arXiv cs.AI ↗ · 2026-06-12 Cached

TrajGenAgent proposes a hierarchical LLM agent framework that decouples macro-level activity planning from micro-level spatiotemporal instantiation for realistic human mobility trajectory generation without fine-tuning. It also introduces an anomaly-detection-based evaluation for behavioral fidelity.

0 favorites 0 likes

#synthetic-data

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

arXiv cs.CL ↗ · 2026-06-12 Cached

This paper introduces HieraRAG, a hierarchical framework for determining optimal granularity in RAG benchmarks. It generates 5,872 synthetic QA pairs across three dimensions and finds that ideal granularity varies by dimension, offering a portable procedure for practitioners.

0 favorites 0 likes

#synthetic-data

Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining

arXiv cs.LG ↗ · 2026-06-10 Cached

This paper systematically evaluates 11 synthetic time-series generators for foundation model pretraining and finds that generator rankings are not stable across architectures, but an equal-weight mixture of all generators matches or beats the best individual. Blending this mixture with real data yields the strongest pretraining corpora, reframing synthetic pretraining as a corpus composition problem rather than a generator selection problem.

0 favorites 0 likes

synthetic-data

Submit Feedback