synthetic-data

#synthetic-data

TabPFN-3: Technical Report

arXiv cs.LG ↗ · 7h ago Cached

TabPFN-3 is a new foundation model for tabular data, pretrained on synthetic data, that scales to 1M training rows while reducing training and inference time, achieving state-of-the-art performance on tabular prediction, time series, and relational data.

0 favorites 0 likes

#synthetic-data

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

arXiv cs.CL ↗ · 7h ago Cached

Proposes Video2GUI, a framework to automatically extract GUI interaction trajectories from unlabeled instructional videos, building WildGUI dataset with 12M trajectories across 1500+ apps. Pre-training on this data yields 5-20% improvements on GUI grounding and action benchmarks.

0 favorites 0 likes

#synthetic-data

Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

arXiv cs.CL ↗ · 7h ago Cached

This paper proposes a context-aware synthetic augmentation framework combined with a hybrid classification model to address data scarcity and class imbalance in classifying psychological defense mechanisms from text. The method achieves significant improvements on the PsyDefDetect shared task benchmark.

0 favorites 0 likes

#synthetic-data

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

arXiv cs.AI ↗ · 7h ago Cached

This paper proposes Multi-Stage In-Flight Rejection (MSIFR), a training-free framework that reduces token waste in LLM-based synthetic data generation by detecting and terminating low-quality generation trajectories at intermediate checkpoints. Across five models and seven benchmarks, MSIFR reduces token consumption by 11–77% as a standalone method and up to 78.2% when combined with early-exit methods, while preserving or improving accuracy.

0 favorites 0 likes

#synthetic-data

What matters when synthetic training data is generated on demand?

Reddit r/ArtificialInteligence ↗ · yesterday

Abliteration launches a made-to-order synthetic training data workflow that generates negative, rare, and adversarial examples for classifiers, with schema, real-world facts, labels, provenance, and export to platforms like Hugging Face.

0 favorites 0 likes

#synthetic-data

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Hugging Face Daily Papers ↗ · yesterday Cached

Investigates whether synthetic layered data can improve graphic design decomposition, finding that synthetic data outperforms non-scalable datasets and enables balanced layer-count distributions.

0 favorites 0 likes

#synthetic-data

@cjzafir: Here's my Fine-tuning Dataset Generation Pipeline: > Codex 5.5 as an Orchestrator > Deepseek v4 Pro as a Generator In s…

X AI KOLs Timeline ↗ · yesterday

The article describes a fine-tuning dataset generation pipeline using Codex 5.5 as orchestrator and Deepseek v4 Pro as generator, with autonomous quality gates and iterative improvement for high-quality synthetic data at low cost.

0 favorites 0 likes

#synthetic-data

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Hugging Face Daily Papers ↗ · 2d ago Cached

IndicMedDialog is a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with a fine-tuned model for personalized symptom elicitation. The dataset is derived from MDDial, enhanced with LLM-generated synthetic consultations and expert verification, supporting multilingual healthcare AI.

0 favorites 0 likes

#synthetic-data

DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis

arXiv cs.LG ↗ · 3d ago Cached

The article introduces DataArc-SynData-Toolkit, an open-source framework designed to simplify multi-path, multimodal, and multilingual synthetic data generation. It aims to lower technical barriers and improve usability for training large language models through a unified, configuration-driven pipeline.

0 favorites 0 likes

#synthetic-data

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper introduces LychSim, a controllable simulation framework built on Unreal Engine 5 to facilitate vision research, synthetic data generation, and agentic LLM evaluation via MCP integration.

0 favorites 0 likes

#synthetic-data

On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics

arXiv cs.LG ↗ · 4d ago Cached

This research paper investigates privacy leakage in tabular diffusion models, quantifying how training setups, synthesis choices, and attacker knowledge impact privacy risks. It reveals that adversaries can succeed without perfect knowledge or massive resources and highlights pitfalls in heuristic privacy metrics.

0 favorites 0 likes

#synthetic-data

PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat

arXiv cs.CL ↗ · 4d ago Cached

This paper presents a system for the EEUCA 2026 shared task on toxicity detection in gaming chat, achieving 4th place by fine-tuning Llama 3.1 8B with synthetic data augmentation. It highlights a 'validation trap' phenomenon where high validation scores do not correlate with test performance due to dataset distribution shifts.

0 favorites 0 likes

#synthetic-data

@dair_ai: The Top AI Papers of the Week (May 4 - 10) - Conductor - HeavySkill - Horizon Generalization - 1,000 Synthetic Computer…

X AI KOLs Following ↗ · 4d ago Cached

A weekly roundup of top AI research papers covering topics such as Conductor, HeavySkill, Horizon Generalization, synthetic computers, self-improving pretraining, and AlphaZero for Connect Four.

0 favorites 0 likes

#synthetic-data

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

arXiv cs.CL ↗ · 2026-05-08 Cached

When2Speak is a synthetic dataset and pipeline for training LLMs to decide when to speak in multi-party conversations. Fine-tuning on this dataset significantly improves turn-taking, with reinforcement learning reducing missed interventions from 50% to ~20%.

0 favorites 0 likes

#synthetic-data

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.

0 favorites 0 likes

#synthetic-data

Into the Omniverse: Manufacturing’s Simulation-First Era Has Arrived

NVIDIA Blog ↗ · 2026-04-28 Cached

Manufacturers are adopting a simulation-first approach using NVIDIA Omniverse and OpenUSD for physical AI, with case studies from ABB Robotics and JLR showing significant improvements in accuracy, cycle time reduction, and cost savings.

0 favorites 0 likes

#synthetic-data

OpenSimula — open implementation of Simula-style mechanism design for synthetic data (in AfterImage) [P]

Reddit r/MachineLearning ↗ · 2026-04-23

OpenSimula, an open-source Python implementation of the Simula mechanism-design recipe for controlled synthetic-data generation, has been added to the AfterImage dataset tool.

0 favorites 0 likes

#synthetic-data

Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from KAIST propose a framework that uses persona-guided LLM agents to synthesize diverse harmful content for stress-testing detection systems, addressing limitations of static benchmarks such as scalability, diversity, and data contamination. Both human and LLM evaluations confirm the synthetic scenarios are harder to detect than existing benchmarks while maintaining linguistic and topical diversity.

0 favorites 0 likes

#synthetic-data

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

Hugging Face Blog ↗ · 2026-04-21 Cached

NVIDIA's Nemotron-Personas-Korea is a dataset of 6-7 million synthetic personas grounded in official Korean demographic statistics, designed to help build culturally accurate Korean AI agents while complying with Korea's Personal Information Protection Act (PIPA). The tutorial demonstrates how to filter personas and deploy a grounded Korean AI agent using hosted APIs in approximately 20 minutes.

0 favorites 0 likes

#synthetic-data

LACE: Lattice Attention for Cross-thread Exploration

arXiv cs.AI ↗ · 2026-04-20 Cached

LACE introduces a lattice attention mechanism that enables concurrent reasoning paths in LLMs to share intermediate insights and correct errors during inference, improving reasoning accuracy by over 7 points compared to standard isolated parallel sampling.

0 favorites 0 likes

synthetic-data

Submit Feedback