Tag
TabPFN-3 is a new foundation model for tabular data, pretrained on synthetic data, that scales to 1M training rows while reducing training and inference time, achieving state-of-the-art performance on tabular prediction, time series, and relational data.
Proposes Video2GUI, a framework to automatically extract GUI interaction trajectories from unlabeled instructional videos, building WildGUI dataset with 12M trajectories across 1500+ apps. Pre-training on this data yields 5-20% improvements on GUI grounding and action benchmarks.
This paper proposes a context-aware synthetic augmentation framework combined with a hybrid classification model to address data scarcity and class imbalance in classifying psychological defense mechanisms from text. The method achieves significant improvements on the PsyDefDetect shared task benchmark.
This paper proposes Multi-Stage In-Flight Rejection (MSIFR), a training-free framework that reduces token waste in LLM-based synthetic data generation by detecting and terminating low-quality generation trajectories at intermediate checkpoints. Across five models and seven benchmarks, MSIFR reduces token consumption by 11–77% as a standalone method and up to 78.2% when combined with early-exit methods, while preserving or improving accuracy.
Abliteration launches a made-to-order synthetic training data workflow that generates negative, rare, and adversarial examples for classifiers, with schema, real-world facts, labels, provenance, and export to platforms like Hugging Face.
Investigates whether synthetic layered data can improve graphic design decomposition, finding that synthetic data outperforms non-scalable datasets and enables balanced layer-count distributions.
The article describes a fine-tuning dataset generation pipeline using Codex 5.5 as orchestrator and Deepseek v4 Pro as generator, with autonomous quality gates and iterative improvement for high-quality synthetic data at low cost.
IndicMedDialog is a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with a fine-tuned model for personalized symptom elicitation. The dataset is derived from MDDial, enhanced with LLM-generated synthetic consultations and expert verification, supporting multilingual healthcare AI.
The article introduces DataArc-SynData-Toolkit, an open-source framework designed to simplify multi-path, multimodal, and multilingual synthetic data generation. It aims to lower technical barriers and improve usability for training large language models through a unified, configuration-driven pipeline.
This paper introduces LychSim, a controllable simulation framework built on Unreal Engine 5 to facilitate vision research, synthetic data generation, and agentic LLM evaluation via MCP integration.
This research paper investigates privacy leakage in tabular diffusion models, quantifying how training setups, synthesis choices, and attacker knowledge impact privacy risks. It reveals that adversaries can succeed without perfect knowledge or massive resources and highlights pitfalls in heuristic privacy metrics.
This paper presents a system for the EEUCA 2026 shared task on toxicity detection in gaming chat, achieving 4th place by fine-tuning Llama 3.1 8B with synthetic data augmentation. It highlights a 'validation trap' phenomenon where high validation scores do not correlate with test performance due to dataset distribution shifts.
A weekly roundup of top AI research papers covering topics such as Conductor, HeavySkill, Horizon Generalization, synthetic computers, self-improving pretraining, and AlphaZero for Connect Four.
When2Speak is a synthetic dataset and pipeline for training LLMs to decide when to speak in multi-party conversations. Fine-tuning on this dataset significantly improves turn-taking, with reinforcement learning reducing missed interventions from 50% to ~20%.
This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.
Manufacturers are adopting a simulation-first approach using NVIDIA Omniverse and OpenUSD for physical AI, with case studies from ABB Robotics and JLR showing significant improvements in accuracy, cycle time reduction, and cost savings.
OpenSimula, an open-source Python implementation of the Simula mechanism-design recipe for controlled synthetic-data generation, has been added to the AfterImage dataset tool.
Researchers from KAIST propose a framework that uses persona-guided LLM agents to synthesize diverse harmful content for stress-testing detection systems, addressing limitations of static benchmarks such as scalability, diversity, and data contamination. Both human and LLM evaluations confirm the synthetic scenarios are harder to detect than existing benchmarks while maintaining linguistic and topical diversity.
NVIDIA's Nemotron-Personas-Korea is a dataset of 6-7 million synthetic personas grounded in official Korean demographic statistics, designed to help build culturally accurate Korean AI agents while complying with Korea's Personal Information Protection Act (PIPA). The tutorial demonstrates how to filter personas and deploy a grounded Korean AI agent using hosted APIs in approximately 20 minutes.
LACE introduces a lattice attention mechanism that enables concurrent reasoning paths in LLMs to share intermediate insights and correct errors during inference, improving reasoning accuracy by over 7 points compared to standard isolated parallel sampling.