Tag
WeightsLab is an open-source, PyTorch-native tool that allows teams to pause training, inspect live loss signals, and catch data issues like mislabels and class imbalance before they affect model performance. It is designed for computer vision engineers working with images, videos, and LiDAR point clouds.
This paper proposes PreUnlearn, a framework for auditing collateral knowledge damage in LLM unlearning before execution, using data-centric analysis to predict downstream damage across semantic layers.
This paper introduces LIMMT, a data-centric study showing that training with high-quality, minimal subsets of motion data (under 3% of AMASS) outperforms using the full dataset for physics-based humanoid motion tracking, defining motion data quality through physics feasibility, diversity, and complexity.
This survey reframes the alignment tuning of large language models as a data pipeline design problem, decomposing it into three stages: response synthesis, preference evaluation, and preference instantiation. It identifies design trade-offs and failure modes, and outlines open challenges such as prompt-level alignment and agentic settings.
This paper challenges the belief that code improves reasoning in language models, finding through controlled pretraining experiments that code alone primarily enhances programming ability, while reasoning gains come from structured reasoning traces like code-text and math-text mixtures.
This paper presents EMA, a model adaptation system for learning-based systems that reduces training and labeling costs while improving system performance in evolving environments.
Investigates whether synthetic layered data can improve graphic design decomposition, finding that synthetic data outperforms non-scalable datasets and enables balanced layer-count distributions.