data-centric

#data-centric

Data-centric debugging for teams training neural nets [P]

Reddit r/MachineLearning ↗ · 5d ago

WeightsLab is an open-source, PyTorch-native tool that allows teams to pause training, inspect live loss signals, and catch data issues like mislabels and class imbalance before they affect model performance. It is designed for computer vision engineers working with images, videos, and LiDAR point clouds.

0 favorites 0 likes

#data-centric

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

arXiv cs.CL ↗ · 2026-06-18 Cached

This paper proposes PreUnlearn, a framework for auditing collateral knowledge damage in LLM unlearning before execution, using data-centric analysis to predict downstream damage across semantic layers.

0 favorites 0 likes

#data-centric

LIMMT: Less is More for Motion Tracking

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

This paper introduces LIMMT, a data-centric study showing that training with high-quality, minimal subsets of motion data (under 3% of AMASS) outperforms using the full dataset for physics-based humanoid motion tracking, defining motion data quality through physics feasibility, diversity, and complexity.

0 favorites 0 likes

#data-centric

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

arXiv cs.CL ↗ · 2026-05-27 Cached

This survey reframes the alignment tuning of large language models as a data pipeline design problem, decomposing it into three stages: response synthesis, preference evaluation, and preference instantiation. It identifies design trade-offs and failure modes, and outlines open challenges such as prompt-level alignment and agentic settings.

0 favorites 0 likes

#data-centric

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

arXiv cs.AI ↗ · 2026-05-20

This paper challenges the belief that code improves reasoning in language models, finding through controlled pretraining experiments that code alone primarily enhances programming ability, while reasoning gains come from structured reasoning traces like code-text and math-text mixtures.

0 favorites 0 likes

#data-centric

EMA: Efficient Model Adaptation for Learning-based Systems

arXiv cs.LG ↗ · 2026-05-15 Cached

This paper presents EMA, a model adaptation system for learning-based systems that reduces training and labeling costs while improving system performance in evolving environments.

0 favorites 0 likes

#data-centric

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Hugging Face Daily Papers ↗ · 2026-05-14 Cached

Investigates whether synthetic layered data can improve graphic design decomposition, finding that synthetic data outperforms non-scalable datasets and enables balanced layer-count distributions.

0 favorites 0 likes

data-centric

Data-centric debugging for teams training neural nets [P]

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

LIMMT: Less is More for Motion Tracking

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

EMA: Efficient Model Adaptation for Learning-based Systems

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Submit Feedback