data-cleaning

Tag

Cards List
#data-cleaning

I had my AI assistant turn 6 months of Apple Watch sleep data into the diary my sleep clinic asked for. The data gotchas were brutal.

Reddit r/openclaw · 4d ago

A user details the challenges of using an AI assistant to convert 6 months of Apple Watch sleep data into a sleep clinic's diary format, including timezone conversions, date offsets, and fabricated values. The post shares lessons on correctly interpreting health data sources for medical forms.

0 favorites 0 likes
#data-cleaning

DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

arXiv cs.LG · 4d ago Cached

DeMix is a novel framework that detects erroneous training samples and identifies their specific error types (label errors, feature errors, spurious correlations) by analyzing influence vectors, achieving a 22.61% improvement in debugging F1-score and 9.32% gain in task performance after data repair.

0 favorites 0 likes
#data-cleaning

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

arXiv cs.AI · 2026-06-03 Cached

This paper investigates when multi-agent debate helps or hurts data cleaning, finding that debate degrades generation due to critique-induced confusion but improves error detection. It proposes a debate benefit condition and shows that adversarial separation with code-execution grounding produces the first configuration to significantly exceed single-agent performance on a generative task.

0 favorites 0 likes
#data-cleaning

Getting good predictions without data cleaning (Why "Garbage In, Garbage Out" is sometimes a trap)

Reddit r/artificial · 2026-05-13

This arXiv preprint challenges the 'Garbage In, Garbage Out' heuristic, arguing that aggressive manual data cleaning can limit predictive performance in high-dimensional tabular data by reducing dimensionality needed to triangulate latent drivers.

0 favorites 0 likes
← Back to home

Submit Feedback