Before we spend months processing open-source robotics datasets, tell us why this is a bad idea [D]

Reddit r/MachineLearning 05/30/26, 12:18 PM News

robotics datasets data-interoperability open-source machine-learning research

Summary

The author, an ML student, questions the robotics community about data interoperability issues and proposes an experiment to normalize and enrich public robotics datasets for better reuse.

Ps. Not pitching anything; Just trying to understand where reality differs from the narrative We're a couple of ML students, mostly worked on ML/software before, but over the last few months we've been playing with VLAs, robot datasets, and trying to understand where the field is heading. After spending a few weeks downloading robotics datasets, we were surprised by how much effort went into just getting data into a usable format. Maybe we're missing something, but it felt like every dataset had different assumptions, schemas, sensors, coordinate frames, metadata standards, and tooling. That got us wondering: How do robotics teams actually think about data sharing? Do people genuinely want access to more robot data, or is the industry moving toward "collect your own data because nobody else's transfers"? Our current (possibly very wrong) hypothesis is: The robotics ecosystem doesn't have a data scarcity problem. It has a data interoperability problem. We're considering running a pretty large experiment: Take essentially every public robot-learning dataset we can get our hands on, normalize it into a common schema, enrich it with metadata, and see how much of it is actually reusable across tasks, embodiments, and learning pipelines. Before we spend months doing that, we'd love to hear from people actually building in robotics. Where is this hypothesis wrong? Is finding data not actually a problem? Is embodiment mismatch the real blocker? Is quality the issue? Is labeling the issue? Is everyone just collecting their own data anyway? Would you ever use robot data collected by another team? If I gave you access tomorrow to every public robotics dataset through one API, what would you actually do with it? Or would you ignore it completely? \------------------------------------------------------------------------------------------------------ Edit: One clarification We're not thinking about a marketplace, proprietary format, or closed platform. The experiment we're considering is much simpler: Take as much public robotics data as possible, normalize it, enrich it with metadata/quality signals, make it searchable, and release it back to the community in an open format. Would that actually be useful to practitioners?

Original Article

Before we spend months processing open-source robotics datasets, tell us why this is a bad idea [D]

Similar Articles

Robotics Teams Are Rebuilding the Data Stack from Scratch

Is this an ethical use of robotics?

@svlevine: Learning from suboptimal data is important, because robots make suboptimal data on their own, and the more robots there…

@RemiCadene: Wow so much open data!

Should public be barred from accessing extremely powerful models for fear of bad actors? Is open source reckless?

Submit Feedback

Similar Articles

Robotics Teams Are Rebuilding the Data Stack from Scratch

Is this an ethical use of robotics?

@svlevine: Learning from suboptimal data is important, because robots make suboptimal data on their own, and the more robots there…

@RemiCadene: Wow so much open data!

Should public be barred from accessing extremely powerful models for fear of bad actors? Is open source reckless?