data-collection

#data-collection

Indian housewives are training next wave of humanoids through their chores

Reddit r/ArtificialInteligence ↗ · 2d ago

Indian housewives are contributing to the training of humanoid robots by performing household chores, providing valuable data for AI learning.

0 favorites 0 likes

#data-collection

Imagine you're building a RAG chatbot that trained on an entire website. How Would you crawl the entire site

Reddit r/AI_Agents ↗ · 2d ago

A discussion on how to crawl an entire website to train a RAG chatbot, covering strategies and challenges.

0 favorites 0 likes

#data-collection

The 'papers, please' era of the internet will decimate your privacy

Hacker News Top ↗ · 5d ago Cached

An opinion piece arguing that mandatory age verification laws, such as Australia's social media ban for under-16s, threaten privacy by forcing users to hand over sensitive data like IDs to third parties, while the law itself is shown to be ineffective.

0 favorites 0 likes

#data-collection

@swyx: on their @latentspacepod we covered how @pimdewitte accidentally made the PERFECT world model data collection business …

X AI KOLs Timeline ↗ · 6d ago Cached

This tweet highlights how Pim de Witte accidentally built a world model data collection business by assembling the largest dataset of trainable (video, action) pairs, and announces a $320M Series A at a $2.3B valuation.

0 favorites 0 likes

#data-collection

How to Opt Out of Google Search’s New AI Data Training Feature

Wired ↗ · 6d ago Cached

Google is rolling out a new Search Services History setting that saves users' uploaded media for AI training, enabled by default. This article explains how to opt out and highlights privacy concerns.

0 favorites 0 likes

#data-collection

@NFTCPS: Finally found out where those repost accounts on X get their content! It's this tool MediaCrawler, a single tool that covers Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. It can scrape public content, comments, likes, and reposts. The best part is it doesn't need JS reverse engineering—it uses browser login state to get signatures directly, …

X AI KOLs Timeline ↗ · 2026-06-23 Cached

MediaCrawler is a multi-platform social media data scraping tool that supports public content crawling from Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. It bypasses JS reverse engineering by leveraging browser login state, lowering the technical barrier.

0 favorites 0 likes

#data-collection

Petition against Meta's employee training data collection for ML models

Hacker News Top ↗ · 2026-06-21 Cached

Meta employees are petitioning against the Model Capability Initiative (MCI), which collects computer-use data like keystrokes, mouse movements, and screen content for AI training, raising serious privacy and regulatory concerns.

0 favorites 0 likes

#data-collection

@WY_mask: MediaCrawler: Open-source web scraping tool for Xiaohongshu, Douyin, Weibo, Bilibili, Kuaishou. Supports scraping videos, images, comments, likes, reposts, etc. https://github.com/NanmiCoder/MediaCrawler…

X AI KOLs Timeline ↗ · 2026-06-21 Cached

MediaCrawler is an open-source multi-platform self-media data collection tool that supports scraping public information from Xiaohongshu, Douyin, Weibo, Bilibili, Kuaishou and other platforms. No JS reverse engineering required, based on Playwright browser automation.

0 favorites 0 likes

#data-collection

The inevitable weakness of metrics

MIT Technology Review ↗ · 2026-06-19 Cached

A reflective essay on the pitfalls of self-quantification, arguing that while metrics can reveal useful information, they often obscure or corrupt deeper self-knowledge.

0 favorites 0 likes

#data-collection

After Senate vote, Trump admin backs off plans to kill ocean monitoring

Ars Technica ↗ · 2026-06-18 Cached

After Senate opposition, the Trump administration reversed its decision to dismantle the Ocean Observatories Initiative, a $350 million ocean monitoring network used for climate tracking, weather forecasting, and fisheries management.

0 favorites 0 likes

#data-collection

Collecting robot training data is dirty, unglamorous work. Some AI labs are already paying XDOF to do it.

TechCrunch AI ↗ · 2026-06-17 Cached

XDOF, a startup emerging from stealth, has raised $70M to build data pipelines and tools for robot training, addressing the bottleneck of physical interaction data. The company is releasing ABC, a large dataset of robot manipulation trajectories, to accelerate robotics AI.

0 favorites 0 likes

#data-collection

The beautiful ugly shape

Reddit r/artificial ↗ · 2026-06-16

An analysis of X's platform architecture reveals how Grok AI integrates with X Premium, behavioral data, and targeted advertising, suggesting users serve as both product and training data source.

0 favorites 0 likes

#data-collection

Ai is scaring tf out of me.

Reddit r/artificial ↗ · 2026-06-15

A user describes how Google's AI overview can analyze Instagram profiles and find all interactions, raising privacy concerns about permanent online footprints.

0 favorites 0 likes

#data-collection

@VaibhavSisinty: A 25-year-old housewife in Chennai earns ₹250/hour ($3) just by doing her normal housework. She wears a phone on her he…

X AI KOLs Timeline ↗ · 2026-06-12 Cached

A 25-year-old housewife in Chennai earns ₹250/hour filming her daily housework for AI companies training humanoid robots, as part of a growing gig economy where thousands in India record everyday tasks to train future robots.

0 favorites 0 likes

#data-collection

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

Hugging Face Daily Papers ↗ · 2026-06-12 Cached

HyVLA-0.5 is an end-to-end robotic learning system that integrates data collection, model design, pre-training, fine-tuning, and reinforcement learning for real-world deployment.

0 favorites 0 likes

#data-collection

Google will save your Lens photos, Search Live recordings, and Translate audio for AI training

The Verge ↗ · 2026-06-10 Cached

Google is introducing a new Search Services History setting that saves images, audio, and video from Lens, Search Live, and Translate to improve its AI models and personalization, with an option to opt out.

0 favorites 0 likes

#data-collection

How AI Agents Collect Data in 2026

Reddit r/AI_Agents ↗ · 2026-06-10

This article explains how AI agents in 2026 collect data from websites and APIs, and discusses key challenges like rate limits, CAPTCHAs, and IP blocking.

0 favorites 0 likes

#data-collection

Why Proxies Are Essential for Your AI Agents

Reddit r/AI_Agents ↗ · 2026-06-05

This article explains why proxies are essential for AI agents to avoid rate limits, CAPTCHAs, and geo-restrictions when collecting data at scale, and covers common use cases and types of proxies.

0 favorites 0 likes

#data-collection

AI Makes Large-Scale Web Scraping Accessible. Is That a Problem?

Reddit r/ArtificialInteligence ↗ · 2026-06-02

The article discusses how AI coding assistants make large-scale web scraping accessible to ordinary people, raising ethical concerns about ignoring robots.txt and rate limits, and questions the responsibility of AI providers.

0 favorites 0 likes

#data-collection

How does AI follow ethical guidelines in Data Collection?

Reddit r/artificial ↗ · 2026-06-02

A commentary on the ethical challenges of AI agents ignoring website rules like robots.txt when generating scrapers, and the responsibility of AI providers to implement guardrails without hindering product usability.

0 favorites 0 likes

data-collection

Submit Feedback