data-processing

#data-processing

With screen-aware AI the privacy question isn't just ""what does it see."" It's where what it sees goes.

Reddit r/ArtificialInteligence ↗ · 2026-05-29

An article exploring privacy concerns with AI tools that read screens, questioning whether screen content leaves the user's machine and the need for local-only processing or clear disclosures.

0 favorites 0 likes

#data-processing

@lhoestq: You don't know you actually need local Common Crawl

X AI KOLs Timeline ↗ · 2026-05-22 Cached

Learn how to set up and use Common Crawl data locally for web data processing tasks.

0 favorites 0 likes

#data-processing

@VikParuchuri: We'll process ~1B pages this week. The team at @datalabto has done incredible work orchestrating our models across thou…

X AI KOLs Following ↗ · 2026-05-11 Cached

The DataLab team is orchestrating AI models across thousands of GPUs to process approximately one billion pages this week, highlighting significant large-scale document processing capabilities.

0 favorites 0 likes

#data-processing

@rwayne: Absolutely impressive for building local knowledge bases with academic papers—the bottleneck has always been cleanly converting PDFs to Markdown. OpenDataLoader-PDF achieves a 0.907 accuracy rate, ranking first on the open-source PDF parsing leaderboard, all under Apache 2.0. Key metrics from a test set of 200 real papers: Overall score 0…

X AI KOLs Timeline ↗ · 2026-05-10

OpenDataLoader-PDF is an open-source PDF parsing tool that achieves a high accuracy rate of 0.907 in tests with real academic papers. It efficiently converts complex PDF documents (including tables, formulas, and scanned images) into Markdown and JSON, making it ideal for local knowledge bases and RAG applications.

0 favorites 0 likes

#data-processing

@cmpatino_: I’ve been using ml-intern for a while, and it genuinely changed my workflow. It's super good at: - Model/Dataset discov…

X AI KOLs Following ↗ · 2026-04-21 Cached

Developer praises ml-intern tool for streamlining model/dataset discovery, post-training iteration and data workflows.

0 favorites 0 likes

data-processing

With screen-aware AI the privacy question isn't just ""what does it see."" It's where what it sees goes.

@lhoestq: You don't know you actually need local Common Crawl

@VikParuchuri: We'll process ~1B pages this week. The team at @datalabto has done incredible work orchestrating our models across thou…

@cmpatino_: I’ve been using ml-intern for a while, and it genuinely changed my workflow. It's super good at: - Model/Dataset discov…

Submit Feedback