Tag
OmniParse is a local platform that ingests and parses unstructured data (documents, images, video, audio, web) into structured JSON optimized for LLM applications like RAG and fine-tuning.
The author shares that they find AI agents useful for repetitive data prep work, specifically using Pandada to clean and standardize raw files, which reduces manual effort and mistakes.
DataFlow is an open-source tool with visual, low-code pipelines to generate, clean, and prepare high-quality LLM training datasets from raw data. It includes a technical report on arXiv.
DataFlow is an LLM-driven framework for automated data preparation and workflow engineering, featuring nearly 200 reusable operators and six domain-general pipelines that improve LLM performance across tasks like math, code, and Text-to-SQL.