@wsl8297: When building RAG / data agents, the easiest step to get stuck is this: how to turn a bunch of scattered files into a trackable, queryable, reusable dataset. Especially PDFs, images, logs, and annotation files in S3 / GCS / Azure, once the scale grows, management and iteration start to spiral out of control. https:/…
Summary
DataChain is a Python library that adds a context layer to unstructured files in S3, GCS, and Azure, turning them into versionable, queryable typed datasets with support for parallel processing, incremental updates, and agent workflow integration.
View Cached Full Text
Cached at: 06/02/26, 05:35 PM
When building RAG / data agents, the hardest bottleneck is this: how to turn scattered files into trackable, queryable, reusable datasets. Especially PDFs, images, logs, and annotation files in S3 / GCS / Azure — once scale hits, management and iteration quickly spiral out of control. https://github.com/datachain-ai/datachain…
Recently I came across a Python library called DataChain with a clear approach: add a context layer to unstructured data, turning file assets in cloud storage into “datasetized” — versioned, typed datasets. After that, you can filter, join, and search by similarity just like you would with a data warehouse, and reuse them directly in Agent / RAG pipelines.
Key features:
- Covers file data from object stores like S3, GCS, Azure;
- Manages structured fields with Pydantic schemas while retaining file pointers and lineage;
- Supports parallel / distributed Python processing, checkpointing, and incremental updates;
- Can export datasets as a Markdown knowledge base for easy consumption by humans and LLMs;
- Provides MCP / agent harness for integration with toolchains like Claude Code, Cursor, Codex, and Copilot.
If you’re working on multimodal datasets, internal enterprise knowledge bases, RAG evaluation sets, or data cleaning pipelines, this is worth a look.
datachain-ai/datachain
Source: https://github.com/datachain-ai/datachain
DataChain
DataChain: The Context Layer for Unstructured Data
PyPI (https://pypi.org/project/datachain/) Python Version (https://pypi.org/project/datachain) Codecov (https://codecov.io/gh/datachain-ai/datachain) Tests (https://github.com/datachain-ai/datachain/actions/workflows/tests.yml) DeepWiki (https://deepwiki.com/datachain-ai/datachain)
A Python library that turns files in S3, GCS, and Azure into versioned, typed datasets, queryable at warehouse speed.
- Compute Engine: parallel and distributed Python over files. Async I/O, checkpoint recovery, incremental updates.
- Dataset DB: Pydantic schemas, versioning, file pointers, automatic lineage. Sub-second filter, join, and similarity search over hundreds of millions of records.
Optional, for agent workflows:
- Knowledge Base: markdown summaries derived from the Dataset DB and enriched by LLM. Readable by humans and LLMs.
- Agent Harness: skill and MCP server that plug all three into Claude Code, Cursor, Codex, GitHub Copilot, and Pi, so they understand your data.
Bytes never leave your storage. Every run deposits a typed dataset the next pipeline (or agent) reads instead of recomputing.
1. Install
bash pip install datachain
To add the agent skill (Knowledge Base + code generation):
``bash datachain skill install –target claude
also: cursor, codex, copilot, pi
``
Works with S3, GCS, Azure, and local filesystems.
2. Quickstart: agent-driven pipeline
Task: find dogs in S3 similar to a reference image, filtered by breed, mask availability, and image dimensions.
Grab a reference image and run Claude Code (or other agent):
bash datachain cp --anon s3://dc-readme/fiona.jpg . claude
Prompt:
``prompt Find dogs in s3://dc-readme/oxford-pets-micro/ similar to ./fiona.jpg:
- Pull breed metadata and mask files from annotations/
- Exclude images without mask
- Exclude Cocker Spaniels
- Only include images wider than 400px ``
Result:
┌──────┬───────────────────────────────────┬────────────────────────────┬──────────┐ │ Rank │ Image │ Breed │ Distance │ ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤ │ 1 │ shiba_inu_52.jpg │ shiba_inu │ 0.244 │ ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤ │ 2 │ shiba_inu_53.jpg │ shiba_inu │ 0.323 │ ├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤ │ 3 │ great_pyrenees_17.jpg │ great_pyrenees │ 0.325 │ └──────┴───────────────────────────────────┴────────────────────────────┴──────────┘ Fiona's closest matches are shiba inus (both top spots), which makes sense given her tan coloring and pointed ears.
The agent decomposed the task into steps - embeddings, breed metadata, mask join, quality filter - and saved each as a named, versioned dataset. Next time you ask a related question, it starts from what’s already built.
The datasets are registered in a Knowledge Base optimized for both agents and humans:
bash dc-knowledge ├── buckets │ └── s3 │ └── dc_readme.md ├── datasets │ ├── oxford_micro_dog_breeds.md │ ├── oxford_micro_dog_embeddings.md │ └── similar_to_fiona.md └── index.md
Browse it as markdown files, navigate with wikilinks, or open in Obsidian (https://obsidian.md/):
Visualize data Knowledge Base
3. Data Harness
Code harnesses (Claude Code, Cursor, Codex, GitHub Copilot, Pi) give agents repo context, dedicated tools, and memory across sessions. DataChain adds the same for data: typed datasets the agent reads, chain operations the agent calls (read_storage, map, save), a Dataset DB where its results persist.
A dataset is the unit of work - a named, versioned result of a pipeline step like [email protected]. Every .save() registers one.
For the data-flow architecture (Compute Engine, Dataset DB, Knowledge Base) and how the components connect, see Architecture (https://docs.datachain.ai/architecture/).
4. Core concepts
4.1. Dataset
A dataset is a versioned data reasoning step - what was computed, from what input, producing what schema. DataChain indexes your storage into one: no data copied, just typed metadata and file pointers. Re-runs only process new or changed files.
Create a dataset manually create_dataset.py:
``python from PIL import Image import io from pydantic import BaseModel import datachain as dc
class ImageInfo(BaseModel): width: int height: int
def get_info(file: dc.File) -> ImageInfo: img = Image.open(io.BytesIO(file.read())) return ImageInfo(width=img.width, height=img.height)
ds = ( dc.read_storage( “s3://dc-readme/oxford-pets-micro/images/**/*.jpg”, anon=True, update=True, delta=True, # re-runs skip unchanged files ) .settings(prefetch=64) .map(info=get_info) .save(“pets_images”) ) ds.show(5) ``
[email protected] is now the shared reference to this data - schema, version, lineage, and metadata. Every .save() registers the dataset in the Dataset DB, DataChain’s persistent store for schemas, versions, lineage, and processing state, kept locally in SQLite DB .datachain/db. Pipelines reference datasets by name, not paths. When the code or input data changes, the next run bumps dataset version. This is what makes a dataset a management unit: owned, versioned, and queryable by everyone on the team.
4.2. Schemas and types
DataChain uses Pydantic to define the shape of every column. The return type of your UDF becomes the dataset schema - each field a queryable column in the Dataset DB. show() in the previous script renders nested fields as dotted columns:
bash file file info info path size width height 0 oxford-pets-micro/images/Abyssinian_141.jpg 111270 461 500 1 oxford-pets-micro/images/Abyssinian_157.jpg 139948 500 375 2 oxford-pets-micro/images/Abyssinian_175.jpg 31265 600 234 3 oxford-pets-micro/images/Abyssinian_220.jpg 10687 300 225 4 oxford-pets-micro/images/Abyssinian_3.jpg 61533 600 869 [Limited by 5 rows]
.print_schema() renders it’s schema:
bash file: File@v1 source: str path: str size: int version: str etag: str is_latest: bool last_modified: datetime location: Union[dict, list[dict], NoneType] info: ImageInfo width: int height: int
Models can be arbitrarily nested - a BBox inside an Annotation, a List[Citation] inside an LLM Response - every leaf field stays queryable the same way. The schema lives in the Dataset DB and is enforced at dataset creation time.
The Dataset DB handles datasets of any size - 100 millions of files, hundreds of metadata rows - without loading anything into memory. Pandas is limited by RAM; DataChain is not. Export to pandas when you need it, on a filtered subset:
python import datachain as dc df = dc.read_dataset("pets_images").filter(dc.C("info.width") > 500).to_pandas() print(df)
4.3. Fast queries
Filters, aggregations, and joins run as vectorized operations directly against the Dataset DB - metadata never leaves your machine, no files downloaded.
``python import datachain as dc
cnt = ( dc.read_dataset(“pets_images”) .filter( (dc.C(“info.width”) > 400) & ~dc.C(“file.path”).ilike(“%cocker_spaniel%”) # case-insensitive ) .count() ) print(f“Large images with Cocker Spaniel: {cnt}“) ``
Milliseconds, even at 100M-file scale.
Large images with Cocker Spaniel: 6
5. Resilient Pipelines
When computation is expensive, bugs and new data are both inevitable. DataChain tracks processing state in the Dataset DB - so crashes and new data are handled automatically, without changing how you write pipelines.
5.1. Data checkpoints
Save to embed.py:
``python import open_clip, torch, io from PIL import Image import datachain as dc
model, _, preprocess = open_clip.create_model_and_transforms(“ViT-B-32”, “laion2b_s34b_b79k”) model.eval()
counter = 0 def encode(file: dc.File, model, preprocess) -> list[float]: global counter counter += 1 if counter > 236: # ← bug: remove these two lines raise Exception(“some bug”) # ← img = Image.open(io.BytesIO(file.read())).convert(“RGB”) with torch.no_grad(): return model.encode_image(preprocess(img).unsqueeze(0))[0].tolist()
( dc.read_dataset(“pets_images”) .settings(batch_size=100) .setup(model=lambda: model, preprocess=lambda: preprocess) .map(emb=encode) .save(“pets_embeddings”) ) ``
It fails due to a bug in the code:
Exception: some bug
Remove the two marked lines and re-run - DataChain resumes from image 201 (two 100 size batches are completed), the start of the last uncommitted batch:
$ python embed.py UDF 'encode': Continuing from checkpoint
5.2. Similarity search
The vectors live in the Dataset DB alongside all the metadata - list[float] type in pydentic schemas. Querying them is instant - no files re-read and can be combined with not vector filters like info.width:
Prepare data:
bash datachain cp s3://dc-readme/fiona.jpg .
similar.py:
``python import open_clip, torch, io from PIL import Image import datachain as dc
model, _, preprocess = open_clip.create_model_and_transforms(“ViT-B-32”, “laion2b_s34b_b79k”) model.eval()
ref_emb = model.encode_image( preprocess(Image.open(“fiona.jpg”)).unsqueeze(0) )[0].tolist()
( dc.read_dataset(“pets_embeddings”) .filter(dc.C(“info.width”) > 500) # from pets_images - no re-read .mutate(dist=dc.func.cosine_distance(dc.C(“emb”), ref_emb)) .order_by(“dist”) .limit(3) .show() ) ``
Under a second - everything runs against the Dataset DB.
5.3. Incremental updates
The bucket in this walkthrough is static, so there’s nothing new to process. But in production - when new images land in your bucket - re-run the same scripts unchanged. delta=True in the original dataset ensures only new files are processed end to end while the whole dataset will be updated to [email protected]:
``python $ python create_dataset.py
500 new images arrived
Skipping 10,000 unchanged · indexing 500 new Saved [email protected] (+500 records)
Next day:
$ python create_dataset.py Skipping 10,000 unchanged · processing 500 new Saved [email protected] (+500 records) ``
6. Knowledge Base
DataChain maintains two layers. The Dataset DB is the ground truth: schemas, processing state, lineage, the vectors themselves. The Knowledge Base is derived from it: structured markdown for humans and agents to read. Because it’s derived, it’s always accurate.
The Knowledge Base is stored in dc-knowledge/. Ask the agent to build it (from Claude Code, Cursor, Codex, GitHub Copilot, or Pi):
bash claude
Prompt:
prompt Build a Knowledge Base for my current datasets
The skill generates dc-knowledge/ directory from the Dataset DB - one file per dataset and bucket:
7. AI-Generated Pipelines
The skill gives the agent data awareness: it reads dc-knowledge/ to understand what datasets exist, their schemas, which fields can be joined - and the meaning of columns inferred from the code that produced them.
See section 2. Quickstart: agent-driven pipeline above. All the steps that were manually created could be just generated.
8. Team and cloud: Studio
Data context built locally stays local. DataChain Studio makes it shared.
``bash datachain auth login datachain job run –workers 20 –cluster gpu-pool caption.py
✓ Job submitted → studio.datachain.ai/jobs/1042
Resuming from checkpoint (4,218 already done)…
Saved [email protected] (3,182 processed)
``
Studio adds: shared dataset registry, access control, UI for video/DICOM/NIfTI/point clouds, lineage graphs, reproducible runs. Bring Your Own Cloud - all data and compute stay in your infrastructure. AWS, GCP, Azure, on-prem Kubernetes. → studio.datachain.ai (https://studio.datachain.ai)
9. Contributing
Contributions are very welcome. To learn more, see the Contributor Guide (https://docs.datachain.ai/contributing).
10. Community and Support
- Report an issue (https://github.com/datachain-ai/datachain/issues) if you encounter any problems
- Docs (https://docs.datachain.ai/)
- Twitter (https://twitter.com/datachain_ai)
Similar Articles
@GitHub_Daily: Using AI agents for production-grade tasks—writing code, running workflows, calling APIs—works fine initially, but as the scale grows, things easily get out of control: permissions too broad, context loss, and debugging becomes impossible. That's where agents-best-practices comes in: a complete guide to designing a runtime framework for AI agents, not limited to coding scenarios, but also applicable to operations, sales...
Introduces the agents-best-practices repository, a production-grade AI agent runtime framework design guide covering tool permission tiers, context compression, etc., supporting Codex and Claude Code installation.
@DashHuang: https://x.com/DashHuang/status/2057323152758480955
This article explores why GitHub is a better foundation for knowledge collaboration than traditional documentation systems in the AI agent era, due to its advantages such as open collaboration, AI model familiarity, local full context, and structured raw data.
@AIExplorerTim: Someone just released a tool that converts PDFs into clean, structured Markdown at speeds up to 100 pages/second. No GPU required. No API costs. No messy parsing. Just raw, usable data. It handles with ease: • Tables → Perfectly ex…
OpenDataLoader is an open-source tool that converts PDFs into structured Markdown and JSON, supporting local processing speeds of up to 100 pages/second without requiring a GPU or incurring API costs, designed specifically for RAG pipelines and PDF accessibility automation.
@rwayne: Absolutely impressive for building local knowledge bases with academic papers—the bottleneck has always been cleanly converting PDFs to Markdown. OpenDataLoader-PDF achieves a 0.907 accuracy rate, ranking first on the open-source PDF parsing leaderboard, all under Apache 2.0. Key metrics from a test set of 200 real papers: Overall score 0…
OpenDataLoader-PDF is an open-source PDF parsing tool that achieves a high accuracy rate of 0.907 in tests with real academic papers. It efficiently converts complex PDF documents (including tables, formulas, and scanned images) into Markdown and JSON, making it ideal for local knowledge bases and RAG applications.
@gkxspace: Found a crazy open-source tool. You input a sentence describing what data you want, and it deploys a group of AI agents to research on various websites in parallel. After a few minutes, it compiles a structured table for you. In fact, the data is all on the internet, but turning it into a usable table has always been a labor-intensive task. In the past, this was an engineering project: combining searches, writing crawlers...
BigSet is an open-source tool. You input a sentence describing the data you need, and it deploys multiple AI agents to research the web in parallel, automatically inferring schema, deduplicating, verifying, and generating a structured table. It supports scheduled refreshes.