Sharing "cull" : my open-source dataset tool for image scraping & classification & captioning pipeline

Reddit r/LocalLLaMA Tools

Summary

Cull is an open-source machine curation engine for AI image datasets that automates scraping, classification, and captioning to prepare data for training LoRAs or fine-tuning models.

I *open-sourced* a tool I built and am maintaining called **Cull**. It’s a machine curation engine for AI image datasets, the kind of work that eats hours every time you want to train a LoRA, build a reference library, or just classify an archive that isn’t a 100,000-file mess. # What it does, end to end * Scrapes from Civitai (.com and .red), X/Twitter, Reddit, Discord, plus any URL gallery-dl supports (Pixiv, DeviantArt, the booru family, ArtStation, Tumblr, FurAffinity / e621, Imgur, Flickr, and \~340 others). * Drops every image plus its source-side prompt into a local queue. Per-source dedup, no database. * Classifies each image with a vision-language model, multiple LM Studio instances for local, Groq for cloud, anything OpenAI-compatible — using a strict 17-field JSON schema, so you don’t get free-text replies you have to regex into shape. * Sorts the keepers into category folders next to their .txt prompt and a .vision.json audit record. Two score gates (overall quality + topic relevance) you tune in the UI. * Surfaces everything through a Flask + Alpine dashboard: start/stop, source toggles, gallery, prompt editor, ZIP export, per-source stats. # Two example use cases I actually used it for: * LoRA (300 images) & Finetune (100,000 images) dataset prep. * Give it a topic such as Female Influencer or {artist} style art * set AUTO\_CAPTION\_ENABLED=true if you want it to caption images or false if you want it to scrape images (and still store any found prompts from the posts it scraped from) and set whatever style prompting you want. * Walk away. * Come back to a folder of triaged images split by quality and category, each with a generated SD-prompt .txt next to it. * ZIP-export the filtered view straight into your trainer. * Ingesting a prompt-less archive. Point LOCAL\_IMPORT\_DIR at a folder of bare JPEGs (or paste a gallery-dl URL list) * Toggle off the prompt requirement, turn on auto-captioning. * Every image is classified and sorted, gets a SD-prompt / booru-tags / natural-language caption written by the same vision call that classifies it. * So you can train on a years-old archive without curating prompts by hand. # Links Repo: [https://github.com/tlennon-ie/cull](https://github.com/tlennon-ie/cull) Screenshots: [https://imgur.com/a/kSvsAW9](https://imgur.com/a/kSvsAW9) Roadmap is going to keep refining around what people actually use it for. On my list: \- more vision-worker backends \- Improved proper *requeue* UI \- a small headless CLI, \- Video scraping , classification etc # A few things worth mentioning: \- Vision worker is pluggable via a registry. Subclass BaseVisionWorker, register, done. Two LM Studio endpoints can run in parallel; there's a keepalive worker that pings every 15s if your local server has aggressive idle-unload, and an idle-unloader for when you want VRAM back. \- It ships with a Claude Code skill bundle in .claude/skills/ (cull-helper, lmstudio-vision, metadata-schema) and three sub-agents in .claude/agents/. If you use Claude Code, Cursor, Aider, Codex, or anything that respects those files, your AI assistant knows cull's load-bearing seams (categories, queue Protocol, vision-worker base class, the strict-output schema) before it touches anything. \- Self-updater is in: toast in the dashboard, click Update, pulls from origin/main and relaunches. Stack: Python 3.10+, Flask, Alpine.js, Pillow, Playwright (for the X scraper), gallery-dl. Single machine. No Redis, no DB, no Docker required. MIT licensed.
Original Article

Similar Articles

CLIP: Connecting text and images

OpenAI Blog

CLIP is OpenAI's vision-language model that learns from text-image pairs from the internet, enabling zero-shot visual classification without task-specific training data. It addresses major limitations in traditional computer vision by reducing dependence on expensive labeled datasets and improving real-world generalization.

We’re proud to open-source LIDARLearn [R] [D] [P]

Reddit r/MachineLearning

LIDARLearn is an open-source PyTorch library for 3D point cloud deep learning that unifies 56 pre-configured models with built-in cross-validation and automatic publication-ready LaTeX report generation. The framework supports supervised, self-supervised, and parameter-efficient fine-tuning methods across datasets like ModelNet40, ShapeNet, and remote sensing benchmarks.

CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams

arXiv cs.CL

Researchers from Bangladesh University of Engineering and Technology present CBRS, a multi-platform framework that filters and parses blood donation requests from social media using a dual-layer architecture and a novel 11K bilingual dataset in Bengali and English. Their LoRA fine-tuned Llama-3.2-3B model achieves 99% filtering accuracy and 92% zero-shot parsing accuracy, outperforming GPT-4o-mini and other LLMs with 35× reduced token usage.