@gkxspace: 发现一个很疯狂的开源工具,你输一句话描述你要什么数据,它派出一群 AI Agent 并行跑到各个网站上调研,几分钟后汇总成一张结构化表格给你 其实数据都摆在网上,但想变成一张能用的表格,历来都是苦力活,过去这是一个工程项目: 拼搜索、写爬…
摘要
BigSet 是一个开源工具,输入一句话描述所需数据,它会派出多个 AI Agent 并行在网络上调研,自动推断 schema、去重、验证并生成结构化表格,支持定时刷新。
查看缓存全文
缓存时间: 2026/06/03 05:44
发现一个很疯狂的开源工具,你输一句话描述你要什么数据,它派出一群 AI Agent 并行跑到各个网站上调研,几分钟后汇总成一张结构化表格给你
其实数据都摆在网上,但想变成一张能用的表格,历来都是苦力活,过去这是一个工程项目: 拼搜索、写爬虫、设计 Schema、去重,再加个定时任务保持更新,每个数据集都要重来一遍。
Exa 融了 2.5 亿验证了一件事:这个距离可以缩短到一句话。 但 Websets 目前只覆盖公司、人和论文。而且都是cacheddata
@Tiny_Fish 开源了一个叫 Bigset 的东西,做同样的事,但不限主题。
我试了一句:“提供免费层或 Freemium 方案的热门 B2B SaaS 工具,包含名称、类别、免费层概要和定价页面。”几分钟后拿到一张完整的表。
背后是真正的多 Agent 架构:编排 Agent 发现实体,并行子 Agent 各自调研,工具预算 6 次封顶,数据来源可追溯。整个代码库 AGPL-3.0 开源,自托管,用你自己的 Key。
如果 BigSet 对你有用,记得去 GitHub repo 点个 star → http://github.com/tinyfish-io/bigset…
tinyfish-io/bigset
Source: https://github.com/tinyfish-io/bigset
Build and maintain any dataset from the live web, that refreshes regularly
⚠️ BigSet is experimental. It works, sometimes surprisingly well, but expect rough edges. We’re building in the open and shipping fast. Things will break, improve, and change. Issues and feedback are very welcome.
What Is BigSet?
You type a sentence:
“YC companies that are currently hiring engineers, with their funding stage, location, and number of open roles.”
BigSet infers the schema automatically, sends autonomous agents to research it on the live web, verifies what they find against real sources, deduplicates, and hands you a structured dataset. Download as CSV or XLSX. Set a refresh cadence (30 min, 6 hours, 12 hours, daily, weekly) and the agents re-run on schedule, pulling fresh data so the dataset never goes stale.
Any topic. GPU prices. Competitor features. Research papers. Restaurant menus. Insurance quotes. Whatever you type, it builds. And keeps current.
You don’t pick a scraper, write selectors, or point it at a URL. You just describe the data you care about, set a refresh cadence, and BigSet handles the rest.
Built on TinyFish APIs.
✨ Why BigSet?
At the end of the day, every interaction with the web, whether it’s you or your AI agent, ultimately comes down to data. Prices, companies, jobs, research, availability, inventory. The web has all of it, scattered across millions of pages.
There are great tools out there for parts of this problem. Scraping frameworks that extract content from URLs you point them at. Search APIs that return ranked results. Pre-built actors for specific sites. Lead gen platforms that produce verified lists of people and companies. They work, and they work well for what they do.
But the moment you need something that cuts across those categories, or something none of them cover, you’re back to square one. Stitching together search, extraction, schema design, deduplication, verification, and a cron job to keep it fresh. For every dataset. Every time. The data is right there on the web. Getting it into a table you can use is still a project.
BigSet closes that gap. One sentence in, verified structured data out, refreshed on whatever cadence you set. Your agents get live data to reason over; you get a table you can actually use.
Any dataset. Any source. Always fresh. That’s the idea.
How It Works
- You describe the dataset in plain English, as vague or specific as you like
- AI infers the schema: column names, types, primary keys, where to look on the web
- An orchestrator agent discovers entities via web search
- Sub-agents fan out in parallel: each one investigates a single entity, fetches real data, and inserts a verified row
- You get a structured table: browse it in the UI, export CSV or XLSX
- Set a refresh cadence and the agents re-run on schedule, keeping the dataset current automatically
Things to Know Before You Start
- It’s experimental. Expect rough edges; schema inference isn’t always perfect, and some topics work better than others.
- Dataset generation takes 2-5 minutes. The agents are doing real web research: searching, fetching pages, verifying data. It’s not instant, but the output is real.
- It works best for topics with publicly available web data. If the information exists on public web pages, BigSet can probably find it. Data behind logins or paywalls is out of reach for now.
- Scheduled refresh keeps datasets current. Set a cadence (30 min to weekly) and the agents re-run automatically. No manual re-runs.
- Datasets are downloadable, not queryable. You can browse in the UI and export CSV/XLSX. SQL query support is on the roadmap.
🚀 Quick Start
Prerequisites: Docker and Make
You’ll also need API keys from three services (all free to set up):
| Service | What it’s for | Get your key |
|---|---|---|
| TinyFish | Web search + page fetching | tinyfish.ai/api-keys |
| OpenRouter | LLM calls (schema inference + agents) | openrouter.ai/settings/keys |
| Clerk | User authentication | dashboard.clerk.com |
Step 1: Clone the repo
git clone https://github.com/tinyfish-io/bigset.git
cd bigset
cp .env.example .env
Step 2: Set up TinyFish (web access)
TinyFish powers all web search and page fetching. Search and Fetch have generous rate limits.
- Go to tinyfish.ai and create an account
- Go to API Keys and create a key
- Paste it as
TINYFISH_API_KEYin.env
Step 3: Set up OpenRouter (LLM)
OpenRouter routes LLM calls to Claude Sonnet (schema inference) and Qwen (agents). It’s pay-as-you-go; a dataset costs a few dollars in LLM usage.
- Go to openrouter.ai and create an account
- Go to Settings → Keys and create an API key
- Paste it as
OPENROUTER_API_KEYin.env - Add some credits; $5-10 is plenty to start
Step 4: Set up Clerk (auth)
Clerk handles user sign-in. The setup takes ~2 minutes:
- Go to dashboard.clerk.com and create a new application
- Pick a sign-in method (email, Google, GitHub, whatever you prefer)
- Once created, go to Configure → API Keys in the sidebar
- Copy Publishable Key → paste as
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEYin.env - Copy Secret Key → paste as
CLERK_SECRET_KEYin.env
- Copy Publishable Key → paste as
- Go to Configure → JWT Templates in the sidebar
- Click New template → select the Convex template → click Save
- Go to Configure → Settings (or Domains)
- Find your Issuer URL (looks like
https://your-app-name.clerk.accounts.dev) - Paste it as
CLERK_JWT_ISSUER_DOMAINin.env
- Find your Issuer URL (looks like
Step 5: Start everything
make dev
This installs dependencies, builds and starts all Docker services (Postgres, Convex, frontend, backend, Mastra), and deploys the Convex schema. On first run, it automatically generates the Convex admin key — no manual steps needed. See How make dev Works for the full breakdown.
Once everything is ready, you’ll see:
| Service | URL |
|---|---|
| BigSet app | localhost:3500 |
| Convex dashboard | localhost:6791 |
| Mastra Studio (workflow inspector) | localhost:4111 |
Open localhost:3500 and click Get started to sign in.
Note: root
.envis the only local env file. If you edit Convex functions infrontend/convex/, runmake convex-pushto deploy the changes.
Free tier: each signed-in account gets 2,500 row operations per calendar month (resets on the 1st, UTC). The header shows a live usage badge; system-owned curated datasets bypass the quota.
Step 6 (optional): Load curated datasets
BigSet includes 9 curated public datasets (AI companies hiring, GPU prices, model pricing, etc.) that show on the landing page:
make seed-public-datasets
This is idempotent; safe to run multiple times.
How make dev Works
make dev is designed to handle everything — first run, subsequent runs, and recovery from bad state. You should never need to run any other setup command. Here’s what it does, in order:
- Validates your
.env— checks that all required API keys are set (Clerk, OpenRouter, TinyFish). Stops with a clear error if anything is missing. - Installs dependencies — runs
npm installin bothfrontend/andbackend/. Silent if already up to date. - Starts the database layer — brings up Postgres and Convex (self-hosted) first, since other services depend on them.
- Waits for Convex — polls the Convex health endpoint until it’s ready (up to 120s).
- Ensures the admin key — if
CONVEX_SELF_HOSTED_ADMIN_KEYis empty in.env, generates one automatically and writes it. If a key exists, validates it against the running Convex instance. If the key is stale (e.g. you ranmake cleanand wiped the database), it detects the mismatch and regenerates. - Pushes Convex config — sets the Clerk JWT issuer URL in Convex so auth tokens are validated correctly.
- Deploys Convex schema — pushes the table schema and functions from
frontend/convex/to the running instance. - Starts remaining services — brings up the frontend, backend, and Mastra. These read the now-populated
.envincluding the admin key. - Streams logs — tails all container logs so you can see what’s happening.
Ctrl+Cto stop watching (containers keep running).
Commands
You only need three commands:
| Command | What it does |
|---|---|
make dev | Start everything (or recover from any broken state) |
make down | Stop all containers (data is preserved) |
make clean | Stop containers, delete all data, and clear the admin key |
Other commands you might use during development:
| Command | What it does |
|---|---|
make convex-push | Deploy Convex schema changes (run after editing frontend/convex/) |
make seed-public-datasets | Load 9 curated public datasets for the landing page |
What if something goes wrong?
make dev is self-healing. If you hit a problem, the fix is almost always just running make dev again.
| Problem | What happens |
|---|---|
Missing .env | Error: “Run: cp .env.example .env” |
| Missing API key | Error tells you exactly which key to set |
Stale admin key (after make clean) | Detected automatically, regenerated |
| Containers already running | No-op for running services, starts any that are missing |
| Convex won’t start | Error after 120s timeout — check Docker is running |
If you want a completely fresh start: make clean then make dev.
Your .env at a Glance
| Variable | Required | Where to get it |
|---|---|---|
TINYFISH_API_KEY | ✅ | tinyfish.ai → API Keys |
OPENROUTER_API_KEY | ✅ | openrouter.ai → Settings → Keys |
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY | ✅ | Clerk dashboard → API Keys |
CLERK_SECRET_KEY | ✅ | Clerk dashboard → API Keys |
CLERK_JWT_ISSUER_DOMAIN | ✅ | Clerk dashboard → Settings/Domains |
CONVEX_SELF_HOSTED_ADMIN_KEY | Auto | Auto-generated by make dev on first run |
RESEND_API_KEY | Optional | For “dataset ready” emails. Leave blank to skip. |
NEXT_PUBLIC_POSTHOG_KEY | Optional | For product analytics. Leave blank to disable. |
🛠 Tech Stack
| Layer | Tech |
|---|---|
| Frontend | Next.js 16, React 19, Tailwind 4 |
| Backend | Fastify, TypeScript (agent runner) |
| Auth | Clerk |
| Database | Convex (self-hosted) |
| Data Collection | TinyFish APIs (Search, Fetch, Browser) |
| AI orchestration | Mastra workflows + Vercel AI SDK + OpenRouter → Claude Sonnet (schema inference + populate agent) |
| Table view | TanStack Table + react-window virtualization |
| Exports | CSV (built-in) + XLSX (SheetJS, dynamic-imported) |
| Analytics | PostHog — events, session replay, error tracking (optional) |
📁 Project Structure
bigset/
├── frontend/ Next.js 16 — UI + Convex schema & functions
│ ├── convex/ Convex functions, schema, authz + quota helpers
├── backend/ Fastify + Mastra — schema inference + populate agent
│ ├── src/pipeline/ Pure pipelines: schema inference + populate context
│ ├── src/mastra/ Mastra workflows, agents, and tools (Studio at :4111 in dev)
│ ├── src/email/ Transactional email (Resend) — sends "dataset ready" notifications
│ └── src/analytics/ Server-side PostHog wrapper for backend-only events
├── scripts/ One-off scripts (e.g. verify-authz.sh)
├── .env Local env for frontend, backend, Convex CLI, and Docker (not committed)
├── docker-compose.dev.yml
└── Makefile
🛣️ Roadmap
We’re building BigSet in the open. Here’s what’s coming:
- TinyFish Browser + Agent integration — For JS-heavy sites, SPAs, and pages that need interaction to reveal data.
- Agent-native API — So your agents can create, query, and consume BigSet datasets programmatically. Build datasets on the fly, export them, feed them to your agents today. Next up: agents generate and query datasets directly.
- SQL query layer — Query your datasets with SQL instead of just exporting.
- Per-cell source provenance — Click any cell to see exactly where the data came from.
- Healer agents — Automatically detect and fix broken or stale rows.
- Incremental updates — Refresh only what changed instead of rebuilding the whole dataset.
🏗 Building in Public
BigSet is a work in progress. We’re building in the open because the best ideas come from the people who actually want to use the thing.
We’d love your feedback, ideas, or help building — come say hi:
- 🐦 Twitter: @Tiny_Fish for project updates
- 🗣 Twitter: @not_simantak for the unfiltered version
- 🐛 GitHub Issues: Report bugs or request features
🤝 Contributing
Contributions are very welcome — whether it’s code, feedback, or just telling us what datasets you’d want to build.
- Fork the repo
- Create a branch (
git checkout -b my-feature) - Make your changes
- Run
bash scripts/verify-authz.shto confirm the authorization layer still holds - Open a PR
If you’re not sure where to start, open an issue or come say hi.
📄 License
TinyFish (@Tiny_Fish): What if you and your agent had all the data that always stays fresh?
Structured, on demand, never stale.
Introducing BigSet.
Describe the data you need in plain English → get a structured dataset built from the live web, that refreshes regularly.
It’s live and open-source.
相似文章
@justloveabit: 用这个开源工具,我让一群AI替我上班了 事情是这样的,最近一直在折腾各种AI agent。Claude Code开一堆窗口,Codex也在跑,偶尔还要用Cursor。结果呢,乱成一锅粥——哪个agent在干啥,花了多少钱,完全搞不清楚。重…
本文介绍了一款名为Paperclip的开源工具,用于统一管理和调度多个AI Agent。它通过模拟公司组织架构、任务分配与预算控制等功能,解决了多Agent协作时上下文丢失、成本不可控和调度混乱的痛点。
@FakeMaidenMaker: 单一 Agent 用来生成图片、做数据分析、剪视频——每一项看起来能干,但想出一份完整的结果,往往得在 ChatGPT、Midjourney、Seedance、Excel 之间来回切五次。 给大家分享一个最近在用的开源项目 VRSEN/O…
介绍开源多智能体系统 OpenSwarm,8 个专业 agent 协同工作,从单个 prompt 在终端中生成幻灯片、研究报告、数据可视化等完整交付物,无需 GUI。
@BTCqzy1: 分享一个超实用的开源项目:Next AI Draw io(GitHub 2.8万+) 一句话就能生成复杂架构图! 再也不用手动拖框框画图了!用自然语言跟 AI 聊天,就能瞬间生成专业 draw io 图表: · 系统架构图、RAG 流程、…
一个基于AI的开源图表生成工具,通过自然语言创建 draw.io 图表,支持多模型,GitHub 星数 2.8 万。
@seclink: 最近这个开源工具挺火的。 看起来像是 钉钉悟空 、 字节 aily的开源版本。 你可以基于它来实现自己的agent 并且接入到上述的 即时通讯平台之中。 有的哥们基于这个改吧改吧,就能给投资人演示,拿到了不小规模的估值 。 让投资人记忆深…
CowAgent 是一个基于大模型的开源 AI 助理框架,支持自主任务规划、长期记忆、知识库、多模型切换和多渠道接入(微信、飞书、钉钉等),可快速构建和部署个性化 AI agent。
@VincentLogic: 发现个 AI 圈高质量信息源神器! follow-builders,这个开源项目能帮你每天蹲守全网一线 AI 大佬的动态,自动整理成摘要推给你。 作者张子雅(哈佛文科背景转型 AI)搞的,理念贼正——"关注建设者,而非网红"。 不追那些只…
介绍了一个名为 follow-builders 的开源项目,用于自动追踪 AI 领域建设者的动态并生成摘要推送,旨在帮助用户获取高质量信息。