Built a Fetch API that returns page labels, not just markdown

Reddit r/AI_Agents Products

Summary

The author introduces a Fetch API for RAG and web ingestion that returns page labels (dead link, content category, page structure) to help filter low-value pages before indexing. They seek feedback on what additional fields would be useful.

I'm working on a Fetch API for RAG, agents, and web ingestion workflows. Think Firecrawl/Jina Reader-style URL-to-markdown or clean-text API, but with one extra signal layer: page labels for content category and page structure. The pain point: fetching is only the first step. You still need to decide whether a page is useful, relevant, and worth sending into indexing, embedding, or an LLM pipeline. Examples of labels we return: dead link / main content missing → skip low-value pages early homepage / index page vs content page → avoid mixing navigation/listing pages with real content content category → keep vertical pipelines from indexing out-of-scope pages, e.g. a finance workflow pulling in random entertainment/forum pages Our category labels cover broad areas like Finance, Health, News, Ecommerce, Education, Jobs, Travel, and more. A couple of open questions: If you've already built filtering logic on top of a fetch API — skipping listing pages, filtering by topic, dropping dead links — curious what that looks like in your pipeline. Does moving this upstream actually save work, or just add a layer you'd rather control yourself? Beyond category and page structure, what other fields or labels would actually be useful in a fetch API response? Author, publish date, sentiment, product pricing, freshness signals...? Curious what's missing from current fetch tools for your pipeline. Happy to share access if you want to try it. New signups get $5 credit, around 5k pages.
Original Article

Similar Articles

How we index images for RAG

Hacker News Top

Kapa.ai describes their approach to indexing images for RAG by using a cheap vision model to generate text descriptions at indexing time, avoiding query-time vision costs, resulting in better answers with minimal per-query overhead.