My client didn't want to add FAQs manually, so I built a system that crawls their website and generates the knowledge base automatically

Reddit r/artificial 06/14/26, 11:12 AM Tools

web-crawler faq-generation ai-agent knowledge-base beautifulsoup automation embeddings

Summary

Describes building a web crawler that extracts content from hotel websites, uses an AI agent to generate structured FAQs, and stores them in a vector database for automatic knowledge base creation.

Building on a hotel email AI system I shipped recently (500 properties, \~15k emails/day). The client had a requirement that turned into the most interesting part of the build. They did not want to manually add FAQs to the database for every hotel they onboard. With 500+ properties and new ones being added regularly, hand-entering FAQs would be a full time job by itself. So they asked for two things: feed the system a hotel's website URL OR a PDF, and have it automatically extract all the relevant information and generate the FAQ knowledge base. Here's how the website crawler works: It starts at the hotel's URL and hits their sitemap first to discover pages. It maintains a set of visited URLs so it never crawls the same page twice. It caps at 50 pages because most of the useful information lives in the first few pages. Crawling the entire site adds hours of processing time for almost no extra value. The junk filtering was important. The crawler skips paths like booking, reserve, login, careers, legal, checkout, cart, admin. These pages have no FAQ-relevant content. It only follows links that look like they lead to useful info (amenities, FAQs, policies, etc). For content extraction it uses BeautifulSoup and strips out script, style, nav, footer, and header elements before grabbing the text. The footer and nav are pure noise that would pollute the knowledge base if included. It crawls deeper by following relevant internal links from the first page, so it captures subsequent pages like /amenities or /faq, not just the landing page. Here's the part that makes it actually useful: After crawling and cleaning the content, it doesn't just dump raw website text into the vector database. A separate AI agent reads the cleaned content and generates structured FAQs from it. Question and answer pairs. Then those get embedded and stored. So the flow is: website URL → crawl relevant pages → clean the content → AI generates FAQs from content → embed and store. The client just pastes a URL and the entire knowledge base builds itself. When the same URL gets crawled again, the old data for that hotel gets deleted and replaced with fresh data, so re-crawling updates the knowledge base instead of duplicating it. The system prompt for the FAQ generation agent was the most critical piece. I gave it explicit rules, guardrails, and 11 worked examples. Garbage in garbage out. If the FAQ generation hallucinates wrong information (like a wrong price or a wrong policy) it could cost the client real money and trust. I've seen reports of AI agents quoting customers wrong prices because of sloppy system prompts. I recorded a full walkthrough of how I built the crawler and FAQ generation if anyone wants to see the actual code: [here](https://www.youtube.com/watch?v=G3g8q_oPx0Q) Happy to answer questions about the crawling or FAQ generation approach.

Original Article

My client didn't want to add FAQs manually, so I built a system that crawls their website and generates the knowledge base automatically

Similar Articles

Smart FAQs

I made a FAQ Chatbot that runs completely in browser; Local AI in Two Clicks

@ziwenxu_: https://x.com/ziwenxu_/status/2053241837453029439

We deployed AI agents across our company, including autonomous support in Jira. AMA.

@femke_plantinga: Company wikis have a problem most teams don't see until it's too late. (and by then, your AI agents are already working…

Submit Feedback

Similar Articles

I made a FAQ Chatbot that runs completely in browser; Local AI in Two Clicks

@ziwenxu_: https://x.com/ziwenxu_/status/2053241837453029439

We deployed AI agents across our company, including autonomous support in Jira. AMA.

@femke_plantinga: Company wikis have a problem most teams don't see until it's too late. (and by then, your AI agents are already working…