How Šéfbot turned complex municipal websites into reliable chatbot knowledge bases

Šéfbot cut their chatbot dataset size by 70% - from 400MB to 120MB - while dramatically improving data quality. The key was gaining complete control over what content gets scraped from municipal websites using Website Content Crawler.

Situation

Šéfbot develops tailored AI chatbots across multiple industries, including automotive, B2B, e-commerce, municipalities, real estate, skilled trades, and healthcare. Their collaboration with Apify specifically focused on municipalities, which present unique challenges due to poorly structured websites. They needed a solution that could automatically transform disorganized content into clean, structured knowledge bases for their AI systems.

Šéfbot initiated a collaboration to use Website Content Crawler - Apify's web scraping tool that automatically extracts and cleans content from websites, filtering out irrelevant elements like navigation bars and ads to deliver structured, AI-ready data. They used it to extract content from their clients' websites, especially the large and unstructured ones, and feed that data into LLMs powering their chatbots.

Although Šéfbot uses another scraping platform, it required a more flexible tool for complex municipal websites to ensure the entire process of managing its knowledge bases was more automated and under control.

With the previous solution the dataset was 400 MB, with Apify it's 120 MB. Now we can be sure the data doesn't contain any unwanted photos, videos, or other irrelevant content that shouldn't be there.

-- Radek Bacík, Conversation Expert at Šéfbot


Like any new tool, it took some getting used to. But Apify's team provided hands-on onboarding support and free consultation sessions to help Šéfbot get everything configured just right.

Problem

Šéfbot knew how to build great chatbots, but getting them reliable information to work with was the real challenge. Municipal websites seemed designed to hide important details - permits buried in sub-menus, office hours scattered across different pages, and current information mixed with content from years ago.

Their previous scraping solution lacked the precision they needed. Without data preview capabilities or automated scheduling, Šéfbot couldn't control what content was being collected. This led to bloated datasets filled with irrelevant information, and when chatbots provided incorrect responses, identifying and fixing the underlying data issues became a complex, time-intensive process.

Core issues they faced:

  • Underestimated scale and complexity: Some websites require more manual intervention than initially anticipated. Municipal websites are typically unstructured, cluttered, and not maintained with AI consumption in mind.
  • Poor content organization: Important information is often hidden in obscure places or scattered across subpages.
  • Limited automation and control: The previous scraping solution was inefficient because it lacked key features such as scheduling and data preview, which made it difficult to control what was collected and keep the knowledge base updated.
  • Risk of outdated responses: Without automatic ingestion from live websites, bots risk serving stale information.
  • Frequent site updates: Make it hard to maintain fresh data; full re-scrapes are time-consuming.
  • Difficulty with PDFs: Scraping and integrating PDF content with the rest of the website data proved challenging.

Solution

Apify’s Website Content Crawler provided a way forward. Instead of hoping the correct data would be collected, Šéfbot could finally decide what would be crawled, how it should be cleaned, and when it should be updated. By combining precise configuration with automation, they could control the quality and the freshness of their chatbot knowledge bases.

With Apify, you have much more control over what content is collected and how it’s used. The old solution lacked a data validation step; we could only confirm data quality during AI testing.

-- Radek Bacík, Conversation Expert at Šéfbot


Now Šéfbot had the tools to:

  • Precisely define the scope of scraping by combining Starting URLs, Include/Exclude globs, and URL patterns, so only relevant sections of the site are crawled.
  • Easily exclude irrelevant content areas like navbars, footers, modals, and cookie banners by targeting them via CSS selectors.
  • Use the Adaptive crawler to handle dynamically generated content and modern front-end frameworks.
  • Create tasks and schedules to organize and automate scraping flows across multiple sites.
  • Leverage clear logging and run history to debug problems and track changes across runs.
  • Output scraped data in Markdown or HTML format, with options to download raw HTML and PDF versions.
  • Enable proxy usage to bypass geo-blocking or anti-scraping protections when necessary.

Result

Šéfbot now has full control over their data pipeline, with the ability to preview, filter, and structure content before feeding it into LLMs. The crawler now delivers cleaner, higher-quality outputs on a regular schedule, so the knowledge base stays fresh and reliable.

  • Full control: Šéfbot can now fine-tune what data is crawled and how it’s processed.
  • Improved content quality: Outputs are cleaned and structured for LLM consumption.
  • Automation-friendly: The crawler runs on a regular schedule, feeding fresh content directly into vector storage.
  • Transparent pipeline: Unlike the previous solution’s workflows, Apify’s crawler provides visibility into what was collected.
  • Cost-effective and fast: Runs quickly even on complex websites, at a predictable cost.
What really stood out to me is how flexible the crawler setup is. Apify was continuously incorporating our feedback, adding new filters and improvements as we needed them. That kind of adaptability made a big difference.

-- Radek Bacík, Conversation Expert at Šéfbot


Value benchmark

Metric Previous solution Apify Website Content Crawler
Content control Limited - no data preview, no fine-grained filtering Full inclusion/exclusion rules, CSS-based filtering, and full preview
Data output Structured output, less transparent Markdown, HTML, text, JSON, with preview & export options
Structured cleanup Minimal Automated cleanup (e.g., navbars, footers, modals removed)
Automation & scheduling Limited scheduling of scraping jobs Full scheduling and task orchestration built-in
Debugging and transparency Limited visibility Clear logging, run history, and full transparency
Crawl performance Basic scraping, less adaptable Adaptive crawler for JS-heavy and dynamic sites

On this page

Build the scraper you want

No credit card required

Start building