Situation
Šéfbot develops tailored AI chatbots across multiple industries, including automotive, B2B, e-commerce, municipalities, real estate, skilled trades, and healthcare. Their collaboration with Apify specifically focused on municipalities, which present unique challenges due to poorly structured websites. They needed a solution that could automatically transform disorganized content into clean, structured knowledge bases for their AI systems.


Radek Bacík, Conversation Expert at Šéfbot
Šéfbot initiated a collaboration to use Website Content Crawler - Apify's web scraping tool that automatically extracts and cleans content from websites, filtering out irrelevant elements like navigation bars and ads to deliver structured, AI-ready data. They used it to extract content from their clients' websites, especially the large and unstructured ones, and feed that data into LLMs powering their chatbots.
Although Šéfbot uses another scraping platform, it required a more flexible tool for complex municipal websites to ensure the entire process of managing its knowledge bases was more automated and under control.
With the previous solution the dataset was 400 MB, with Apify it's 120 MB. Now we can be sure the data doesn't contain any unwanted photos, videos, or other irrelevant content that shouldn't be there.
-- Radek Bacík, Conversation Expert at Šéfbot
Like any new tool, it took some getting used to. But Apify's team provided hands-on onboarding support and free consultation sessions to help Šéfbot get everything configured just right.
Problem
Šéfbot knew how to build great chatbots, but getting them reliable information to work with was the real challenge. Municipal websites seemed designed to hide important details - permits buried in sub-menus, office hours scattered across different pages, and current information mixed with content from years ago.
Their previous scraping solution lacked the precision they needed. Without data preview capabilities or automated scheduling, Šéfbot couldn't control what content was being collected. This led to bloated datasets filled with irrelevant information, and when chatbots provided incorrect responses, identifying and fixing the underlying data issues became a complex, time-intensive process.
Core issues they faced:
- Underestimated scale and complexity: Some websites require more manual intervention than initially anticipated. Municipal websites are typically unstructured, cluttered, and not maintained with AI consumption in mind.
- Poor content organization: Important information is often hidden in obscure places or scattered across subpages.
- Limited automation and control: The previous scraping solution was inefficient because it lacked key features such as scheduling and data preview, which made it difficult to control what was collected and keep the knowledge base updated.
- Risk of outdated responses: Without automatic ingestion from live websites, bots risk serving stale information.
- Frequent site updates: Make it hard to maintain fresh data; full re-scrapes are time-consuming.
- Difficulty with PDFs: Scraping and integrating PDF content with the rest of the website data proved challenging.
Solution
Apify’s Website Content Crawler provided a way forward. Instead of hoping the correct data would be collected, Šéfbot could finally decide what would be crawled, how it should be cleaned, and when it should be updated. By combining precise configuration with automation, they could control the quality and the freshness of their chatbot knowledge bases.
With Apify, you have much more control over what content is collected and how it’s used. The old solution lacked a data validation step; we could only confirm data quality during AI testing.
-- Radek Bacík, Conversation Expert at Šéfbot
Now Šéfbot had the tools to:
- Precisely define the scope of scraping by combining Starting URLs, Include/Exclude globs, and URL patterns, so only relevant sections of the site are crawled.
- Easily exclude irrelevant content areas like navbars, footers, modals, and cookie banners by targeting them via CSS selectors.
- Use the Adaptive crawler to handle dynamically generated content and modern front-end frameworks.
- Create tasks and schedules to organize and automate scraping flows across multiple sites.
- Leverage clear logging and run history to debug problems and track changes across runs.
- Output scraped data in Markdown or HTML format, with options to download raw HTML and PDF versions.
- Enable proxy usage to bypass geo-blocking or anti-scraping protections when necessary.
Result
Šéfbot now has full control over their data pipeline, with the ability to preview, filter, and structure content before feeding it into LLMs. The crawler now delivers cleaner, higher-quality outputs on a regular schedule, so the knowledge base stays fresh and reliable.
- Full control: Šéfbot can now fine-tune what data is crawled and how it’s processed.
- Improved content quality: Outputs are cleaned and structured for LLM consumption.
- Automation-friendly: The crawler runs on a regular schedule, feeding fresh content directly into vector storage.
- Transparent pipeline: Unlike the previous solution’s workflows, Apify’s crawler provides visibility into what was collected.
- Cost-effective and fast: Runs quickly even on complex websites, at a predictable cost.
What really stood out to me is how flexible the crawler setup is. Apify was continuously incorporating our feedback, adding new filters and improvements as we needed them. That kind of adaptability made a big difference.
-- Radek Bacík, Conversation Expert at Šéfbot
Value benchmark
Metric | Previous solution | Apify Website Content Crawler |
---|---|---|
Content control | Limited - no data preview, no fine-grained filtering | Full inclusion/exclusion rules, CSS-based filtering, and full preview |
Data output | Structured output, less transparent | Markdown, HTML, text, JSON, with preview & export options |
Structured cleanup | Minimal | Automated cleanup (e.g., navbars, footers, modals removed) |
Automation & scheduling | Limited scheduling of scraping jobs | Full scheduling and task orchestration built-in |
Debugging and transparency | Limited visibility | Clear logging, run history, and full transparency |
Crawl performance | Basic scraping, less adaptable | Adaptive crawler for JS-heavy and dynamic sites |