How to extract and download news articles online

Hey, we're Apify. You can build, deploy, share, and monitor any data extraction tools on the Apify platform. Check us out.

If you’re thinking of gathering many articles about one or several topics (say, the latest news on the economy) and then building a corpus from them, simply doing a Google search on a selected website every time is highly impractical.

A faster and more efficient way to extract content from a website for analysis and research is with a tool designed to collect and download news articles. One such tool is Smart Article Extractor, and in this article, we'll show you how to use it:

We'll break it up into two sections:

A detailed breakdown of the features and how to configure Smart Article Extractor.
A simple step-by-step demonstration of Smart Article Extractor downloading news articles.

🕵️

Smart Article Extractor has been used for data journalism. Learn more in Czech media and their word choices and Terror or clickbait?

How to extract text and download news articles

First, we'll take you through the configuration options of the extractor, and then we'll show you a real-world example of scraping and downloading data from a website. So, let’s go through the different options step by step.

Extract and download news articles

Use Smart Article Extractor to collect articles from any scientific, academic, or news website with a simple UI

Try it on a free Apify plan

1. Choose start URLs or article URLs

Once you've logged in to your Apify account, you can go to Smart Article Extractor and configure the scraper in Apify Console.

Begin by choosing start URLs in the website/category URLs input field. Article pages are detected and crawled from these, and they can be any category or subpage URL, for example, https://www.bbc.com/ ‌

Text scraping with start URLs or article URLs

‌Alternatively, you can insert article URLs in the second input field. These are direct URLs for the articles to be extracted, for example, https://www.bbc.com/uk-62836057. No extra pages are crawled from article pages.

Use the advanced options to select the HTTP method to request the URLs and the payload sent with the HTTP request. You also have header and data user options where you can insert a JSON object.‌

2. Select optional Booleans‌

‌You have two only new articles options, one for small runs and a saved per domain option for the use of the extractor on a large scale. With these options, the extractor will only scrape new articles each time you run it. For small runs, scraped URLs are saved in a dataset, while the per domain option saves scraped articles in a dataset and compares them with new ones.

If you go with the default only inside domain articles option, the extractor will only scrape articles on the domain from where they are linked. If the domain presents links to articles on different domains, e.g., https://www.bbc.com/ vs. https://www.bbc.co.uk, the extractor will not scrape them.

The enqueue articles from articles option allows the scraper to extract articles linked within articles. Otherwise, it will only scrape articles from category pages.

The extractor will scan different sitemaps from the initial article URL with the find articles in sitemaps option. Because this can lead to loading a vast amount of data, including old articles, the time and cost of the scraper will increase. Instead, we recommend using the optional array, sitemap URLs, below.

If you’re not sure what a sitemap URL is, it's an XML file that lists the URLs for a site. To get a sitemap URL, all you need to do is append /sitemap.xml to the domain URL.

With the sitemap URLs option, you can provide selected sitemap URLs that include the articles you need to scrape. Let’s say you want the sitemap URL for apify.com. Just insert https://apify.com/sitemap.xml. ‌

‌You can choose to save the full HTML of the article page, but keep in mind that this will make the data less readable. The use Googlebot headers option allows you to bypass protection and paywalls on some websites, but this increases your chances of getting blocked, so use it with caution.

3. Choose what articles you want to extract‌

Choose what articles to extract to scrape text from a website

‌The default minimum word value is 150. This is typically sufficient for article recognition.

You can also use the date option to command the scraper to extract articles from a specific day. Otherwise, it will scrape all articles. You can use two formats for this option: YYYY-MM-DD, e.g., 2019-12-31, or a number type, e.g., 1 week, or 20 days.

The default must have date value lets the extractor know that it should only scrape articles with publication dates.

In the is the URL an article? option, you can input JSON settings to define what URLs should be considered articles by the scraper. If any are ‘true,’ it will open the link and extract the article.

4. Custom enqueuing and pseudo URLs

You can use the pseudo URLs function in the custom enqueuing box to include more links like pagination or categories. Read more about pseudo URLs here.‌

Scrape text from websites with pseudo URLs

‌Use the link selector option to limit the tags which will be enqueued. To activate this option, you need to add a.some-class.

The max depth input is for the depth of crawling, i.e., how many times the scraper picks up a link to other web pages. If you input a number of total pages to be crawled in the max pages per crawl box, the extractor will stop automatically after reaching that number. The maximum number of total pages crawled includes the home page, pagination pages, and invalid articles.

The max articles per crawl option is the maximum number of valid articles the extractor will scrape and will stop automatically after reaching that number.

Use the max concurrency option to limit the speed of the scraper to avoid getting blocked.

5. Proxy configuration‌

‌The default input is automatic proxy. If you want to use your own proxies, use the ProxyConfigurationOptions.proxyUrls option, and the configuration will rotate your list of proxy URLs.

6. Browser options‌

‌The use browser (Puppeteer) option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.

The wait on each page (ms) value is the number of milliseconds the extractor will wait on each page before scraping data. Wait for selector on each page is an optional string to tell the extractor for what selector to wait on each page before scraping the data.

7. Extend output function‌

Extend output function for text scraping

‌This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.

8. Compute units and notifications‌

Choose compute units and notifications options for text scraping

‌With the above options, you can command the scraper to stop running after reaching a certain number of compute units and to send notifications to specified email addresses when the number of CUs is reached.

9. Options‌

Build, timeout, and memory options for text scraping

‌Finally, you can use the final box of options for the tag or number of the build you want to run (this can be something like latest, beta, or 1.2.34.), the number of seconds at which the scraper should time out (zero value means it will run until completion or forever), and the RAM allocated for the extractor in megabytes.

How to download news articles (example)

Now you've seen how Smart Article Extractor works, let's do a quick and simple step-by-step demonstration of the tool in action.

Step 1. Go to Smart Article Extractor on the Apify platform

If you still haven't signed up, don't forget to set up a free Apify account first. Then go to Smart Article Extractor on Apify Console.

Step 2. Add URLs for the articles you want to download‌

We’ll scrape the start URL https://theconversation.com/global. If you keep the remaining default values, all you need to do is click Save & Start (step 4).

Step 3. Choose your settings (optional)

You can configure the tool for your specific case by following the step-by-step guide we covered earlier. Here are three of the most important options to keep in mind:

You can select the publication dates from which you want articles to be extracted
You can extend the search to pseudo URLs. Read more about pseudo URLs here.
You can choose the minimum word count per article (the default 150 is the recommended minimum for article recognition)

Step 4. Start downloading articles

Once you’re happy with your configuration, or if you’re using the default settings, just click the Save & Start button. The extractor will now begin collecting articles. You will see the data in the log while the tool is running, but wait until the status has changed to succeeded before you try to download the information.

Text scraping run with Smart Article Extractor

‌Step 5. Export and download the text data

Once the article extractor has finished, click on the Storage tab to download the information in any of the available formats.

Scrape and download articles from websites in JSON and other formats

‌Let’s go with CSV. Here's the dataset for this run:

Smart Article Extractor dataset

Exported data in CSV

dataset_article-extractor-smart_2024-03-04_13-56-06-507.csv

343 KB

Step 6 (Bonus step). Be secured

After successfully downloading the article data, take an additional step to ensure the security of your system and online activities by incorporating a trustworthy antivirus with VPN. This dual-layered defense not only safeguards the downloaded information but also protects your entire online experience, shielding against potential threats and maintaining a secure digital environment.

Start downloading articles

We barely scratched the surface with that example, so we suggest you get started with your own text-scraping tasks and enjoy discovering what else Smart Article Extractor can do. If you have any troubles, you can reach out to us, and we’ll be happy to help.‌‌

Get text data with Smart Article Extractor

You'll also find a broad range of web scrapers already configured for extracting news articles from specific websites in Apify Store, such as New York Times Scraper, Washington Post Scraper, Google News Scraper, and more.