Hey, we're Apify. You can build, deploy, share, and monitor any data extraction tools on the Apify platform. Check us out.
Why do I need a tool to download articles?
If you’re thinking of gathering many articles about one or several topics (say, the latest news on the economy) and then building a corpus from them, simply doing a Google search on a selected website every time is highly impractical.
A faster and more efficient way to extract content from a website for analysis and research is with a tool designed to collect and download news articles. One such tool is Smart Article Extractor, and in this article, we'll show you how to use it.
It is perfectly legal to extract publicly available texts from the web, but remember that many of them are protected by copyright law. That means you should not publish articles you have collected without prior permission. If you’re simply web scraping for research and citations for a dissertation, you won’t have any problems, but make sure you don’t republish intellectual property without consent.
Extract and download news articles
Collect articles from any scientific, academic, or news website with just one click.
We’ll show you how to scrape text from news articles and download the data with Smart Article Extractor.
First, we'll take you through the configuration options of the extractor, and then we'll show you a real-world example of scraping and downloading data from a website. So, let’s go through the different options step by step.
1. Choose start URLs or article URLs
You can configure the scraper by choosing start URLs in the website/category URLs input field. Article pages are detected and crawled from these, and they can be any category or subpage URL, for example, https://www.bbc.com/
Alternatively, you can insert article URLs in the second input field. These are direct URLs for the articles to be extracted, for example, https://www.bbc.com/uk-62836057. No extra pages are crawled from article pages.
Use the advanced options to select the HTTP method to request the URLs and the payload sent with the HTTP request. You also have header and data user options where you can insert a JSON object.
2. Select optional Booleans
You have two only new articles options, one for small runs and a saved per domain option for the use of the extractor on a large scale. With these options, the extractor will only scrape new articles each time you run it. For small runs, scraped URLs are saved in a dataset, while the per domain option saves scraped articles in a dataset and compares them with new ones.
If you go with the default only inside domain articles option, the extractor will only scrape articles on the domain from where they are linked. If the domain presents links to articles on different domains, e.g., https://www.bbc.com/ vs. https://www.bbc.co.uk, the extractor will not scrape them.
The enqueue articles from articles option allows the scraper to extract articles linked within articles. Otherwise, it will only scrape articles from category pages.
The extractor will scan different sitemaps from the initial article URL with the find articles in sitemaps option. Because this can lead to loading a vast amount of data, including old articles, the time and cost of the scraper will increase. Instead, we recommend using the optional array, sitemap URLs, below.
If you’re not sure what a sitemap URL is, it's an XML file that lists the URLs for a site. To get a sitemap URL, all you need to do is append /sitemap.xml to the domain URL.
With the sitemap URLs option, you can provide selected sitemap URLs that include the articles you need to scrape. Let’s say you want the sitemap URL for apify.com. Just insert https://apify.com/sitemap.xml.
You can choose to save the full HTML of the article page, but keep in mind that this will make the data less readable. The use Googlebot headers option allows you to bypass protection and paywalls on some websites, but this increases your chances of getting blocked, so use it with caution.
3. Choose what articles you want to extract
The default minimum word value is 150. This is typically sufficient for article recognition.
You can also use the date option to command the scraper to extract articles from a specific day. Otherwise, it will scrape all articles. You can use two formats for this option: YYYY-MM-DD, e.g., 2019-12-31, or a number type, e.g., 1 week, or 20 days.
The default must have date value lets the extractor know that it should only scrape articles with publication dates.
In the is the URL an article? option, you can input JSON settings to define what URLs should be considered articles by the scraper. If any are ‘true,’ it will open the link and extract the article.
4. Custom enqueuing and pseudo URLs
You can use the pseudo URLs function in the custom enqueuing box to include more links like pagination or categories. Read more about pseudo URLs here.
Use the link selector option to limit the tags which will be enqueued. To activate this option, you need to add a.some-class.
The max depth input is for the depth of crawling, i.e., how many times the scraper picks up a link to other web pages. If you input a number of total pages to be crawled in the max pages per crawl box, the extractor will stop automatically after reaching that number. The maximum number of total pages crawled includes the home page, pagination pages, and invalid articles.
The max articles per crawl option is the maximum number of valid articles the extractor will scrape and will stop automatically after reaching that number.
Use the max concurrency option to limit the speed of the scraper to avoid getting blocked.
5. Proxy configuration
The default input is automatic proxy. If you want to use your own proxies, use the ProxyConfigurationOptions.proxyUrls option, and the configuration will rotate your list of proxy URLs.
6. Browser options
The use browser (Puppeteer) option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.
The wait on each page (ms) value is the number of milliseconds the extractor will wait on each page before scraping data. Wait for selector on each page is an optional string to tell the extractor for what selector to wait on each page before scraping the data.
7. Extend output function
This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.
8. Compute units and notifications
With the above options, you can command the scraper to stop running after reaching a certain number of compute units and to send notifications to specified email addresses when the number of CUs is reached.
9. Options
Finally, you can use the final box of options for the tag or number of the build you want to run (this can be something like latest, beta, or 1.2.34.), the number of seconds at which the scraper should time out (zero value means it will run until completion or forever), and the RAM allocated for the extractor in megabytes.
How to download news articles (example)
Now you've seen how Smart Article Extractor works, let's do a quick and simple step-by-step demonstration of the tool in action.
Step 1. Go to Smart Article Extractor on the Apify platform
First, go to Smart Article Extractor on the Apify platform and click Try for free.
You’ll be redirected to sign up first if you don’t have an account (you don't need a credit card, and there's no time limit on your free subscription). Otherwise, you can get started right away.
Step 2. Add URLs for the articles you want to download
We’ll scrape the start URL https://theconversation.com/global. If you keep the remaining default values, all you need to do is click Save & Start (step 4).
Step 3. Choose your settings (optional)
You can configure the tool for your specific case by following the step-by-step guide we covered earlier. Here are three of the most important options to keep in mind:
You can select the publication dates from which you want articles to be extracted
You can choose the minimum word count per article (the default 150 is the recommended minimum for article recognition)
Step 4. Start downloading articles
Once you’re happy with your configuration, or if you’re using the default settings, just click the Save & Startbutton. The extractor will now begin collecting articles. You will see the data in the log while the tool is running, but wait until the status has changed to succeeded before you try to download the information.
Step 5. Export and download the text data
Once the article extractor has finished, click on the Storage tab to download the information in any of the available formats.
Let’s go with CSV. Here's the dataset for this run:
After successfully downloading the article data, take an additional step to ensure the security of your system and online activities by incorporating a trustworthy antivirus with VPN. This dual-layered defense not only safeguards the downloaded information but also protects your entire online experience, shielding against potential threats and maintaining a secure digital environment.
Start downloading articles
We barely scratched the surface with that example, so we suggest you get started with your own text-scraping tasks and enjoy discovering what else Smart Article Extractor can do. If you have any troubles, you can reach out to us, and we’ll be happy to help.
You'll also find a broad range of web scrapers already configured for extracting news articles from specific websites in Apify Store, such as New York Times Scraper, Washington Post Scraper, Google News Scraper, and more.
Download news articles from the NYT
If you want a simple, straightforward option just for extracting and downloading articles from the New York Times, this scraper was designed exactly for that.
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.