Why do I need a tool to download articles?
If you’re thinking of gathering many articles about one or several topics (say, the latest news on the economy) and then building a corpus from them, simply doing a Google search on a selected website every time is highly impractical.
A faster and more efficient way to extract content from a website for analysis and research is with a tool designed to collect and download news articles. One such tool is Smart Article Extractor. Extracting data from the web is usually a task requiring technical skills, but with this article download tool, you only need to insert the URLs of the websites you want to search and click a few buttons. The result is a dataset of news articles that you can then export and download in multiple formats.
Is extracting news articles legal?
It is perfectly legal to extract publicly available texts from the web, but remember that many of them are protected by copyright law. That means you should not publish articles you have collected without prior permission. If you’re simply collecting data for research and citations for a dissertation, you won’t have any problems, but make sure you don’t republish intellectual property without consent.
How to download news articles
Step 1. Go to Smart Article Extractor on the Apify platform
Go directly to the Smart Article Extractor page from here and click Try for free. If you don’t already have an Apify account, you’ll be redirected to sign up for free. You don’t need a credit card, so you can quickly sign up with your email account, Google, or GitHub, without worrying about being charged anything if you simply want to try it out.
Step 2. Add URLs for the articles you want to download
You can choose more than one URL per run by clicking on the +Add button. If this is your first time using Smart Article Extractor, you can just use the default settings to test the scraper quickly. If you feel more confident, you can configure it to your requirements with the many options provided. We will extract and download articles from the Economist and the New York Times in our example.
Step 3. Choose your settings (optional)
If you’re using the default settings, you can go straight to step 4. If you want to configure the tool for your specific case, there are plenty of options to choose from. Here are three of the most important options to keep in mind:
- You can select the publication dates from which you want articles to be extracted
- You can extend the search to pseudo URLs (e.g., https://www.bbc.com/) to crawl articles from a variety of URLs on bbc.com). Click to read more about pseudo URLs.
- You can choose the minimum word count per article (the default 150 is the recommended minimum for article recognition)
Step 4. Click Start to begin extracting news articles
Once you’re happy with your configuration, or if you’re using the default settings, just click the Save & Start button. The extractor will now begin collecting articles. You will see the data in the log while the tool is running, but wait until the status has changed to succeeded before you try to download the information.
Step 5. Export and download the article data
Once the article extractor has finished, click on the Storage tab to download the information in any of the available formats. Here’s a sample of the dataset from this tutorial in Excel.
Congratulations! You’ve just extracted many news articles you can download and use for your project. This makes it easy to identify relevant articles, cite texts for your books or essays, and keep track of your source materials. So, how about trying it again on your own? Just choose some URLs you want to search, select your preferred settings, and collect and download more online news articles.