The definitive guide to text scraping

Text scraping and computational text mining and analysis methods are becoming increasingly popular among students, scholars, and researchers in multiple academic fields. We show you how to extract text data from websites with Smart Article Extractor.

Content


What is text scraping?

Text scraping is a use case of web scraping, an automated method of extracting data from websites. If you’re extracting text or article URLs from the web, then text scraping is what you’re doing. The process often begins with web crawling, an automated method of searching web pages by starting with a list of URLs and processing them for extraction.

Web scraping has become a popular method of text data extraction among students, scholars, and researchers. Web scraping software recognizes types of content on a website and can be configured to crawl and scrape types of content specified by the user. For example, if you wish to extract article URLs, titles, or authors from a news website, text scraping radically reduces the time you’d spend searching for sources.

Smart Article Extractor extracts articles from any scientific, academic, or news website. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as JSON, HTML table, Excel, RSS feed, and more.

What is text mining?

Text mining, also known as text data mining, is the next step after text scraping. It is similar to text analytics, as it involves deriving quality information from extant text data. You can discover hitherto unknown information by extracting such data from online resources. While text analysis grew from the field of humanities in the form of manual analysis, text mining and text analysis are now synonymous. Both are computational methods involving automated web crawling and scraping to search, retrieve, and analyze text data.

How can you scrape text from a URL?

A great tool to scrape text from a URL is Smart Article Extractor. This scraper crawls a whole website and automatically distinguishes articles from other web pages. You can configure it for your purposes, and extract information from news websites based on the publication date, word count, pseudo URLs, and more.

Scraping data available online for everyone to see is legal since it merely automates a task that a human would have to do manually. Just ensure not to accumulate sensitive information such as personal data or copyrighted content, which are protected by various international regulations.

⚖️
To learn more about the legal side of web scraping, read our blog post on the subject ➜

Step-by-step guide to scraping text from a website

We’ll show you how to scrape text from a website with Smart Article Extractor. You can test the scraper by using the default inputs. The default setting is configured this way:

{
    "enqueueFromArticles": false,
    "extendOutputFunction": "($) => {\n    const result = {};\n    // Uncomment to add a title to the output\n    // result.pageTitle = $('title').text().trim();\n\n    return result;\n}",
    "isUrlArticleDefinition": {
        "minDashes": 4,
        "hasDate": true,
        "linkIncludes": [
            "article",
            "storyid",
            "?p=",
            "id=",
            "/fpss/track",
            ".html",
            "/content/"
        ]
    },
    "mustHaveDate": true,
    "onlyInsideArticles": true,
    "onlyNewArticles": false,
    "onlyNewArticlesPerDomain": false,
    "proxyConfiguration": {
        "useApifyProxy": true
    },

Assuming you want to explore the other options available, let’s go through the different options step by step:

1. Choose start URLs or article URLs

You can configure the scraper by choosing start URLs in the website/category URLs input field. Article pages are detected and crawled from these and they can be any category or subpage URL, for example, https://www.bbc.com/

Text scraping with start URLs or article URLs

Alternatively, you can insert article URLs in the second input field. These are direct URLs for the articles to be extracted, for example, https://www.bbc.com/uk-62836057. No extra pages are crawled from article pages.

Use the advanced options to select the HTTP method to request the URLs and the payload sent with the HTTP request. You also have header and data user options where you can insert a JSON object.

Text scraping HTTP request

2. Select optional Booleans

Text scraping booleans

You have two only new articles options, one for small runs and a saved per domain option for the use of the extractor on a large scale. With these options, the extractor will only scrape new articles each time you run it. For small runs, scraped URLs are saved in a dataset, while the per domain option saves scraped articles in a dataset and compares them with new ones.

If you go with the default only inside domain articles option, the extractor will only scrape articles on the domain from where they are linked. If the domain presents links to articles on different domains, e.g., https://www.bbc.com/ vs. https://www.bbc.co.uk, the extractor will not scrape them.

The enqueue articles from articles option allows the scraper to extract articles linked within articles. Otherwise, it will only scrape articles from category pages.

The extractor will scan different sitemaps from the initial article URL with the find articles in sitemaps option. Because this can lead to loading a vast amount of data, including old articles, the time and cost of the scraper will increase. Instead, we recommend using the optional array, sitemap URLs, below.

If you’re not sure what a sitemap URL is, it's an XML file that lists the URLs for a site. To get a sitemap URL, all you need to do is append /sitemap.xml to the domain URL.

With the sitemap URLs option on Smart Article Extractor, you can provide selected sitemap URLs that include the articles you need to scrape. Let’s say you want the sitemap URL for apify.com. Just insert https://apify.com/sitemap.xml.

Smart Article Extractor Sitemap URLs

You can choose to save the full HTML of the article page, but keep in mind that this will make the data less readable. The use Googlebot headers option allows you to bypass protection and paywalls on some websites, but this increases your chances of getting blocked, so use it with caution.

3. Choose what articles you want to extract

Choose what articles to extract to scrape text from a website

The default minimum word value is 150. This is typically sufficient for article recognition.

You can also use the date option to command the scraper to extract articles from a specific day. Otherwise, it will scrape all articles. You can use two formats for this option: YYYY-MM-DD, e.g., 2019-12-31, or a number type, e.g., 1 week, or 20 days.

The default must have date value lets the extractor know that it should only scrape articles with publication dates.

In the is the URL an article? option, you can input JSON settings to define what URLs should be considered articles by the scraper. If any are ‘true,’ it will open the link and extract the article.

4. Custom enqueuing and pseudo URLs

You can use the pseudo URLs function in the custom enqueuing box to include more links like pagination or categories. Read more about pseudo URLs here.

Scrape text from websites with pseudo URLs

Use the link selector option to limit the tags which will be enqueued. To activate this option, you need to add a.some-class.

The max depth input is for the depth of crawling, i.e., how many times the scraper picks up a link to other web pages. If you input a number of total pages to be crawled in the max pages per crawl box, the extractor will stop automatically after reaching that number. The maximum number of total pages crawled includes the home page, pagination pages, and invalid articles.

The max articles per crawl option is the maximum number of valid articles the extractor will scrape and will stop automatically after reaching that number.

Use the max concurrency option to limit the speed of the scraper to avoid getting blocked.

5. Proxy configuration

Use a proxy for text scraping

The default input is automatic proxy. If you want to use your own proxies, use the ProxyConfigurationOptions.proxyUrls option, and the configuration will rotate your list of proxy URLs.

6. Browser options

Use browser (Puppeteer) for text scraping

The use browser (Puppeteer) option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.

The wait on each page (ms) value is the number of milliseconds the extractor will wait on each page before scraping data. Wait for selector on each page is an optional string to tell the extractor for what selector to wait on each page before scraping the data.

7. Extend output function

Extend output function for text scraping

This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.

8. Compute units and notifications

Choose compute units and notifications options for text scraping

With the above options you can command the scraper to stop running after reaching a certain number of compute units, and to send notifications to specified email addresses when the number of CUs is reached.

9. Options

Build, timeout, and memory options for text scraping

Finally, you can use the final box of options for the tag or number of the build you want to run (this can be something like latest, beta or 1.2.34.), the number of seconds at which the scraper should time out (zero value means it will run until completion or forever), and the RAM allocated for the extractor in megabytes.

Scraping text from a website with Smart Article Extractor (example)

Let’s do a basic text scraping run to demonstrate Smart Article Extractor. First, go to Smart Article Extractor on the Apify platform and click Try for free.

Scrape text from websites and try Smart Article Extractor for free

You’ll be redirected to sign up first if you don’t have an account (you don't need a credit card and there's no time limit on your free subscription). Otherwise, you can get started right away.

Scrape URLs to extract text data

We’ll scrape the start URL https://theconversation.com/global. We’ll keep the remaining default values. So, all we need to do is click Start.

Text scraping run with Smart Article Extractor

Now that the extractor has finished, we can view and download the data in multiple formats by clicking on the Storage tab.

Scrape and download articles from websites in JSON and other formats

Let’s go with JSON. Here's the data for the first of the 45 results we got:

{
  "url": "https://theconversation.com/electric-planes-are-coming-short-hop-regional-flights-could-be-running-on-batteries-in-a-few-years-190098",
  "loadedUrl": "https://theconversation.com/electric-planes-are-coming-short-hop-regional-flights-could-be-running-on-batteries-in-a-few-years-190098",
  "loadedDomain": "theconversation.com",
  "title": "Short-hop regional flights could be running on batteries in a few years",
  "softTitle": "Electric planes are coming: Short-hop regional flights could be running on batteries in a few years",
  "date": "2022-09-19T12:21:03.000Z",
  "author": [
    "Gökçin Çınar"
  ],
  "publisher": "The Conversation",
  "copyright": "2010–2022",
  "favicon": "https://cdn.theconversation.com/static/tc/@theconversation/ui/dist/esm/logos/favicon-cdcdc0dd51ffe5238483c3f27fd2eb57.ico",
  "description": "Air Canada and United Airlines both have orders for hybrid electric 30-seaters. An aerospace engineer explains where electrification, hydrogen and sustainable aviation fuels are headed.",
  "lang": "en",
  "canonicalLink": "https://theconversation.com/electric-planes-are-coming-short-hop-regional-flights-could-be-running-on-batteries-in-a-few-years-190098",
  "tags": [],
  "image": "https://images.theconversation.com/files/484695/original/file-20220914-9158-ybu2z4.jpg?ixlib=rb-1.1.0&rect=0%2C528%2C5043%2C2521&q=45&auto=format&w=1356&h=668&fit=crop",
  "videos": [
    {
      "height": "400",
      "width": "100%"
    },
    {
      "src": "https://datawrapper.dwcdn.net/5mb3z/6/",
      "height": "400px",
      "width": "100%"
    }
  ],
  "links": [
    {
      "text": "quietly buzzing around Europe",
      "href": "https://investor.textron.com/news/news-releases/press-release-details/2022/Textron-Completes-Acquisition-of-Pipistrel/default.aspx"
    },
    {
      "text": "electric sea planes",
      "href": "https://harbourair.com/harbour-air-and-magnix-announce-successful-flight-of-worlds-first-commercial-electric-airplane/"
    },
    {
      "text": "Air Canada",
      "href": "http://heartaerospace.com/heart-aerospace-unveils-new-airplane-design-confirms-air-canada-and-saab-as-new-shareholders/"
    },
    {
      "text": "first hybrid electric 50- to 70-seat",
      "href": "https://www.nrel.gov/docs/fy22osti/80220.pdf"
    },
    {
      "text": "could be ready",
      "href": "https://www.electricaviationgroup.com/electric-flight/"
    },
    {
      "text": "three to five times more",
      "href": "https://www.nrel.gov/docs/fy22osti/80220.pdf"
    },
    {
      "text": "Gökçin Çınar",
      "href": "https://scholar.google.com/citations?user=KIbLE10AAAAJ&hl=en"
    },
    {
      "text": "electric alternative",
      "href": "https://www.mdpi.com/2071-1050/14/10/5880"
    },
    {
      "text": "cut fuel use by about 10%",
      "href": "https://arc.aiaa.org/doi/10.2514/1.C036919"
    },
    {
      "text": "make more use of regional airports",
      "href": "https://sacd.larc.nasa.gov/sacd/wp-content/uploads/sites/167/2021/04/2021-04-20-RAM.pdf"
    },
    {
      "text": "corn, oilseeds",
      "href": "https://www.energy.gov/eere/bioenergy/2016-billion-ton-report"
    },
    {
      "text": "algae",
      "href": "https://biomassmagazine.com/articles/18484/honeywell-technology-enables-jet-flights-with-saf-from-algal-oil"
    },
    {
      "text": "by around 80%",
      "href": "https://www.iata.org/en/programs/environment/sustainable-aviation-fuels/"
    },
    {
      "text": "route planning",
      "href": "https://theconversation.com/why-the-aviation-industry-must-look-beyond-carbon-to-get-serious-about-climate-change-186947"
    },
    {
      "text": "green hydrogen",
      "href": "https://www.energy.gov/eere/fuelcells/hydrogen-production-electrolysis"
    },
    {
      "text": "still takes up more space",
      "href": "https://www.iata.org/contentassets/d13875e9ed784f75bac90f000760e998/fact_sheet7-hydrogen-fact-sheet_072020.pdf"
    },
    {
      "text": "aiming to have mature technology by 2025",
      "href": "https://www.airbus.com/en/innovation/zero-emission/hydrogen/zeroe"
    },
    {
      "text": "testing a 34-seat, hydrogen-electric airplane",
      "href": "https://australianaviation.com.au/2022/07/rex-to-trial-electric-planes-on-short-routes-in-2024/"
    },
    {
      "text": "International Civil Aviation Organization",
      "href": "https://www.icao.int/about-icao/Pages/default.aspx"
    },
    {
      "text": "cut net carbon dioxide emissions 50%",
      "href": "https://www.icao.int/Meetings/2022-ICAO-LTAG-GLADS/Pages/default.aspx"
    }
  ],
  "text": "Electric planes might seem futuristic, but they aren’t that far off, at least for short hops.\n\nTwo-seater Velis Electros are already quietly buzzing around Europe, electric sea planes are being tested in British Columbia, and larger planes are coming. Air Canada announced on Sept. 15, 2022, that it would buy 30 electric-hybrid regional aircraft from Sweden’s Heart Aerospace, which expects to have its 30-seat plane in service by 2028. Analysts at the U. S. National Renewable Energy Lab note that the first hybrid electric 50- to 70-seat commuter plane could be ready not long after that. In the 2030s, they say, electric aviation could really take off.\n\nThat matters for managing climate change. About 3% of global emissions come from aviation today, and with more passengers and flights expected as the population expands, aviation could be producing three to five times more carbon dioxide emissions by 2050 than it did before the COVID-19 pandemic.\n\nAerospace engineer and assistant professor Gökçin Çınar develops sustainable aviation concepts, including hybrid-electric planes and hydrogen fuel alternatives, at the University of Michigan. We asked her about the key ways to cut aviation emissions today and where technologies like electrification and hydrogen are headed.\n\nAircraft are some of the most complex vehicles out there, but the biggest problem for electrifying them is the battery weight.\n\nIf you tried to fully electrify a 737 with today’s batteries, you would have to take out all the passengers and cargo and fill that space with batteries just to fly for under an hour.\n\nJet fuel can hold about 50 times more energy compared to batteries per unit mass. So, you can have 1 pound of jet fuel or 50 pounds of batteries. To close that gap, we need to either make lithium-ion batteries lighter or develop new batteries that hold more energy. New batteries are being developed, but they aren’t yet ready for aircraft.\n\nEven though we might not be able to fully electrify a 737, we can get some fuel burn benefits from batteries in the larger jets by using hybrid propulsion systems. We are trying to make that happen in the short term, with a 2030-2035 target for smaller regional planes. The less fuel burned during flight, the fewer greenhouse gas emissions.\n\nHybrid electric aircraft are similar to hybrid electric cars in that they use a combination of batteries and aviation fuels. The problem is that no other industry has the weight limitations that we do in the aerospace industry.\n\nThat’s why we have to be very smart about how and how much we are hybridizing the propulsion system.\n\nUsing batteries as a power assist during takeoff and climb are very promising options. Taxiing to the runway using just electric power could also save a significant amount of fuel and reduce the local emissions at airports. There is a sweet spot between the added weight of the battery and how much electricity you can use to get net fuel benefits. This optimization problem is at the center of my research.\n\nHybrids would still burn fuel during flight, but it could be considerably less than just relying entirely on jet fuel.\n\nI see hybridization as a mid-term option for larger jets, but a near-term solution for regional aircraft.\n\nFor 2030 to 2035, we’re focused on hybrid turboprops, typically regional aircraft with 50-80 passengers or used for freight. These hybrids could cut fuel use by about 10%.\n\nWith electric hybrids, airlines could also make more use of regional airports, reducing congestion and time larger planes spend idling on the runway.\n\nShorter term we’ll see more use of sustainable aviation fuels, or SAF. With today’s engines, you can dump sustainable aviation fuel into the same fuel tank and burn it. Fuels made from corn, oilseeds, algae and other fats are already being used.\n\nSustainable aviation fuels can reduce an aircraft’s net carbon dioxide emissions by around 80%, but supply is limited, and using more biomass for fuel could compete with food production and lead to deforestation.\n\nA second option is using synthetic sustainable aviation fuels, which involves capturing carbon from the air or other industrial processes and synthesizing it with hydrogen. But that’s a complex and costly process and does not have a high production scale yet.\n\nAirlines can also optimize their operations in the short term, such as route planning to avoid flying nearly empty planes. That can also reduce emissions.\n\nHydrogen fuel has been around a very long time, and when it’s green hydrogen – produced with water and electrolysis powered by renewable energy – it doesn’t produce carbon dioxide. It can also hold more energy per unit of mass than batteries.\n\nThere are two ways to use hydrogen in an airplane: either in place of regular jet fuel in an engine, or combined with oxygen to power hydrogen fuel cells, which then generate electricity to power the aircraft.\n\nThe problem is volume – hydrogen gas takes up a lot of space. That’s why engineers are looking at methods like keeping it very cool so it can be stored as liquid until it’s burned as a gas. It still takes up more space than jet fuel, and the storage tanks are heavy, so how to store, handle or distribute it on aircraft is still being worked out.\n\nAirbus is doing a lot of research on hydrogen combustion using modified gas turbine engines with an A380 platform, and aiming to have mature technology by 2025. Australia’s Rex airline expects to start testing a 34-seat, hydrogen-electric airplane for short hops in the next few years.\n\nDue to the variety of options, I see hydrogen as one of the key technologies for sustainable aviation.\n\nThe problem with aviation emissions isn’t their current levels – it’s the fear that their emissions will increase rapidly as demand increases. By 2050, we could see three to five times more carbon dioxide emissions from aviation than before the pandemic.\n\nThe International Civil Aviation Organization, a United Nations agency, generally defines the industry’s goals, looking at what’s feasible and how aviation can push the boundaries.\n\nIts long-term goal is to cut net carbon dioxide emissions 50% by 2050 compared with 2005 levels. Getting there will require a mix of different technologies and optimization. I don’t know if we’re going to be able to reach it by 2050, but I believe we must do everything we can to make future aviation environmentally sustainable."
}

Start text scraping

We barely scratched the surface with that example, so we suggest you get started with your own text scraping tasks and enjoy discovering what Smart Article Extractor can do. If you have any troubles, you can reach out to us, and we’ll be happy to help.

Get started now

Step up your web scraping and automation