Python functions in web scraping

Functions make code easier to manage, debug, and understand. Learn how to use them in Python.

Content

Functions are a core component of programming. Inspired by their mathematical counterparts, functions do much more than take inputs and return outputs. They make code easier to manage, debug, and understand.

Regardless of whether one is using functional or some other programming paradigms, functions are central to every programming language, including Python. In this article, we'll talk about Python functions, especially in the context of scraping data from the web.

We'll begin with a quick overview of Python functions, followed by how to use them for web scraping in Python.

Functions in Python

A function is a code snippet that can be defined once and then used ("called") anywhere we want in the code. For example, we can define the function CalculateTax() and call it whenever we have to calculate an employee’s tax.

Before we explain how to define and use a Python function, here are 3 points worthy of attention.

  • Similar to a mathematical function, a function may take some inputs and return an output. The aforementioned function to calculate tax is a good example.
  • A function may take no input and return an output. For example, the built-in function for returning time, localtime() in the time module.
  • Python's outputs might surprise someone familiar with C/C++, C#, or Java, as a Python function can return multiple outputs.
đź’ˇ
Most of these general facts for Python functions hold true for the other languages as well.

Creating functions

A Python function can be defined as:

def sample_function(input):
	"""
	Some function body
	"""

For example, the aforementioned function to calculate tax:

def calculate_tax(salary):
    if(salary < 12000):
        print('No tax applicable')
    else:
        tax = salary * 0.17
        print('Tax deductible for given salary is: ', tax)

return keyword

While the above function doesn’t have any output, we may have some functions with an output too. For such functions, we use the return keyword. For example, a simple function for converting the degrees into radians:

import math

def deg_to_radians(deg):
    rad = deg/180*math.pi
      return rad

Now we've defined a function. But a function is almost pointless until it’s used. That brings us neatly to calling functions.

Calling functions

To use a function, we “call” it by mentioning it and providing the inputs (if any).

(inputs)

These inputs can either be variables or direct values. Both ways of calling it are valid:

s = 35000
calculate_tax(s)
# or
calculate_tax(35000)

Functions with multiple outputs

As I mentioned above, Python differs from other common languages in terms of multiple outputs. It’s pretty simple - while returning the outputs, you can specify as many as you would like. Here's a quick example:

import math
def circle_attributes(radius):
    c = 2*math.pi*radius
    a = math.pi*radius*radius
    return a,c

They can be collected (in the same order) while calling the function.

area, circumference = circle_attributes(3)
print(area, circumference)

We can even collect them in a single variable.

a = circle_attributes(4.5)

Since they're returned in the same order, this means it's a tuple, which we can also confirm.

print(type(a))

Pure and impure functions

We have seen that a function may or may not take input and yet return output(s). But can you think of a situation where a function may or may not return any output at all?

Consider a function that takes your date of birth to determine whether you’re eligible for a driving license. Ideally, this function, which we could call ValidateLicenseAge(), would return a clear yes or no. However, it could also be designed to simply print() the result, without returning any explicit output.

In more advanced scenarios, a function might receive information for credit card or bank loan approval. After processing the inputs—whether through a fixed formula or a machine learning algorithm—the function might email its recommendations directly to the relevant department, without returning anything.

These kinds of actions, where a function performs tasks like printing or sending emails, are known as side effects. Another example of a side effect is when a function modifies the value of a variable within the program.

This distinction brings us to the concepts of pure and impure functions:

  • Pure functions do not change the program state. They rely only on the function's inputs to produce an output.
  • Impure functions can modify the program state and may depend on factors beyond the function's inputs, such as internal program state or I/O operations.

Functional programming encourages the use of pure functions to ensure predictable and reliable code.

đź’ˇ
In practice, Python's functional programming often allows a relaxed form of side effects, where variable values remain unchanged, but I/O operations are permitted.

Python functions for web scraping

Now that we've covered the basics of Python functions, let's look at how they're used in web scraping, beginning with two of the most prominent libraries in the Python ecosystem: Beautiful Soup and Scrapy.

Beautiful Soup

Beautiful Soup is a Python library to extract data from HTML or XML pages. We can install it from the PyPI.

pip install beautifulsoup4

To use it, we instantiate a BeautifulSoup object:

from bs4 import BeautifulSoup
bs_obj = BeautifulSoup()

There are a number of useful functions, like:

  • find() - to look up a particular tag within the page.
  • find_all() - finding all the respective tags within the page. For example, if we specify 'a' as a parameter in this function, it will get all the URLs within a page.
  • get_text() - extracting plain (non-html) text from the page.
đź’ˇ
While instantiating the BeautifulSoup object, it implicitly picks an already installed parser. If we want a specific parser, we can specify it as the optional argument in the constructor.

We need to remember that Beautiful Soup’s job is restricted to pulling the data out of the HTML (or XML) pages; we have to fetch the page ourselves. We can use a library like requests to facilitate it. For example, here we're using find_all() to fetch all the URLs from a page:

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/Cosmopolitanism"
page = requests.get(URL)

soup = BeautifulSoup(page.content)

for link in soup.find_all('a'):
    print(link.get('href'))

Scrapy

Scrapy is a complete Python library that provides comprehensive support for web scraping. Its core component is a Spider object. We can use this object for web crawling. It comes with a number of useful methods, like:

  • start_requests() - begins crawling and is called only once by the crawler.
  • parse() - processes the response and returns the scraped data.

To install Scrapy, we can simply use pip:

pip install scrapy

To confirm if it's correctly installed, an import statement will do.

import scrapy   #should execute without any trouble if its installed

Scrapy also provides its CLI, which is powerful enough. For example, here we'll use it to make a spider and run it.

scrapy startproject Proj1
scrapy genspider Spider1 wikiwand.com
scrapy crawl Spider1

We can add -o while crawling the spider to save the output as well.

How to import functions in Python

We've been using the pre-defined functions from various libraries in this article so far. It provides abstraction and a neat way of calling them. For example, if you use a function - say NumPy’s arange() - you are unconcerned with the definition details usually and call it without any fuss.

import numpy as np
a = np.arange(1,10)

Can we do it for our user-defined functions too?

Luckily, it’s pretty straightforward: define the functions in a Python file, import it (like we do for the libraries), and call it.

As an example, I'll define some functions in a file.

Module file

This is a separate file with functions in it. Here's one with a couple of functions:

import math

def circumference(r):
    return 2*math.pi*r

def area(r):
    return math.pi*r*r

Calling code

Once the Python file (in this case sample_module.py) is saved, we can easily import it from either a Python file or a JuPyter notebook (in the same directory):

import sample_module as sm

sm.circumference(2)

Browser automation with functions

So far, we've learned how to use Python functions for fetching data from a website. But there's more to it; we can also automate a browser using Python.

When it comes to web scraping dynamically loaded content (which is often the case with most modern websites), we need browser automation tools.

Two of the most powerful and popular libraries for the job are Selenium and Playwright.

The methodology is common in both libraries: we instantiate a browser of our choice (Chrome and Firefox being the common choices) and use it to either scrape or enter the data.

Selenium

I won't take up your precious time with introductions to Selenium and Playwright. If you want to learn more about them, check out Playwright vs. Selenium.

Let's get straight into using Selenium to instantiate a Firefox web driver.

from selenium import webdriver

driver = WebDriver.Firefox()

This driver has some useful methods, like:

  • get() - retrieve the given URL.
  • find_element(By.ID, 'element_id') - finding a particular element in the page by its id, element_id. Please make sure to import By

Using these methods, we make a function for logging into a page.

def automate_login(url, user_name, password):
    driver = WebDriver.Firefox()
    driver.get(url)

    driver.find_element(By.ID, 'email').send_keys(user_name)
    driver.find_element(By.ID, 'pass').send_keys(password)
    driver.find_element(By.ID, 'loginbutton').click()
from selenium import webdriver
from [selenium.webdriver.common.by]() import By

def automate_login(url, user_name, password):
    # Instantiate the Firefox WebDriver
    driver = webdriver.Firefox()

    try:
        # Navigate to the URL
        driver.get(url)

        # Find the elements and perform actions
        driver.find_element(By.ID, 'email').send_keys(user_name)
        driver.find_element(By.ID, 'pass').send_keys(password)
        driver.find_element(By.ID, 'loginbutton').click()

    finally:
        # Close the browser after the operations
        driver.quit()

As you can see, there are a couple of notes worthy of attention here:

  • send_keys() is a submethod of find_element() and used to send the respective data. So is click(), which allows us to automate the button clicking.
  • As you can see, the IDs we used (emailpass, etc.) aren’t ubiquitous to every website, so this function can't be used with every website.

Playwright

Playwright’s working unit is a Page. A Page refers to a single tab or a popup window within a browser context. It has a number of functions (methods), like:

  • goto() - navigation to a page
  • locator() - going to a particular location on the page
  • screenshot() - for grabbing the screenshot of the current page. This method takes the path to save the screenshot. There are also some optional parameters: clipping coordinates, whether it's a full page or not, whether it's an animation or not, etc. More details can be found in the documentation.

Let's make a function using Playwright to scroll to the given URL and take a screenshot.

from playwright.sync_api import sync_playwright

def take_screenshot(url, path):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        page.screenshot(path=path)
        browser.close()

Now let's take a screenshot of Wikipedia’s page:

take_screenshot("https://www.wikipedia.org", 'wikipediaSS.png')

Best practices and advanced function techniques

Before we wrap up, let me leave you with a couple of best practices for writing Python functions:

  • Use recursion carefullyRecursion is a trapped beast. If used properly, it can be quite useful. But if mishandled, it can be troublesome. Personally, I've found it quite difficult to dry-run/debug them due to the call stack and backtracking.
  • No overmodulation: In this context, overmodulation means to not overuse the modules or functions. Whenever possible, it's a good idea to use anonymous expressions/functions instead of making trivial functions like a sum of two numbers.

Conclusion

We've reviewed both pre-defined functions and defined some of them ourselves to have a clear understanding of Python functions, especially in the context of web scraping. I enjoyed writing (and coding) it. Hopefully, you'll appreciate what you've learned here when you try using them yourself.

Talha Irfan
Talha Irfan
I love reading Russian Classics, History, Cricket, Nature, Philosophy, and Science (especially Physics)— a lifelong learner. My goal is to learn and facilitate the learners.

Get started now

Step up your web scraping and automation