Functions are a core component of programming. Inspired by their mathematical counterparts, functions do much more than take inputs and return outputs. They make code easier to manage, debug, and understand.
Regardless of whether one is using functional or some other programming paradigms, functions are central to every programming language, including Python. In this article, we'll talk about Python functions, especially in the context of scraping data from the web.
We'll begin with a quick overview of Python functions, followed by how to use them for web scraping in Python.
Functions in Python
A function is a code snippet that can be defined once and then used ("called") anywhere we want in the code. For example, we can define the function CalculateTax()
and call it whenever we have to calculate an employee’s tax.
Before we explain how to define and use a Python function, here are 3 points worthy of attention.
- Similar to a mathematical function, a function may take some inputs and return an output. The aforementioned function to calculate tax is a good example.
- A function may take no input and return an output. For example, the built-in function for returning time,
localtime()
in thetime
module. - Python's outputs might surprise someone familiar with C/C++, C#, or Java, as a Python function can return multiple outputs.
Creating functions
A Python function can be defined as:
def sample_function(input):
"""
Some function body
"""
For example, the aforementioned function to calculate tax:
def calculate_tax(salary):
if(salary < 12000):
print('No tax applicable')
else:
tax = salary * 0.17
print('Tax deductible for given salary is: ', tax)
return
keyword
While the above function doesn’t have any output, we may have some functions with an output too. For such functions, we use the return
keyword. For example, a simple function for converting the degrees into radians:
import math
def deg_to_radians(deg):
rad = deg/180*math.pi
return rad
Now we've defined a function. But a function is almost pointless until it’s used. That brings us neatly to calling functions.
Calling functions
To use a function, we “call” it by mentioning it and providing the inputs (if any).
(inputs)
These inputs can either be variables or direct values. Both ways of calling it are valid:
s = 35000
calculate_tax(s)
# or
calculate_tax(35000)
Functions with multiple outputs
As I mentioned above, Python differs from other common languages in terms of multiple outputs. It’s pretty simple - while returning the outputs, you can specify as many as you would like. Here's a quick example:
import math
def circle_attributes(radius):
c = 2*math.pi*radius
a = math.pi*radius*radius
return a,c
They can be collected (in the same order) while calling the function.
area, circumference = circle_attributes(3)
print(area, circumference)
We can even collect them in a single variable.
a = circle_attributes(4.5)
Since they're returned in the same order, this means it's a tuple, which we can also confirm.
print(type(a))
Pure and impure functions
We have seen that a function may or may not take input and yet return output(s). But can you think of a situation where a function may or may not return any output at all?
Consider a function that takes your date of birth to determine whether you’re eligible for a driving license. Ideally, this function, which we could call ValidateLicenseAge()
, would return a clear yes or no. However, it could also be designed to simply print()
the result, without returning any explicit output.
In more advanced scenarios, a function might receive information for credit card or bank loan approval. After processing the inputs—whether through a fixed formula or a machine learning algorithm—the function might email its recommendations directly to the relevant department, without returning anything.
These kinds of actions, where a function performs tasks like printing or sending emails, are known as side effects. Another example of a side effect is when a function modifies the value of a variable within the program.
This distinction brings us to the concepts of pure and impure functions:
- Pure functions do not change the program state. They rely only on the function's inputs to produce an output.
- Impure functions can modify the program state and may depend on factors beyond the function's inputs, such as internal program state or I/O operations.
Functional programming encourages the use of pure functions to ensure predictable and reliable code.
Python functions for web scraping
Now that we've covered the basics of Python functions, let's look at how they're used in web scraping, beginning with two of the most prominent libraries in the Python ecosystem: Beautiful Soup and Scrapy.
Beautiful Soup
Beautiful Soup is a Python library to extract data from HTML or XML pages. We can install it from the PyPI.
pip install beautifulsoup4
To use it, we instantiate a BeautifulSoup object:
from bs4 import BeautifulSoup
bs_obj = BeautifulSoup()
There are a number of useful functions, like:
find()
- to look up a particular tag within the page.find_all()
- finding all the respective tags within the page. For example, if we specify'a'
as a parameter in this function, it will get all the URLs within a page.get_text()
- extracting plain (non-html) text from the page.
We need to remember that Beautiful Soup’s job is restricted to pulling the data out of the HTML (or XML) pages; we have to fetch the page ourselves. We can use a library like requests
to facilitate it. For example, here we're using find_all()
to fetch all the URLs from a page:
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/Cosmopolitanism"
page = requests.get(URL)
soup = BeautifulSoup(page.content)
for link in soup.find_all('a'):
print(link.get('href'))
Scrapy
Scrapy is a complete Python library that provides comprehensive support for web scraping. Its core component is a Spider
object. We can use this object for web crawling. It comes with a number of useful methods, like:
start_requests()
- begins crawling and is called only once by the crawler.parse()
- processes the response and returns the scraped data.
To install Scrapy, we can simply use pip
:
pip install scrapy
To confirm if it's correctly installed, an import
statement will do.
import scrapy #should execute without any trouble if its installed
Scrapy also provides its CLI, which is powerful enough. For example, here we'll use it to make a spider and run it.
scrapy startproject Proj1
scrapy genspider Spider1 wikiwand.com
scrapy crawl Spider1
We can add -o
while crawling the spider to save the output as well.
How to import functions in Python
We've been using the pre-defined functions from various libraries in this article so far. It provides abstraction and a neat way of calling them. For example, if you use a function - say NumPy’s arange()
- you are unconcerned with the definition details usually and call it without any fuss.
import numpy as np
a = np.arange(1,10)
Can we do it for our user-defined functions too?
Luckily, it’s pretty straightforward: define the functions in a Python file, import it (like we do for the libraries), and call it.
As an example, I'll define some functions in a file.
Module file
This is a separate file with functions in it. Here's one with a couple of functions:
import math
def circumference(r):
return 2*math.pi*r
def area(r):
return math.pi*r*r
Calling code
Once the Python file (in this case sample_module.py
) is saved, we can easily import it from either a Python file or a JuPyter notebook (in the same directory):
import sample_module as sm
sm.circumference(2)
Browser automation with functions
So far, we've learned how to use Python functions for fetching data from a website. But there's more to it; we can also automate a browser using Python.
When it comes to web scraping dynamically loaded content (which is often the case with most modern websites), we need browser automation tools.
Two of the most powerful and popular libraries for the job are Selenium and Playwright.
The methodology is common in both libraries: we instantiate a browser of our choice (Chrome and Firefox being the common choices) and use it to either scrape or enter the data.
Selenium
I won't take up your precious time with introductions to Selenium and Playwright. If you want to learn more about them, check out Playwright vs. Selenium.
Let's get straight into using Selenium to instantiate a Firefox web driver.
from selenium import webdriver
driver = WebDriver.Firefox()
This driver has some useful methods, like:
get()
- retrieve the given URL.find_element(By.ID, 'element_id')
- finding a particular element in the page by its id,element_id
. Please make sure to importBy
Using these methods, we make a function for logging into a page.
def automate_login(url, user_name, password):
driver = WebDriver.Firefox()
driver.get(url)
driver.find_element(By.ID, 'email').send_keys(user_name)
driver.find_element(By.ID, 'pass').send_keys(password)
driver.find_element(By.ID, 'loginbutton').click()
from selenium import webdriver
from [selenium.webdriver.common.by]() import By
def automate_login(url, user_name, password):
# Instantiate the Firefox WebDriver
driver = webdriver.Firefox()
try:
# Navigate to the URL
driver.get(url)
# Find the elements and perform actions
driver.find_element(By.ID, 'email').send_keys(user_name)
driver.find_element(By.ID, 'pass').send_keys(password)
driver.find_element(By.ID, 'loginbutton').click()
finally:
# Close the browser after the operations
driver.quit()
As you can see, there are a couple of notes worthy of attention here:
send_keys()
is a submethod offind_element()
and used to send the respective data. So isclick()
, which allows us to automate the button clicking.- As you can see, the IDs we used (
email
,pass
, etc.) aren’t ubiquitous to every website, so this function can't be used with every website.
Playwright
Playwright’s working unit is a Page. A Page refers to a single tab or a popup window within a browser context. It has a number of functions (methods), like:
goto()
- navigation to a pagelocator()
- going to a particular location on the pagescreenshot()
- for grabbing the screenshot of the current page. This method takes the path to save the screenshot. There are also some optional parameters: clipping coordinates, whether it's a full page or not, whether it's an animation or not, etc. More details can be found in the documentation.
Let's make a function using Playwright to scroll to the given URL and take a screenshot.
from playwright.sync_api import sync_playwright
def take_screenshot(url, path):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
page.screenshot(path=path)
browser.close()
Now let's take a screenshot of Wikipedia’s page:
take_screenshot("https://www.wikipedia.org", 'wikipediaSS.png')
Best practices and advanced function techniques
Before we wrap up, let me leave you with a couple of best practices for writing Python functions:
- Use recursion carefully: Recursion is a trapped beast. If used properly, it can be quite useful. But if mishandled, it can be troublesome. Personally, I've found it quite difficult to dry-run/debug them due to the call stack and backtracking.
- No overmodulation: In this context, overmodulation means to not overuse the modules or functions. Whenever possible, it's a good idea to use anonymous expressions/functions instead of making trivial functions like a sum of two numbers.
Conclusion
We've reviewed both pre-defined functions and defined some of them ourselves to have a clear understanding of Python functions, especially in the context of web scraping. I enjoyed writing (and coding) it. Hopefully, you'll appreciate what you've learned here when you try using them yourself.