How to use cURL in Python

Every developer uses cURL through the command line on a daily basis due to its control and flexibility. Let's see how we can use the same cURL in Python and explore some of its advanced use cases.

Content

What is cURL?

cURL stands for Client URL, a lightweight command line tool for making network requests from the client side. It's suitable for many use cases, like making HTTP requests and testing APIs. It provides support for using different network protocols such as HTTP, HTTPS, FTP, and many more.

Let's say we have different endpoints of an API, and we want to test those endpoints. Rather than using web-based API calls, we can just use cURL to hit those endpoints and test our APIs.

cURL logo

cURL is a powerful tool without any doubt, but to extend its limits and use cURL in different applications, Python provides a library called PycURL that enables us to combine the control and features of cURL with the simplicity and flexibility of Python. The main goal of this tutorial is to explore the different features provided by PycURL.

How to use PycURL

Setting up the environment for cURL requests in Python

Before making cURL requests in Python, we must ensure that we have Python and pycURL installed.

So, we first need to install Python on our machine. We can easily download Python from the official Python website depending on our machine. Once we download the Python installer, we can install Python and check the version by entering the following command on the command line.

python3 --version

We'll see the version of Python that's installed on our system.

📔
Note: We are using python3 here because we have downloaded the latest version and which uses the syntax of python3 instead of simple python.

Installing the PycURL library

Now we have Python installed on our system. We can install pycurl using Python's package manager pip3.

pip3 install pycurl

Once the command is executed, we can confirm the installation by simply running the following command:

pip3 show pycurl

This should print the name of the library with other information. Once we're done with the installation, we're ready to make our requests using PycURL.

📔
Note: We need to be careful while installing packages through pip. If we're using Python 3 or above, we need to use pip3 with that. Otherwise, things might go wrong, like version issues.
What are the best Python web scraping libraries?
We show you how the best Python web scraping tools work.
Best Python libraries for web scraping

How to make a GET request with PycURL

Before making GET requests, let's first see how a basic GET request works. When we make a GET request, we basically ask a server to give us a specific resource. The resource could be a file, an HTML page, a JSON object, or other data. It's called a GET request because it gets a resource from the server.

We can make get requests with the pycurl library by following a series of steps. Don't worry. We will learn everything in detail. Let's see the code first.

import pycurl
from io import BytesIO
# Create a Curl object
curl = pycurl.Curl()
# Create a BytesIO object to store the response
responseBuffer = BytesIO()

# Set the URL
curl.setopt(curl.URL, 'https://www.google.com')
# Set the option to write the response to the buffer
curl.setopt(curl.WRITEDATA, responseBuffer)
# Make the request
curl.perform()
# Fetch the response body
response = responseBuffer.getvalue()

# Print the response
print(response.decode('utf-8'))
# Close the Curl object
curl.close()
# Close the buffer
responseBuffer.close()

In the code above, we first create a Curl object using pycurl.Curl(). Then we set the URL we want to fetch using curl.setopt(curl.URL, '<https://www.google.com>'). We can also set other options using curl.setopt(), such as storing the response data in the responseBuffer object.

We then perform the request using curl.perform(). This sends the HTTP request to the URL and retrieves the response. We can get the response body using curl.getvalue(), which returns a bytes object. We decode this response using response.decode('utf-8') to convert it to a string.

Finally, we close the Curl and buffer objects using curl.close() and responseBuffer.close() respectively.

While executing this code on MacOS, you may encounter an error due to different versions of curl or ssl. You can try the following solution:

# Uninstall curl

brew uninstall curl

# Uninstall the openssl

brew uninstall OpenSSL

# Uninstall curl-openssl

brew uninstall curl-openssl

# Install openssl

brew install openssl

# Install curl

brew install curl

echo 'export PATH="/usr/local/opt/openssl@1.1/bin:$PATH"' >> ~/.zshrc

echo 'export PATH="/usr/local/opt/curl/bin:$PATH"' >> ~/.zshrc

source ~/.zshrc

source ~/.virtualenvs/foo_venv/bin/activate # or whichever venv you're using

export PYCURL_SSL_LIBRARY=openssl

export LDFLAGS="-L/usr/local/opt/curl/lib"

export CPPFLAGS="-I/usr/local/opt/curl/include"

# The below command will give you the location where the ssl is installed

brew --prefix openssl

# For example,/usr/local/opt/openssl@3, You have to write yours

ls -la /usr/local/opt/openssl@3

# Specify the location at the end of the next command as well, We have mentioned there

pip3 install --no-cache-dir --compile --ignore-installed --install-option="--with-openssl" --install-option="--openssl-dir=/usr/local/opt/openssl@1.1" pycurl

The question may arise in your mind why is the syntax so confusing, and why are we using so many things to make just one request?

Why the syntax of the PycURL library is hard to understand

The PycURL library provides a low-level interface to cURL, giving us more control and flexibility. As we get closer to any computer system, we attain more control, but the syntax gets harder and less readable for humans.

The steps we have seen earlier, like importing BytesIO, creating objects, and decoding responses, are necessary because PycURL operates at a lower level and provides us with direct access to the raw HTTP response.

Why is Python used for web scraping?
What are the pros and cons of web scraping with Python?
Pros and cons of web scraping in Python

How to make a POST request with PycURL

The POST request is almost similar to the GET request. The only difference is that we just add the data we want to send through the request. But the data is encoded first. Let's see how things work.

import pycurl
# Import the urllib for encoding
import urllib.parse
from io import BytesIO

# Create a Curl object
curl = pycurl.Curl()
responseBuffer = BytesIO()
# Set the URL
curl.setopt(curl.URL, 'http://httpbin.org/post')
# Set the method to POST
curl.setopt(curl.POST, True)

# Data
data = {'name': 'John', 'age': '30'}
# Encode the data
dataString = urllib.parse.urlencode(data)
# Add data as POSTFIELDS
curl.setopt(curl.POSTFIELDS, dataString)
# Set the variable that will store data
curl.setopt(curl.WRITEDATA, responseBuffer)

# Make the request
curl.perform()
# Get the responseCode
responseCode = curl.getinfo(curl.RESPONSE_CODE)
print('Response Code:', responseCode)
# Get the responseBody
responseBody = responseBuffer.getvalue()
print('Response Body:', responseBody.decode('utf-8'))
# Close the object
curl.close()
# Close the buffer
responseBuffer.close()

We first create a Curl object, and then we set the URL. We set the request method to POST using curl.setopt(curl.POST, True). After that, we set the data using a dictionary and encode it using urllib.parse.urlencode() method. We set the encoded data as the request body using curl.setopt(curl.POSTFIELDS, dataString). After performing the request, we print the response code and the request body. Finally, we close the Curl and buffer objects.

Advanced concepts of PycURL library

We have covered some basic concepts of the PycURL library and how it works. Now, let's cover some more advanced topics, including adding custom headers, handling redirects, authenticating requests, and handling errors.

How to add headers in the cURL request using Python

We can send additional information about the requests using the headers, like user agent, content type, or authorization credentials. As we have already mentioned that the .setopt() of PycURL allows us to add additional information about the requests, so we will use this method to add headers as well. Let's see how it's done.

import pycurl
from io import BytesIO

# Create a new Curl object
curl = pycurl.Curl()
# Create a BytesIO object
responseBuffer = BytesIO()

# Set the URL to fetch
curl.setopt(curl.URL, 'https://httpbin.org/headers')

# Set custom headers
customHeaders = ['Authorization: Bearer mytoken', 'User-Agent: MyCustomAgent']
curl.setopt(curl.HTTPHEADER, customHeaders)

# Set the WRITEFUNCTION option to redirect the response body to the responseBuffer
curl.setopt(curl.WRITEFUNCTION, responseBuffer.write)

# Make the request
curl.perform()

# Get the HTTP response code
responseCode = curl.getinfo(curl.RESPONSE_CODE)
print('Response Code:', responseCode)

# Get the response body from the response_buffer
responseBody = responseBuffer.getvalue()

# Print the body of the response
print('Response Body:', responseBody.decode('utf-8'))

# Close the Curl object and the response_buffer
curl.close()
responseBuffer.close()

We set two custom headers (Accept-Language and User-Agent) by setting the curl.HTTPHEADER option to a list of strings containing the header names and values. We then perform the request and retrieve the response body as before.

What’s the future of web scraping in 2023?
Tech, legal, market changes, and web scraping trends.
Web scraping trends in 2023

How to Handle redirects in cURL requests with Python

An HTTP redirect is a way to tell a browser to request a different URL instead of the one originally requested. In simple words, when a user tries to access a specific URL, the server automatically takes the user to an alternative URL. It happens in cases when the owner of the website changes the URL or wants the user to see a better version.  In the case of alternative URLs, the website is accessible through both the URLs, the newer and the older ones.

pycurl do not automatically follow the redirects. We can configure pycurl to follow redirects using the curl.setopt() method.

import pycurl
from io import BytesIO

# Create a new Curl object
curl = pycurl.Curl()
# Create a BytesIO object
responseBuffer = BytesIO()

# Set the URL to fetch and follow redirects
curl.setopt(curl.URL, 'http://httpbin.org/redirect-to?url=https%3A%2F%2Fwww.google.com')
curl.setopt(curl.FOLLOWLOCATION, 2)
# Write the data in the responseBuffer
curl.setopt(curl.WRITEDATA, responseBuffer)

# Perform the request
curl.perform()
# Get the response body
response = responseBuffer.getvalue()
# Print the response
print(response.decode('utf-8'))

# Close the Curl object
curl.close()
# Close the buffer
responseBuffer.close()

We set the curl.FOLLOWLOCATION option to True to instruct cURL to follow redirects. We also set the curl.REDIRECT_LIMIT option to 2, which limits the number of redirects that it will follow.

How to handle cookies in cURL requests with Python?

Cookies are small pieces of text that are used to save a user's state or information about the user. Pycurl provides us with an easy way to manage cookies. We just need to set a text file using the setopt() method and utilize two methods, COOKIEJAR and COOKIEFILE. COOKIEFILE tells the pycurl to read cookies from a file before making a request and by using COOKIEJAR, pycurl will save the cookies to a file received in the response.

import pycurl
from io import BytesIO
# Create a Curl object
curl = pycurl.Curl()
# Create a buffer object
responseBuffer = BytesIO()
# Set the URL
curl.setopt(curl.URL, 'https://stackoverflow.com/')
# Set the buffer to receive data
curl.setopt(curl.WRITEDATA, responseBuffer)
# Save cookies to a file
curl.setopt(curl.COOKIEJAR, 'cookies.txt')
# Load cookies from a file
curl.setopt(curl.COOKIEFILE, 'cookies.txt')

# Perform the request
curl.perform()

# Read cookies
cookies = curl.getinfo(pycurl.INFO_COOKIELIST)

# Print the cookies
print("Cookies are:")
for cookie in cookies:
    print(cookie)

# Close the Curl object
curl.close()
# Close the buffer
buffer.close()

In this example, we are specifying a file through setopt() . If the file is not created, it will create a new file and save cookies into it. We get the cookies using the pycurl.INFO_COOKIELIST method and print them.

When and how to use datacenter proxies
The ultimate guide to datacenter proxies for web scraping.
Datacenter proxies: when and how to use them 

Authenticating cURL requests with Python

While working with websites and requests, we encounter some websites that require authentication and credentials to use the content. PycURL provides a way to handle this situation as well through the .setopt() method. We add the credentials through encoding and make the POST request. Let’s see the code in detail.

import pycurl
from io import BytesIO
from urllib.parse import urlencode

# Create a new Curl object
curl = pycurl.Curl()

# Create a buffer object
responseBuffer = BytesIO()

# Enable cookie handling
curl.setopt(curl.COOKIEJAR, 'cookies.txt')
curl.setopt(curl.COOKIEFILE, 'cookies.txt')

# Set the login URL
curl.setopt(curl.URL, 'https://newsapi.org/login')

# Set the request method to POST
curl.setopt(curl.POST, 1)

# Add the data 
postData = {'email': 'yourEmail', 'password': 'yourPassword'}
# Encode the data
postfields = urlencode(postData)
# Add the post fields
curl.setopt(curl.POSTFIELDS, postfields)

# Add the buffer variable
curl.setopt(curl.WRITEDATA, responseBuffer)
# Perform the login request
curl.perform()

# Get the HTTP response code
responseCode = curl.getinfo(curl.RESPONSE_CODE)
print('Response Code:', responseCode)

# Clear the response buffer
responseBuffer.truncate(0)
responseBuffer.seek(0)

# Set the URL to the home page
curl.setopt(curl.URL, 'https://newsapi.org/account')

# Make the request to the home page
curl.perform()

# Fetch the response body
responseBody = responseBuffer.getvalue()

# Print the body of the response
print('Response Body:', responseBody.decode('utf-8'))

# Close the Curl object, and the responseBuffer
curl.close()
responseBuffer.close()

In the code above, we are using an extra module urlencode for encoding the credentials. Then, we have cookies to store the session, and we are saving them in a file. This curl.setopt(curl.POSTFIELDS, postfields) line of code is adding the credentials to the required fields. After that, we make the POST request and get the response that would be 200. After clearing the responseBuffer, we are again making a GET request to get the home page of the account.

How to handle errors in cURL requests with Python

While writing code, sometimes we are unsure whether a piece of code will work. For example, when we make a request or read a file, it's possible that the file is not available. At that point, we get an error. To handle such situations, we use try except  blocks in pycurl.

Here's an example of using a try except block to perform a curl request.

import pycurl
# Create a new Curl object
curl = pycurl.Curl()
# Set the URL
curl.setopt(curl.URL, 'https://www.googlecom')
try:
   # Perform the request
   curl.perform()
except pycurl.error as error:
  # Handle the error
  errorNumber, errorString = error.args
  print('Error: %s %s' % (errorNumber, errorString))
# Close the Curl object
curl.close()

In this example, we have intentionally set an invalid URL to demonstrate how to handle errors. We have used a try except block to catch any pycurl.error exceptions that may be raised during the request. We then extract the error number and error string from the exception using error.args and print an error message.

Web scraping: how to solve 403 errors
403 Forbidden error keeps reappearing? Try our workarounds.
How to solve 403 errors when web scraping

More advanced topics in cURL requests with Python

In this section, we'll cover some even more advanced topics in cURL requests with Python, including performing file uploads and working with SSL/TLS certificates.

How to perform file uploads in cURL requests with Python

We may need to upload a file along with our HTTP request, such as when working with file storage or API endpoints that accept file uploads. PycURL provides us with an easy way to upload files using the same method setopt(). We just need to set our request to POST and specify the path of the file with its type, and we are good to go. Let's see how easy this process is.

import pycurl
from io import BytesIO

# Create a new Curl object
curl = pycurl.Curl()
# Create a buffer
responseBuffer = BytesIO()
# Set the URL for the file upload
curl.setopt(curl.URL, 'https://httpbin.org/post')
# Set the file to be uploaded and other options
curl.setopt(curl.HTTPPOST, [('file', (curl.FORM_FILE, '/content/cookies.txt'))])
# Specify the buffer to receive response
curl.setopt(curl.WRITEDATA, responseBuffer)
# Apply the try block
try:
    # Perform the request
    curl.perform()
except pycurl.error as error:
    # Handle the error
    errorNumber, errorString = error.args
    print('Error: %s %s' % (errorNumber, errorString))
# Close the Curl object
curl.close()
# Decode and print the response
response = responseBuffer.getvalue().decode('utf-8')
print(response)
# Close the buffer
responseBuffer.close()

Our code is mostly the same, and we have discussed the code a lot in the last examples. The line that requires explanation here is just 13, where we are specifying the method to POST and giving the file name. curl.HTTPPOST sets the method the HTTP method to POST and the next argument is a list that includes information about the file. file is the name of the form field on the server side that will receive the uploaded file. (curl.FORM_FILE, '/content/cookies.txt') is a tuple that specifies the type of the form field and the file to be uploaded.

How to handle SSL/TLS certificates in PycURL

The Secure Sockets Layer (SSL) and Transport Layer Security (TLS) are used to create secure connections over the internet. These methods make the connections between the browser and server by encrypting the data transferred between them. These certificates are issued by an organization called Certificate Authority (CA) to websites that ensures that a website is owned by an organization and it's trustworthy.

HTTP VS HTTPS

When we visit a website with an SSL/TLS, our web browser checks the SSL/TLS certificate that acts as a digital stamp of approval that verifies the authenticity and trustworthiness of the website. As certificates change over time, PycURL does not provide such support. We can also add certificates from our local directory, but if we don't have one, we can use the certifi Python package.

Before using this package, let’s install it. We can run the following command to install this package.

pip3 install certifi

Now, we can use this package in our code.

import pycurl
import certifi
# Create a Curl object
curl = pycurl.Curl()
# Set the URL
curl.setopt(curl.URL, 'https://blog.apify.com/')
# Check the CA certificates through certifi
curl.setopt(curl.CAINFO, certifi.where())
# Perform the request
curl.perform()
# Retrieve the response code
responseCode = curl.getinfo(curl.RESPONSE_CODE)
print(f'Response code: {responseCode}')
# Close the Curl object
curl.close()

In the code above, we are using an additional package that confirms the certificate of any website. The curl.CAINFO sets the Curl to check for a CA certificate of a website and certifi.where() extracts the path to the default certificates provided by the package.

Final code

Throughout this tutorial, we have covered the basics and some advanced concepts of cURL combined with the simplicity and ease of use of Python. This combination allows you to perform complex web operations with fewer lines of code and provides better control over HTTP requests.

Let’s put everything we’ve learned from this tutorial together and look at a script that covers most aspects.

In this script, we will also use Beautiful Soup to parse and extract information from the content we get through our requests. If you’re not familiar with it, you should read more about web scraping with Beautiful Soup.

Install by entering the command pip install beautifulsoup4 and run the script below.

import pycurl
import certifi
from io import BytesIO
# Import BeautifulSoup
from bs4 import BeautifulSoup
# Make a new curl object
curl = pycurl.Curl()
# Make a new Buffer
responseBuffer = BytesIO()
# Set the URL
curl.setopt(curl.URL, "https://blog.apify.com/")
# Check the certificates
curl.setopt(curl.CAINFO, certifi.where())
# Allow redirections
curl.setopt(curl.FOLLOWLOCATION, 3)
# Save the response data
curl.setopt(curl.WRITEDATA, responseBuffer)
# Make the request
curl.perform()
# Decode the response
htmlResponse = responseBuffer.getvalue().decode('utf-8')
# Add the information to the parser
soup = BeautifulSoup(htmlResponse, 'html.parser')
# Extract articles
articles = soup.find_all('div', class_='post-info-wrap')  # Use class_ instead of class
# Loop through all the articles
for article in articles:
    title = article.find('h2', class_='post-title').text.strip()  # Use class_ instead of class
    author = article.find('a', class_='post-author').text.strip()  # Use class_ instead of class

    print("Title:", title)
    print("Author:", author)
    print("-" * 25)

curl.close()
responseBuffer.close()

In this script, we are making a GET request to the Apify blog. We are using setopt() to check the SSL/TLS certificate and applying redirection. In the end, we are using Beautiful Soup to parse the HTML response. First, we retrieve posts on the website, and then we extract the Title and Author of each post.

I hope this tutorial has helped you understand how to use PycURL! For more information and other tutorials about using Python for web scraping, check out the articles below.

📝 How to parse JSON with Python

📝 What are the best Python web scraping libraries?

📝 Web scraping with Python

📝 Web scraping with Python Requests

📝 Web scraping with Scrapy

📝 Web scraping with Beautiful Soup and Requests

📝 Web scraping with Selenium and Python

Get started now

Step up your web scraping and automation