The terms concurrency and parallelism are often used interchangeably in the world of computer science and programming, particularly in the context of multithreaded programs. However, while they do share the common goal of achieving more efficient and faster execution of tasks, they’re distinct in their approaches and applications. It’s important to know the difference if you want to make informed decisions about whether and how to apply them in the context of web scraping.
What follows will help you understand the meaning of these concepts. We’ll explore the distinctions between them, their relationship to multithreading and web scraping, and how they can be combined.
Concurrency and parallelism in multithreading
What is multithreading?
Multithreading is a programming and execution technique that allows a single process to have multiple threads of execution concurrently. In simpler terms, it enables a program to perform multiple tasks or operations concurrently within the same process.
A thread is the smallest set of tasks that can be managed by an operating system without dependencies. Each thread represents an independent sequence of instructions that can run simultaneously with other threads in the same program.
Multithreading is, therefore, the ability of an operating system to handle multiple requests from the same user or program.
What is concurrency?
Concurrency means that an application is making (or at least appears to be making) progress on more than one task at a time. A CPU switches between different tasks during execution to make this possible. With one CPU, you can’t make progress on more than one task concurrently.
What is parallelism?
Parallelism is a subclass of concurrency. It refers to an application splitting tasks up into smaller subtasks to be processed in parallel. For parallelism to work, your application needs multiple threads, with each thread running on a separate CPU.
When to use concurrency and when to use parallelism
Concurrency is primarily concerned with enabling efficient task-switching and resource-sharing in order to make the most of a single processor core. Tasks are interleaved or run in overlapping time intervals, allowing the illusion of simultaneous execution.
Concurrency is often used to enhance the responsiveness of applications, such as handling user interactions in graphical interfaces or managing multiple network connections.
Parallelism, on the other hand, is the simultaneous execution of multiple tasks with the explicit goal of speeding up computation. It relies on multiple processor cores, CPU threads, or even separate machines to work on different parts of a problem concurrently.
Parallelism is often used in computationally intensive tasks like data processing, scientific simulations, and rendering, where breaking a problem into smaller, parallelizable subtasks can lead to substantial performance gains.
Concurrency and parallelism combinations
While concurrency and parallelism are distinct, they’re not mutually exclusive. In fact, they can be combined to achieve optimal performance in some scenarios. Here are some combinations to consider:
Concurrent, but not parallel
When an application is concurrent but not parallel, it means that it processes more than one task concurrently, but the application switches between making progress on each of the tasks until the tasks are completed.
Parallel, but not concurrent
When an application is parallel but not concurrent, it means that it processes one task at a time in sequence. Tasks may be broken down into subtasks to be processed in parallel, but each one is completed before the next task is executed.
Concurrent and parallel
When an application is both concurrent and parallel, it means that it processes multiple tasks or subtasks concurrently and executes them in parallel. There are two ways applications can be both concurrent and parallel:
- The application executes multiple threads on multiple CPUs.
- The application simultaneously works on multiple tasks and breaks down each task into subtasks for parallel execution.
Using concurrency and parallelism in web scraping
Web scraping involves extracting data from websites and often includes multiple requests and data processing. Here's how concurrency and parallelism can help with web scraping:
Concurrency can be useful when scraping multiple websites or pages simultaneously. It allows you to issue multiple requests concurrently and manage their responses efficiently. Libraries like asyncio for Python and Crawlee for Node.js enable concurrent programming for web scraping.
Parallelism comes into play when you need to process large amounts of data obtained from web scraping. You can use parallelism to speed up parsing, data extraction, and storage processes, especially if the data requires complex transformation or analysis.
Which one should you choose?
Both concurrency and parallelism can speed up the web scraping process and can come in useful when you want to scale your crawlers and scrapers. But given every project is unique and the complexity of each project varies, the only reasonable answer to the question of which one you should use is ‘it depends’.
Combining concurrency and parallelism can help in some cases, but it can also lead to performance loss or minimal performance gain. It can also make code too complex. So make sure you weigh up the pros and cons before you adopt any particular concurrent/parallel model.
You can learn more about concurrency and parallelism for scaling your web scraping projects in the Crawlee documentation. Crawlee is an open-source web scraping and browser automation library that helps you build reliable crawlers fast.