Web scraping - the process of acquiring large amounts of data from various online sources, including websites - can be incredibly useful for various tasks, including tracking competitor prices and analyzing market trends.
How exactly does Puppeteer interact with Headless Chrome? Essentially, Puppeteer is a Node.js library that provides a high-level API to control Chromium over the DevTools Protocol. Headless Chrome, on the other hand, operates without a GUI, making it possible to run background processes, including automated tasks.
The storage section makes it easy to store and manage arbitrary amounts of data generated from the different automation processes.
Setting up the template in your development environment
To scrape stock data from Nasdaq, we’ll set up the template locally, make several code changes, and deploy it to the Apify platform.
Now, let's make a few changes to that Actor's main code files: src/main.js, and src/routes.js. Generally, the code in the src/main.jscreates an instance of the PuppeteerCrawler Actor, specifies the input URLs targeted for scraping, and configures a dynamic proxy setup. Additionally, the script defines a crawler to manage how HTTP requests are handled, ensuring the extraction and storage of data in an efficient manner.
Since we want to scrape Nasdaq for stock data, the only change we will make to this file is updating the default start URL that the Actor will crawl through—go ahead and make this change to your code.
The startUrlsvariable is determined based on the input provided to the Actor. If the input object exists, the Actor will use the URL specified in the input object. However, the Actor will default to a predetermined URL if the object is not provided. In this case, the input object will be included when a user specifies the URL in the input console on the platform.
With that, let's proceed to update the src/routes.js file. This file contains the code that defines how the Actor handles and processes HTTP requests during the web scraping operation.
Ideally, the code initializes a router instance and defines a default routes handler function using the router.addDefaultHandlermethod. This handler is designed to enqueue URLs based on specified glob patterns—strings that precisely indicate which URLs should be included in the request queue for crawling.
Now, to make the Actor crawl Nasdaq-related URLs, we need to update the globs property within the enqueueLinks function by adding the Nasdaq stock route URL.
Once the data has been obtained, the Actor will utilize the Dataset.pushData method to store the results in the dataset directory. With these modifications, you are now ready to deploy the Actor.
Deploying and running the Actor
To deploy the Actor on Apify, you need to first log in to Apify via the command line. To do that, you will need to obtain your Apify API token. Then, run the following command in your terminal to input the token and initiate the login process:
Once logged in, use the following command to deploy the Actor on the Apify platform.
You can then find your newly created Actor under the Actors tab on Apify Console.
Running the Actor on Apify
Now, simply click the Start button to build and run your Actor. While the Actor is running, you can view the scraped data in the Actor’s Output tab.
To access all the datasets, simply navigate to the Storage tab. You can choose to download the data, view the data in another tab, or preview the data. Additionally, you can export the data in any of the supported data formats.
Regardless of the project you're working on, the time it takes to ship a specific feature or an entire application is an essential factor. By utilizing this Actor, you can greatly improve your development workflow and accelerate the deployment process. For instance, consider a scenario where you're developing a stock prediction application. With this template, you can focus on building and training the prediction ML model using the scraped data rather than creating an entirely new service from scratch to handle this task. This approach saves time and resources, allowing you to optimize your efforts and expedite the time-to-market for your application.
Explore alternative automation and web scraping SDKs and templates
And that is it! You can now go ahead and explore other interesting ways in which you can customize the Puppeteer and Headless Chrome Actor to better suit your scraping needs. Also, consider exploring other available templates and SDKs to test out various web scraping and automation features. You might like to start by learning how to build a web scraper with the Apify CLI and Python templates.