An alternative way for doing web scraping using NodeJS
If you search for Web Scrapping using NodeJS, probably Puppeteer examples/articles will come up. It is an awesome library to use for complex web scraping because you are actually automating a browser when using Puppeteer. With that said, I think it’s an overkill library to use for a simpler web scrapping. So in this article, we’ll look into how we can scrape data from the web without using Puppeteer
Getting Started
To do this we need to solve two problems. The first one is, how we can get the website HTML code. After that’s solved, the second problem is how to get the actual data that we need from the HTML code.
Let’s start coding! First, scaffold a new Node project by running
yarn init -y
Now that we have a project ready to use, let’s install some dependencies
yarn install axios cheerio
Axios
You might be familiar with this package because it’s quite a popular package to use for doing HTTP requests. Nowadays we usually use this to interact with API and get the result as JSON, but there’s a setting that we can tweak so the response will be an HTML instead of JSON.
Cheerio
Taken from their NPM Package description, it’s a “Fast, flexible & lean implementation of core jQuery designed specifically for the server” I think that explains it really well. Basically, with this package, we can run jQuery commands on the server.
Building The Scraper
We'll be using https://books.toscrape.com/ website to test our scraper. First off, create a file called index.js
in your project folder root, we’ll use this file to build our scraper.
From the list of books on the website we'll grab a couple of things including:
Title
Price
Cover Image
Rating
Availability
URL
Let's get coding!
First, we import both axios
and cheerio
and then we create an async function called scrape
.
Now let's grab the HTML code from the website using axios
and load it to cheerio
so we can query the data, to do this we'll do it like this
After inspecting the website we can see that the book listing looks like this. This will help us get the data.
With that information, let's grab the book elements first. We can do that by using cheerio like this
Alright, we got the books. Now it's time to grab the simple data first, these are something that we can directly see in the element
After that's done, now we can also grab the data that's a bit more complicated like rating
, availability
, and url
.
First off, for rating
we can grab the p
element and check the class because it contains how many ratings the book has (e.g. Three). Next up, for the availability we can just check is there any div with a class of .instock.availability
, we query for both classes to make sure that the .instock
class is really for the availability, and the .availability
has .instock
class to show that it is available.
All done! This is what the complete code looks like
Conclusion
I think this is the simplest way to do web scraping, and there are some pros and cons of doing it this way.
Pros
Simpler to build
Fewer resources needed (library like Puppeteer needs to install Chromium to run)
Smaller package size
Cons
Cannot scrape a website where navigation is needed (sign in, scroll, etc.)
Cannot take a screenshot of the page
In the end, it depends on what website do you want to scrape and what data that you want to get. If you want to get something from a complex website then yes, use something like Puppeteer! It has a powerful API and you can interact with a complex website. But if you need something simple, then axios
and cheerio
might be a better choice
Resources
Here are some resources for all the things that I've mentioned in this tutorial
Cheerio: https://github.com/cheeriojs/cheerio
Puppeteer: https://github.com/puppeteer/puppeteer