Lighter Web Scraping Using NodeJS
Easily build a lightweight web scraper using NodeJS
An alternative way for doing web scraping using NodeJS
If you search for Web Scrapping using NodeJS, probably Puppeteer examples/articles will come up. It is an awesome library to use for complex web scraping because you are actually automating a browser when using Puppeteer. With that said, I think it’s an overkill library to use for a simpler web scrapping. So in this article, we’ll look into how we can scrape data from the web without using Puppeteer
To do this we need to solve two problems. The first one is, how we can get the website HTML code. After that’s solved, the second problem is how to get the actual data that we need from the HTML code.
Let’s start coding! First, scaffold a new Node project by running
yarn init -y
Now that we have a project ready to use, let’s install some dependencies
yarn install axios cheerio
You might be familiar with this package because it’s quite a popular package to use for doing HTTP requests. Nowadays we usually use this to interact with API and get the result as JSON, but there’s a setting that we can tweak so the response will be an HTML instead of JSON.
Taken from their NPM Package description, it’s a “Fast, flexible & lean implementation of core jQuery designed specifically for the server” I think that explains it really well. Basically, with this package, we can run jQuery commands on the server.
Building The Scraper
We'll be using https://books.toscrape.com/ website to test our scraper. First off, create a file called
index.js in your project folder root, we’ll use this file to build our scraper.
From the list of books on the website we'll grab a couple of things including:
Let's get coding!
First, we import both
cheerio and then we create an async function called
Now let's grab the HTML code from the website using
axios and load it to
cheerio so we can query the data, to do this we'll do it like this
After inspecting the website we can see that the book listing looks like this. This will help us get the data.
With that information, let's grab the book elements first. We can do that by using cheerio like this
Alright, we got the books. Now it's time to grab the simple data first, these are something that we can directly see in the element
After that's done, now we can also grab the data that's a bit more complicated like
First off, for
rating we can grab the
p element and check the class because it contains how many ratings the book has (e.g. Three). Next up, for the availability we can just check is there any div with a class of
.instock.availability, we query for both classes to make sure that the
.instock class is really for the availability, and the
.instock class to show that it is available.
All done! This is what the complete code looks like
I think this is the simplest way to do web scraping, and there are some pros and cons of doing it this way.
Simpler to build
Fewer resources needed (library like Puppeteer needs to install Chromium to run)
Smaller package size
Cannot scrape a website where navigation is needed (sign in, scroll, etc.)
Cannot take a screenshot of the page
In the end, it depends on what website do you want to scrape and what data that you want to get. If you want to get something from a complex website then yes, use something like Puppeteer! It has a powerful API and you can interact with a complex website. But if you need something simple, then
cheerio might be a better choice
Here are some resources for all the things that I've mentioned in this tutorial