Corona Data Scraping with Scrapy Python

Krishna Gaire
4 min readApr 29, 2020

We are about to scrape the data from the worldometer for the coronavirus update.

Worldometers page we are about to scrape

Web scraping is general data extraction from the website like text, video, images, emails, etc. In this article, we use python framework Scrapy for scraping the data from worldometer website.

In contrast to Scrapy there are others tools written in python used for the scraping like request library , beautifulsoup but this tools can’t standalone for the complex tast they are usally used for the simple task . This is where Scrapy shines.

Talking about the Scrapy. we have the 5 main components :

5 main component of the Scrapy
  1. Spiders

Spiders are the classes that define how a certain site(or a group of sites) will be scraped, including how to perform the crawl(i.e follow the link) and how to extract structured data from their pages(i.e scraping items). In other words, Spiders are the place where you define the custom behavior for crawling and parsing pages for a particular site(or, in some cases, a group of sites).

We have different types of Spiders like: scrapy.Spider, CrawlSpider ,XMLFeedSpider ,CSVFeedSpider ,SitemapSpider.

2. Pipelines: Pipelines are related to the way we extract data like Cleaning the data, Remove duplicate data, Storing data.

3. Middlewares: Which has everything we do with the request we send to the website and Response we get back from the website.

4. Engine: Engine is responsible for coordinating between all the other components. In other words, It will ensure the consistency of all the operation that happens.

5. Scheduler: It is responsible for preserving the order of the operations, Technically speaking it is the Simple Data Structure, Queue which follow the FIFO (First In First Out) technique.

Getting started with the scraping:

  1. Create a virtual environment for this scraping project.
  2. You need to install dependencies like Scrapy,pylint, and autopep8 which can be installed at once by the following command.

pip3 install scrapy pylint autopep8

We are done with the installation processes and now its time to explore scrapy, when we run scrapy command on our terminal we will see :

We can see that after running scrapy command we have :

Scrapy 2.0.1-no active project

Scrapy 2.0.1 is the version of the scrapy we are using and there is no active project till now so it shows no active project. After that we can see:

Scrapy <command> [options] [args]

Start Creating the project :

Our main objective is to extract Countries name and their total cases, Total Deaths, Total Recovered Cases, Active Cases and Serious Critical Cases in a CSV file.

  1. To create a basic template of the project we run the following code.

scrapy startproject worldometer

The New Scrapy project named ‘worldometer’ using a basic template(Which is as shown in the left side) is created. Now, we have to create a spider where we can actually write some code for the extraction of the data from the website. To generate spider we gonna run following command in the command prompt.

cd worldometer

scrapy genspider covid www.worldometers.info/coronavirus

genspider: command Generate new spider using pre-defined templates,

covid: name of the spider we are about to create,

And at the end we give the link of the page we are about to scrape. Originally link would be like https://www.worldometers.info/coronavirus/, we will remove ( https:// ) from the front and ( / ) from the back, to make www.worldometers.info/coronavirus like in the following figure.

The basic template of spider generated.

As we can see there are three features inside the CovidSpider class are:

name: The name of the spider should be unique.

allowed_domains: Originally you might see allowed domains as shown in the figure but you can change it to the main page like [‘www.worldometer.info’]

start_urls. : You have to make some modification on this

In covid class, we have a parse method in which all the outcome of the XPath expression(You need to have strong knowledge of Xpath expression for the scraping) is store in a variable which then yield.

You need to provide the XPath expression inside the spider.

scrapy crawl covid -o CoronaData_final.csv

Above command is used to store Scraped data into the CSV file, As we can see command is basically: scrapy crawl [spidername] -o [ filename(.csv or .json)]

As you open the file you created it will look like this. We got what we have wanted.

Scraped Worldometer Data

thank you, Happy Coding …

--

--