Web Scraping, Data Wrangling/Analysis with Pandas (Python) is as easy as 1-2-3

Introduction

Pandas is so amazing for data wangling and analysis. I have written about Pandas severally before as it is the de facto data wangling and analysis tool in the Python Data Science space. In one past post – Data Analytics: Why I chose Python instead of R – I articulated the reasons why I chose Python over R for Data Analytics. I could as well have titled that post as “Data Analytics: Why I chose Pandas instead of R”.

Much earlier, in another post – Web Scraping with Python (BeautifulSoup and Pandas) – I had cited some use cases of web scraping using Python, i.e. for harvesting, analyzing and plotting Forex rates and weather data from the web. These are practical use cases, and there are unlimited examples usually for monitoring something, be it prices, news, political sentiments, epidemics, etc. I had separately used Beautifulsoup and Pandas for comparative purposes in that post but without code samples.

Today, I want to be more specific and show you how you can simply extract a table (HTML table) from a web page using Pandas, and then do some wrangling and analysis. Code snippets are included.

We shall see examples of scraping Coronavirus and Forex rates data with some wrangling and analysis.

First understanding Web Scraping

Let us first understand what web scraping is. In simple terms, web scraping is the process of extracting data from web pages using some automated means.

Some websites like Twitter already provide APIs that allow one to extract the data. In such cases, you can simply use the provided APIs. But other websites do not have such APIs, which means that we have to find another way of extracting the data.

In the event that we are using our own custom code to extract web data, we have to be mindful of possible restrictions on the website. Does the website allow us to scrape their website? How often can we scrape the website? Remember that an automated tool to scrape a website might bring the website down due to excessive activity. A simple check may be done using robots.txt, if it exists on a website. You can go to https://url/robots.txt, e.g. https://wikipedia.org/robots.txt and search for the word “crawl” to see some useful pointers.

You need to understand a website well before you can do any scraping. Read any policies that may put you at legal crossroads with the website owners. Also understand the structure of the website well by inspecting it in the browser so that you know what and how to extract.

Example 1

The first example is to scrape this site, https://www.worldometers.info/coronavirus/, which contains updates of coronavirus data. Within the site, there is a table like below:

I would like to scrape the above table using its table id (main_table_countries_today) with the help of pd.read_html() as in the following code:

import pandas as pd
import requests

url ='http://www.worldometers.info/coronavirus/'
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36","X-Requested-With": "XMLHttpRequest"}
df = pd.read_html(requests.get(url, headers=header).text, displayed_only=False, attrs = {'id': 'main_table_countries_today'})[0]

df = format_df(df) #some custom function of mine, not shown here

The formatted output would be as follows, but you can now do further analysis if required. Note that this particular output is NOT picked from Jupyter Notebook (the de facto interactive IDE). It is within a custom web-app.

As you can see above, the code is quite simple, yet powerful. Now the onus is upon you to do any further data analysis of the scraped data using Pandas.

Example 2

Let's now look at the second example of scraping the following url for Forex rates using "class" attribute. This one is much simpler and straightforward because we do not have to provide additional header information like the first example.

With the following code:

url ='https://www.bou.or.ug/bou/bouwebsite/BOU-HOME'
df = pd.read_html(url, attrs = {'class': 'myTable'})[0]
df = format_df(df) #some custom function of mine, not shown here

df

we get the output below:

Very simple stuff above! Imagine if you can scrape this data once each day and cumulatively store it, then you automate the process of an informative time-series analysis of Forex rates.

Before we move onto the next section, let's see a slightly processed version of the above, where we now set the header and index as we read the html. We also cast the column 'Buying' to int, and 'Selling' to Decimal with an open precision but a scale of 3, i.e. '{0: .3f}'.

from decimal import Decimal

url ='https://www.bou.or.ug/bou/bouwebsite/BOU-HOME'
df = pd.read_html(url, attrs = {'class': 'myTable'}, header=[0], index_col=[0])[0]

df.fillna(0, inplace=True)
df = df.astype({'Buying': 'int64'})
df['Selling'] = df['Selling'].apply(Decimal).apply(lambda x: '{0: .3f}'.format(x))

and the output is:

Wrapping up with some Gymnastics (Data Wrangling and Analysis)

Let's wrap up with some gymnastics on the first example (Example 1) of scraped data on Coronavirus. Note that the formatted output in Example 1 has been displayed within a web-app. To remove the formatting (styling), we can simply use df.data (the data attribute of the styler object), which produces the following raw/original dataframe that can then be analysed:

Doing the following:

raw_df = df.data #retrieve the original dataframe from the styler object using the "data" attribute
raw_df.to_csv('D:/Temp/coronavirus.csv') #let's also save raw_df to csv for future use
raw_df

will generate the following (non-styled dataframe):

Let's now try to do some minimal data wrangling. We will be selecting only important columns for our analysis, dropping missing data, filling missing data, setting appropriate index, and then sorting the data. That is enough data wrangling for our purposes now to allow us do some charting/plotting at the end.

#select only these columns
new_data = raw_data[['Country,Other', 'TotalCases', 'NewCases', 'TotalDeaths', 'NewDeaths', 'TotalRecovered', 'NewRecovered', 'Population']]

#drop rows with missing (nan) population
#new_data.dropna(subset=['Population'], inplace=True)
new_data = new_data[new_data['Population'].notna()] 

#fill cells with missing values
new_data.fillna(0, inplace=True)

#rename column 'Country,Other'
new_data.rename(columns = {'Country,Other': 'Country'}, inplace=True)

#make 'Country' the index
new_data.set_index(['Country'], drop=True, inplace=True)

#sort by Country alphabetical ascending
#new_data.sort_index(ascending=True, inplace=True)

#sort by Country length of name ascending
#new_data.sort_index(key=lambda x: x.str.len(), ascending=True, inplace=True)

#sort by Population descending
new_data.sort_values(['Population'], ascending=False, inplace=True) 

#list index
#new_data.index

new_data

Our output will be:

Let us now do a bit of crude plotting. We shall make a horizontal bar chart (barh) of new cases by Country (but only for the Five most populated Countries).

# horizontal bar chart
new_data['NewCases'].iloc[:5].plot(kind='barh', title='New Cases for Five most populated countries')

# pie chart
# new_data['NewCases'].iloc[:5].plot(kind='pie', title='New Cases for Five most populated countries')

The output will be:

For better presentation-quality charts, we can use specialized charting Python libraries such as Seaborn or Plotly. They are quite easy to use.

That is fair enough for today as we are just scratching the surface of Pandas. Please read about Pandas by searching my earlier posts, or posts elsewhere, then come back here and do further analysis with the scraped data.

As with anything else in programming, doing something everyday will quickly make you an expert in it. That is the only sure way to master the art.

Orama's Data Blog

sub-title

Saturday, 7 May 2022

Web Scraping, Data Wrangling/Analysis with Pandas (Python) is as easy as 1-2-3

No comments:

Post a Comment