Pandas is a great Python library for data wrangling and analysis. Consider Pandas an inevitable toolkit for Data Science within the Python space. For those who have SQL-phobia, one can easily skip the use of SQL for data manipulation and instead use Pandas provided the raw data has been extracted from the database or other source.
One common process in data wrangling is filtering of data, also called subsetting data (getting a subset). There are many ways to filter data in Pandas – and we shall look at them in this post. When using SQL, we filter data with the WHERE clause. Both SQL and Pandas can get really complex depending on the requirement.
It is important to remember that the data format in Pandas is DataFrame (rows and columns) and Series (rows of a single column). Pandas works with structured data in rows and columns.
For anything in Pandas, make sure Pandas is installed on your system using
pip install pandas
Sample DataFrame
Now, as an example, supposing you have the following DataFrame.
# Loading a Sample Pandas Dataframe
import pandas as pd
df = pd.DataFrame.from_dict({ #works the same even without from_dict
'Student': ['Peter', 'Anderson', 'Kevin', 'Richards', 'Jane', 'Brown', 'Jackie'],
'Gender': ['M', 'M', 'F', 'M', 'F', 'M', 'F'],
'English': [90, 95, 75, 93, 60, 85, 75],
'Chemistry': [95, 95, 75, 65, 50, 85, 100],
'Math': [100, 95, 50, 75, 90, 50, 80]
})
df
... and the output is as follows
Then you may want to filter only records (rows) that satisfy a certain condition, or you may want to filter only certain fields (columns).
Filtering a DataFrame
Let me now show a basic example of filtering based on the above sample.
Filter records with Gender as Male, using the simplest form
df_males = df[df['Gender']=='M']
... and the result is
The following code snippet illustrates the various methods of filtering data in Pandas. Study each example closely, and you will notice how easy they are provided you exercise consistency in using them.
# Filters
##### iloc - index based indexing - not very flexible
#df_new = df.iloc[:, :] #all rows, columns
#df_new = df.iloc[:-2] #except last 2 rows
#df_new = df.iloc[-2:, [2,3]] #last 2 rows; third and fourth columns
#df_new = df.iloc[:-2] #except last 2 rows
##### loc - label based indexing (more powerful for filtering)
#df_new = df[df['Gender'] == 'M'] #Males only
#df_new = df[~df['Gender'] == 'M'] #reverse Males only using tilde symbol (not Males)
#df_new = df.loc[df['Gender'] == 'M'] #Males only
#df_new = df.loc[df['Gender'] == 'M', ['English']] #Column English only for Males
#df_new = df.loc[((df['Gender'] == 'M') & (df['English'] > 90)), ['English']] #Column English only for Males with English > 90
#mask = (df['Gender'] == 'M') & (df['English'] > 90) #males who scored over 90 in English
#df_new = df.loc[mask, ['Gender', 'English']]
#df_new = df.loc[df.Student.str.startswith('J')] #name starts with J
#df_new = df.loc[df['Student'].str.startswith('J'), ['Gender']] #name starts with J - show Gender only
#df_new = df.loc[df['Student'].str.contains('J|A')] #name contains J or A
#df_new = df.loc[df['Student'].notnull()] #where name is not null
#df_new = df.loc[df.Student.isnull()] #where name is null
##### query - also flexible like loc
df_new = df.query('Gender == "M" & Math > 0')
#compare df_new = df.iloc[0:2] and df_new = df.loc[0:2]. Loc still works, but it returns up to the index 2
df_new
The above samples look easy, but the onus is on you to translate a given user requirement to code. Some requirements can get really complicated. As a Data Scientist or Data Engineer (whats in a name?), the only limitation is your imagination as you deal with coding requirements.
Wrap-up
As an interesting wrap-up, try out the following and see what it returns
df_new = df.loc[~df['English'].duplicated(keep='last')] #exclude (using ~), where English is duplicated, but keep the last record as 'not duplicated'
... and the output is
And that wraps up this important post.
No comments:
Post a Comment