Pandas Quick Start Guide
What is Pandas?
Pandas is a Python library that used to read, clean, visualize, and process a dataset. Pandas uses a high-performance data analysis functions to manipulate data a structures which called data frame. You can compare this to a table in tools like Excel. Learn Python with Codecademy. Master the skill that your future employer is looking for. Sign up for Codecademy Pro today and save 50% on an annual plan!
How to Install Pandas?
To install pandas all that you need to install Python. If you havent, check out the link here https://www.python.org/downloads/. Once installed, use pip to install Pandas like that. If the alert “Requirement already satisfied” is triggered then Pandas already installed in your device.
!pip install pandas
Importing Pandas library.
You need to import the library using the “import ” command combined with the name of the library. Typically, you can add an alias to the library name for simplicity like “import pandas as pd”. Now, the alias can be used in place of the function name. For example, which is “pd.function_name()”.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
Reading the data using Pandas DataFrame
Pandas uses a high-performance data structure to store the data in it. This “DataFrame” is a two dimension which can handle data from different formats like json, xlxs, CSV, and much more. The data frame is similar in structure to tables in Database Management Systems (DBMS). To read the dataset in the data frame a read function is used. For example, read_csv() function if your data in CSV format or read_json(), read_excel(), read_hdf() etc. Typically it follows according to the data format.
#csv_data = pd.read_csv("file_name.csv")
#excel_data = pd.read_excel("file_name.excel")
#HtMl_data = pd.read_html("file_name.html")
Build a Dummy DataFrame using python Dictionary
Before working on a real-life dataset sometimes we need to build a dummy data frame to try in it before applying the methods to the real DataFrame to do this all that we need is to use Dataframe() function in pandas and give it the dictionary as we can see in the snippet code below.
data_dict = {"id":[1,2,3,4,5,6,7,8,9],
"age":[21,34,np.nan,25,50,46,34,20,44],
"sex":['M','F','F','M','M','F','M','F','M'],
"Status":['married','married','divorced','single','single','single','married','divorced','married'],
"Salary":['100k','50k','60k','120k','90k',np.nan,'30k','160k','200k']}
df = pd.DataFrame(data_dict)
df
Cleaning the Dataset using Pandas
First, we need to define what we mean by “cleaning the Data”. All the real-world datasets have problems like missing in data, incorrect naming of features, inappropriate values of the feature like time in minus, The insufficient number of features and much more. So, we need to clean this dataset to be able to start using it. Pandas and DataFrames have a numerous number of functions for handling these problems in data like isna() and isnull() functions in Pandas which check the existence of missing values and dropna(), fillna() which handle these missing variables in the dataset. So, we can choose between to drop these missing values or fill it with an appropriate value like the median of the values of the feature that has this missing.
df.isnull().any()
Here we used fillna() function to fill the missing value in this salary column with the value ’70k’.
Notice : We can use the median value of the feature to fill the missing values in these features. But the feature values must be numerical to be able to calculate this fea
df['Salary'].fillna('70k')
The second way to handle the missing values is to drop any rows have any missing in the data using dropna() function. like the way below.
df = df.dropna()
As we can notice in the DataFrame below the index value is not sequenced after dropping the row that was had missing. So, to make the index sequenced like 1,2,3,4,5, etc. We need to use reset_index() function.
df.reset_index()
Replacing the String or object column to a numerical
As we have mentioned above we need to fill the missing values with the Median Salary. So, we need to convert the String column into a numerical column. But the process is not straightforward because we cannot convert the column directly because of the existence of ‘k’ that represents 1000. So, we need to convert the k first to 000 this can be handled easily using replace() function as below.
df['Salary'] = df['Salary'].str.replace('k','000')
df['Salary']
Now, we can easily convert the string into numerical using astype() function.
Hint: astype() function takes one parameter that indicates the type which we want to convert to it like that astype(float) that convert the column into the float.
df['Salary'] = df['Salary'].astype(float)
df = df.reset_index()
Now, we can fill the missing values of the salary column with the median value of the feature using the fillna() function we mentioned before.
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
Dropping Columns
There many cases we need to drop a column in the data that represent redundancy in our data or if we implement the machine learning model we need to drop the target feature to train the model on it. So, Dropping the column from DataFrame is an essential task in cleaning the data. So, in pandas, we have two ways of dropping the column.
1- Using drop() function this way the DataFrame make a copy from the original DataFrame without the dropped column
2- The Second way is to use ‘del’ to remove the column and this removes the column permanently from the DataFrame and we will need to read the data again from the CSV file to restore the column.
df
As we can see in the DataFrame above we have a redundant column called index result from applying reset_index(). So, we need to drop this column. The first method we use drop() function as below.
Hint: We need to specify the axis we want to drop from it
axis = 0 ->> represent rows
axis = 1 ->> represent columns
data_copy = df.drop('index', axis = 1)
Now, data_copy has a copy of the df without the index column as we can see below.
Hint head() function display the first 5 rows of the data. We can determine the number of the displayed column by sending the number to the function as a parameter.
data_copy.head()
df Dataframe still as it the drop function didn’t impact it. As we can see below the index column still exist.
df.head()
Now, we will try using del to remove the index column from the df Dataframe.
del df['index']
As we can see in the DataFrame below the index column has removed from our original DataFrame (df).
df
df['Data_Of_Birth'] = ['01-05-1998', '10-09-1985', '08-02-1994','09-08-1969','02-02-1985','02-09-1999','11-11-1975']
df
Now, we want to check the data types of Columns in DataFrame using dtypes and if we found the Data_Of_Birth column has not the date type we will convert it using astype(). As we can see the data type of column Date_Of_Brith is object so we need to convert it to a date data type.
df.dtypes
df['Data_Of_Birth'] = df['Data_Of_Birth'].astype('datetime64[ns]')
Working with date columns
Now, we can find that the column data type converted into the date.
df.dtypes
df
We can get the name of the day of the week from the date column using dt.dayofweek function. This function returns the days of the week as numerical values like 0, 1, 2, 3, 4, 5, 6. So, we need to map each value of these numbers into the name of weekdays.
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df['weekday'] = df['Data_Of_Birth'].dt.dayofweek.map(dayOfWeek)
Here we have generated the weekday column from the date column and in the same way we can split the data column into a number of features.
df
Slicing the DataFrame using loc and iloc functions
Selecting specific rows from the data frame is one of the most important tasks as well. This used to be able to choose specific rows based on the index or based on the positions. First, we need to explain the difference between the two functions loc and iloc.
iloc
iloc is used to slice the data frame based on the position that you specify. The square brackets with the iloc[] function allow you to select columns and rows. The comma separates the the colum and rows and a colon allows youto select between the columns or rows.
For example
df.iloc [:4] this means you want the first four rows regardless the index.
df.iloc[-4:] this means you want to get the last four rows of the data.
df.iloc[:3 , :3] this means you want to get the first 3×3 matrix of the data.
loc
2- loc: loc depends on the index and columns names in slicing the DataFrame for Example if “weekend” was the last column. You can get all the rows for the last columns by using loc as follows df.loc[:, ‘weekday’] whereas the same task can be achieved using iloc like that df.iloc(:,6). Also if our index are like the follow a,b,c,d,etc, to access the first row via loc we need to do that df.loc[‘a’] but using iloc we can do this as df.iloc[0]. Let’s try this using coding.
df.iloc[:3 , :3]
Here we can find that df.iloc[-4:] return the same results as df.loc[3:6] because df.loc[3:6] get the rows starting from index 3 to the index 6 and this the same we got using df.iloc[-4:] that return the last four rows. I hope the difference is clear now.
df.iloc[-4:]
df.loc[3:6]
df.loc[:, 'weekday']