Pandas Cut – Continuous to Categorical
Pandas cut function or pd.cut() function is a great way to transform continuous data into categorical data. The question is why would you want to do this. Here are a few reasons you might want to use the Pandas cut function. Practice your Python skills with Interactive Datasets.
Reason to Cut and Bin your Continous Data into Categories
- Wide range of numerical data that will be more readable in groups
- Need for statistical analysis of groups for better insight
If you have continuous ages, you can create groupings or categories for infant, children, young adults and elderly. If you have literally thousands of observations with each having an individual observation, it would better to group these in categorical bins.
Pandas.Cut Functions
The pd.cut function has 3 main essential parts, the bins which represent cut off points of bins for the continuous data and the second necessary components are the labels. There are two lists that you will need to populate with your cut off points for your bins. The key here is that your labels will always be one less than to the number of bins. The first number in the list represents the start point of the bin and the next number represents the cutoff point of the bin.
PD.CUT(column, bins=[ ],labels=[ ])
pd.cut(df.Age,bins=[0,2,17,65,99],labels=['Toddler/Baby','Child','Adult','Elderly'])
From the code above you can see that the bins are:
- 0 to 2 = ‘Toddler/Baby’
- 3 to 17 = ‘Child’
- 18 to 65 = ‘Adult’
- 66 to 99=’Elderly’
For this particular example, we are going to be using the Titanic dataset that you can find on Kaggle. This dataset has the age of the passengers.
Our goal is to convert continuous ages into categorical groups.
There are quite a few NaN values in the age category. So to prepare the dataset you should remove these values or fill them. However, the next step is to isolate the “Age” column using df.Age notation. Also, we want to save the result values in a variable and then apply this variable back into our data frame using the insert function.
#add a new column category next to the age group. category = pd.cut(df.Age,bins=[0,2,17,65,99],labels=['Toddler/baby','Child','Adult','Elderly']) df.insert(5,'Age Group',category)
The insert will add it back to the column number that you specify that I want the column to be next to the Age category.
Analyze your Categories
Now that we have this data in a category, we can do analysis on the categories.
df['Age Group'].value_counts(normalize=True)
Thanks a lot