Carrying on from the last blog post, I am now going to shift attention to plotting categorical data with Seaborn. So let’s write our first few lines of code that deals with the import of various packages and loads our excel file into a DataFrame. The excel file we are using can be downloaded by clicking the download link below.
import pandas as pd import seaborn as sns #if using Jupyter Notebooks the below line allows us to display charts in the browser %matplotlib inline #load our data in a Pandas DataFrame df = pd.read_excel('Financial Sample.xlsx') #set the style we wish to use for our plots sns.set_style("darkgrid") #print first 5 rows of data to ensure it is loaded correctly df.head()
For categorical plots we are going to be mainly concerned with seeing the distributions of a categorical column with reference to either another of the numerical columns or another categorical column. Let’s go ahead and plot the most basic categorical plot whcih is a “barplot”. We need to pass in our x and y column names as arguments, along with the relevant DataFrame we are referring to.
A barplot is just a general plot which allows us to aggregate the data based on some function – the default function in this case is the mean. You can see from the plot above that we have chosen the “Country” column as the categorical column, and the “Units Sold” column as the column for which we present the mean (i.e. average) of the relevant data held in the “Units Sold” column. So we can now see the average “Units Sold” by “Country”.
We can change the “estimator object” – that is the function by which we aggregate the data by setting the estimator to a statistical function. Let’s import numpy and plot the standard deviation of the data based on the categorical variable “Country”.
import numpy as np sns.barplot(x="Country",y="Units Sold",data=df,estimator=np.std)
As a quick note, the black line that you see crossing through the top of each data bar is actually the confidence interval for that data, with the default being the 95% confidence interval. If you are unsure about what confidence intervals are and need a quick brush up – please find some relevant info here.
Let’s now move on to a “countplot” – this is in essence the same as a barplot except the estimator is explicitly counting the number of occurences. For thar reason we only set the x data.
Here we can see a countplot for the categorical “Segment” DataFrame column.
Now we can move onto boxplots and violinplots. These types of plots are used to show the distribution of categorical data. They are also sometimes called a “box and whisker” plot. It shows the distribution of quantitative data in a way that hopefully facilitates comparison between variables. Let’s create a box plot…
The boxplot shows the quartiles of the dataset, while the whickers extend to show the rest of the distribuiton. The dots that appear outside of the whiskers are deemed to be outliers.
We can split up these boxplots even further based on another categorical variable, by introducing and “hue” element to the plot.
Now I see the profit split by “Segment” and also split by “Year”. This is really the power of Seaborn – to be able to add this whiole new layer of data very quickly and very smoothly.
Let’s go on now to speak about violin plots. Let’s create a violin plot below:
It’s very similar to a boxplot and takes exactly the same arguments. The violinplot, unlike the boxplot, allows us to plot all the components that correspond to actual data points and it’s essenitally showing the kernel density estimation of the underlying distribution. If we split the the “violin” in half and lay it on it’s side – that is the KDE reresentation of the underlying distribution.
FYI the violinplot also allows you to add the “hue” element. However what it also allows you to do, which a box plot doesn’t, is to split the vilion plot to show the different hue on each side. Let me show you below and it will become a lot clearer:
Let’s no move on to the “stripplot”. This is a scatter plot where one variable is categorical.
One peroblem here is that it’s not always easy to see exactly how many individual points there are stakced up, as when they get too close to eachother they merge together. One way to combat this is to add the “jitter” parameter as follows:
You can also use the “hue” and “split” parameters, similar to the boxplots and violin plots.
Another useful plot that kind of combines a stripplot and a violin plot, is a swarmplot. It’s probably just easiest to show you an example and you will no doubt understand what I mean.
As an FYI swarmplots probably aren’t a great choice for really large datasets as it’s quite computaionally expensive to arrange the data points and also it can become quite difficult to fit all the data points on the chart – the swarm plots can become very wide!
Finally let’s look at “factorplots” – these are the most generic of the categorical plots we have come across. Using factorplots you can pass in your data and parametersand then specify the “kind” of plot that you want – wheteher that be for e.g. a bar plot or a violin plot. I will show two quick examples of how to create a bar plot and a violin plot below.
#create bar plot with factorplot method sns.factorplot(x="Segment",y="Profit",data=df,kind="bar") #create violin plot with factorplot method sns.factorplot(x="Segment",y="Profit",data=df,kind="violin")
I prefer to call the plot itself specifically, but just be aware that you can use “factorplot” and then specify the “kind”.