Carrying on from the last blog post, I am now going to shift attention to plotting categorical data with Seaborn. So let’s write our first few lines of code that deals with the import of various packages and loads our excel file into a DataFrame. The excel file we are using can be downloaded by clicking the download link below.
import pandas as pd import seaborn as sns #if using Jupyter Notebooks the below line allows us to display charts in the browser %matplotlib inline #load our data in a Pandas DataFrame df = pd.read_excel('Financial Sample.xlsx') #set the style we wish to use for our plots sns.set_style("darkgrid") #print first 5 rows of data to ensure it is loaded correctly df.head()
For categorical plots we are going to be mainly concerned with seeing the distributions of a categorical column with reference to either another of the numerical columns or another categorical column. Let’s go ahead and plot the most basic categorical plot whcih is a “barplot”. We need to pass in our x and y column names as arguments, along with the relevant DataFrame we are referring to.
A barplot is just a general plot which allows us to aggregate the data based on some function – the default function in this case is the mean. You can see from the plot above that we have chosen the “Country” column as the categorical column, and the “Units Sold” column as the column for which we present the mean (i.e. average) of the relevant data held in the “Units Sold” column. So we can now see the average “Units Sold” by “Country”.
We can change the “estimator object” – that is the function by which we aggregate the data by setting the estimator to a statistical function. Let’s import numpy and plot the standard deviation of the data based on the categorical variable “Country”.
import numpy as np sns.barplot(x="Country",y="Units Sold",data=df,estimator=np.std)
As a quick note, the black line that you see crossing through the top of each data bar is actually the confidence interval for that data, with the default being the 95% confidence interval. If you are unsure about what confidence intervals are and need a quick brush up – please find some relevant info here.
Let’s now move on to a “countplot” – this is in essence the same as a barplot except the estimator is explicitly counting the number of occurences. For thar reason we only set the x data.
Here we can see a countplot for the categorical “Segment” DataFrame column.
Now we can move onto boxplots and violinplots. These types of plots are used to show the distribution of categorical data. They are also sometimes called a “box and whisker” plot. It shows the distribution of quantitative data in a way that hopefully facilitates comparison between variables. Let’s create a box plot…
The boxplot shows the quartiles of the dataset, while the whickers extend to show the rest of the distribuiton. The dots that appear outside of the whiskers are deemed to be outliers.
We can split up these boxplots even further based on another categorical variable, by introducing and “hue” element to the plot.
Now I see the profit split by “Segment” and also split by “Year”. This is really the power of Seaborn – to be able to add this whiole new layer of data very quickly and very smoothly.
Let’s go on now to speak about violin plots. Let’s create a violin plot below:
It’s very similar to a boxplot and takes exactly the same arguments. The violinplot, unlike the boxplot, allows us to plot all the components that correspond to actual data points and it’s essenitally showing the kernel density estimation of the underlying distribution. If we split the the “violin” in half and lay it on it’s side – that is the KDE reresentation of the underlying distribution.
FYI the violinplot also allows you to add the “hue” element. However what it also allows you to do, which a box plot doesn’t, is to split the vilion plot to show the different hue on each side. Let me show you below and it will become a lot clearer:
Let’s no move on to the “stripplot”. This is a scatter plot where one variable is categorical.
One peroblem here is that it’s not always easy to see exactly how many individual points there are stakced up, as when they get too close to eachother they merge together. One way to combat this is to add the “jitter” parameter as follows:
You can also use the “hue” and “split” parameters, similar to the boxplots and violin plots.
Another useful plot that kind of combines a stripplot and a violin plot, is a swarmplot. It’s probably just easiest to show you an example and you will no doubt understand what I mean.
As an FYI swarmplots probably aren’t a great choice for really large datasets as it’s quite computaionally expensive to arrange the data points and also it can become quite difficult to fit all the data points on the chart – the swarm plots can become very wide!
Finally let’s look at “factorplots” – these are the most generic of the categorical plots we have come across. Using factorplots you can pass in your data and parametersand then specify the “kind” of plot that you want – wheteher that be for e.g. a bar plot or a violin plot. I will show two quick examples of how to create a bar plot and a violin plot below.
#create bar plot with factorplot method sns.factorplot(x="Segment",y="Profit",data=df,kind="bar") #create violin plot with factorplot method sns.factorplot(x="Segment",y="Profit",data=df,kind="violin")
I prefer to call the plot itself specifically, but just be aware that you can use “factorplot” and then specify the “kind”.
Completely off topic but I thought I would be more likely to get a reply on your most recent post.
I have a solid understanding of the basics and have completed the courses on coursera that you suggested.
Your blog posts are very useful however I find that just following what you do isn’t as helpful as discovering it for myself. You never really mention how you actually learnt these processes outlined in your blog posts.
Hi Alex – thanks for your comment, I’m always especially interested to hear opinions such as these relating more to the overall design and delivery of content, rather than relating to the content itself (although of course I am also very interested in those comments too!)
You raise an interesting point here, and actually one that I have thought about myself many times over the couple of years since I began writing this blog. At the very start I did indeed promise that I would concentrate as much on explaining and documenting the learning process itself, as I would on other content.
I agree with you that I have strayed a little from that path, but now is as good a time as any to address that. May I ask – what do you think would be most useful for me to present to do this? Do you mean you would like to see more recommendations of online courses/resources and why and when to use each one? Or do you mean potentially writing more subjectively about my learning process/state of mind and what I did and why I did those things?
Please do let me know what kind of things you want to see, and I will be more than happy to oblige. After all, I write things hoping they will be read…
Thanks for the reply, I think both of the suggestions you offered would be very helpful. Just to give you some background, I have a job in the investment industry although it is not require very much programming at all. When I ask friends how they taught themselves programming they always reply saying that they learnt by needing to use programming to solve a problem but since my work isn’t very heavily programming orientated I find it difficult to do this at work. Therefore, I look to online resources, such as your blog, for guidance on what I could potentially work on and learn.
Hence the thought processes behind why, how, where you learnt to do the processes in each blog post would be useful.