This article is aimed at showing how to replicate some common Excel tasks using Python and the Pandas library. The point and click interface of Excel means the learning curve is somewhat less
steep than it is for using Pandas – however once a certain level of proficiency is met when using Pandas, the possibilities presented are far greater than you would find in Excel (at least that’s what I believe).
I have downloaded some test data from Microsoft in “.xlsx” format which can be downloaded below if you wish to follow along.
Let us first read our data into a Pandas DataFrame and visualise the first 5 rows of data, just to see what we are playing with.
#import pandas library import pandas as pd #read data into DataFrame df = pd.read_excel('Financial Sample.xlsx') #visualise first 5 rows - different numbers can be placed within the parenthesis to display different numbers of rows - the default is 5 df.head()
This will look like the following:
So we can see we have “Excel like” data, presented in rows and columns.
If we wanted to sum the “Profit” column in Excel, we would enter a formula into a cell – something along the lines of “=SUM(L2:L30)” to sum up the total for column L, from rows 2 to 30.
With Pandas, we can just write:
total_profit = df['Profit'].sum() print(total_profit)
This gets us:
If we want to add a row at the bottom of the DataFrame that sums up a few select columns, this can be achieved as follows; first we need to create a Series that holds the
data fopr the sum of the rows in question.
sum_row = df[['Gross Sales','Sales','COGS','Profit']].sum()
If we print out the variable “sum_row” it appears as such:
Gross Sales 1.279316e+08 Sales 1.187264e+08 COGS 1.018326e+08 Profit 1.689370e+07 dtype: float64
Before we can append this to the bottom of our main DataFrame, we need to transpose the data and turn it into a DataFrame object.
sum_row = pd.DataFrame(sum_row).T
If displayed, “sum_row” would now look like this:
The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our
columns and then allow pandas to fill in the values that are missing.
It would now look like this:
The final step is to now append it to the main DataFrame, and show the last 5 rows to make sure it has worked properly:
df_final = df.append(sum_row) df_final.tail()
Great we can see it has worked as required and we now have a final row that shows the sums that we requested. If you wanted rto quickly sent the output back into either
“.csv” or .”xlsx” format you can just run one of the following:
That will output the relevant file in the current working directory. if you want the output to appear somewhere else, you can just specify the full file path instead of
just the name of the file.
So it might currently seem like using Pandas is a bit of overkill, and why wouldn’t you just use an excel sum formula instead of the 5-6 step process we have outlined above.
Well that’s not a bad question – the answer will become evident as we investigate how to undertake more and more complex data manipulation operations. You will see that what takes multiple, convoluted
steps in Excel can actually be replicated in Python and Pandas in a much more simple format. Let’s take the next step, and replicate anm excel pivot table where we are looking to sum and group by
the month and year, for various data columns.
profit_per_month = pd.groupby(df[['Gross Sales','COGS','Profit','Month Name']],'Month Name').sum()
This will create a table which sums the ‘Gross Sales’,’COGS’,’Profit’ columns and groups them by “Month Name”.
This is great and all but you may have noticed that when we group by just “Month Name”, it doesn’t distinguish by year, and groups together the same month, across multiple years – this is
usually not exactly what we want. Usually we would like to group by month AND year. To achieve this we need tgo create a new column which concatenates together the “Month Name” column and
the “Year” column.
df['Month Year'] = df['Month Name'] + " " + df['Year'].map(str)
Because the “Month Name” is already a string, and “Year” in an integer, we need to change the datatype from an integer to a string, in order for the string concatenation to work.
That’s what the “.map(str)” part of the code does.
We can now use the “groupby” method to create a pivot table style output that sorts by month AND year.
profit_per_month = df[['Gross Sales','COGS','Profit','Month Year']].groupby('Month Year').sum()
This looks like the following:
Ok, we’re getting there, but the first column showing our month and year doesn’t seem to be in the right chronological order. That’s a problem!
The reason for this is that the Month and Year columns (which is currently or DataFrame “index”), isn’t of type “datetime”. It currently exists just as datatype “object”.
We can see this by running the code:
Which gets us:
Index(['April 2014', 'August 2014', 'December 2013', 'December 2014', 'February 2014', 'January 2014', 'July 2014', 'June 2014', 'March 2014', 'May 2014', 'November 2013', 'November 2014', 'October 2013', 'October 2014', 'September 2013', 'September 2014'], dtype='object', name='Month Year')
There you can see at the bottom “dtype=’object'”. We need to convert this to a “datetime” object so that it can be auto sorted correctly by Pandas.
profit_per_month.index = pd.to_datetime(profit_per_month.index, format="%B %Y")
The format %B refers to the month name, and the %Y refers to the year. This tells Pandas what format to expect our “dates” in so that it can correctly carry out the convcersion. The full list
of formats can be found here.
Now if we call
We can see that the datatype has changed to “‘datetime64[ns]'”
DatetimeIndex(['2014-04-01', '2014-08-01', '2013-12-01', '2014-12-01', '2014-02-01', '2014-01-01', '2014-07-01', '2014-06-01', '2014-03-01', '2014-05-01', '2013-11-01', '2014-11-01', '2013-10-01', '2014-10-01', '2013-09-01', '2014-09-01'], dtype='datetime64[ns]', name='Month Year', freq=None)
And a call to:
now shows the correct sorting of the date columkn in chronological order.
Ooh…so close! The date columns now seems to have changed back to a “YYYY-MM-DD” format. If you’re happy with thatm then fantastic, we’re done. If you would prefer it
in a more “easy to read” format then we can do something along the folllowing lines.
profit_per_month.index = profit_per_month.index.strftime("%B %Y") profit_per_month
The fact that our pivot table index is now in chronological order means that we are now able to create some plots of our data with minimal code.
We can create a line chart of the data as folows:
Or we can change this to a bar chart very simply:
Although we used the “groupby” method above to create a “pivot table like” output, Pandas does actually have a dedicated “pivot_table” method. If for example we wanted to see the
average Sale Price by Country and Product, we can do so as follows:
The “margin=True” adds subtotals to the rows and columns of the table. The output can be rounded down to show a certain number of decimal places (in this case 2) by adding “.round(2)” to the end
Filtering DataFrames can also be done very easily – assume we wish to filter out results just to show data from which the country is Canada. We can do this a couple of ways – we can do it in
one line of code which will imediately output the filtered data.
df[df['Country'] == "Canada"]
Or we can do it in a couple of steps, firstly by setting up the filter and thjn applying the filter to the data.
#set up filter df_canada = df['Country'] == "Canada" #Apply filter to data df[df_canada]
This allows us to set up multiple filters and then apply them all at once to the data – below we add the filter than product must be “Montana” and then we apply borth filters to the data.
#set up filters df_canada = df['Country'] == "Canada" df_montana = df['Product'] == "Montana" #Apply filters to data df[df_canada & df_montana]
Technically this could also be done in one line of code, but you may find it a bit neater and easier to track what you are doing if you do it step by step – to each their own.
If you did want to do it in one line of code:
df[(df['Country'] == "Canada") & (df['Product'] == "Montana")]
To filter by multiple values within a single column, you can use the “isin” method – say we want data where the country is either Canada or France:
And just for completeness, we can also use the same logic to apply multiple filters to multiple columns at once – say we wish to filter by country as either Canada or France,
and Product as either Montana or Paseo:
df[(df["Country"].isin(["Canada","France"])) &; (df["Product"].isin(["Montana","Paseo"]))]
Another thing we can do with Pandas is apply conditional formatting – say for example we wish to highlight the data in the filtered data we just created above – we wish to
highlight in red any value in the “Units Sold” column under 1000. We first need to create a simple function and then apply that to the DataFrame style.
def colour_red(val): color = 'red' if val < 1000 else 'black' return 'color: %s' % color #filter DataFrame df_canada = df['Country'] == "Canada" df_montana = df['Product'] == "Montana" df_filtered = df[df_canada & df_montana] #apply style to "Units Sold" column df_filtered.style.applymap(colour_red, subset=pd.IndexSlice[:, ['Units Sold']])
You can also do some prett snazzy other formatting styles – here’s just one example below – the full details of what can be achieved can be found here.
So hopefully you’ll agree, that while Pandas may indeed have a steeper learning curve than Excel, once a certain level of proficiency is reached, the possibilities really are fantastic.
There’s PLENTY more functionality offered by Pandas that I havn’t mentioned 0 in fact I’ve barely scratched the surface so do feel free to investigate the module further! If you have any specific
qiuestions on how to replicate particular excel functionality, please do ask and i will do my best to show you how it is done.