Home Basic Data Analysis Stock Clusters Using K-Means Algorithm in Python

Stock Clusters Using K-Means Algorithm in Python

by s666

For this post, I will be creating a script to download pricing data for the S&P 500 stocks, calculate their historic returns and volatility and then proceed to use the K-Means clustering algorithm to divide the stocks into distinct groups based upon said returns and volatilities.

So why would we want to do this you ask? Well dividing stocks into groups with “similar characteristics” can help in portfolio construction to ensure we choose a universe of stocks with sufficient diversification between them.

The concept behind K-Means clustering is explained here far more succinctly than I ever could, so please visit that link for more details on the concept and algorithm

I’ll deal instead with the actual Python code needed to carry out the necessary data collection, manipulation and analysis.

First things first, we need to collect the data – lets run our imports and create a simple data download script that scrapes the web to collect the tickers for all the individual stocks within the S&P 500.

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
import numpy as np
from scipy.cluster.vq import kmeans,vq
import pandas as pd
import pandas_datareader as dr
from math import sqrt
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt


sp500_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

#read in the url and scrape ticker data
data_table = pd.read_html(sp500_url)

tickers = data_table[0][1:][0].tolist()
prices_list = []
for ticker in tickers:
    try:
        prices = dr.DataReader(ticker,'yahoo','01/01/2017')['Adj Close']
        prices = pd.DataFrame(prices)
        prices.columns = [ticker]
        prices_list.append(prices)
    except:
        pass
    prices_df = pd.concat(prices_list,axis=1)

prices_df.sort_index(inplace=True)

prices_df.head()

This gets up something resembling the following:

We can now start to analyse the data and begin our K-Means investigation…

Our first decision is to choose how many clusters do we actually want to separate the data into. Rather than make some arbitrary decision we can use an “Elbow Curve” to highlight the relationship between how many clusters we choose, and the Sum of Squared Errors (SSE) resulting from using that number of clusters.

We then plot this relationship to help us identify the optimal number of clusters to use – we would prefer a lower number of clusters, but also would prefer the SSE to be lower – so this trade off needs to be taken into account.

Lets run the code for our Elbow Curve plot.

#Calculate average annual percentage return and volatilities over a theoretical one year period
returns = prices_df.pct_change().mean() * 252
returns = pd.DataFrame(returns)
returns.columns = ['Returns']
returns['Volatility'] = prices_df.pct_change().std() * sqrt(252)

#format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T

X = data
distorsions = []
for k in range(2, 20):
    k_means = KMeans(n_clusters=k)
    k_means.fit(X)
    distorsions.append(k_means.inertia_)

fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')

The resulting plot with the above data is as follows:

So we can sort of see that once the number of clusters reaches 5 (on the bottom axis), the reduction in the SSE begins to slow down for each increase in cluster number. This would lead me to believe that the optimal number of clusters for this exercise lies around the 5 mark – so let’s use 5.

# computing K-Means with K = 5 (5 clusters)
centroids,_ = kmeans(data,5)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'oy',
     data[idx==2,0],data[idx==2,1],'or',
     data[idx==3,0],data[idx==3,1],'og',
     data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

This gives us the output:

Ok, so it looks like we have an outlier in the data which is skewing the results and making it difficult to actually see what is going on for all the other stocks. Let’s take the easy route and just delete the outlier from our data set and run this again.

#identify the outlier
print(returns.idxmax())

Returns BHF
Volatility BHF
dtype: object

Ok so let’s drop the stock ‘BHF and recreate the necessary data arrays.

#drop the relevant stock from our data
returns.drop('BHF',inplace=True)

#recreate data to feed into the algorithm
data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T

So now running the following piece of code:

# computing K-Means with K = 5 (5 clusters)
centroids,_ = kmeans(data,5)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'oy',
     data[idx==2,0],data[idx==2,1],'or',
     data[idx==3,0],data[idx==3,1],'og',
     data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

gets us a much clearer visual representation of the clusters as follows:

Finally to get the details of which stock is actually in which cluster we can run the following line of code to carry out a list comprehension to create a list of tuples in the (Stock Name, Cluster Number) format:

details = [(name,cluster) for name, cluster in zip(returns.index,idx)]

for detail in details:
    print(detail)

This will print out something resembling the below (I havn’t included all the results for brevity)

SO there you have it, we now have a list of each of the stocks in the S&P 500, along with which one of 5 clusters they belong to with the clusters being defined by their return and volatility characteristics. We also have a visual representation of the clusters in chart format.

If anyone has any questions or comments, as always feel free to leave them below.

Cheers!

You may also like

10 comments

Dhruv July 30, 2018 - 6:48 am

Sir, could you please tell what atr the x and y labels in the lsat graph? Thank you

Reply
s666 August 8, 2018 - 11:28 am

Do you mean the cluster scatter chart? If so the x-axis is returns and y-axis is volatility. If you mean the Elbow Curve chart, then the x-axis is the number of clusters and the y-axis is the SSE (Sum of Squared Errors)

Reply
Wannapa Phaob September 19, 2018 - 5:02 pm

hi, do you have any research regarding to this article? id be really appreciated if you could share it with me. Thank you very much.

Reply
K-Means: Clasificación y agrupamiento con Minería de datos [Introducción] January 8, 2019 - 1:07 pm

[…] Separar grupos de acciones según sus características para mejorar la diversificación de la cartera: Muy buen ejemplo utilizando Python en Python for Finance […]

Reply
John March 22, 2019 - 8:25 pm

Hello,

Thank you for this! I have tried it for my specific set of securities, and I found it weird that with the same dataset, it gives me different clusters by security every time I run the notebook.

Any insight as to why ? Is it the same for your dataset ?

Thanks!

Reply
s666 March 26, 2019 - 3:57 pm

Hi John,

This is actually expected behaviour to a degree as KMEANS by default selects the initial cluster centroid positions at random and works from there to optimise. The results can indeed rest on the location of those initial random centroid placements.

You can read more and how to overcome it by setting your “random_state” here:

https://stackoverflow.com/questions/25921762/changes-of-clustering-results-after-each-time-run-in-python-scikit-learn

Reply
Mike June 9, 2019 - 6:20 pm

The Stock BHF was removed from SP500 in 2019. Therefore, your example shows the following for outliers:
Returns NKTR
Volatility CTVA

In this case, the k-means step would have to be executed again removing each outlier? This would be repeated until no outliers exist? Given this result, would adding another (unbiased) feature help to improve the clustering?

Reply
Tushar Gupta October 19, 2019 - 8:26 am

Showing me a value error:n_samples=2 should be >= n_clusters=3
KMeans(n_clusters = k) in this line,
not able to understand can you please help

Reply
s666 October 19, 2019 - 8:40 am

Sounds like you perhaps only have 2 data points in your sample, but you are trying to split them into 3 clusters. The number of clusters you use has to be less than, or equal to the number of observations/data points you have.

Reply
Kyle November 30, 2019 - 6:15 am

Hello, I am not sure why I but the first part when scraping data from the web I get a KeyError: 0,

I am new to python and would love some help to understand what is happening and why it is not working.

Reply

Leave a Reply

%d bloggers like this: