For this post, I will be creating a script to download pricing data for the S&P 500 stocks, calculate their historic returns and volatility and then proceed to use the K-Means clustering algorithm to divide the stocks into distinct groups based upon said returns and volatilities.

So why would we want to do this you ask? Well dividing stocks into groups with “similar characteristics” can help in portfolio construction to ensure we choose a universe of stocks with sufficient diversification between them.

The concept behind K-Means clustering is explained here far more succinctly than I ever could, so please visit that link for more details on the concept and algorithm

I’ll deal instead with the actual Python code needed to carry out the necessary data collection, manipulation and analysis.

First things first, we need to collect the data – lets run our imports and create a simple data download script that scrapes the web to collect the tickers for all the individual stocks within the S&P 500.