In this article we are going to revisit the concept of building a trading strategy backtest based on mean reverting, co-integrated pairs of stocks. So to restate the theory, stocks that are statistically co-integrated move in a way that means when their prices start to diverge by a certain amount (i.e. the spread between the 2 stocks prices increases), we would expect that divergence to

eventually revert back to the mean. In this instance we would look to sell the outperforming stock,and buy the under performing stock in our expectance that the under performing stock would eventually “catch up” with the overpeforming stock and rise in price, or vice versa the overperforming stock would in time suffer from the same downward pressure of the underperforming stock and fall in relative value.

Hence, pairs trading is a market neutral trading strategy enabling traders to profit from virtually any market conditions: uptrend, downtrend, or sideways movement.

So in our search for co-integrated stocks, economic theory would suggest that we are more likley to find pairs of stocks that are driven by the same factors, if we search for pairs that are drawn from similar/the same industry. After all, it is logical to expect

2 stocks in the technology sector that produce similar products, to be at the mercy of the same general ups and downs of the industry environment. So for this particular backtest I will be scraping a load of tech stock tickers from the web and then using Pandas data-reader to download daily data for those stocks. (n.b. Pandas data-reader has been facing some serious issues recently and in effect the Yahoo and Google APIs are no longer fit for purpose and are no longer working. Instead I shall use “iex” provider, which offers daily data for a maximum of a 5 year historical period. I see 5 years as being more than long enough for our purposes.)

I started this blog a few years ago, and one of my very first blog series was on this exact subject matter – mean reversion based pairs trading. So why approach it again and repeat myself? Well this time I am going to add a few more elements that were not present in the initial blog series.

I am going to

- Create a heatmap of co-integrated pairs so we can visually see the level of cointegration between any and all pairs that we are concerning ourselves with.
- Introduce the concept of a “Kalman Filter” when considering the spread series which will give us our trading signal.
- Add the concept of a “training set” of data, and a “test set” of data – seperating the two. What this helps us avoid is “look back” bias, whereby we would incorrectly test co-integration over the full set of data that we have, and also run our backtest of trading over the same data. Afetr all, how would we be able to both

run a co-integration test over a time period to select our trading pairs, and then travel back in time and carry out trades over the same time period. That’sjust not possible, however it is one subtle pitfall that many, many systematic traders fall into.

So what is a Kalman Filter? Well I this site (click here) explains the concept and shows examples in the clearest manner that I have yet to find while searching online. It states the following:

You can use a Kalman filter in any place where you have uncertain information about some dynamic system, and you can make an educated guess about what the system is going to do next. Even if messy reality comes along and interferes with the clean motion you guessed about, the Kalman filter will often do a very good job of figuring out what actually happened. And it can take advantage of correlations between crazy phenomena that you maybe wouldn’t have thought to exploit!

Kalman filters are ideal for systems which are continuously changing. They have the advantage that they are light on memory (they don’t need to keep any history other than the previous state), and they are very fast, making them well suited for real time problems and embedded systems.

So lets start to import the relevant modules we will need for our strategy backtest:

##### import the necessary modules and set chart style#### import numpy as np import pandas as pd import seaborn as sns import matplotlib as mpl mpl.style.use('bmh') import pandas_datareader.data as web import matplotlib.pylab as plt from datetime import datetime import statsmodels.api as sm from pykalman import KalmanFilter from math import sqrt

And lets use the Pandas and the data-reader module to scrape the relevant tech stock tickers from the www.marketwatch.com website.

#scrape html from website and store 3rd DataFrame as our stock tickers - this is dictated to us by the structure of the html stock_list = pd.read_html("https://www.marketwatch.com/tools/industry/stocklist.asp?bcind_ind=9535&bcind_period=3mo")[3] #convert the DataFrame of stocks into a list so we can easily iterate over it stocks = stock_list[1].dropna()[1:].tolist() #set empty list o hold the stock price DataFrames that we can later concatenate into a master frame df_list = [] #not all stocks will return data so set up an empty list to store the stock tickers that actually successfully returns data used_stocks = [] #iterate over stock tickers in list and download relevant data, storing said data and successfully downloaded tickers along the way for stock in stocks: try: data = pd.DataFrame(web.DataReader(stock,data_source='iex',start='01/01/2013')['close']) data.columns = [stock] df_list.append(data) used_stocks.append(stock) except: pass #concatenate list of individual tciker price DataFrames into one master DataFrame df = pd.concat(df_list,axis=1)

Lets plot the resulting DataFrame of price data just to make sure we have what we need and as a quick sanity check:

df.plot(figsize=(20,10))

Ok so it looks from the chart as if we have around price data downloaded for around 25-30 stocks; this should be more than enough to find at least a couple of co-integrated pairs to run our backtest over.

We will now define a quick function that will run our stocks, combining them into pairs one by one and running co-integration tests on each pair. That result will then be stored in a matrix that we initialise,

and then we will be able to plot that matrix as a heatmap. Also, if the co-integration test meets our threshold statistical significance (in our case 5%), then that pair of stock tciokers will be stored in a list for later retrieval.

#NOTE CRITICAL LEVEL HAS BEEN SET TO 5% FOR COINTEGRATION TEST def find_cointegrated_pairs(dataframe, critial_level = 0.05): n = dataframe.shape[1] # the length of dateframe pvalue_matrix = np.ones((n, n)) # initialize the matrix of p keys = dataframe.columns # get the column names pairs = [] # initilize the list for cointegration for i in range(n): for j in range(i+1, n): # for j bigger than i stock1 = dataframe[keys[i]] # obtain the price of "stock1" stock2 = dataframe[keys[j]]# obtain the price of "stock2" result = sm.tsa.stattools.coint(stock1, stock2) # get conintegration pvalue = result[1] # get the pvalue pvalue_matrix[i, j] = pvalue if pvalue < critial_level: # if p-value less than the critical level pairs.append((keys[i], keys[j], pvalue)) # record the contract with that p-value return pvalue_matrix, pairs

Let’s now run our data through our function, save the results and plot the heatmap:

#set up the split point for our "training data" on which to perform the co-integration test (the remaining dat awill be fed to our backtest function) split = int(len(df) * .4) #run our dataframe (up to the split point) of ticker price data through our co-integration function and store results pvalue_matrix,pairs = find_cointegrated_pairs(df[:split]) #convert our matrix of stored results into a DataFrame pvalue_matrix_df = pd.DataFrame(pvalue_matrix) #use Seaborn to plot a heatmap of our results matrix fig, ax = plt.subplots(figsize=(15,10)) sns.heatmap(pvalue_matrix_df,xticklabels=used_stocks,yticklabels=used_stocks,ax=ax)

So we can see from the very dark red squares that it looks as though there are indeed a few pairs of stocks who’s co-integration score is below the 5% threshold

hardcoded into the function we defined. To see more explicitly which pairs these are, let’s print out our list of stored pairs that was part of the fucntion results we stored:

for pair in pairs: print("Stock {} and stock {} has a co-integration score of {}".format(pair[0],pair[1],round(pair[2],4)))

Stock AKAM and stock TCX has a co-integration score of 0.027 Stock AKAM and stock YNDX has a co-integration score of 0.0484 Stock BIDU and stock WEB has a co-integration score of 0.0377 Stock WIFI and stock JCOM has a co-integration score of 0.0039 Stock WIFI and stock LLNW has a co-integration score of 0.0187 Stock WIFI and stock NTES has a co-integration score of 0.0215 Stock WIFI and stock SIFY has a co-integration score of 0.0026 Stock WIFI and stock YNDX has a co-integration score of 0.0092 Stock ENV and stock SINA has a co-integration score of 0.0294 Stock IPAS and stock LLNW has a co-integration score of 0.0199 Stock IPAS and stock EGOV has a co-integration score of 0.0405 Stock JCOM and stock SIFY has a co-integration score of 0.0388 Stock LLNW and stock NTES has a co-integration score of 0.0109 Stock LLNW and stock TCX has a co-integration score of 0.032

We will now use the “pykalman” module to set up a couple of functions that will allow us to generate Kalman filters which we will apply to our data and in turn our regression that is fed the said data.

I will also define a function for “Halflife” which just recycles some tof the code from my mean reversion pairs trading blog post from a couple of years ago, which can be found

here.

def KalmanFilterAverage(x): # Construct a Kalman filter kf = KalmanFilter(transition_matrices = [1], observation_matrices = [1], initial_state_mean = 0, initial_state_covariance = 1, observation_covariance=1, transition_covariance=.01) # Use the observed values of the price to get a rolling mean state_means, _ = kf.filter(x.values) state_means = pd.Series(state_means.flatten(), index=x.index) return state_means # Kalman filter regression def KalmanFilterRegression(x,y): delta = 1e-3 trans_cov = delta / (1 - delta) * np.eye(2) # How much random walk wiggles obs_mat = np.expand_dims(np.vstack([[x], [np.ones(len(x))]]).T, axis=1) kf = KalmanFilter(n_dim_obs=1, n_dim_state=2, # y is 1-dimensional, (alpha, beta) is 2-dimensional initial_state_mean=[0,0], initial_state_covariance=np.ones((2, 2)), transition_matrices=np.eye(2), observation_matrices=obs_mat, observation_covariance=2, transition_covariance=trans_cov) # Use the observations y to get running estimates and errors for the state parameters state_means, state_covs = kf.filter(y.values) return state_means def half_life(spread): spread_lag = spread.shift(1) spread_lag.iloc[0] = spread_lag.iloc[1] spread_ret = spread - spread_lag spread_ret.iloc[0] = spread_ret.iloc[1] spread_lag2 = sm.add_constant(spread_lag) model = sm.OLS(spread_ret,spread_lag2) res = model.fit() halflife = int(round(-np.log(2) / res.params[1],0)) if halflife <= 0: halflife = 1 return halflife

Now let us define our main “Backtest” function that we will run our data through. The fucntion takes one pair of tickers at a time, and then returns several outputs, namely the DataFrame of cumulative returns,

the Sharpe Ratio and the Compound Annual Growth Rate (CAGR). Once we have defined our function, we can iterate over our list of pairs and feed the relevant data, pair by pair, into the function, storing the outputs for each pair for

later use and retrieval.

def backtest(df,s1, s2): ############################################################# # INPUT: # DataFrame of prices # s1: the symbol of contract one # s2: the symbol of contract two # x: the price series of contract one # y: the price series of contract two # OUTPUT: # df1['cum rets']: cumulative returns in pandas data frame # sharpe: Sharpe ratio # CAGR: Compound Annual Growth Rate x = df[s1] y = df[s2] # run regression (including Kalman Filter) to find hedge ratio and then create spread series df1 = pd.DataFrame({'y':y,'x':x}) df1.index = pd.to_datetime(df1.index) state_means = KalmanFilterRegression(KalmanFilterAverage(x),KalmanFilterAverage(y)) df1['hr'] = - state_means[:,0] df1['spread'] = df1.y + (df1.x * df1.hr) # calculate half life halflife = half_life(df1['spread']) # calculate z-score with window = half life period meanSpread = df1.spread.rolling(window=halflife).mean() stdSpread = df1.spread.rolling(window=halflife).std() df1['zScore'] = (df1.spread-meanSpread)/stdSpread ############################################################## # trading logic entryZscore = 2 exitZscore = 0 #set up num units long df1['long entry'] = ((df1.zScore < - entryZscore) & ( df1.zScore.shift(1) > - entryZscore)) df1['long exit'] = ((df1.zScore > - exitZscore) & (df1.zScore.shift(1) < - exitZscore)) df1['num units long'] = np.nan df1.loc[df1['long entry'],'num units long'] = 1 df1.loc[df1['long exit'],'num units long'] = 0 df1['num units long'][0] = 0 df1['num units long'] = df1['num units long'].fillna(method='pad') #set up num units short df1['short entry'] = ((df1.zScore > entryZscore) & ( df1.zScore.shift(1) < entryZscore)) df1['short exit'] = ((df1.zScore < exitZscore) & (df1.zScore.shift(1) > exitZscore)) df1.loc[df1['short entry'],'num units short'] = -1 df1.loc[df1['short exit'],'num units short'] = 0 df1['num units short'][0] = 0 df1['num units short'] = df1['num units short'].fillna(method='pad') df1['numUnits'] = df1['num units long'] + df1['num units short'] df1['spread pct ch'] = (df1['spread'] - df1['spread'].shift(1)) / ((df1['x'] * abs(df1['hr'])) + df1['y']) df1['port rets'] = df1['spread pct ch'] * df1['numUnits'].shift(1) df1['cum rets'] = df1['port rets'].cumsum() df1['cum rets'] = df1['cum rets'] + 1 ############################################################## try: sharpe = ((df1['port rets'].mean() / df1['port rets'].std()) * sqrt(252)) except ZeroDivisionError: sharpe = 0.0 ############################################################## start_val = 1 end_val = df1['cum rets'].iat[-1] start_date = df1.iloc[0].name end_date = df1.iloc[-1].name days = (end_date - start_date).days CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) - 1,4) df1[s1+ " "+s2] = df1['cum rets'] return df1[s1+" "+s2], sharpe, CAGR

So now let’s run our full list of pairs through our Backtest function, and print out some results along the way, and finally after storing the equity curve for each pair,

produce a chart that plots out each curve.

results = [] for pair in pairs: rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1]) results.append(rets) print("The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}".format(pair[0],pair[1],round(sharpe,2),round(CAGR,4))) rets.plot(figsize=(20,15),legend=True)

The pair AKAM and TCX produced a Sharpe Ratio of 1.69 and a CAGR of 0.0701 The pair AKAM and YNDX produced a Sharpe Ratio of 1.95 and a CAGR of 0.0974 The pair BIDU and WEB produced a Sharpe Ratio of 0.48 and a CAGR of 0.0163 The pair WIFI and JCOM produced a Sharpe Ratio of 1.31 and a CAGR of 0.0866 The pair WIFI and LLNW produced a Sharpe Ratio of 1.22 and a CAGR of 0.13 The pair WIFI and NTES produced a Sharpe Ratio of 0.36 and a CAGR of 0.0385 The pair WIFI and SIFY produced a Sharpe Ratio of 1.7 and a CAGR of 0.0942 The pair WIFI and YNDX produced a Sharpe Ratio of 1.05 and a CAGR of 0.065 The pair ENV and SINA produced a Sharpe Ratio of 0.83 and a CAGR of 0.0163 The pair IPAS and LLNW produced a Sharpe Ratio of -0.72 and a CAGR of -0.1943 The pair IPAS and EGOV produced a Sharpe Ratio of 1.25 and a CAGR of 0.1504 The pair JCOM and SIFY produced a Sharpe Ratio of 0.0 and a CAGR of 0.0 The pair LLNW and NTES produced a Sharpe Ratio of 0.24 and a CAGR of 0.0338 The pair LLNW and TCX produced a Sharpe Ratio of 0.95 and a CAGR of 0.1073

Now we run a few extra lines of code to combine, equally weight, and print our our final equity curve:

#concatenate together the individual equity curves into a single DataFrame results_df = pd.concat(results,axis=1).dropna() #equally weight each equity curve by dividing each by the number of pairs held in the DataFrame results_df /= len(results_df.columns) #sum up the equally weighted equity curves to get our final equity curve final_res = results_df.sum(axis=1) #plot the chart of our final equity curve final_res.plot(figsize=(20,15))

#calculate and print our some final stats for our combined equity curve sharpe = (final_res.pct_change().mean() / final_res.pct_change().std()) * (sqrt(252)) start_val = 1 end_val = final_res.iloc[-1] start_date = final_res.index[0] end_date = final_res.index[-1] days = (end_date - start_date).days CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) - 1,4) print("Sharpe Ratio is {} and CAGR is {}".format(round(sharpe,2),round(CAGR,4)))

Sharpe Ratio is 2.14 and CAGR is 0.06

## 68 comments

Hi, nice post!

PS: the link to Kalman filter does not work unfortunately.

That’s strange, it works for me…make sure you click the word “here” rather than “click”.

Great article!

I have found one issue: The first (halflife -1) entries in the meanSpread to be nan’s. This causes the first entries of df1.zScore to be nan’s and therefore the comparison with the entryZscore fails. How can I make this work?

Thanks and keep up the good work!

JC

I dont understand why you define and use 2 kalman fileter functions? I apreciatte your answer!

nice!

Great article! But the hedge ratio is changing every day, and in real situation, the hedge ratio is fixed while executing buy and sell trading, until long or short exit.

The hedge ratio should be online(should change every day)

Hello S666,

Firstly I would like to thank you for your very interesting posts on pair trading. I have two questions regarding your implementation:

1. You calculate the daily return when in position as: (spread – spread.lag(1)) / (x * hr + y). Can you please explain where it comes from and which position sizing you are assuming for each leg of the pair?

2. Your implementation of the Kalman Filter is to first filter x and y through a Kalman average (works like some sort of a moving average) and then feed the result to the main Kalman filter that calculates the hedge ratio and intercept. Could you please explain why is the hedge ration calculated on the smoothed prices rather than the true prices?

Thank you,

Nathan

Apologies for the delay – I shall get to this question and reply shortly!

Hi

I’m trying to build the spread slightly differently by adding the intercept as well. (i.e spread = stock1 – beta*stock2 -alpha). Just to confirm my alpha would be contained in state_means[:,1] is it? I’d assume so but wanted to double check

There is an error in the backtest function related to calculation of hedge ratio. It recalculates at each timestamp, i.e. it is assumed that position sizes are added/reduced every day (if it is a daily data). Though when you open the trades you fix the hedge ratio until you close them. It gives you an extra income. This error presents also in the source of your code (QI) as well. The true backtesting will not like the current one at all, unforunately. Don’t fall into that trap.

Hello, I am trying to replicate the portfolio as a way to improve my programming. However the download of the prices from yhaoo I think has been desabled.

What tools are your using to download the data now?

I have come across a module called https://github.com/JECSand/yahoofinancials but the way it downloads the data has a very complicated formatting and I am struggling to get the adjusted closes.

Thank you very much.

Hi Nick,

I think the Pandas Datareader Yahoo download has been “fixed” somewhat. Please refer to this post (http://pythonforfinance.net//2019/05/30/python-monte-carlo-vs-bootstrapping/) which is the latest on the blog – it uses Datareader to pull down the Adjusted Close for a number of tickers in one go.

Hopefully that gets you what you want. If you are still experiencing issues, let me know.

Hi, thanks for getting back to me.

I am using a list of tickers for all the technology stocks from the nasdaq. and I am using the formula

asset_universe = pd.DataFrame([web.DataReader(ticker, ‘yahoo’, start, end).loc[:, ‘Adj Close’] for ticker in clean_names],index=clean_names).T.fillna(method=’ffill’)

I get

Keyerror Data

Have you ever had the same error?

Hi Nick,

It’s a bit difficult to debug without having the full list of tickers you are using (so I can try to recreate the problem), or having the full error message. Do you have a ticker in your list named “Data” by any chance?

If you could post the full error message and also perhaps paste your list of tickers I can take a closer look.

I found this link on Google: https://github.com/pydata/pandas-datareader/issues/487

It suggests using the “fix_yahoo_finance” package to solves the problem – although the official fix should have been integrated into pandas_datareader.

You could either try updating your pandas_datareader with the following command in the command prompt:

Or you could follow the advice on the above link and add the below lines and your script should work.

Make sure you have pip installed fix_yahoo_finance already.

Hi and thank you for your post, it is very interesting approach!

Have you tried working with Futures? I would like to apply a similar logic to oil futures. For example you have the prices for September and December as pair AND you get the data for the Sep-Dec 2018,2017,2016 contracts and so on.

I would like to use for example the 2013-2017 historical timeseries as training set and then the 2018 timeseries as a test set. The problem is this is not a continuous timeseries, ie the 2013 might close with oil at Sep, Dec= 60, 55 and the 2014 might start at Sep, Dec= 80,75.

How would you merge and normalize these series together before feeding them into your model?

Thanks!

Hi Pete, thanks for your comment and thanks for the kind words – its nice to hear you find it of interest.

There are a number of ways to deal with creating a “continuous” futures contract but they all have their pros and cons – with one of the methods perhaps being seen as the “best” way forward (that would be the “perpetual” method).

There is a great blog post doing just what you are looking for – it runs through an example of combining oil futures contracts together – take a look here https://www.quantstart.com/articles/Continuous-Futures-Contracts-for-Backtesting-Purposes and if you have any follow up questions I would be more than happy to assist.

Thanks very much for your article, great material !

There is however one line I don’t understand:

where does this come from ?

Why not: (df1[‘spread’] – df1[‘spread’].shift(1)) / (df1[‘spread’].shift(1)) ?

Many thanks !

Hi, I am having trouble pulling down the data.

Has the syntax changed?

#scrape html from website and store 3rd DataFrame as our stock tickers – this is dictated to us by the structure of the html

stock_list = pd.read_html(“https://www.marketwatch.com/tools/industry/stocklist.asp?bcind_ind=9535&bcind_period=3mo”)[3]

#convert the DataFrame of stocks into a list so we can easily iterate over it

stocks = stock_list[1].dropna()[1:].tolist()

IndexError Traceback (most recent call last)

in

1 #scrape html from website and store 3rd DataFrame as our stock tickers – this is dictated to us by the structure of the html

—-> 2 stock_list = pd.read_html(“https://www.marketwatch.com/tools/industry/stocklist.asp?bcind_ind=9535&bcind_period=3mo”)[3]

3 #convert the DataFrame of stocks into a list so we can easily iterate over it

4 stocks = stock_list[1].dropna()[1:].tolist()

IndexError: list index out of range

Hi Andrew, I’m afraid so… The html structure of the page has changed numerous times and it’s difficult to keep the code updated.

from pandas_datareader import data as pdr

import yfinance as yf

yf.pdr_override() # <== that’s all it takes 🙂

NYSE

url_nyse = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download”

Nasdaq

url_nasdaq = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download”

AMEX

url_amex = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=amex&render=download”

import pandas as pd

df = pd.DataFrame.from_csv(url_nyse)

stocks = df.index.tolist()

The above is how to get the stocklist-

I just cant port it to your code.

The below is what i have tried-

from pandas_datareader import data as pdr

import yfinance as yf

yf.pdr_override() # <== that’s all it takes 🙂

NYSE

url_nyse = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download”

Nasdaq

url_nasdaq = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download”

AMEX

url_amex = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=amex&render=download”

import pandas as pd

df = pd.DataFrame.from_csv(url_nyse)

stocks = df.index.tolist()

df_list = []

used_stocks = []

for stock in stocks:

try:

data = pdr.get_data_yahoo(stock, start=”2017-01-01″, end=”2017-04-30″)

data_close=data[‘Close’]

df_list.append(data_close)

used_stocks.append(stock)

except (KeyError, ValueError):

print(“Error”)

df_list

Ah right apologies I am replying on my phone and I thought this comment had been made on a different blog post… Yeah the yahoo download should be easy to fix as you mention. How is your code not working? Are you getting any error messages? As I said, I’m currently typing on my mobile phone so can’t run the code myself just now.

when there is no data for the query. it comes up with a traceback error rather than catching the error.

I am pretty close, i am just not sure how to catch the traceback error.

cheers,

Andrew

You could just use “pass” instead of catching it… Might get you up and running for the mean time

Hi yer, I tried pass but for some reason it kept coming up with a traceback error.

I thought it was pretty strange behaviour.

Best,

Andrew

Hi @S666,

I was wondering if you could show were to add transaction fees in the back test.

If we could just do a simple fee per trade that would account for slippage and transaction costs

it would bring more realism to the back test.

Best,

Andrew

Hi @S666

I was wondering how do we put a fee per trade made in the back test section.

We could use the fee to account for slippage and trading costs.

It would make the back test more realistic.

Best,

Andrew

Hi Andrew,

I’ll have to have a think about this one as the strategy logic wasn’t really designed or built with the inclusion of commissions and slippage etc in mind. If I can think of a decent quick fix, I’ll try to find time to post it. I may actually make my next post all about those “extra” bits that go into a backtest that are usually ommited and most people tend to ignore…things precisely like slippage and commissions

Hi So I was able the data issue.

Now i am running into problems with the trading logic-

results = []

for pair in pairs:

rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])

results.append(rets)

print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))

rets.plot(figsize=(20,15),legend=True)

home/andrewcz/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.

return ptp(axis=axis, out=out, **kwargs)

/home/andrewcz/.local/lib/python3.7/site-packages/ipykernel_launcher.py:37: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

KeyError Traceback (most recent call last)

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)

2656 try:

-> 2657 return self._engine.get_loc(key)

2658 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘short entry’

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)

in

1 results = []

2 for pair in pairs:

—-> 3 rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])

4 results.append(rets)

5 print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))

in backtest(df, s1, s2)

38 df1[‘num units long’] = df1[‘num units long’].fillna(method=’pad’) #set up num units short df1[‘short entry’] = ((df1.zScore > entryZscore) & ( df1.zScore.shift(1) < entryZscore))

39 df1[‘short exit’] = ((df1.zScore < exitZscore) & (df1.zScore.shift(1) > exitZscore))

—> 40 df1.loc[df1[‘short entry’],’num units short’] = -1

41 df1.loc[df1[‘short exit’],’num units short’] = 0

42 df1[‘num units short’][0] = 0

~/.local/lib/python3.7/site-packages/pandas/core/frame.py in

getitem(self, key)2925 if self.columns.nlevels > 1:

2926 return self._getitem_multilevel(key)

-> 2927 indexer = self.columns.get_loc(key)

2928 if is_integer(indexer):

2929 indexer = [indexer]

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)

2657 return self._engine.get_loc(key)

2658 except KeyError:

-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))

2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

2661 if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘short entry’

Hi there – I have had a quick look and it is due to some incorrect formatting in the code above – there are some “new line” breaks that aren’t being recognised – let me fix it now and I will message again when done.

Ok try cutting and pasting the code again – I believe I have corrected the problem. If it still doesn’t work, let me know.

Ah cheers mate much appreciated!

Did you also change the formatting in the cell above with the back test?

Many thanks,

Best,

Andrew

The cell with the backtest function was the cell I changed, only that one…

Mate you are awesome!

worked like a charm.

A question, how would I add realistic fee’s to add further accuracy to the back test?

Best,

Andrew

No probs – glad that worked 😉

In terms of adding a “fees” component, it can be done a number of ways…I guess it depends on which assets you are planning to trade and how ttheir real life fees/commissions etc are structured. Which assets are you considering?

Well, I was thinking of just adding a general cost that would take care of slippage and transaction costs.

If I could just add the cost into the backtest if would just give a general idea of how costs would affect the sharp ratio.

I was just wondering on what line i would add the cost component.

Best,

Andrew

Also in the back test, where is the line that sets the initial value for the portfolio?

Best,

Andrew

This is great, thank you. Looking forward to testing. I’m having the syntax issue Andrew Czeizler had with fetching urls. I’m very new to coding and not sure how to get there. I created my own watch list on MarketWatch as well as trying the exchange downloads as Andrew suggested but with no progress. The MarketWatch list returns an error with ‘no tables found’. The links Andrew tried return with a syntax error for each of the urls, ‘invalid character in identifier’. Any tips would be greatly appreciated.

David

Hi David, when you just run the code as is on the site, what error message do you get?

In cell 2 (scrape html from website), I get ‘IndexError: list index out of range’ when copied/pasted. I haven’t gotten beyond that point.

Hi there,

Ok try this – replace the code in teh second cell down to line6 with the following:

That scrapes all the NYSE stock tickers – its a LOT of tickers so you may like to be a bit more selective as it will take quite a long time to run that many stocks through the data download.

Hmm same error. I wonder if there’s a module I have not imported or installed. Maybe something so common that you wouldn’t have needed to specify it. I added all code into Jupyter and have the following: Cell 2: list index out of range. Cell 3: name ‘df’ is not defined. Cell 5: name ‘df’ is not defined. Cell 6: name ‘pairs’ is not defined. Cell 9: name ‘pairs’ is not defined. Cell 10: No objects to concatenate. Cell 11: name ‘final_res’ is not defined.

Its a little bit trickey with not being able to see your code or error messages, but perhaps try doing what Andrew did in the above comments and change the download provider for “iex” to “yahoo” and see if that gets you any further.

the below code downloads the ticker data.

Hope this helps.

Best,

Andrew

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib as mpl

mpl.style.use(‘bmh’)

import pandas_datareader.data as web

import matplotlib.pylab as plt

from datetime import datetime

import statsmodels.api as sm

from pykalman import KalmanFilter

from math import sqrt

from pandas_datareader import data as pdr

import yfinance as yf

yf.pdr_override() # <== that’s all it takes 🙂

import pandas as pd

data = pd.read_html(‘https://en.wikipedia.org/wiki/List_of_S%26P_500_companies’)

table = data[0]

table.head()

sliced_table = table[1:]

header = table.iloc[0]

corrected_table = sliced_table.rename(columns=header)

corrected_table

tickers = corrected_table[‘MMM’].tolist()

print(tickers)

tickers=tickers[0:30]

#dowload ticker data and get closing prices

data = yf.download(tickers, start=”2014-01-01″, end=”2019-04-30″)

df=data[‘Close’]

df.plot(figsize=(20,10))

Many thanks for adding that and contributing! Very much appreciated…

mate your blog is awesome!

thank you!

Super excited about future articles.

I was just wondering if there could be articles on transaction costs and running an algorithm live.

These two topics seem very difficult to find good, practical information.

Best,

Andrew

Will do mate, I’ll make those both the subject of my next post 😀

Thank you both!

You mentioned being a bit more selective rather than looking at all tickers on an exchange. How should I do this? Having trouble understanding which pair is being referred to in the final equity curve.

I guess because I have an error with the heat map not printing. NameError: name ‘used_stocks’ is not defined.

df1[‘spread pct ch’] = (df1[‘spread’] – df1[‘spread’].shift(1))

Re above, I think there is forward bias here. Spread here is based on the hedge ratio which is updated on daily basis. So the daily mark to market pnl should be based on spread by t – 1 hedge ratio but not on t ratio, I.e settle your existing pair portfolio before getting into a new one.

When I run the following code:

results = []

for pair in pairs:

rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])

results.append(rets)

print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))

rets.plot(figsize=(20,15),legend=True)

I get the error

TypeError Traceback (most recent call last)

in

2

3 for pair in pairs:

—-> 4 rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])

5 results.append(rets)

6 print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))

TypeError: cannot unpack non-iterable NoneType object

Do you not what is wrong?

Thanks in advance!

So it looks like your backtest function is returning “None” instead of the 3 variables it is supposed to. Have you altered the last line of the backtest function that deals with the return statement at all?

highly recommend you translate the strategy into shares and using round lots. You will find the results will be completely different. You will find the majority of winning trades will actually be significant losses.

Absolutely agree, the results will change fundemantally once the strategy logic is refined further to include those kinds of “pesky realities”!! Things such as having to trade round lots, not having an endless pit of money to keep altering position sizes with no idea of total inflow needed, having to cross bid/offer spread, slippage and brokerage costs/commissions are just a few examples off the top of my head…

Nicely done 🙂

So what would be the calculation for the forecast error here? Would this simply be the spread?

Hi! quick question! I’m trying to implement the program but the cointegration function seems to give different output. You have any idea why is this happening? Any suggestion would be highly appreciated. Thanks in advance!

Hi Vinayak – may I ask, when you say it gives “different output” may I ask what exactly is being returned and how is it different?

Hi Stuart – Thanks for the great posts. I’m really enjoying this one in particular 🙂 However, I’m getting the pesky “SeettingWithCopyWarning” on every pair when I run the backtest function. Has something changed in Pandas that would trigger this error? Thank you

Hi,

I liked the blog and the content above “MEAN REVERSION PAIRS TRADING WITH INCLUSION OF A KALMAN FILTER”.

However, I am new to Python and I want to make sure that I am not lost during the flow. ı would like to especially understand why you used -1.4 below in CAGR calculation:

CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) – 1,4)

Thanks in advance for taking time to reply.

Best,

Huseyin

Hi Huseyin, it is actually “-1, 4” not “-1.4” – i.e. with a comma, not a point. The 4 is an input to the “round” function, meaning I am just rounding the value to 4 decimal places. The “-1” is part of the return calculation.

Hi,

I liked the blog and the content above “MEAN REVERSION PAIRS TRADING WITH INCLUSION OF A KALMAN FILTER”.

Do you have an idea to optimize the “entryZscore” level?

regards

SR

The value of 2 for the entryZscore represents 2 standard deviations, which is representative of capturing around 95% of the expected values within the +/- 2SD range (it is actually nearer 1.96 to be exact but 2 is often used). This is based on the characteristics of the Standard Normal Distribution (https://en.wikipedia.org/wiki/1.96). As we are dealing here with the “mean of the sample” values – then the Central Limit Theorem applies (https://en.wikipedia.org/wiki/Central_limit_theorem).

If you are asking more about the structure of how to code the logic to optimise the input value, then the easiest way is to just brute force it through some for loops, changing the value each time and recording the outcomes (although you must be aware of over-fitting this way and understand the pitfalls of oversampling etc.)

This is a very through and practical implementation of pair trading!

Very useful and well-written framework!

Thank you for the nice work!

Not sure I’m following the rationale here. I’m understanding the comment to ask whether uneven (volatility-adjusted) position sizing should be detrimental to profitability – is that correct? Shouldn’t it basically be a coin flip in terms of which stock drives most of the profitability? In other words, if you’re forced to have uneven exposure, it will be 50/50 in terms of whether this will hurt you or not. Maybe at worst, you could always err on the side of having more long exposure since stocks tend to go up over time.