Hi all, welcome back. This blog post is going to deal with creating the initial stages of our **Python backtesting mean reversion** script – we’re going to leave the “symbol pairs” function we created in the last post behind for a bit (we’ll come back to it a bit later) and use a single pair of symbols to run our first few stages of the backtest to keep it simple.

Once we have the script working for a single input of one pair of symbols, we can very easily adapt it at the end to work with the symbol pairs function created previously.

To give you all a brief outline of what we will be tackling in this post, here’s a quick list of steps:

1) Define our symbol pair, download the relevant price data from yahoo Finance and make sure the data downloaded for each symbol is of the same length.

2) Plot the two ETF price series against each other to get a visual representation, then run a Seaborn “jointplot” to analyse the strength of correlation between the two series.

3) Run an Ordinary Least Squares regression on the closing prices to calculate a hedge ratio. Use the hedge ratio to generate the spread between the two prices, and then plot this to see if it looks in any way mean reverting.

4) Run an Augmented Dickey Fuller test on the spread to confirm statistically whether the series is mean reverting or not. We will also calculate the Hurst exponent of the spread series.

5) Run an Ordinary Least Squares regression on the spread series and a lagged version of the spread series in order to then use the coefficient to calculate the half-life of mean reversion.

Right now let’s get to some code…time to import the relevant modules we will need, set our ETF ticker symbols and download the price data from Yahoo Finance.

#import needed modules
from datetime import datetime
from pandas_datareader import data
import pandas as pd
import numpy as np
from numpy import log, polyfit, sqrt, std, subtract
import statsmodels.tsa.stattools as ts
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
#choose ticker pairs for our testing
symbList = ['EWA','EWC']
start_date = '2012/01/01'
end_date = datetime.now()
#download data from Yahoo Finance
y=data.DataReader(symbList[0], "yahoo", start=start_date, end=end_date)
x=data.DataReader(symbList[1], "yahoo", start=start_date, end=end_date)
#rename column to make it easier to work with later
y.rename(columns={'Adj Close':'price'}, inplace=True)
x.rename(columns={'Adj Close':'price'}, inplace=True)
#make sure DataFrames are the same length
min_date = max(df.dropna().index[0] for df in [y, x])
max_date = min(df.dropna().index[-1] for df in [y, x])
y = y[(y.index >= min_date) & (y.index <= max_date)] x = x[(x.index >= min_date) & (x.index <= max_date)] |

#import needed modules
from datetime import datetime
from pandas_datareader import data
import pandas as pd
import numpy as np
from numpy import log, polyfit, sqrt, std, subtract
import statsmodels.tsa.stattools as ts
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
#choose ticker pairs for our testing
symbList = ['EWA','EWC']
start_date = '2012/01/01'
end_date = datetime.now()
#download data from Yahoo Finance
y=data.DataReader(symbList[0], "yahoo", start=start_date, end=end_date)
x=data.DataReader(symbList[1], "yahoo", start=start_date, end=end_date)
#rename column to make it easier to work with later
y.rename(columns={'Adj Close':'price'}, inplace=True)
x.rename(columns={'Adj Close':'price'}, inplace=True)
#make sure DataFrames are the same length
min_date = max(df.dropna().index[0] for df in [y, x])
max_date = min(df.dropna().index[-1] for df in [y, x])
y = y[(y.index >= min_date) & (y.index <= max_date)] x = x[(x.index >= min_date) & (x.index <= max_date)]

Now that we have our price data stored in DataFrames, let’s just bring up a quick plot of the two series to see what information we can gather

plt.plot(y.price,label=symbList[0])
plt.plot(x.price,label=symbList[1])
plt.ylabel('Price')
plt.xlabel('Time')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show() |

plt.plot(y.price,label=symbList[0])
plt.plot(x.price,label=symbList[1])
plt.ylabel('Price')
plt.xlabel('Time')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

It looks like the prices are definitely correlated to a degree, and generally move in the same direction. But just how strong is this correlation? Well there’s an easy way to get a clearer visual representation of this…we can just use a Seaborn “jointplot” as follows:

sns.jointplot(y.price, x.price ,color='b')
plt.show() |

sns.jointplot(y.price, x.price ,color='b')
plt.show()

We can see from the information provided in the jointplot that the Pearson Correlation coefficient is 0.87 – so we can see that there is definitely a pretty strong correlation here between the price series. This sets the pair up as a potentially good fit for a mean reversion strategy.

What we need to do now is create the spread series between the two prices by first running a linear regression analysis between the two price series.

#run Odinary Least Squares regression to find hedge ratio
#and then create spread series
df1 = pd.DataFrame({'y':y['price'],'x':x['price']})
est = sm.OLS(df1.y,df1.x)
est = est.fit()
df1['hr'] = -est.params[0]
df1['spread'] = df1.y + (df1.x * df1.hr) |

#run Odinary Least Squares regression to find hedge ratio
#and then create spread series
df1 = pd.DataFrame({'y':y['price'],'x':x['price']})
est = sm.OLS(df1.y,df1.x)
est = est.fit()
df1['hr'] = -est.params[0]
df1['spread'] = df1.y + (df1.x * df1.hr)

Right so we have now managed to run the regression between the ETF price series; the beta coefficient from this regression was then used as the hedge ratio to create the spread series of the two prices.

If we plot the spread series we get the following:

plt.plot(df1.spread)
plt.show() |

plt.plot(df1.spread)
plt.show()

So it looks relatively mean reverting. But sometimes looks can be deceiving, so really it would be great if we could run some statistical tests on the spread series to get a better idea. The test we will be using is the Augmented Dickey Fuller test. You can have a quick read up about it here if you need to refresh your memory:

cadf = ts.adfuller(df1.spread)
print 'Augmented Dickey Fuller test statistic =',cadf[0]
print 'Augmented Dickey Fuller p-value =',cadf[1]
print 'Augmented Dickey Fuller 1%, 5% and 10% test statistics =',cadf[4] |

cadf = ts.adfuller(df1.spread)
print 'Augmented Dickey Fuller test statistic =',cadf[0]
print 'Augmented Dickey Fuller p-value =',cadf[1]
print 'Augmented Dickey Fuller 1%, 5% and 10% test statistics =',cadf[4]

This gets us the following:

Augmented Dickey Fuller test statistic = -3.21520854685
Augmented Dickey Fuller p-value = 0.0191210549486
Augmented Dickey Fuller 1%, 5% and 10% test statistics=
{'5%': -2.86419037625175, '1%': -3.436352507699052, '10%': -2.5681811468354598}

From this we can see that the test statistic of -3.215 is larger in absolute terms than the 10% test statistic of -2.568 and the 5% test statistic of -2.864, but not the 1% test statistic of -3.436, meaning we can reject the null hypothesis that there is a unit root in the spread time series, and is therefore not mean reverting, at both the 10% and 5% level of significance, but not at the 1% level.

The p-value of 0.0191 means that we can reject the null hypothesis up to the 1.91% significance level. That’s pretty good in terms of statistical significance, and from this we can be pretty certain that the spread series does in fact posses mean reverting qualities.

The last thing we will do is run a quick function to calculate the Hurst exponent of the spread series.

For info on the Hurst Exponent please refer to: this article

To simplify things, the important info to remember here is that a time series can be characterised in the following manner with regard to the Hurst exponent (H):

H < 0.5 – The time series is mean reverting

H = 0.5 – The time series is a Geometric Brownian Motion

H > 0.5 – The time series is trending

I have “borrowed” a code snippet of a Hurst Exponent function found on the www.quantstart.com blog (great site by the way – definitely worth checking out).

The post containing the “borrowed” code can be found here:

and here is the code:

def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0 |

def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0

Now we run the function on the spread series:

print "Hurst Exponent =",round(hurst(df1.spread),2) |

print "Hurst Exponent =",round(hurst(df1.spread),2)

and get:

The Hurst Exponent is under the 0.5 value of a random walk and we can therefore conclude that the series is mean reverting, which backs up our conclusion based on the Augmented Dickey Fuller test previously. Good! This means that the spread series looks like a definite candidate for a mean reversion strategy, what with the spread series being mean reverting and all.

However just because a time series displays mean reverting properties, it doesn’t necessarily mean that we can trade it profitably – there’s a difference between a series that deviates and mean reverts every week and one that takes 10 years to mean revert. I’m not sure too many traders would be willing to sit and wait around for 10 years to close out a trade profitably.

To get an idea of how long each mean reversion is going to take, we can look into the “half-life” of the time series. Please click here for more info on half-life.

We can calculate this by running a linear regression between the spread series and a lagged version of itself. The Beta coefficient produced by this regression can then be incorporated into the Ornstein-Uhlenbeck process to calculate the half-life.

Please see here for some more info.

The code to calculate this is as follows:

#Run OLS regression on spread series and lagged version of itself
spread_lag = df1.spread.shift(1)
spread_lag.ix[0] = spread_lag.ix[1]
spread_ret = df1.spread - spread_lag
spread_ret.ix[0] = spread_ret.ix[1]
spread_lag2 = sm.add_constant(spread_lag)
model = sm.OLS(spread_ret,spread_lag2)
res = model.fit()
halflife = round(-np.log(2) / res.params[1],0)
print 'Halflife = ',halflife |

#Run OLS regression on spread series and lagged version of itself
spread_lag = df1.spread.shift(1)
spread_lag.ix[0] = spread_lag.ix[1]
spread_ret = df1.spread - spread_lag
spread_ret.ix[0] = spread_ret.ix[1]
spread_lag2 = sm.add_constant(spread_lag)
model = sm.OLS(spread_ret,spread_lag2)
res = model.fit()
halflife = round(-np.log(2) / res.params[1],0)
print 'Halflife = ',halflife

Which gets us:

So according to this result, the halflife of mean reversion is 40 days. That’s not too bad, and not so long as to automatically exclude it from consideration for a mean reversion strategy. Ideally the half-life would be as short as possible so as to provide us with more profitable trading opportunities but there you have it, 40 days is what we have.

OK so I think I’ll cut it here as this is getting a little long. Next post we will work on producing a normalised “Z Score” series that allows us to see the deviation away from the local mean in terms of standard deviations. We will then begin the actual backtest itself using Pandas and see if we can produce something that is any way profitable.

If anyone has any comments, please leave them below – always eager to hear the thoughts of others.

Until next time!