I thought I’d finish off this short series of Python Backtesting Mean Reversion by providing a full, executable script that incorporates the use of SQL queries to extract our ticker symbols from the SQLite database we created in an earlier post (This can be found here)
In this particular example I have decided to run a series of backtests on ticker symbols from the database, based upon a “Niche” of “Gold Miners”. In theory this selection of tickers based upon some non-arbitrary, meaningful grouping criteria should allow us to focus in on pairs of symbols that are more likely to have statistically meaningful co-integration of prices series.
To test other groups of symbols, all we have to do is update the SQL query based upon whichever criteria we are interested in….just as an example, we could change the query to select symbols based upon an “Asset Class” of “Fixed Income”. The possible combinations are obviously plentiful, and we could even incorporate multiple criteria at once to hone our focus even further.
So, after following the instructions to create our database, one should be able to just copy and paste the entire code below into a python script and it will run through all the backtests in one go:
#import needed modules #import needed modules from datetime import datetime from pandas_datareader import data import pandas as pd import numpy as np from numpy import log, polyfit, sqrt, std, subtract import statsmodels.tsa.stattools as ts import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns import pprint import sqlite3 as db #set the database file path we wish to connect to #this will obviously be unique to wherever you created #the SQLite database on your local system database = 'C:\Users\S666\sqlite_databases\etfs.db' #this is the SQL statement containing the information #regarding which tickers we want to pull from the database #As an example I have chosen to pull down all tickers which #have their "Focus" listed as being "Silver" sql = 'SELECT Ticker FROM etftable WHERE Niche = "Value";' #create a connection to the database specified above cnx = db.connect(database) cur = cnx.cursor() #execute the SQL statement and place the results into a #variable called "tickers" tickers = pd.read_sql(sql, con=cnx) #create an empty list symbList = [] #iterate over the DataFrame and append each item into the empty list for i in xrange(len(tickers)): symbList.append(tickers.ix[i][0]) def get_symb_pairs(symbList): '''symbList is a list of ETF symbols This function takes in a list of symbols and returns a list of unique pairs of symbols''' symbPairs = [] i = 0 #iterate through the list and create all possible combinations of #ticker pairs - append the pairs to the "symbPairs" list while i < len(symbList)-1: j = i + 1 while j < len(symbList): symbPairs.append([symbList[i],symbList[j]]) j += 1 i += 1 #iterate through the newly created list of pairs and remove any pairs #made up of two identical tickers for i in symbPairs: if i[0] == i[1]: symbPairs.remove(i) #create a new empty list to store only unique pairs symbPairs2 = [] #iterate through the original list and append only unique pairs to the #new list for i in symbPairs: if i not in symbPairs2: symbPairs2.append(i) return symbPairs2 symbPairs = get_symb_pairs(symbList) def backtest(symbList): start_date = '2012/01/01' end_date = datetime.now() #download data from Yahoo Finance y=data.DataReader(symbList[0], "yahoo", start=start_date, end=end_date) x=data.DataReader(symbList[1], "yahoo", start=start_date, end=end_date) #rename column to make it easier to work with later y.rename(columns={'Adj Close':'price'}, inplace=True) x.rename(columns={'Adj Close':'price'}, inplace=True) #make sure DataFrames are the same length min_date = max(df.dropna().index[0] for df in [y, x]) max_date = min(df.dropna().index[-1] for df in [y, x]) y = y[(y.index >= min_date) & (y.index <= max_date)] x = x[(x.index >= min_date) & (x.index <= max_date)] ############################################################ plt.plot(y.price,label=symbList[0]) plt.plot(x.price,label=symbList[1]) plt.ylabel('Price') plt.xlabel('Time') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.show() ############################################################# sns.jointplot(y.price, x.price ,color='b') plt.show() ############################################################# #run Odinary Least Squares regression to find hedge ratio #and then create spread series df1 = pd.DataFrame({'y':y['price'],'x':x['price']}) est = sm.OLS(df1.y,df1.x) est = est.fit() df1['hr'] = -est.params[0] df1['spread'] = df1.y + (df1.x * df1.hr) ############################################################## plt.plot(df1.spread) plt.show() ############################################################## cadf = ts.adfuller(df1.spread) print 'Augmented Dickey Fuller test statistic =',cadf[0] print 'Augmented Dickey Fuller p-value =',cadf[1] print 'Augmented Dickey Fuller 1%, 5% and 10% test statistics =',cadf[4] ############################################################## def hurst(ts): """Returns the Hurst Exponent of the time series vector ts""" # Create the range of lag values lags = range(2, 100) # Calculate the array of the variances of the lagged differences tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags] # Use a linear fit to estimate the Hurst Exponent poly = polyfit(log(lags), log(tau), 1) # Return the Hurst exponent from the polyfit output return poly[0]*2.0 ############################################################## print "Hurst Exponent =",round(hurst(df1.spread),2) ############################################################## #Run OLS regression on spread series and lagged version of itself spread_lag = df1.spread.shift(1) spread_lag.ix[0] = spread_lag.ix[1] spread_ret = df1.spread - spread_lag spread_ret.ix[0] = spread_ret.ix[1] spread_lag2 = sm.add_constant(spread_lag) model = sm.OLS(spread_ret,spread_lag2) res = model.fit() halflife = round(-np.log(2) / res.params[1],0) if halflife <= 0: halflife = 1 print 'Halflife = ',halflife ############################################################## meanSpread = df1.spread.rolling(window=halflife).mean() stdSpread = df1.spread.rolling(window=halflife).std() df1['zScore'] = (df1.spread-meanSpread)/stdSpread df1['zScore'].plot() plt.show() ############################################################## entryZscore = 2 exitZscore = 0 #set up num units long df1['long entry'] = ((df1.zScore < - entryZscore) & ( df1.zScore.shift(1) > - entryZscore)) df1['long exit'] = ((df1.zScore > - exitZscore) & (df1.zScore.shift(1) < - exitZscore)) df1['num units long'] = np.nan df1.loc[df1['long entry'],'num units long'] = 1 df1.loc[df1['long exit'],'num units long'] = 0 df1['num units long'][0] = 0 df1['num units long'] = df1['num units long'].fillna(method='pad') #set up num units short df1['short entry'] = ((df1.zScore > entryZscore) & ( df1.zScore.shift(1) < entryZscore)) df1['short exit'] = ((df1.zScore < exitZscore) & (df1.zScore.shift(1) > exitZscore)) df1.loc[df1['short entry'],'num units short'] = -1 df1.loc[df1['short exit'],'num units short'] = 0 df1['num units short'][0] = 0 df1['num units short'] = df1['num units short'].fillna(method='pad') df1['numUnits'] = df1['num units long'] + df1['num units short'] df1['spread pct ch'] = (df1['spread'] - df1['spread'].shift(1)) / ((df1['x'] * abs(df1['hr'])) + df1['y']) df1['port rets'] = df1['spread pct ch'] * df1['numUnits'].shift(1) df1['cum rets'] = df1['port rets'].cumsum() df1['cum rets'] = df1['cum rets'] + 1 ############################################################## try: sharpe = ((df1['port rets'].mean() / df1['port rets'].std()) * sqrt(252)) except ZeroDivisionError: sharpe = 0.0 plt.plot(df1['cum rets']) plt.xlabel(i[1]) plt.ylabel(i[0]) plt.show() ############################################################## start_val = 1 end_val = df1['cum rets'].iat[-1] start_date = df1.iloc[0].name end_date = df1.iloc[-1].name days = (end_date - start_date).days CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) - 1,4) print "CAGR = {}%".format(CAGR*100) print "Sharpe Ratio = {}".format(round(sharpe,2)) print 100*"----" for i in symbPairs: backtest(i)
So hopefully this makes it as easy as possible for anyone reading this to just copy and paste and run this whole thing locally.
I must admit, the backtest results are far from amazing, and many, many of them end in negative territory. However there are a couple of pairs in there that generate quite a smooth equity curve with an upward trajectory (and this is just from the “Niche” = “Gold Miners” run, returning about 4-5% CAGR with Sharpe rations of 1.9 – 2.5. For an example, check out [GDX, RING] and [GDX, PSAU] ticker pairs.
Of course, we havn’t made any allowance for transaction costs so those small gains could well end up in negative territory also, depending on how many trades need to be carried out.
I think I will leave it here for this series; I think I’ve covered pretty much everything that i wanted to cover and have ended up with a fully working, executable script to carry out basic, initial tests on ETF pairs to identify any possible candidates for a mean-reversion strategy.
Onto something new for next time…until then!
15 comments
Apologies, there were some copy and paste issues with the code, for example “&” was showing up in its “amp” format – this has been rectified now and the above code should work straight out of the box.
I also added a “Try/Except” for the calculation of the Sharpe ratio which will deal with the situation where there were no trades made at all, the standard deviation is therefore 0, and avoids a “division by Zero” error.
Hope the above is helpful.
Hey, very useful code. I am trying to replicate the same and noticed a symbol that might have been missed. In the entry-exit part of the code, could you confirm if you have missed a ” – ” before the ‘exitZscore’?
df1[‘long exit’] = ((df1.zScore > – exitZscore) & (df1.zScore.shift(1) < exitZscore))
^
Good spot, I think you’re right…although my choice to have 0 as the exit zScore means that in this particular case, the results aren’t affected as -0 is the same as 0.
If the exit zScore was set to anything other than 0 then yes, there should be a minus sign in front of the code as you pointed out.
In fact if you wanted more flexibility regarding the entry and exit zScores (for example if you wanted the exit zScore to be a different sign than the entry zScore for either long or short trades), perhaps it would be better to set up specific entry and exit zScores for long positions and short positions separately, i.e. have 4 separate zScore levels.
In its current format the flexibility is limited.
Thanks!
I am kind of stuck at the signals part of the code and it’s hard to explain it over here. Any advice from your side would be of great help to me. If it isn’t too much trouble, do let me know how I can contact you or drop me mail at sai_prem2000@yahoo.com.
No problem, I’ve sent you an email 😀
Hi,
new to python. Tried running your script on two securities in spyder (python 2.7 windows) and ran into a value error (details below) after printing halflife at:
meanSpread = df1.spread.rolling(window=halflife).mean()
File “C:\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 866, in runfile
execfile(filename, namespace)
File “C:\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 87, in execfile
exec(compile(scripttext, filename, ‘exec’), glob, loc)
File “…/Mean Reversion Testing_v2.py”, line 108, in
meanSpread = df1.spread.rolling(window=halflife).mean()
File “C:\Anaconda2\lib\site-packages\pandas\core\generic.py”, line 5182, in rolling
center=center, win_type=win_type, axis=axis)
File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 1524, in rolling
return Rolling(obj, **kwds)
File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 59, in __init__
self.validate()
File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 843, in validate
raise ValueError(“window must be an integer”)
ValueError: window must be an integer
Any suggestion for fixing this problem would be appreciated. Thanks
Hi there, I have rerun my code and I am getting the same error – luckily it is a very easy fix! The reason you are getting the error message is that the “rolling mean” Pandas function only accepts an integer as a window, whereas we are currently passing it a float with multiple decimal places.
The way to fix this is by changing the line:
halflife = round(-np.log(2) / res.params[1],0)
to:
halflife = int(round(-np.log(2) / res.params[1],0))
This will cast the halflife variable as an integer and that should then work when passed to the rolling mean function.
Let me know if that solves your problem 😀
Hi, great work, I really enjoy reading your blog and to simulate the code.
I am getting rolling error as well but a different one. any insight?
—————————————————————————
AttributeError Traceback (most recent call last)
in ()
—-> 1 meanSpread = df1.spread.rolling(window=halflife).mean()
2 stdSpread = df1.spread.rolling(window=halflife).std()
3
4 df1[‘zScore’] = (df1.spread-meanSpread)/stdSpread
5
/usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
2244 return self[name]
2245 raise AttributeError(“‘%s’ object has no attribute ‘%s'” %
-> 2246 (type(self).__name__, name))
2247
2248 def __setattr__(self, name, value):
AttributeError: ‘Series’ object has no attribute ‘rolling’
Hi there – this is a really strange error message as a Pandas Series object does have a “rolling” attribute. Could you run the following code for me please and let me know the output:
It seems there is an error in your code.
def get_symb_pairs(symbList):
symbPairs = []
i=0
while i < len(symbList)-1:
j=i+1
while j <len(symbList):
symbPairs.append([symbList[i], symbList[j] ])
j += 1
i += 1
Does not calculate "all possible combinations of picker pairs". I checked the list symbPairs. Happy coding!
Is there any reason why you used:
df1[‘spread pct ch’] = (df1[‘spread’] – df1[‘spread’].shift(1)) / ((df1[‘x’] * abs(df1[‘hr’])) + df1[‘y’])
instead of simply df1[‘spread’].pct_change(1)? Also, why do you need the absolute value for the hedge ratio ‘abs(df1[‘hr’])’ in the above formula? Isn’t the spread defined as y + hr*x where hr=-beta? I thought the pct ch is defined as (V_t – V_t-1) / V_t
Another question, in these lines:
#run Odinary Least Squares regression to find hedge ratio
#and then create spread series
df1 = pd.DataFrame({‘y’:y[‘price’],’x’:x[‘price’]})
est = sm.OLS(df1.y,df1.x)
est = est.fit()
df1[‘hr’] = -est.params[0]
df1[‘spread’] = df1.y + (df1.x * df1.hr)
Can I ask why you didn’t add a constant to the x before running the linear regression? I would expect you need to add a constant like you did later on for ‘spread_lag2 = sm.add_constant(spread_lag)’?
Hi Jamieson –
thanks for your work so far, i find it very useful and practical above all.
I am trying to run your code above and I encountered some issues when calculating the halflife spread. I am looking at a different pair of stock ‘AMG’ and ‘AFL’. The hallife of the spread between the two is negative and equal to -405days. Now if I set a negative halflife equal to to 1 (As per your code above) I have no rolling standard deviation. All values will be equal “nan” and it wont be possible to compute the Z-score.
Does it make sense to have a negative half life? Shouldn’t this be always positive representing a number of days? If so why would you set half life equal to 1 for negative values when the real half life might be bigger (like -405 days in my example)?
Look forward to hearing your opinion.
Thanks a lot again
I’m just wondering if there is look ahead bias introduced in the backtest.
You calculate cadf and hedge ratio from the full sample.
However you execute trades before cadf and hedge ratio are known.
I presume you are aware this issue but this is important for your readers who may use this model for real trading.
Your computation of the returns seems a bit off. I would suggest to use the computation of a nav