Home Trading Strategy Backtest Python Backtesting Mean Reversion – Part 4

Python Backtesting Mean Reversion – Part 4

by s666

I thought I’d finish off this short series of Python Backtesting Mean Reversion by providing a full, executable script that incorporates the use of SQL queries to extract our ticker symbols from the SQLite database we created in an earlier post (This can be found here)

In this particular example I have decided to run a series of backtests on ticker symbols from the database, based upon a “Niche” of “Gold Miners”. In theory this selection of tickers based upon some non-arbitrary, meaningful grouping criteria should allow us to focus in on pairs of symbols that are more likely to have statistically meaningful co-integration of prices series.

To test other groups of symbols, all we have to do is update the SQL query based upon whichever criteria we are interested in….just as an example, we could change the query to select symbols based upon an “Asset Class” of “Fixed Income”. The possible combinations are obviously plentiful, and we could even incorporate multiple criteria at once to hone our focus even further.

So, after following the instructions to create our database, one should be able to just copy and paste the entire code below into a python script and it will run through all the backtests in one go:

#import needed modules
#import needed modules
from datetime import datetime
from pandas_datareader import data
import pandas as pd
import numpy as np
from numpy import log, polyfit, sqrt, std, subtract
import statsmodels.tsa.stattools as ts
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
import sqlite3 as db


#set the database file path we wish to connect to
#this will obviously be unique to wherever you created
#the SQLite database on your local system
database = 'C:\Users\S666\sqlite_databases\etfs.db'
 
#this is the SQL statement containing the information
#regarding which tickers we want to pull from the database
#As an example I have chosen to pull down all tickers which
#have their "Focus" listed as being "Silver"
sql = 'SELECT Ticker FROM etftable WHERE Niche = "Value";'
 
#create a connection to the database specified above
cnx = db.connect(database)
cur = cnx.cursor()
 
#execute the SQL statement and place the results into a 
#variable called "tickers"
tickers = pd.read_sql(sql, con=cnx)

 
#create an empty list
symbList = []
 
#iterate over the DataFrame and append each item into the empty list
for i in xrange(len(tickers)):
    symbList.append(tickers.ix[i][0])
    
    
def get_symb_pairs(symbList):
    '''symbList is a list of ETF symbols
       This function takes in a list of symbols and 
       returns a list of unique pairs of symbols'''
 
    symbPairs = []
 
    i = 0
 
    #iterate through the list and create all possible combinations of
    #ticker pairs - append the pairs to the "symbPairs" list
    while i < len(symbList)-1:
        j = i + 1
        while j < len(symbList): 
            symbPairs.append([symbList[i],symbList[j]]) 
            j += 1 
            i += 1 
            
    #iterate through the newly created list of pairs and remove any pairs #made up of two identical tickers 
    for i in symbPairs: 
        if i[0] == i[1]: 
            symbPairs.remove(i) 
    #create a new empty list to store only unique pairs 
    symbPairs2 = [] 
    #iterate through the original list and append only unique pairs to the 
    #new list 
    for i in symbPairs: 
        if i not in symbPairs2: 
            symbPairs2.append(i) 

    return symbPairs2 



symbPairs = get_symb_pairs(symbList) 


def backtest(symbList): 
    start_date = '2012/01/01' 
    end_date = datetime.now() 
    #download data from Yahoo Finance 
    y=data.DataReader(symbList[0], "yahoo", start=start_date, end=end_date) 
    x=data.DataReader(symbList[1], "yahoo", start=start_date, end=end_date) 

    #rename column to make it easier to work with later 
    y.rename(columns={'Adj Close':'price'}, inplace=True) 
    x.rename(columns={'Adj Close':'price'}, inplace=True) 

    #make sure DataFrames are the same length 
    min_date = max(df.dropna().index[0] for df in [y, x]) 
    max_date = min(df.dropna().index[-1] for df in [y, x]) 
    y = y[(y.index >= min_date) & (y.index <= max_date)] 
    x = x[(x.index >= min_date) & (x.index <= max_date)]
    
    ############################################################
    
    plt.plot(y.price,label=symbList[0])
    plt.plot(x.price,label=symbList[1])
    plt.ylabel('Price')
    plt.xlabel('Time')
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.show()
    
    #############################################################
    
    sns.jointplot(y.price, x.price ,color='b')
    plt.show()
    
    #############################################################
    
    #run Odinary Least Squares regression to find hedge ratio
    #and then create spread series
    df1 = pd.DataFrame({'y':y['price'],'x':x['price']})
    est = sm.OLS(df1.y,df1.x)
    est = est.fit()
    df1['hr'] = -est.params[0]
    df1['spread'] = df1.y + (df1.x * df1.hr)
    
    ##############################################################
    
    plt.plot(df1.spread)
    plt.show()
    
    ##############################################################
    
    cadf = ts.adfuller(df1.spread)
    print 'Augmented Dickey Fuller test statistic =',cadf[0]
    print 'Augmented Dickey Fuller p-value =',cadf[1]
    print 'Augmented Dickey Fuller 1%, 5% and 10% test statistics =',cadf[4]
    
    ##############################################################
    
    def hurst(ts):
    	"""Returns the Hurst Exponent of the time series vector ts"""
    	# Create the range of lag values
    	lags = range(2, 100)
     
    	# Calculate the array of the variances of the lagged differences
    	tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
     
    	# Use a linear fit to estimate the Hurst Exponent
    	poly = polyfit(log(lags), log(tau), 1)
     
    	# Return the Hurst exponent from the polyfit output
    	return poly[0]*2.0
     
    
    ##############################################################
    
    print "Hurst Exponent =",round(hurst(df1.spread),2)
    
    ##############################################################
    
    #Run OLS regression on spread series and lagged version of itself
    spread_lag = df1.spread.shift(1)
    spread_lag.ix[0] = spread_lag.ix[1]
    spread_ret = df1.spread - spread_lag
    spread_ret.ix[0] = spread_ret.ix[1]
    spread_lag2 = sm.add_constant(spread_lag)
     
    model = sm.OLS(spread_ret,spread_lag2)
    res = model.fit()
     
     
    halflife = round(-np.log(2) / res.params[1],0)
    
    if halflife <= 0:
        halflife = 1
     
    print  'Halflife = ',halflife
    
    ##############################################################
    
    
    meanSpread = df1.spread.rolling(window=halflife).mean()
    stdSpread = df1.spread.rolling(window=halflife).std()
        
    df1['zScore'] = (df1.spread-meanSpread)/stdSpread
        
    df1['zScore'].plot()
    plt.show()
    
    ##############################################################
    
    entryZscore = 2
    exitZscore = 0
    
    #set up num units long             
    df1['long entry'] = ((df1.zScore < - entryZscore) & ( df1.zScore.shift(1) > - entryZscore))
    df1['long exit'] = ((df1.zScore > - exitZscore) & (df1.zScore.shift(1) < - exitZscore)) df1['num units long'] = np.nan df1.loc[df1['long entry'],'num units long'] = 1 df1.loc[df1['long exit'],'num units long'] = 0 df1['num units long'][0] = 0 df1['num units long'] = df1['num units long'].fillna(method='pad') #set up num units short df1['short entry'] = ((df1.zScore >  entryZscore) & ( df1.zScore.shift(1) < entryZscore))
    df1['short exit'] = ((df1.zScore < exitZscore) & (df1.zScore.shift(1) > exitZscore))
    df1.loc[df1['short entry'],'num units short'] = -1
    df1.loc[df1['short exit'],'num units short'] = 0
    df1['num units short'][0] = 0
    df1['num units short'] = df1['num units short'].fillna(method='pad')
    
    df1['numUnits'] = df1['num units long'] + df1['num units short']
    df1['spread pct ch'] = (df1['spread'] - df1['spread'].shift(1)) / ((df1['x'] * abs(df1['hr'])) + df1['y'])
    df1['port rets'] = df1['spread pct ch'] * df1['numUnits'].shift(1)
    
    df1['cum rets'] = df1['port rets'].cumsum()
    df1['cum rets'] = df1['cum rets'] + 1
    
    ##############################################################
    
    try:
        sharpe = ((df1['port rets'].mean() / df1['port rets'].std()) * sqrt(252)) 
    except ZeroDivisionError:
        sharpe = 0.0
    
    
    
    plt.plot(df1['cum rets'])
    plt.xlabel(i[1])
    plt.ylabel(i[0])
    plt.show()
    
    ##############################################################
    
    
        
    start_val = 1
    end_val = df1['cum rets'].iat[-1]
        
    start_date = df1.iloc[0].name
    end_date = df1.iloc[-1].name
    days = (end_date - start_date).days
        
    CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) - 1,4)
    
    
    print "CAGR = {}%".format(CAGR*100)
    print "Sharpe Ratio = {}".format(round(sharpe,2))
    print 100*"----"
    
for i in symbPairs:
    backtest(i)

So hopefully this makes it as easy as possible for anyone reading this to just copy and paste and run this whole thing locally.

I must admit, the backtest results are far from amazing, and many, many of them end in negative territory. However there are a couple of pairs in there that generate quite a smooth equity curve with an upward trajectory (and this is just from the “Niche” = “Gold Miners” run, returning about 4-5% CAGR with Sharpe rations of 1.9 – 2.5. For an example, check out [GDX, RING] and [GDX, PSAU] ticker pairs.

Of course, we havn’t made any allowance for transaction costs so those small gains could well end up in negative territory also, depending on how many trades need to be carried out.

I think I will leave it here for this series; I think I’ve covered pretty much everything that i wanted to cover and have ended up with a fully working, executable script to carry out basic, initial tests on ETF pairs to identify any possible candidates for a mean-reversion strategy.

Onto something new for next time…until then!

You may also like

15 comments

s666 July 13, 2016 - 5:04 pm

Apologies, there were some copy and paste issues with the code, for example “&” was showing up in its “amp” format – this has been rectified now and the above code should work straight out of the box.

I also added a “Try/Except” for the calculation of the Sharpe ratio which will deal with the situation where there were no trades made at all, the standard deviation is therefore 0, and avoids a “division by Zero” error.

Hope the above is helpful.

Reply
Sai Prem August 28, 2016 - 1:53 pm

Hey, very useful code. I am trying to replicate the same and noticed a symbol that might have been missed. In the entry-exit part of the code, could you confirm if you have missed a ” – ” before the ‘exitZscore’?

df1[‘long exit’] = ((df1.zScore > – exitZscore) & (df1.zScore.shift(1) < exitZscore))
^

Reply
s666 August 28, 2016 - 4:25 pm

Good spot, I think you’re right…although my choice to have 0 as the exit zScore means that in this particular case, the results aren’t affected as -0 is the same as 0.

If the exit zScore was set to anything other than 0 then yes, there should be a minus sign in front of the code as you pointed out.

In fact if you wanted more flexibility regarding the entry and exit zScores (for example if you wanted the exit zScore to be a different sign than the entry zScore for either long or short trades), perhaps it would be better to set up specific entry and exit zScores for long positions and short positions separately, i.e. have 4 separate zScore levels.

In its current format the flexibility is limited.

Reply
Sai Prem August 28, 2016 - 5:52 pm

Thanks!

I am kind of stuck at the signals part of the code and it’s hard to explain it over here. Any advice from your side would be of great help to me. If it isn’t too much trouble, do let me know how I can contact you or drop me mail at [email protected].

Reply
s666 August 29, 2016 - 10:18 pm

No problem, I’ve sent you an email 😀

Reply
george November 17, 2016 - 8:00 pm

Hi,
new to python. Tried running your script on two securities in spyder (python 2.7 windows) and ran into a value error (details below) after printing halflife at:

meanSpread = df1.spread.rolling(window=halflife).mean()

File “C:\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 866, in runfile
execfile(filename, namespace)
File “C:\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 87, in execfile
exec(compile(scripttext, filename, ‘exec’), glob, loc)
File “…/Mean Reversion Testing_v2.py”, line 108, in
meanSpread = df1.spread.rolling(window=halflife).mean()
File “C:\Anaconda2\lib\site-packages\pandas\core\generic.py”, line 5182, in rolling
center=center, win_type=win_type, axis=axis)
File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 1524, in rolling
return Rolling(obj, **kwds)
File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 59, in __init__
self.validate()
File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 843, in validate
raise ValueError(“window must be an integer”)
ValueError: window must be an integer

Any suggestion for fixing this problem would be appreciated. Thanks

Reply
s666 November 27, 2016 - 11:26 am

Hi there, I have rerun my code and I am getting the same error – luckily it is a very easy fix! The reason you are getting the error message is that the “rolling mean” Pandas function only accepts an integer as a window, whereas we are currently passing it a float with multiple decimal places.

The way to fix this is by changing the line:

halflife = round(-np.log(2) / res.params[1],0)

to:

halflife = int(round(-np.log(2) / res.params[1],0))

This will cast the halflife variable as an integer and that should then work when passed to the rolling mean function.

Let me know if that solves your problem 😀

Reply
wilkins January 18, 2017 - 6:44 am

Hi, great work, I really enjoy reading your blog and to simulate the code.

I am getting rolling error as well but a different one. any insight?

—————————————————————————
AttributeError Traceback (most recent call last)
in ()
—-> 1 meanSpread = df1.spread.rolling(window=halflife).mean()
2 stdSpread = df1.spread.rolling(window=halflife).std()
3
4 df1[‘zScore’] = (df1.spread-meanSpread)/stdSpread
5

/usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
2244 return self[name]
2245 raise AttributeError(“‘%s’ object has no attribute ‘%s'” %
-> 2246 (type(self).__name__, name))
2247
2248 def __setattr__(self, name, value):

AttributeError: ‘Series’ object has no attribute ‘rolling’

Reply
s666 January 19, 2017 - 8:35 am

Hi there – this is a really strange error message as a Pandas Series object does have a “rolling” attribute. Could you run the following code for me please and let me know the output:

df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10], 
                    'b':[1,2,3,4,5,6,7,8,9,10]})

print df.a.rolling(window=5).mean()
Reply
jan123 August 23, 2017 - 4:51 pm

It seems there is an error in your code.

def get_symb_pairs(symbList):
symbPairs = []
i=0
while i < len(symbList)-1:
j=i+1
while j <len(symbList):
symbPairs.append([symbList[i], symbList[j] ])
j += 1
i += 1
Does not calculate "all possible combinations of picker pairs". I checked the list symbPairs. Happy coding!

Reply
tsando January 12, 2018 - 2:53 pm

Is there any reason why you used:

df1[‘spread pct ch’] = (df1[‘spread’] – df1[‘spread’].shift(1)) / ((df1[‘x’] * abs(df1[‘hr’])) + df1[‘y’])

instead of simply df1[‘spread’].pct_change(1)? Also, why do you need the absolute value for the hedge ratio ‘abs(df1[‘hr’])’ in the above formula? Isn’t the spread defined as y + hr*x where hr=-beta? I thought the pct ch is defined as (V_t – V_t-1) / V_t

Reply
tsando January 15, 2018 - 10:12 am

Another question, in these lines:

#run Odinary Least Squares regression to find hedge ratio
#and then create spread series
df1 = pd.DataFrame({‘y’:y[‘price’],’x’:x[‘price’]})
est = sm.OLS(df1.y,df1.x)
est = est.fit()
df1[‘hr’] = -est.params[0]
df1[‘spread’] = df1.y + (df1.x * df1.hr)

Can I ask why you didn’t add a constant to the x before running the linear regression? I would expect you need to add a constant like you did later on for ‘spread_lag2 = sm.add_constant(spread_lag)’?

Reply
small August 18, 2018 - 11:45 pm

Hi Jamieson –

thanks for your work so far, i find it very useful and practical above all.
I am trying to run your code above and I encountered some issues when calculating the halflife spread. I am looking at a different pair of stock ‘AMG’ and ‘AFL’. The hallife of the spread between the two is negative and equal to -405days. Now if I set a negative halflife equal to to 1 (As per your code above) I have no rolling standard deviation. All values will be equal “nan” and it wont be possible to compute the Z-score.

Does it make sense to have a negative half life? Shouldn’t this be always positive representing a number of days? If so why would you set half life equal to 1 for negative values when the real half life might be bigger (like -405 days in my example)?

Look forward to hearing your opinion.

Thanks a lot again

Reply
Piotr Arendarski April 16, 2020 - 7:51 pm

I’m just wondering if there is look ahead bias introduced in the backtest.
You calculate cadf and hedge ratio from the full sample.
However you execute trades before cadf and hedge ratio are known.

I presume you are aware this issue but this is important for your readers who may use this model for real trading.

Reply
Henribu May 3, 2020 - 4:15 pm

Your computation of the returns seems a bit off. I would suggest to use the computation of a nav

Reply

Leave a Reply

%d bloggers like this: