Python Backtesting Mean Reversion – Part 4

Python Backtesting Mean Reversion – Part 4

Categories Trading Strategy Backtest

I thought I’d finish off this short series of Python Backtesting Mean Reversion by providing a full, executable script that incorporates the use of SQL queries to extract our ticker symbols from the SQLite database we created in an earlier post (This can be found here)

In this particular example I have decided to run a series of backtests on ticker symbols from the database, based upon a “Niche” of “Gold Miners”. In theory this selection of tickers based upon some non-arbitrary, meaningful grouping criteria should allow us to focus in on pairs of symbols that are more likely to have statistically meaningful co-integration of prices series.

To test other groups of symbols, all we have to do is update the SQL query based upon whichever criteria we are interested in….just as an example, we could change the query to select symbols based upon an “Asset Class” of “Fixed Income”. The possible combinations are obviously plentiful, and we could even incorporate multiple criteria at once to hone our focus even further.

So, after following the instructions to create our database, one should be able to just copy and paste the entire code below into a python script and it will run through all the backtests in one go:

 

#import needed modules
#import needed modules
from datetime import datetime
from pandas_datareader import data
import pandas as pd
import numpy as np
from numpy import log, polyfit, sqrt, std, subtract
import statsmodels.tsa.stattools as ts
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
import sqlite3 as db
 
 
#set the database file path we wish to connect to
#this will obviously be unique to wherever you created
#the SQLite database on your local system
database = 'C:\Users\Stuart\sqlite_databases\etfs.db'
 
#this is the SQL statement containing the information
#regarding which tickers we want to pull from the database
#As an example I have chosen to pull down all tickers which
#have their "Focus" listed as being "Silver"
sql = 'SELECT Ticker FROM etftable WHERE Niche = "Value";'
 
#create a connection to the database specified above
cnx = db.connect(database)
cur = cnx.cursor()
 
#execute the SQL statement and place the results into a 
#variable called "tickers"
tickers = pd.read_sql(sql, con=cnx)
 
 
#create an empty list
symbList = []
 
#iterate over the DataFrame and append each item into the empty list
for i in xrange(len(tickers)):
    symbList.append(tickers.ix[i][0])
 
 
def get_symb_pairs(symbList):
    '''symbList is a list of ETF symbols
       This function takes in a list of symbols and 
       returns a list of unique pairs of symbols'''
 
    symbPairs = []
 
    i = 0
 
    #iterate through the list and create all possible combinations of
    #ticker pairs - append the pairs to the "symbPairs" list
    while i < len(symbList)-1:
        j = i + 1
        while j < len(symbList):
            symbPairs.append([symbList[i],symbList[j]])
            j += 1
        i += 1
 
    #iterate through the newly created list of pairs and remove any pairs
    #made up of two identical tickers
    for i in symbPairs:
        if i[0] == i[1]:
            symbPairs.remove(i)
 
    #create a new empty list to store only unique pairs        
    symbPairs2 = []
 
    #iterate through the original list and append only unique pairs to the 
    #new list 
    for i in symbPairs:
        if i not in symbPairs2:
            symbPairs2.append(i)
 
    return symbPairs2
 
 
symbPairs = get_symb_pairs(symbList)
 
 
 
 
 
 
def backtest(symbList):
    start_date = '2012/01/01'
    end_date = datetime.now()
 
    #download data from Yahoo Finance
    y=data.DataReader(symbList[0], "yahoo", start=start_date, end=end_date)
    x=data.DataReader(symbList[1], "yahoo", start=start_date, end=end_date)
 
    #rename column to make it easier to work with later
    y.rename(columns={'Adj Close':'price'}, inplace=True)
    x.rename(columns={'Adj Close':'price'}, inplace=True)
 
    #make sure DataFrames are the same length
    min_date = max(df.dropna().index[0] for df in [y, x])
    max_date = min(df.dropna().index[-1] for df in [y, x])
 
    y = y[(y.index >= min_date) & (y.index <= max_date)]
    x = x[(x.index >= min_date) & (x.index <= max_date)]
 
    ############################################################
 
    plt.plot(y.price,label=symbList[0])
    plt.plot(x.price,label=symbList[1])
    plt.ylabel('Price')
    plt.xlabel('Time')
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.show()
 
    #############################################################
 
    sns.jointplot(y.price, x.price ,color='b')
    plt.show()
 
    #############################################################
 
    #run Odinary Least Squares regression to find hedge ratio
    #and then create spread series
    df1 = pd.DataFrame({'y':y['price'],'x':x['price']})
    est = sm.OLS(df1.y,df1.x)
    est = est.fit()
    df1['hr'] = -est.params[0]
    df1['spread'] = df1.y + (df1.x * df1.hr)
 
    ##############################################################
 
    plt.plot(df1.spread)
    plt.show()
 
    ##############################################################
 
    cadf = ts.adfuller(df1.spread)
    print 'Augmented Dickey Fuller test statistic =',cadf[0]
    print 'Augmented Dickey Fuller p-value =',cadf[1]
    print 'Augmented Dickey Fuller 1%, 5% and 10% test statistics =',cadf[4]
 
    ##############################################################
 
    def hurst(ts):
    	"""Returns the Hurst Exponent of the time series vector ts"""
    	# Create the range of lag values
    	lags = range(2, 100)
 
    	# Calculate the array of the variances of the lagged differences
    	tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
 
    	# Use a linear fit to estimate the Hurst Exponent
    	poly = polyfit(log(lags), log(tau), 1)
 
    	# Return the Hurst exponent from the polyfit output
    	return poly[0]*2.0
 
 
    ##############################################################
 
    print "Hurst Exponent =",round(hurst(df1.spread),2)
 
    ##############################################################
 
    #Run OLS regression on spread series and lagged version of itself
    spread_lag = df1.spread.shift(1)
    spread_lag.ix[0] = spread_lag.ix[1]
    spread_ret = df1.spread - spread_lag
    spread_ret.ix[0] = spread_ret.ix[1]
    spread_lag2 = sm.add_constant(spread_lag)
 
    model = sm.OLS(spread_ret,spread_lag2)
    res = model.fit()
 
 
    halflife = round(-np.log(2) / res.params[1],0)
 
    if halflife <= 0:
        halflife = 1
 
    print  'Halflife = ',halflife
 
    ##############################################################
 
 
    meanSpread = df1.spread.rolling(window=halflife).mean()
    stdSpread = df1.spread.rolling(window=halflife).std()
 
    df1['zScore'] = (df1.spread-meanSpread)/stdSpread
 
    df1['zScore'].plot()
    plt.show()
 
    ##############################################################
 
    entryZscore = 2
    exitZscore = 0
 
    #set up num units long             
    df1['long entry'] = ((df1.zScore < - entryZscore) & ( df1.zScore.shift(1) > - entryZscore))
    df1['long exit'] = ((df1.zScore > - exitZscore) & (df1.zScore.shift(1) < - exitZscore))
    df1['num units long'] = np.nan
    df1.loc[df1['long entry'],'num units long'] = 1
    df1.loc[df1['long exit'],'num units long'] = 0
    df1['num units long'][0] = 0
    df1['num units long'] = df1['num units long'].fillna(method='pad')
 
    #set up num units short
    df1['short entry'] = ((df1.zScore >  entryZscore) & ( df1.zScore.shift(1) < entryZscore))
    df1['short exit'] = ((df1.zScore <  exitZscore) & (df1.zScore.shift(1) > exitZscore))
    df1.loc[df1['short entry'],'num units short'] = -1
    df1.loc[df1['short exit'],'num units short'] = 0
    df1['num units short'][0] = 0
    df1['num units short'] = df1['num units short'].fillna(method='pad')
 
    df1['numUnits'] = df1['num units long'] + df1['num units short']
    df1['spread pct ch'] = (df1['spread'] - df1['spread'].shift(1)) / ((df1['x'] * abs(df1['hr'])) + df1['y'])
    df1['port rets'] = df1['spread pct ch'] * df1['numUnits'].shift(1)
 
    df1['cum rets'] = df1['port rets'].cumsum()
    df1['cum rets'] = df1['cum rets'] + 1
 
    ##############################################################
 
    try:
        sharpe = ((df1['port rets'].mean() / df1['port rets'].std()) * sqrt(252)) 
    except ZeroDivisionError:
        sharpe = 0.0
 
 
 
    plt.plot(df1['cum rets'])
    plt.xlabel(i[1])
    plt.ylabel(i[0])
    plt.show()
 
    ##############################################################
 
 
 
    start_val = 1
    end_val = df1['cum rets'].iat[-1]
 
    start_date = df1.iloc[0].name
    end_date = df1.iloc[-1].name
    days = (end_date - start_date).days
 
    CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) - 1,4)
 
 
    print "CAGR = {}%".format(CAGR*100)
    print "Sharpe Ratio = {}".format(round(sharpe,2))
    print 100*"----"
 
for i in symbPairs:
    backtest(i)

So hopefully this makes it as easy as possible for anyone reading this to just copy and paste and run this whole thing locally.

I must admit, the backtest results are far from amazing, and many, many of them end in negative territory. However there are a couple of pairs in there that generate quite a smooth equity curve with an upward trajectory (and this is just from the “Niche” = “Gold Miners” run, returning about 4-5% CAGR with Sharpe rations of 1.9 – 2.5. For an example, check out [GDX, RING] and [GDX, PSAU] ticker pairs.

Of course, we havn’t made any allowance for transaction costs so those small gains could well end up in negative territory also, depending on how many trades need to be carried out.

I think I will leave it here for this series; I think I’ve covered pretty much everything that i wanted to cover and have ended up with a fully working, executable script to carry out basic, initial tests on ETF pairs to identify any possible candidates for a mean-reversion strategy.

Onto something new for next time…until then!

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInEmail this to someonePin on PinterestShare on Reddit

9 thoughts on “Python Backtesting Mean Reversion – Part 4

  1. Apologies, there were some copy and paste issues with the code, for example “&” was showing up in its “amp” format – this has been rectified now and the above code should work straight out of the box.

    I also added a “Try/Except” for the calculation of the Sharpe ratio which will deal with the situation where there were no trades made at all, the standard deviation is therefore 0, and avoids a “division by Zero” error.

    Hope the above is helpful.

  2. Hey, very useful code. I am trying to replicate the same and noticed a symbol that might have been missed. In the entry-exit part of the code, could you confirm if you have missed a ” – ” before the ‘exitZscore’?

    df1[‘long exit’] = ((df1.zScore > – exitZscore) & (df1.zScore.shift(1) < exitZscore))
    ^

    1. Good spot, I think you’re right…although my choice to have 0 as the exit zScore means that in this particular case, the results aren’t affected as -0 is the same as 0.

      If the exit zScore was set to anything other than 0 then yes, there should be a minus sign in front of the code as you pointed out.

      In fact if you wanted more flexibility regarding the entry and exit zScores (for example if you wanted the exit zScore to be a different sign than the entry zScore for either long or short trades), perhaps it would be better to set up specific entry and exit zScores for long positions and short positions separately, i.e. have 4 separate zScore levels.

      In its current format the flexibility is limited.

      1. Thanks!

        I am kind of stuck at the signals part of the code and it’s hard to explain it over here. Any advice from your side would be of great help to me. If it isn’t too much trouble, do let me know how I can contact you or drop me mail at sai_prem2000@yahoo.com.

  3. Hi,
    new to python. Tried running your script on two securities in spyder (python 2.7 windows) and ran into a value error (details below) after printing halflife at:

    meanSpread = df1.spread.rolling(window=halflife).mean()

    File “C:\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 866, in runfile
    execfile(filename, namespace)
    File “C:\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 87, in execfile
    exec(compile(scripttext, filename, ‘exec’), glob, loc)
    File “…/Mean Reversion Testing_v2.py”, line 108, in
    meanSpread = df1.spread.rolling(window=halflife).mean()
    File “C:\Anaconda2\lib\site-packages\pandas\core\generic.py”, line 5182, in rolling
    center=center, win_type=win_type, axis=axis)
    File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 1524, in rolling
    return Rolling(obj, **kwds)
    File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 59, in __init__
    self.validate()
    File “C:\Anaconda2\lib\site-packages\pandas\core\window.py”, line 843, in validate
    raise ValueError(“window must be an integer”)
    ValueError: window must be an integer

    Any suggestion for fixing this problem would be appreciated. Thanks

    1. Hi there, I have rerun my code and I am getting the same error – luckily it is a very easy fix! The reason you are getting the error message is that the “rolling mean” Pandas function only accepts an integer as a window, whereas we are currently passing it a float with multiple decimal places.

      The way to fix this is by changing the line:

      halflife = round(-np.log(2) / res.params[1],0)

      to:

      halflife = int(round(-np.log(2) / res.params[1],0))

      This will cast the halflife variable as an integer and that should then work when passed to the rolling mean function.

      Let me know if that solves your problem 😀

  4. Hi, great work, I really enjoy reading your blog and to simulate the code.

    I am getting rolling error as well but a different one. any insight?

    —————————————————————————
    AttributeError Traceback (most recent call last)
    in ()
    —-> 1 meanSpread = df1.spread.rolling(window=halflife).mean()
    2 stdSpread = df1.spread.rolling(window=halflife).std()
    3
    4 df1[‘zScore’] = (df1.spread-meanSpread)/stdSpread
    5

    /usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
    2244 return self[name]
    2245 raise AttributeError(“‘%s’ object has no attribute ‘%s'” %
    -> 2246 (type(self).__name__, name))
    2247
    2248 def __setattr__(self, name, value):

    AttributeError: ‘Series’ object has no attribute ‘rolling’

    1. Hi there – this is a really strange error message as a Pandas Series object does have a “rolling” attribute. Could you run the following code for me please and let me know the output:

      df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10], 
                          'b':[1,2,3,4,5,6,7,8,9,10]})
       
      print df.a.rolling(window=5).mean()

Leave a Reply