Home Trading Strategy Backtest Mean Reversion Pairs Trading With Inclusion of a Kalman Filter

Mean Reversion Pairs Trading With Inclusion of a Kalman Filter

by s666

In this article we are going to revisit the concept of building a trading strategy backtest based on mean reverting, co-integrated pairs of stocks. So to restate the theory, stocks that are statistically co-integrated move in a way that means when their prices start to diverge by a certain amount (i.e. the spread between the 2 stocks prices increases), we would expect that divergence to
eventually revert back to the mean. In this instance we would look to sell the outperforming stock,and buy the under performing stock in our expectance that the under performing stock would eventually “catch up” with the overpeforming stock and rise in price, or vice versa the overperforming stock would in time suffer from the same downward pressure of the underperforming stock and fall in relative value.

Hence, pairs trading is a market neutral trading strategy enabling traders to profit from virtually any market conditions: uptrend, downtrend, or sideways movement.

So in our search for co-integrated stocks, economic theory would suggest that we are more likley to find pairs of stocks that are driven by the same factors, if we search for pairs that are drawn from similar/the same industry. After all, it is logical to expect
2 stocks in the technology sector that produce similar products, to be at the mercy of the same general ups and downs of the industry environment. So for this particular backtest I will be scraping a load of tech stock tickers from the web and then using Pandas data-reader to download daily data for those stocks. (n.b. Pandas data-reader has been facing some serious issues recently and in effect the Yahoo and Google APIs are no longer fit for purpose and are no longer working. Instead I shall use “iex” provider, which offers daily data for a maximum of a 5 year historical period. I see 5 years as being more than long enough for our purposes.)

I started this blog a few years ago, and one of my very first blog series was on this exact subject matter – mean reversion based pairs trading. So why approach it again and repeat myself? Well this time I am going to add a few more elements that were not present in the initial blog series.
I am going to

  1. Create a heatmap of co-integrated pairs so we can visually see the level of cointegration between any and all pairs that we are concerning ourselves with.
  2. Introduce the concept of a “Kalman Filter” when considering the spread series which will give us our trading signal.
  3. Add the concept of a “training set” of data, and a “test set” of data – seperating the two. What this helps us avoid is “look back” bias, whereby we would incorrectly test co-integration over the full set of data that we have, and also run our backtest of trading over the same data. Afetr all, how would we be able to both
    run a co-integration test over a time period to select our trading pairs, and then travel back in time and carry out trades over the same time period. That’sjust not possible, however it is one subtle pitfall that many, many systematic traders fall into.

So what is a Kalman Filter? Well I this site (click here) explains the concept and shows examples in the clearest manner that I have yet to find while searching online. It states the following:

You can use a Kalman filter in any place where you have uncertain information about some dynamic system, and you can make an educated guess about what the system is going to do next. Even if messy reality comes along and interferes with the clean motion you guessed about, the Kalman filter will often do a very good job of figuring out what actually happened. And it can take advantage of correlations between crazy phenomena that you maybe wouldn’t have thought to exploit!

Kalman filters are ideal for systems which are continuously changing. They have the advantage that they are light on memory (they don’t need to keep any history other than the previous state), and they are very fast, making them well suited for real time problems and embedded systems.

So lets start to import the relevant modules we will need for our strategy backtest:

##### import the necessary modules and set chart style####
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
mpl.style.use('bmh')
import pandas_datareader.data as web
import matplotlib.pylab as plt
from datetime import datetime
import statsmodels.api as sm
from pykalman import KalmanFilter
from math import sqrt

And lets use the Pandas and the data-reader module to scrape the relevant tech stock tickers from the www.marketwatch.com website.

#scrape html from website and store 3rd DataFrame as our stock tickers - this is dictated to us by the structure of the html
stock_list = pd.read_html("https://www.marketwatch.com/tools/industry/stocklist.asp?bcind_ind=9535&bcind_period=3mo")[3]

#convert the DataFrame of stocks into a list so we can easily iterate over it
stocks = stock_list[1].dropna()[1:].tolist()

#set empty list o hold the stock price DataFrames that we can later concatenate into a master frame
df_list = []

#not all stocks will return data so set up an empty list to store the stock tickers that actually successfully returns data
used_stocks = []

#iterate over stock tickers in list and download relevant data, storing said data and successfully downloaded tickers along the way
for stock in stocks:
    try:
        data = pd.DataFrame(web.DataReader(stock,data_source='iex',start='01/01/2013')['close'])
        data.columns = [stock]
        df_list.append(data)
        used_stocks.append(stock)
    except:
        pass

#concatenate list of individual tciker price DataFrames into one master DataFrame
df = pd.concat(df_list,axis=1)

Lets plot the resulting DataFrame of price data just to make sure we have what we need and as a quick sanity check:

df.plot(figsize=(20,10))

Ok so it looks from the chart as if we have around price data downloaded for around 25-30 stocks; this should be more than enough to find at least a couple of co-integrated pairs to run our backtest over.

We will now define a quick function that will run our stocks, combining them into pairs one by one and running co-integration tests on each pair. That result will then be stored in a matrix that we initialise,
and then we will be able to plot that matrix as a heatmap. Also, if the co-integration test meets our threshold statistical significance (in our case 5%), then that pair of stock tciokers will be stored in a list for later retrieval.

#NOTE CRITICAL LEVEL HAS BEEN SET TO 5% FOR COINTEGRATION TEST
def find_cointegrated_pairs(dataframe, critial_level = 0.05):
    n = dataframe.shape[1] # the length of dateframe
    pvalue_matrix = np.ones((n, n)) # initialize the matrix of p
    keys = dataframe.columns # get the column names
    pairs = [] # initilize the list for cointegration
    for i in range(n):
        for j in range(i+1, n): # for j bigger than i
            stock1 = dataframe[keys[i]] # obtain the price of "stock1"
            stock2 = dataframe[keys[j]]# obtain the price of "stock2"
            result = sm.tsa.stattools.coint(stock1, stock2) # get conintegration
            pvalue = result[1] # get the pvalue
            pvalue_matrix[i, j] = pvalue
            if pvalue < critial_level: # if p-value less than the critical level
                pairs.append((keys[i], keys[j], pvalue)) # record the contract with that p-value
    return pvalue_matrix, pairs

Let’s now run our data through our function, save the results and plot the heatmap:

#set up the split point for our "training data" on which to perform the co-integration test (the remaining dat awill be fed to our backtest function)
split = int(len(df) * .4)

#run our dataframe (up to the split point) of ticker price data through our co-integration function and store results
pvalue_matrix,pairs = find_cointegrated_pairs(df[:split])

#convert our matrix of stored results into a DataFrame
pvalue_matrix_df = pd.DataFrame(pvalue_matrix)

#use Seaborn to plot a heatmap of our results matrix
fig, ax = plt.subplots(figsize=(15,10))
sns.heatmap(pvalue_matrix_df,xticklabels=used_stocks,yticklabels=used_stocks,ax=ax)

So we can see from the very dark red squares that it looks as though there are indeed a few pairs of stocks who’s co-integration score is below the 5% threshold
hardcoded into the function we defined. To see more explicitly which pairs these are, let’s print out our list of stored pairs that was part of the fucntion results we stored:

for pair in pairs:
    print("Stock {} and stock {} has a co-integration score of {}".format(pair[0],pair[1],round(pair[2],4)))
Stock AKAM and stock TCX has a co-integration score of 0.027
Stock AKAM and stock YNDX has a co-integration score of 0.0484
Stock BIDU and stock WEB has a co-integration score of 0.0377
Stock WIFI and stock JCOM has a co-integration score of 0.0039
Stock WIFI and stock LLNW has a co-integration score of 0.0187
Stock WIFI and stock NTES has a co-integration score of 0.0215
Stock WIFI and stock SIFY has a co-integration score of 0.0026
Stock WIFI and stock YNDX has a co-integration score of 0.0092
Stock ENV and stock SINA has a co-integration score of 0.0294
Stock IPAS and stock LLNW has a co-integration score of 0.0199
Stock IPAS and stock EGOV has a co-integration score of 0.0405
Stock JCOM and stock SIFY has a co-integration score of 0.0388
Stock LLNW and stock NTES has a co-integration score of 0.0109
Stock LLNW and stock TCX has a co-integration score of 0.032

We will now use the “pykalman” module to set up a couple of functions that will allow us to generate Kalman filters which we will apply to our data and in turn our regression that is fed the said data.

I will also define a function for “Halflife” which just recycles some tof the code from my mean reversion pairs trading blog post from a couple of years ago, which can be found
here.

def KalmanFilterAverage(x):
  # Construct a Kalman filter
    kf = KalmanFilter(transition_matrices = [1],
    observation_matrices = [1],
    initial_state_mean = 0,
    initial_state_covariance = 1,
    observation_covariance=1,
    transition_covariance=.01)

  # Use the observed values of the price to get a rolling mean
    state_means, _ = kf.filter(x.values)
    state_means = pd.Series(state_means.flatten(), index=x.index)
    return state_means

# Kalman filter regression
def KalmanFilterRegression(x,y):
    delta = 1e-3
    trans_cov = delta / (1 - delta) * np.eye(2) # How much random walk wiggles
    obs_mat = np.expand_dims(np.vstack([[x], [np.ones(len(x))]]).T, axis=1)

    kf = KalmanFilter(n_dim_obs=1, n_dim_state=2, # y is 1-dimensional, (alpha, beta) is 2-dimensional
    initial_state_mean=[0,0],
    initial_state_covariance=np.ones((2, 2)),
    transition_matrices=np.eye(2),
    observation_matrices=obs_mat,
    observation_covariance=2,
    transition_covariance=trans_cov)

    # Use the observations y to get running estimates and errors for the state parameters
    state_means, state_covs = kf.filter(y.values)
    return state_means

def half_life(spread):
    spread_lag = spread.shift(1)
    spread_lag.iloc[0] = spread_lag.iloc[1]
    spread_ret = spread - spread_lag
    spread_ret.iloc[0] = spread_ret.iloc[1]
    spread_lag2 = sm.add_constant(spread_lag)
    model = sm.OLS(spread_ret,spread_lag2)
    res = model.fit()
    halflife = int(round(-np.log(2) / res.params[1],0))

    if halflife <= 0:
        halflife = 1
    return halflife

Now let us define our main “Backtest” function that we will run our data through. The fucntion takes one pair of tickers at a time, and then returns several outputs, namely the DataFrame of cumulative returns,
the Sharpe Ratio and the Compound Annual Growth Rate (CAGR). Once we have defined our function, we can iterate over our list of pairs and feed the relevant data, pair by pair, into the function, storing the outputs for each pair for
later use and retrieval.

def backtest(df,s1, s2):
    #############################################################
    # INPUT:
    # DataFrame of prices
    # s1: the symbol of contract one
    # s2: the symbol of contract two
    # x: the price series of contract one
    # y: the price series of contract two
    # OUTPUT:
    # df1['cum rets']: cumulative returns in pandas data frame
    # sharpe: Sharpe ratio
    # CAGR: Compound Annual Growth Rate

    x = df[s1]
    y = df[s2]

    # run regression (including Kalman Filter) to find hedge ratio and then create spread series
    df1 = pd.DataFrame({'y':y,'x':x})
    df1.index = pd.to_datetime(df1.index)
    state_means = KalmanFilterRegression(KalmanFilterAverage(x),KalmanFilterAverage(y))

    df1['hr'] = - state_means[:,0]
    df1['spread'] = df1.y + (df1.x * df1.hr)

    # calculate half life
    halflife = half_life(df1['spread'])

    # calculate z-score with window = half life period
    meanSpread = df1.spread.rolling(window=halflife).mean()
    stdSpread = df1.spread.rolling(window=halflife).std()
    df1['zScore'] = (df1.spread-meanSpread)/stdSpread

    ##############################################################
    # trading logic
    entryZscore = 2
    exitZscore = 0

    #set up num units long
    df1['long entry'] = ((df1.zScore < - entryZscore) & ( df1.zScore.shift(1) > - entryZscore))
    df1['long exit'] = ((df1.zScore > - exitZscore) & (df1.zScore.shift(1) < - exitZscore))
    df1['num units long'] = np.nan 
    df1.loc[df1['long entry'],'num units long'] = 1 
    df1.loc[df1['long exit'],'num units long'] = 0 
    df1['num units long'][0] = 0 
    df1['num units long'] = df1['num units long'].fillna(method='pad') 
    #set up num units short 
    df1['short entry'] = ((df1.zScore > entryZscore) & ( df1.zScore.shift(1) < entryZscore))
    df1['short exit'] = ((df1.zScore < exitZscore) & (df1.zScore.shift(1) > exitZscore))
    df1.loc[df1['short entry'],'num units short'] = -1
    df1.loc[df1['short exit'],'num units short'] = 0
    df1['num units short'][0] = 0
    df1['num units short'] = df1['num units short'].fillna(method='pad')

    df1['numUnits'] = df1['num units long'] + df1['num units short']
    df1['spread pct ch'] = (df1['spread'] - df1['spread'].shift(1)) / ((df1['x'] * abs(df1['hr'])) + df1['y'])
    df1['port rets'] = df1['spread pct ch'] * df1['numUnits'].shift(1)

    df1['cum rets'] = df1['port rets'].cumsum()
    df1['cum rets'] = df1['cum rets'] + 1

    ##############################################################

    try:
        sharpe = ((df1['port rets'].mean() / df1['port rets'].std()) * sqrt(252))
    except ZeroDivisionError:
        sharpe = 0.0

    ##############################################################
    start_val = 1
    end_val = df1['cum rets'].iat[-1]

    start_date = df1.iloc[0].name
    end_date = df1.iloc[-1].name

    days = (end_date - start_date).days

    CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) - 1,4)

    df1[s1+ " "+s2] = df1['cum rets']

    return df1[s1+" "+s2], sharpe, CAGR

So now let’s run our full list of pairs through our Backtest function, and print out some results along the way, and finally after storing the equity curve for each pair,
produce a chart that plots out each curve.

results = []

for pair in pairs:
    rets, sharpe,  CAGR = backtest(df[split:],pair[0],pair[1])
    results.append(rets)
    print("The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}".format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))
    rets.plot(figsize=(20,15),legend=True)
The pair AKAM and TCX produced a Sharpe Ratio of 1.69 and a CAGR of 0.0701
The pair AKAM and YNDX produced a Sharpe Ratio of 1.95 and a CAGR of 0.0974
The pair BIDU and WEB produced a Sharpe Ratio of 0.48 and a CAGR of 0.0163
The pair WIFI and JCOM produced a Sharpe Ratio of 1.31 and a CAGR of 0.0866
The pair WIFI and LLNW produced a Sharpe Ratio of 1.22 and a CAGR of 0.13
The pair WIFI and NTES produced a Sharpe Ratio of 0.36 and a CAGR of 0.0385
The pair WIFI and SIFY produced a Sharpe Ratio of 1.7 and a CAGR of 0.0942
The pair WIFI and YNDX produced a Sharpe Ratio of 1.05 and a CAGR of 0.065
The pair ENV and SINA produced a Sharpe Ratio of 0.83 and a CAGR of 0.0163
The pair IPAS and LLNW produced a Sharpe Ratio of -0.72 and a CAGR of -0.1943
The pair IPAS and EGOV produced a Sharpe Ratio of 1.25 and a CAGR of 0.1504
The pair JCOM and SIFY produced a Sharpe Ratio of 0.0 and a CAGR of 0.0
The pair LLNW and NTES produced a Sharpe Ratio of 0.24 and a CAGR of 0.0338
The pair LLNW and TCX produced a Sharpe Ratio of 0.95 and a CAGR of 0.1073

Now we run a few extra lines of code to combine, equally weight, and print our our final equity curve:

#concatenate together the individual equity curves into a single DataFrame
results_df = pd.concat(results,axis=1).dropna()

#equally weight each equity curve by dividing each by the number of pairs held in the DataFrame
results_df /= len(results_df.columns)

#sum up the equally weighted equity curves to get our final equity curve
final_res = results_df.sum(axis=1)

#plot the chart of our final equity curve
final_res.plot(figsize=(20,15))
#calculate and print our some final stats for our combined equity curve
sharpe = (final_res.pct_change().mean() / final_res.pct_change().std()) * (sqrt(252))
start_val = 1
end_val = final_res.iloc[-1]

start_date = final_res.index[0]
end_date = final_res.index[-1]

days = (end_date - start_date).days

CAGR = round(((float(end_val) / float(start_val)) ** (252.0/days)) - 1,4)
print("Sharpe Ratio is {} and CAGR is {}".format(round(sharpe,2),round(CAGR,4)))
Sharpe Ratio is 2.14 and CAGR is 0.06

You may also like

59 comments

LS July 5, 2018 - 9:10 pm

Hi, nice post!
PS: the link to Kalman filter does not work unfortunately.

Reply
S666 July 8, 2018 - 11:53 am

That’s strange, it works for me…make sure you click the word “here” rather than “click”.

Reply
Jan-Carlo Jillissen July 11, 2018 - 6:40 pm

Great article!

I have found one issue: The first (halflife -1) entries in the meanSpread to be nan’s. This causes the first entries of df1.zScore to be nan’s and therefore the comparison with the entryZscore fails. How can I make this work?

Thanks and keep up the good work!
JC

Reply
Diego August 22, 2018 - 8:20 pm

I dont understand why you define and use 2 kalman fileter functions? I apreciatte your answer!

Reply
TitoBielsa September 22, 2018 - 8:04 pm

nice!

Reply
CT Lee November 11, 2018 - 8:56 am

Great article! But the hedge ratio is changing every day, and in real situation, the hedge ratio is fixed while executing buy and sell trading, until long or short exit.

Reply
Saurabh Shukla November 24, 2018 - 4:29 am

The hedge ratio should be online(should change every day)

Reply
Nathan Sela January 20, 2019 - 5:25 pm

Hello S666,
Firstly I would like to thank you for your very interesting posts on pair trading. I have two questions regarding your implementation:
1. You calculate the daily return when in position as: (spread – spread.lag(1)) / (x * hr + y). Can you please explain where it comes from and which position sizing you are assuming for each leg of the pair?
2. Your implementation of the Kalman Filter is to first filter x and y through a Kalman average (works like some sort of a moving average) and then feed the result to the main Kalman filter that calculates the hedge ratio and intercept. Could you please explain why is the hedge ration calculated on the smoothed prices rather than the true prices?
Thank you,
Nathan

Reply
s666 January 29, 2019 - 11:53 am

Apologies for the delay – I shall get to this question and reply shortly!

Reply
Dexter February 25, 2019 - 2:36 am

Hi

I’m trying to build the spread slightly differently by adding the intercept as well. (i.e spread = stock1 – beta*stock2 -alpha). Just to confirm my alpha would be contained in state_means[:,1] is it? I’d assume so but wanted to double check

Reply
Roman May 22, 2019 - 3:23 pm

There is an error in the backtest function related to calculation of hedge ratio. It recalculates at each timestamp, i.e. it is assumed that position sizes are added/reduced every day (if it is a daily data). Though when you open the trades you fix the hedge ratio until you close them. It gives you an extra income. This error presents also in the source of your code (QI) as well. The true backtesting will not like the current one at all, unforunately. Don’t fall into that trap.

Reply
Nick June 10, 2019 - 3:23 pm

Hello, I am trying to replicate the portfolio as a way to improve my programming. However the download of the prices from yhaoo I think has been desabled.
What tools are your using to download the data now?
I have come across a module called https://github.com/JECSand/yahoofinancials but the way it downloads the data has a very complicated formatting and I am struggling to get the adjusted closes.
Thank you very much.

Reply
s666 June 10, 2019 - 4:03 pm

Hi Nick,

I think the Pandas Datareader Yahoo download has been “fixed” somewhat. Please refer to this post (https://pythonforfinance.net//2019/05/30/python-monte-carlo-vs-bootstrapping/) which is the latest on the blog – it uses Datareader to pull down the Adjusted Close for a number of tickers in one go.

Hopefully that gets you what you want. If you are still experiencing issues, let me know.

Reply
Nick June 10, 2019 - 5:42 pm

Hi, thanks for getting back to me.
I am using a list of tickers for all the technology stocks from the nasdaq. and I am using the formula

asset_universe = pd.DataFrame([web.DataReader(ticker, ‘yahoo’, start, end).loc[:, ‘Adj Close’] for ticker in clean_names],index=clean_names).T.fillna(method=’ffill’)

I get

Keyerror Data

Have you ever had the same error?

Reply
s666 June 13, 2019 - 4:35 pm

Hi Nick,

It’s a bit difficult to debug without having the full list of tickers you are using (so I can try to recreate the problem), or having the full error message. Do you have a ticker in your list named “Data” by any chance?

If you could post the full error message and also perhaps paste your list of tickers I can take a closer look.

Reply
s666 June 13, 2019 - 4:39 pm

I found this link on Google: https://github.com/pydata/pandas-datareader/issues/487

It suggests using the “fix_yahoo_finance” package to solves the problem – although the official fix should have been integrated into pandas_datareader.

You could either try updating your pandas_datareader with the following command in the command prompt:

pip install --upgrade pandas-datareader

Or you could follow the advice on the above link and add the below lines and your script should work.

import fix_yahoo_finance as yf
yf.pdr_override()

Make sure you have pip installed fix_yahoo_finance already.

Reply
Pete June 25, 2019 - 7:08 pm

Hi and thank you for your post, it is very interesting approach!

Have you tried working with Futures? I would like to apply a similar logic to oil futures. For example you have the prices for September and December as pair AND you get the data for the Sep-Dec 2018,2017,2016 contracts and so on.

I would like to use for example the 2013-2017 historical timeseries as training set and then the 2018 timeseries as a test set. The problem is this is not a continuous timeseries, ie the 2013 might close with oil at Sep, Dec= 60, 55 and the 2014 might start at Sep, Dec= 80,75.
How would you merge and normalize these series together before feeding them into your model?

Thanks!

Reply
s666 June 26, 2019 - 4:16 pm

Hi Pete, thanks for your comment and thanks for the kind words – its nice to hear you find it of interest.

There are a number of ways to deal with creating a “continuous” futures contract but they all have their pros and cons – with one of the methods perhaps being seen as the “best” way forward (that would be the “perpetual” method).

There is a great blog post doing just what you are looking for – it runs through an example of combining oil futures contracts together – take a look here https://www.quantstart.com/articles/Continuous-Futures-Contracts-for-Backtesting-Purposes and if you have any follow up questions I would be more than happy to assist.

Reply
rafidou July 17, 2019 - 7:53 pm

Thanks very much for your article, great material !
There is however one line I don’t understand:

df1[‘spread pct ch’] = (df1[‘spread’] – df1[‘spread’].shift(1)) / ((df1[‘x’] * abs(df1[‘hr’])) + df1[‘y’])

where does this come from ?
Why not: (df1[‘spread’] – df1[‘spread’].shift(1)) / (df1[‘spread’].shift(1)) ?

Many thanks !

Reply
andrewcz July 24, 2019 - 5:48 am

Hi, I am having trouble pulling down the data.
Has the syntax changed?

#scrape html from website and store 3rd DataFrame as our stock tickers – this is dictated to us by the structure of the html
stock_list = pd.read_html(“https://www.marketwatch.com/tools/industry/stocklist.asp?bcind_ind=9535&bcind_period=3mo”)[3]

#convert the DataFrame of stocks into a list so we can easily iterate over it
stocks = stock_list[1].dropna()[1:].tolist()

IndexError Traceback (most recent call last)
in
1 #scrape html from website and store 3rd DataFrame as our stock tickers – this is dictated to us by the structure of the html
—-> 2 stock_list = pd.read_html(“https://www.marketwatch.com/tools/industry/stocklist.asp?bcind_ind=9535&bcind_period=3mo”)[3]
3 #convert the DataFrame of stocks into a list so we can easily iterate over it
4 stocks = stock_list[1].dropna()[1:].tolist()

IndexError: list index out of range

Reply
s666 July 25, 2019 - 5:54 am

Hi Andrew, I’m afraid so… The html structure of the page has changed numerous times and it’s difficult to keep the code updated.

Reply
Andrew Czeizler July 25, 2019 - 5:55 am

from pandas_datareader import data as pdr

import yfinance as yf
yf.pdr_override() # <== that’s all it takes 🙂

NYSE

url_nyse = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download”

Nasdaq

url_nasdaq = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download”

AMEX

url_amex = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=amex&render=download”

import pandas as pd

df = pd.DataFrame.from_csv(url_nyse)
stocks = df.index.tolist()

Reply
Andrew Czeizler July 25, 2019 - 5:56 am

The above is how to get the stocklist-
I just cant port it to your code.

Reply
Andrew Czeizler July 25, 2019 - 5:57 am

The below is what i have tried-

from pandas_datareader import data as pdr

import yfinance as yf
yf.pdr_override() # <== that’s all it takes 🙂

NYSE

url_nyse = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download”

Nasdaq

url_nasdaq = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download”

AMEX

url_amex = “http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=amex&render=download”

import pandas as pd

df = pd.DataFrame.from_csv(url_nyse)
stocks = df.index.tolist()

df_list = []
used_stocks = []

for stock in stocks:
try:
data = pdr.get_data_yahoo(stock, start=”2017-01-01″, end=”2017-04-30″)
data_close=data[‘Close’]
df_list.append(data_close)
used_stocks.append(stock)
except (KeyError, ValueError):
print(“Error”)

df_list

Reply
s666 July 25, 2019 - 6:02 am

Ah right apologies I am replying on my phone and I thought this comment had been made on a different blog post… Yeah the yahoo download should be easy to fix as you mention. How is your code not working? Are you getting any error messages? As I said, I’m currently typing on my mobile phone so can’t run the code myself just now.

Reply
Andrew Czeizler July 25, 2019 - 6:20 am

when there is no data for the query. it comes up with a traceback error rather than catching the error.
I am pretty close, i am just not sure how to catch the traceback error.
cheers,
Andrew

s666 July 25, 2019 - 6:22 am

You could just use “pass” instead of catching it… Might get you up and running for the mean time

Andrew Czeizler July 25, 2019 - 6:23 am

Hi yer, I tried pass but for some reason it kept coming up with a traceback error.
I thought it was pretty strange behaviour.
Best,
Andrew

airflowdirect August 4, 2019 - 4:50 am

Hi @S666,
I was wondering if you could show were to add transaction fees in the back test.
If we could just do a simple fee per trade that would account for slippage and transaction costs
it would bring more realism to the back test.
Best,
Andrew

Reply
airflowdirect August 4, 2019 - 5:45 am

Hi @S666
I was wondering how do we put a fee per trade made in the back test section.
We could use the fee to account for slippage and trading costs.
It would make the back test more realistic.
Best,
Andrew

Reply
s666 August 7, 2019 - 1:28 pm

Hi Andrew,

I’ll have to have a think about this one as the strategy logic wasn’t really designed or built with the inclusion of commissions and slippage etc in mind. If I can think of a decent quick fix, I’ll try to find time to post it. I may actually make my next post all about those “extra” bits that go into a backtest that are usually ommited and most people tend to ignore…things precisely like slippage and commissions

Reply
Andrew Czeizler July 25, 2019 - 7:34 am

Hi So I was able the data issue.
Now i am running into problems with the trading logic-

results = []
for pair in pairs:
rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])
results.append(rets)
print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))
rets.plot(figsize=(20,15),legend=True)

home/andrewcz/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
/home/andrewcz/.local/lib/python3.7/site-packages/ipykernel_launcher.py:37: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

KeyError Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2656 try:
-> 2657 return self._engine.get_loc(key)
2658 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘short entry’

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
1 results = []
2 for pair in pairs:
—-> 3 rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])
4 results.append(rets)
5 print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))

in backtest(df, s1, s2)
38 df1[‘num units long’] = df1[‘num units long’].fillna(method=’pad’) #set up num units short df1[‘short entry’] = ((df1.zScore > entryZscore) & ( df1.zScore.shift(1) < entryZscore))
39 df1[‘short exit’] = ((df1.zScore < exitZscore) & (df1.zScore.shift(1) > exitZscore))
—> 40 df1.loc[df1[‘short entry’],’num units short’] = -1
41 df1.loc[df1[‘short exit’],’num units short’] = 0
42 df1[‘num units short’][0] = 0

~/.local/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2925 if self.columns.nlevels > 1:
2926 return self._getitem_multilevel(key)
-> 2927 indexer = self.columns.get_loc(key)
2928 if is_integer(indexer):
2929 indexer = [indexer]

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘short entry’

Reply
s666 July 25, 2019 - 9:09 am

Hi there – I have had a quick look and it is due to some incorrect formatting in the code above – there are some “new line” breaks that aren’t being recognised – let me fix it now and I will message again when done.

Reply
s666 July 25, 2019 - 9:14 am

Ok try cutting and pasting the code again – I believe I have corrected the problem. If it still doesn’t work, let me know.

Reply
Andrew Czeizler July 25, 2019 - 9:32 am

Ah cheers mate much appreciated!
Did you also change the formatting in the cell above with the back test?
Many thanks,
Best,
Andrew

Reply
s666 July 25, 2019 - 9:34 am

The cell with the backtest function was the cell I changed, only that one…

Reply
Andrew Czeizler July 25, 2019 - 9:45 am

Mate you are awesome!
worked like a charm.
A question, how would I add realistic fee’s to add further accuracy to the back test?
Best,
Andrew

s666 July 25, 2019 - 9:56 am

No probs – glad that worked 😉

In terms of adding a “fees” component, it can be done a number of ways…I guess it depends on which assets you are planning to trade and how ttheir real life fees/commissions etc are structured. Which assets are you considering?

Andrew Czeizler July 25, 2019 - 10:00 am

Well, I was thinking of just adding a general cost that would take care of slippage and transaction costs.
If I could just add the cost into the backtest if would just give a general idea of how costs would affect the sharp ratio.
I was just wondering on what line i would add the cost component.
Best,
Andrew

Andrew Czeizler July 25, 2019 - 10:03 am

Also in the back test, where is the line that sets the initial value for the portfolio?
Best,
Andrew

David August 3, 2019 - 10:19 pm

This is great, thank you. Looking forward to testing. I’m having the syntax issue Andrew Czeizler had with fetching urls. I’m very new to coding and not sure how to get there. I created my own watch list on MarketWatch as well as trying the exchange downloads as Andrew suggested but with no progress. The MarketWatch list returns an error with ‘no tables found’. The links Andrew tried return with a syntax error for each of the urls, ‘invalid character in identifier’. Any tips would be greatly appreciated.

David

Reply
s666 August 3, 2019 - 10:23 pm

Hi David, when you just run the code as is on the site, what error message do you get?

Reply
David August 4, 2019 - 12:42 am

In cell 2 (scrape html from website), I get ‘IndexError: list index out of range’ when copied/pasted. I haven’t gotten beyond that point.

Reply
s666 August 4, 2019 - 1:28 am

Hi there,

Ok try this – replace the code in teh second cell down to line6 with the following:

import string
df_list = []

for letter in string.ascii_uppercase:
    url = f"http://eoddata.com/stocklist/NYSE/{letter}.htm"
    df_temp = pd.read_html(url)[4]
    df_list.append(df_temp)

df_master = pd.concat(df_list)
stocks = df_master['Code'].to_list()

That scrapes all the NYSE stock tickers – its a LOT of tickers so you may like to be a bit more selective as it will take quite a long time to run that many stocks through the data download.

Reply
David August 4, 2019 - 1:53 am

Hmm same error. I wonder if there’s a module I have not imported or installed. Maybe something so common that you wouldn’t have needed to specify it. I added all code into Jupyter and have the following: Cell 2: list index out of range. Cell 3: name ‘df’ is not defined. Cell 5: name ‘df’ is not defined. Cell 6: name ‘pairs’ is not defined. Cell 9: name ‘pairs’ is not defined. Cell 10: No objects to concatenate. Cell 11: name ‘final_res’ is not defined.

s666 August 7, 2019 - 1:07 pm

Its a little bit trickey with not being able to see your code or error messages, but perhaps try doing what Andrew did in the above comments and change the download provider for “iex” to “yahoo” and see if that gets you any further.

andrewcz August 7, 2019 - 1:45 pm

the below code downloads the ticker data.
Hope this helps.
Best,
Andrew

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
mpl.style.use(‘bmh’)
import pandas_datareader.data as web
import matplotlib.pylab as plt
from datetime import datetime
import statsmodels.api as sm
from pykalman import KalmanFilter
from math import sqrt
from pandas_datareader import data as pdr

import yfinance as yf
yf.pdr_override() # <== that’s all it takes 🙂

import pandas as pd
data = pd.read_html(‘https://en.wikipedia.org/wiki/List_of_S%26P_500_companies’)
table = data[0]
table.head()

sliced_table = table[1:]
header = table.iloc[0]
corrected_table = sliced_table.rename(columns=header)
corrected_table
tickers = corrected_table[‘MMM’].tolist()
print(tickers)

tickers=tickers[0:30]
#dowload ticker data and get closing prices
data = yf.download(tickers, start=”2014-01-01″, end=”2019-04-30″)
df=data[‘Close’]

df.plot(figsize=(20,10))

s666 August 7, 2019 - 4:13 pm

Many thanks for adding that and contributing! Very much appreciated…

andrewcz August 8, 2019 - 2:52 am

mate your blog is awesome!
thank you!
Super excited about future articles.
I was just wondering if there could be articles on transaction costs and running an algorithm live.
These two topics seem very difficult to find good, practical information.
Best,
Andrew

s666 August 13, 2019 - 4:06 am

Will do mate, I’ll make those both the subject of my next post 😀

David August 8, 2019 - 3:10 am

Thank you both!

Reply
David August 8, 2019 - 11:49 pm

You mentioned being a bit more selective rather than looking at all tickers on an exchange. How should I do this? Having trouble understanding which pair is being referred to in the final equity curve.

Reply
David August 9, 2019 - 12:09 am

I guess because I have an error with the heat map not printing. NameError: name ‘used_stocks’ is not defined.

Reply
Jay August 13, 2019 - 11:35 pm

df1[‘spread pct ch’] = (df1[‘spread’] – df1[‘spread’].shift(1))
Re above, I think there is forward bias here. Spread here is based on the hedge ratio which is updated on daily basis. So the daily mark to market pnl should be based on spread by t – 1 hedge ratio but not on t ratio, I.e settle your existing pair portfolio before getting into a new one.

Reply
Kristian N. Kristensen September 22, 2019 - 10:33 am

When I run the following code:

results = []
for pair in pairs:
rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])
results.append(rets)
print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))
rets.plot(figsize=(20,15),legend=True)

I get the error

TypeError Traceback (most recent call last)
in
2
3 for pair in pairs:
—-> 4 rets, sharpe, CAGR = backtest(df[split:],pair[0],pair[1])
5 results.append(rets)
6 print(“The pair {} and {} produced a Sharpe Ratio of {} and a CAGR of {}”.format(pair[0],pair[1],round(sharpe,2),round(CAGR,4)))

TypeError: cannot unpack non-iterable NoneType object

Do you not what is wrong?

Thanks in advance!

Reply
s666 September 22, 2019 - 11:05 am

So it looks like your backtest function is returning “None” instead of the 3 variables it is supposed to. Have you altered the last line of the backtest function that deals with the return statement at all?

Reply
mathew October 22, 2019 - 7:25 am

highly recommend you translate the strategy into shares and using round lots. You will find the results will be completely different. You will find the majority of winning trades will actually be significant losses.

Reply
s666 October 22, 2019 - 7:32 am

Absolutely agree, the results will change fundemantally once the strategy logic is refined further to include those kinds of “pesky realities”!! Things such as having to trade round lots, not having an endless pit of money to keep altering position sizes with no idea of total inflow needed, having to cross bid/offer spread, slippage and brokerage costs/commissions are just a few examples off the top of my head…

Reply
Johnny Que November 20, 2019 - 8:57 pm

Nicely done 🙂
So what would be the calculation for the forecast error here? Would this simply be the spread?

Reply

Leave a Reply

%d bloggers like this: