Home Data Analysis Equities Market Intraday Momentum Strategy in Python – Part 1

Equities Market Intraday Momentum Strategy in Python – Part 1

by s666

For this post, I want to take a look at the concept of intra-day momentum and investigate whether we are able to identify any positive signs of such a phenomenon occurring across (quite a large) universe of NYSE stocks. It has been suggested that, for the wider market in general at least, there is a statistically significant intra-day momentum effect resulting in a positive relationship between the direction of returns seen during the first half an hour of the trading day (taking the previous day’s closing price as the “starting value”) and the last half an hour of the day’s session. That is to say, it may be that a stock/index which displays a positive return early in the trading session, will be more likely to experience a positive return over the last part of the session.

The effect seems to have been first identified/posited by Gao, Han, Li and Zhou in their 2015 research paper (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2440866). In their research paper, they specifically look at high-frequency data regarding the S&P 500 ETF, and they test over 20 years’ worth of data – so it’s worth pointing out that I am going the “other way” somewhat. Where their study lacked depth (number of instruments studied), my data contains around 3000 individual stocks, however, where they tested over a long time period (20 years) my data spans only 1 year.

I have a feeling already that the mechanisms and forces that move the “overall market” and result in certain price patterns and behaviours, may not necessarily translate exactly over into the individual constituent stocks. We can but try…

The chart below shows what we are looking for – a daily price path that displayed the same overall direction in the first 30 minutes as it does in the last 30 minutes (at least 30 minutes is our starting gambit for a reasonable window as this is the window period used in the aforementioned research paper – we can perhaps play around with this value at some point). The overall return for the two window periods can be either up or down, as long as daily moves are in the same direction. I think I’ve said that enough times now…!

As in the previous post I shall be using data sourced from AlgoSeek.com. I have found the data quality so far to be a way “above and beyond” when compared to the freely available sources I have mainly used up until now. Even aside from the gulf in quality, it is just next to impossible to source intraday stock data for free (at least in my experience). I believe AlphaVantage still has an API that allows intra-day downloads, although I have used them before for various pet projects and research efforts and quickly realised my results were being badly affected by the dubious quality.

I have found that the more granular the time-frame you are working with, invariably the more your results and models/backtests are susceptible to data flaws, such as bum prints, missing data points, random zero values etc. I came across them a LOT when using AlphaVantage intraday data. But hey, sometimes you can’t argue with free!! I would just gently suggest that if and when you come to a more “serious” stage of investigation with a particular strategy regarding backtesting and model evaluation/validation you would be well advised just to at least be aware of the difference your input data can make.

Ok so…onto the “strategy”…

As always we begin with our module imports. I have also set the value of the default matplotlib figure to be 12 x 8, as I find the normal default value to be too small for my liking.

import os
import pandas as pd
import numpy as np
import statsmodels.api as sm

import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

mpl.rcParams['figure.figsize'] = (12., 8.)

Saying as I was dealing with 1000s of stocks over a 1-year period (2017), each one containing minute by minute data and an accompanying 57 odd columns of data per 1-minute bar (i.e. a LOT of data) I had to download, extract and wrangle it in a few distinct steps. It was downloaded directly from the AlgoSeek Amazon Web Server API and in zipped CSV file format sat at 45 GB in size. Once extracted and moved across into a series of SQLite databases it grew 10x in size and currently sits at around 432 GB. And that is just 2017’s da! I also signed up for the 2018 period data set – so that’s probably the same size again, nearing 1 TB of minute bar data for the NYSE listed stocks.

Currently, it’s more a case of struggling to manoeuvre around such a large data set but definitely watch this space, I want to try to extract all I can from it. As I said earlier, it’s not easy to find high-quality intraday equity data. I may take the opportunity soon to write a post on the difficulties and approaches regarding the use of such a large data set…1 TB isn’t going to fit in your local memory so adjustments and various specialised libraries usually come into play (e.g. Dask, use of specialised Pandas arguments and methods to deal with limitations, paying special attention to data types used to store data etc).

One storage file type I have come to rely on in instances such as these is the “feather” file type. When datasets start to get “large”, and read/write times can themselves become a bit of a time sink when carried out multiple times (as is usually the case when exploring and carrying out the initial research phases of a project), it can really make a noticeable difference to pay attention to details such as these.

The feather Github can be found here (https://github.com/wesm/feather).

If we just run a few simple tests and time each one, we can get an idea of the speed up we can expect by substituting in feather files for the bog-standard CSV files we all usually default to.

I have a Pandas DataFrame that currently holds 240851 rows and 1844 columns, so 444,129,244 cells in total. It is showing as being 1.7 GB when displaying the output of a “df.info()” call. That’s not HUGE by any means, but its starting to creep towards a size whereby we wouldn’t want to be reading and writing it to file and back out again too many times if we can possibly help it.

The data held in the DataFrame is all of the “float32” datatype, and when I write the data to CSV on my hard drive using the basic:

df.to_csv('df.csv')

The first attempt registered at 6 minutes and 1 second for the complete write time. That’s not great…

How about reading it back in?

df2 = pd.read_csv('df.csv')

The clock registered 42.2 seconds. Better than the 6 minutes taken to write the file to disk, but if you keep reading it in, again and again, that time is going to stack up! So onto our “feather” files…are they really going to be that much better than the stock Pandas “read_csv” method? Let’s have a go and see.

First, let’s write the exact same DataFrame to file and time it.

df.to_feather('df.file')

And….1.07 seconds!! That’s quite something. And finally, let’s read the DataFrame back in from a feather file:

df2 = pd.read_feather('df.file')

Again…just 1.16 seconds. So it’s safe to say that working with feather files can definitely add value and cut down on the read/write times you have to sit there and suffer with a large dataset. On another positive note Feather currently supports a relatively wide range of data types (info can be found at the Github repo link pasted earlier).

Ok so onto the intraday momentum analysis!

I had previously run some scripts which extracted the relevant data I wanted from the series of SQLite databases, stripped out what I didn’t want and then saved the results in feather files for speedy retrieval later. The code below just defines the data folder where the feather files are stored, then iterates through all files in the directory (which I made sure are all relevant feather files with no other files hidden away amongst them to cause an error), and reads the data into DataFrames which are each stored in an empty list after having their contents cast as “float32” type (to save space/memory vs “float64”), which finally in-turn is concatenated together into one giant master DataFrame.

feather_path = r'G:\AlgoSeek\Data\'

df_list = []

for filename in os.listdir(feather_path):
    if filename.endswith('.file'):
        df = pd.read_feather(os.path.join(feather_path, filename))
        df.set_index('index', inplace=True)
        df = df.astype('float32')
        df_list.append(df)

df = pd.concat(df_list, axis=1)

The resulting “head” of the DataFrame looks as follows (with many columns not showing of course…there are 3865 columns in total!)

You can see that the structure of the data is such that the columns represent pairs of Bid/Offer prices as we move from left to right, so each stock is represented by two columns.

Next, we have to write the code to enact the following logic:

) Iterate through the DataFrame columns, selecting 2 columns at a time and extracting only three rows from that data. The 3 rows we are interested in are those corresponding to the 10 am minute bar, the 3:30 pm minute bar and the 4pm-close of trading minute bar. The rows in the DataFrame represent the close of that particular minute bar just as a reminder.

2) Calculate the “mid-price” between the bid and offer price at those 3 times mentioned above (this is perhaps not an ideal way to calculate the “price” at each particular time as it is susceptible to being skewed by anonymously wide bid/offer spreads – but just for the moment let’s stick with it and see how we get on).

3) Calculate the percentage change in our calculated “mid-price” between each of the 3 times – this represents the percentage change in price between 10am and 3:30pm, the change between 3:30pm and close of trading at 4pm, and finally the change between the close of trading at 4 pm and the next NEXT DAY at 10 am.

4) We don’t actually need the values relating to the percentage change between 10 am and 3:30 pm, so we ignore those and extract only the values for the other 2 periods (3:30 pm to 4 pm and 4 pm to next day at 10 am). We then quickly drop any data corresponding to the first day in the DataFrame as of course it will be incomplete (no comparison to “yesterday” is possible if we don’t have “yesterday’s” data).

5) We are now left with a DataFrame for the current stock in question which is made up of 2 rows of data per day, which we append to list for storage purposes. Once the loop through all the stocks has finished, the list of results is then concatenated together into a master DataFrame.

6) Once we have this master DataFrame, we then create a new DataFrame which is filled with boolean values representing whether each row’s percentage change value is of the same sign as the next row’s value. i.e. are they both positive or both negative…if so then that is stored as a True value. If the signs differ (i.e. one is positive and one is negative, then the value is of course stored as False).

7) The final couple of steps involved changing the Trues and Falses into 1s and 0s (for use later) and then extracting only the rows we are interested in. That is achieved by indexing the DataFrame starting at the second row and “jumping” 2 rows each time – basically, we are dropping the first row and then only selecting every second row after that.

The reason for the above indexing method relates to the fact we only want to be left with a single entry for each day – either 1 or 0, or in other words either True or False, hence we don’t need the rows relating to the 10 am bar each day anymore.

results_list = []

for i in range(2, len(df.columns), 2):
    df_temp = df[df.columns[i-2:i]]
    df_temp = df_temp[((df_temp.index.hour == 16) & (df_temp.index.minute == 0)) | \
        ((df_temp.index.hour == 10) & (df_temp.index.minute == 0)) | \
        ((df_temp.index.hour == 15) & (df_temp.index.minute == 30))]

    df_temp['mid_price'] = df_temp.mean(axis=1)

    df_temp['pct_change'] = np.log(df_temp['mid_price']).diff()

    df_temp = df_temp[((df_temp.index.hour == 16) & (df_temp.index.minute == 0)) | \
            ((df_temp.index.hour == 10) & (df_temp.index.minute == 0))]

    df_temp = df_temp[df_temp.index.day > df_temp.index[0].day]
    
    results_list.append(df_temp['pct_change'])

pct_change_df = pd.concat(results_list, axis=1)

sign_df = np.sign(pct_change_df)

sign_change_df = np.sign(sign_df) == np.sign(sign_df.shift(1))
sign_change_df = sign_change_df.astype(int)

final_df = sign_change_df.iloc[1::2]

Now that we have this new DataFrame, we can just take the average of the values in a particular column to determine the proportion of times that we have indeed seen a stock end the trading session in the same vein as it started it, with either both 30 minute periods ending in negative territory or both ending in positive territory.

If it never occurred, the average value would, of course, be zero, while if it happened 100% of the time, the average of the column would be 1. Let’s calculate the average value of each column and plot a histogram of the results.

final_df.mean().mean()
ax = final_df.mean().hist(color='r')
ax.set_title('Distribution of Proportion of Days "Intra-Day Momentum" Was Observed Per Stock')
ax.set_xlabel('% of Days Intra-Day Momentum Observed')
ax.set_ylabel('Number of Stocks')
plt.show()

The results seem to indicate that not only do we not tend to observe the “intra-day momentum” we were searching for, but actually the opposite seems to be taking place the majority of the time. If we assume that the cases where either 30-minute daily period results in EXACTLY 0% return and no price change at all are relatively few and far between, then we can interpret the above results as signifying that this intra-day momentum effect happens, on average less than half the time for practically every single stock. There seems to be one outlier up near the 80%+ mark, which may warrant closer inspection to make sure it isn’t being caused by any bugs or faults in logic.

The rest of the stocks look to experience, on average, moves of opposite directions between the first and last 30 minutes of the trading session. I wasn’t expecting to see a distribution like this, I must admit.

If we calculate the overall average value of the main DataFrame with our 1s and 0s, we end up with 0.3076. That is the same as saying these stocks in question, over the time period tested, displayed returns of opposing sign between the two daily time periods in question on average (1 – 0.3076) = 69.24% of the time.

This is interesting information but it doesn’t afford us any insight into the “strength” of the relationship, for want of a better term. We can see that the direction of return plays a part, but does the magnitude of the percent return in the first 30 minute period have any significant influence on the magnitude of the return in the last 30 minutes?

Let’s move back to concentrate on our DataFrame containing the actual pergentage changes for each stock in each time period for each day – this is the DataFrame we named “pct_change_df” and the first few rows look as follows:

The eagle-eyed might notice a few “0.000000” values in there and wonder if something hasn’t been correctly calculated. I ran through a couple of manual checks and indeed for the one I looked at, the price hadn’t moved – most often due to the stock just being relatively illiquid and not so heavily traded. I also calculated the instances of these zero values as occurring less than 2% of the time across the entires DataFrame so hopefully, these won’t affect our results too much (not that they would really cause too many problems anyway quite frankly).

Our next piece of code converts the DataFrame into a format that is ready to be passed into a linear regression model – the DataFrame is first indexed accordingly into two separate datasets, with one containing the odd-numbered rows (the morning session return %s) and the other containing the even-numbered rows (the days’ last 30 minutes return %s).

X = pct_change_df.iloc[0::2].values.flatten()
y = pct_change_df.iloc[1::2].values.flatten()

The use of the “.values” allows us to extract the values as multi-dimensional numpy arrays (with a shape corresponding to half the number of rows of the original DataFrame in length, and the same amount of columns as width), which we then “flatten” into a 1-dimensional array so that all our values are held in the same vector.

Doing this keeps the order of our data correctly so that now, the first value in vector X (the first 30 minutes return % for day 1 for stock 1) is lined up against the first value in vector y (the last 30 minutes return for day 1 for stock 1).

The next values lined up would be:

(the first 30 minutes return % for day 1 for stock 2) -> (the last 30 minutes return for day 1 for stock 2)
(the first 30 minutes return % for day 1 for stock 3) -> (the last 30 minutes return for day 1 for stock 3)

until all stocks have been displayed once, and then it would be:

(the first 30 minutes return % for day 2 for stock 1) -> (the last 30 minutes return for day 2 for stock 1)
(the first 30 minutes return % for day 2 for stock 2) -> (the last 30 minutes return for day 2 for stock 2)

and so on…

What this in essence means, is that we can now easily feed these vectors into a quick scatter plot to get an idea of what the relationship looks like between the variables.

plt.scatter(reg_df['X'].values,reg_df['y'].values,c='r')
plt.xlabel('First 30 mins Log Return')
plt.ylabel('Last 30 mins Log Return')
plt.show()

We can see from the plot there seems to be some sort of negative, linear relationship between a substantial number of the data points plotted. There are of course many other data points which don’t seem to show any significant, identifiable relationship which we need to account for too. But that “line” of points moving down from the top left to the bottom right of our chart figure could suggest there is something interesting to be had from these results.

Let us run a simple linear regression on the data and see what we get:

reg_df = pd.DataFrame({'X': X, 'y': y}).dropna()
X = reg_df['X'].values
y = reg_df['y'].values
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

Again, we see evidence of that strong negative relationship between the returns a stock experiences in the first 30 minutes of the trading session vs those in the last 30 minutes.

I think that might be a good place to leave this just now, I feel it probably deserves a “2-parter” set of articles rather than try to finish it off and squeeze everything into this post. Well, I think next time we could start by looking more closely at this potential relationship we have identified, and perhaps incorporate an element of searching for some kind of “optimal” parameter with regards to the length of the morning and afternoon windows across which we are erecdioing the stock returns – see if these results are in any way robust to changes in that value.

Until next time…

You may also like

3 comments

Jacky Chen November 6, 2019 - 3:49 pm

I ran their paper’s setup with just SPY for the past 10 years using 30 min OHLC data from IB. I think their results are just BS.

Reply
s666 November 7, 2019 - 6:31 am

Interesting… May I ask what your overall findings we’re? Just a general conclusion of no relationship between the start and end of day periods?

Reply
S666 November 22, 2019 - 11:35 am

Is that site you linked just a website that offers to write students’ dissertations and thesis for them? Must admit I’m not a great supporter of those kinds of services to be honest. May I ask how and why it is relevant in this case? I’m struggling to see the connection.

Reply

Leave a Reply

%d bloggers like this: