Hi all, and welcome back to the site – I appreciate it has been an unexpectedly long time since I last posted…in fact my last post was around this time last year. Hopefully I can get back on the “treadmill” and churn out some articles at a somewhat faster rate than 1 a year over the next couple of months! Well that’s my aim anyway.
Ok so this post will be based on how to build and structure what is often referred to as a “data pipeline”; essentially it is the part of the overall project workflow concerned with the gathering and storing of data, along with any wrangling/munging/pre-processing/transforming of that data for later use.
Ideally this endeavour should aim to produce a code module which is robust, efficient, reusable, scalable and maintainable. It should also aim to produce a well structured, easily accessible store of “high-quality”, ready to use financial data sets and series.
I thought I’d start by sketching out a rough mind-map style representation of the concepts and ideas I want to cover and the basic points I want to make; this is shown below.

I’m hoping this post (call it Part 1) will allow me to cover the first 5 subsections:
1) Categories of Data
2) Sources of Data
3) Downloading/Collecting Data
4) Data Cleaning/Wrangling
5) Data Storage System
The remaining subsections will be covered in a subsequent post, or potentially “posts” depending on how detailed and lengthy it all ends up being. On to subsection 1:
Categories of Data:
Of course data can come in all shapes, sizes and formats however when dealing with financial topics and the finance domain in general, we tend to concentrate heavily on quantitative data and certain forms of qualitative data depending on the issue at hand.
Quantitative data basically includes anything that can be measured and represented in numerical format, i.e. any quantifiable information that can be used for mathematical calculations and statistical analysis.
Qualitative data (also referred to as categorical data) on the other hand is information that is non-numerical in nature (with a slight caveat which I will come to in a second), usually text based or word related. For example, if a company has a market capitalisation of $100m that is clearly considered quantitative data, whereas the location of its headquarters, New York for example, would be considered qualitative data. This may seem a bit basic and obvious to many, but I mention it in order to clarify the point that I shall be concentrating heavily on gathering and storing quantitative data sets, with the exception of a small amount of “meta” style data and some company level industry sector categorisations and such.
The caveat mentioned above regarding what is considered quantitative data was alluding to the fact that it is not simply whether a value is numeric or not which determines whether it is to be classed as “quantitative”, but rather whether it can be used for mathematical calculations and analysis etc. For example a phone number is represented numerically, but is considered to be qualitative data as there is no logical way to perform any calculations or carry out any statistical analysis on a list of phone numbers.
Within quantitative data we have three major categories, namely “time series data”, “cross-sectional data” and “panel data”.
1) Time series refers to observations on a single entity over a number of time periods, the classic example being a financial asset’s price history.
2) Cross-sectional data refers to data that is collected from multiple entities at a single point in time, for example the P/E ratios of the Dow 30 constituents on a certain date.
3) Panel data is somewhere in between and refers to data that is collected from multiple entities over a number of time periods. An example may be the annual GDPs of the UK, US, Russia and China from 2010 to 2020.
Sources of Data:
This section could fill a whole book to be honest, there are so many factors to consider when it come to the question of where to source your data from; whether to use freely available or paid for data providers? Whether you need intra-day granularity or are you satisfied with daily granularity and above? Do you need a data set which accounts for survivorship-bias? Do you intend to use the data to develop a trading algo/strategy that you plan to deploy “live” with real capital at stake? etc etc
I don’t want to get too bogged down with those sorts of questions in this article, rather I want to concentrate on the actual process of building the tools necessary to gather and store data from whichever source one ends up deciding to use. For illustration purposes I shall focus exclusively on freely available data sources to allow everyone to follow along and build their pipeline for free.
Below are links to two curated repos of financial data sources – each one has about 25-30 sources listed (although there is most likely a decent bit of overlap between the repo lists)
https://github.com/wilsonfreitas/awesome-quant#data-sources
https://awesomeopensource.com/projects/financial-data
I will choose pandas-datareader and investpy to build my pipeline – these are just used as examples of course and if one would prefer to use different packages and sources that is absolutely acceptable, however it may limit the ability to follow along with the rest of my code as portions will follow package specific syntax.
Information on each of the 2 packages can be found at the following links:
https://github.com/pydata/pandas-datareader
https://github.com/alvarobartt/investpy
Downloading/Collecting Data, Data Cleaning/Wrangling & Data Storage System
(The above headings have been bunched together as the content below covers all 3 together)
To outline which specific data sets I plan to capture, I will begin with the stocks in the S&P500 index looking to gather the daily price history for each stock, along with a collection of “meta-data” for each company and a comprehensive collection of company fundamental data, in a cross sectional format (i.e. I will look to scrape the latest fundamental data values rather than any historic values)
I shall then gather historic price series for a range of spot currency pairs followed by some historic fund prices and related fund “meta-data” style information.
Before we begin writing the code necessary to carry out the actual retrieval and downloading of the planned data, we must first design our database structure; the final structure needs to allow quick and efficient retrieval of data in the future, and also allow future data set updates to be automated – once your data store gets large and complex enough, it would be extremely time consuming to have to update everything manually.
There are a number of options when it comes to choosing a data storage format
1) Flat-File Storage – this is simply a collection of individual files, usually csv files or similar, with each file containing data in a two-dimensional matrix or “table” structure with rows and columns.
2) NoSQL Database – This is mainly used when storing less “structured” data that doesn’t fit nicely into the table paradigm and may not follow the “relational” logic of more common SQL databases.
3) Relational Database – This is a database that adheres to the “relational model” and most commonly uses the SQL data definition and query language. This is by far the most common choice for storing financial data as we can usually structure our data sets as various individual items which we separate into tables and define certain relationships between those tables.
It makes sense at this point to identify what we would like to define as our “individual items” as it were, i.e. the data sets that we will map across into the various tables in our database.
Let’s start simple and only consider the structure for the equity data that we are gathering for now – in our case the constituents of the S&P500 index. We can tag on the various tables and possibly whole databases to the mix as and when we need to, i.e. when we start to deal with the spot currency data and fund data that we plan to include.
I plan to segregate my equity data into 4 separate tables, using unique ids and foreign keys to create the relevant relationships and links between them.
The tables will be as follows:
1) Exchange – the stock exchange on which the stocks trade
2) Company – the company represented by each symbol/ticker
3) Security – the actual share/asset belonging to each company
4) Security Price – the security’s historic price and volume data
And the keys will be set up as follows:
The “company” table will include a foreign key “security_id
“
The “security” table will include two foreign keys, “company_id
” and “exchange_id”
The “
” table will include a foreign key “security_price
security_id
“

Ok, let us begin coding! 😉
We start by creating our tables with the following code:
import os import sqlite3 from sqlite3 import Error def create_connection(db_file): """ create a database connection to the SQLite database specified by db_file :param db_file: database file :return: Connection object or None """ conn = None try: conn = sqlite3.connect(db_file) conn.execute("PRAGMA foreign_keys = 1") return conn except Error as e: print(e) return conn def create_table(conn, create_table_sql): """ create a table from the create_table_sql statement :param conn: Connection object :param create_table_sql: a CREATE TABLE statement :return: """ try: c = conn.cursor() c.execute(create_table_sql) except Error as e: print(e) db_name = 'db_pff7.db' def main(): database = os.path.join('..','data',db_name) sql_create_exchange_table = """ CREATE TABLE IF NOT EXISTS exchange ( id integer , name text NOT NULL, currency, code text NOT NULL UNIQUE, PRIMARY KEY(id) ); """ sql_create_company_table = """CREATE TABLE IF NOT EXISTS company ( id integer , name text NOT NULL, industry text, sector text, hq_location text, security_id integer, PRIMARY KEY(id), FOREIGN KEY (security_id) REFERENCES security (id) );""" sql_create_security_table = """CREATE TABLE IF NOT EXISTS security ( id integer, ticker text NOT NULL UNIQUE, name text NOT NULL, company_id integer, exchange_id integer, PRIMARY KEY(id) FOREIGN KEY (company_id) REFERENCES company (id), FOREIGN KEY (exchange_id) REFERENCES exchange (id) );""" sql_create_security_price_table = """CREATE TABLE IF NOT EXISTS security_price ( id integer, date text NOT NULL, open decimal NOT NULL, high decimal NOT NULL, low decimal NOT NULL, close decimal NOT NULL, volume integer, adj_close decimal NOT NULL, security_id integer, PRIMARY KEY(id), FOREIGN KEY (security_id) REFERENCES security (id) );""" # create a database connection conn = create_connection(database) # create tables if conn is not None: # create exchange table create_table(conn, sql_create_exchange_table) # create company table create_table(conn, sql_create_company_table) # create security table create_table(conn, sql_create_security_table) # create security_price table create_table(conn, sql_create_security_price_table) else: print("Error! cannot create the database connection.") main()
Now we move on to creating our table containing data regarding the various stock exchanges that we may find of interest or may want to include in our universe at some point. Luckily there is a readily available source file containing all the data we need and more; various download formats can be found at https://www.iso20022.org/market-identifier-codes. I have used the download in csv format as I find that to be the best format for our needs in this case. We are able to pass the relevant url directly into the pandas read_csv function as shown in the code below.
First we import the necessary modules, then read in the data from the relevant csv file url. Next we extract the columns of interest and subsequently rename them so they are more suitable to be used as column names in an SQL database (usually the format used is all lower case and any spaces are replaced with underscores).
Lastly we create a new column called ‘id’ – this will be the unique exchange id that will remain static and allow us to reference the exchanges in the future when running queries and extractions from the database. I have simply used the values in the index as they are unique and work nicely as our exchange ids at this point.
import pandas as pd import requests import io import numpy as np exchange_data = pd.read_csv("https://www.iso20022.org/sites/default/files/ISO10383_MIC/ISO10383_MIC.csv", encoding='iso-8859-1' ) exchange_data = exchange_data[['ISO COUNTRY CODE (ISO 3166)', 'MIC', 'NAME-INSTITUTION DESCRIPTION', 'ACRONYM']] exchange_data.rename(columns={'ISO COUNTRY CODE (ISO 3166)' : 'country_code', 'MIC':'code', 'NAME-INSTITUTION DESCRIPTION': 'name', 'ACRONYM':'acronym'}, inplace=True) exchange_data['id'] = exchange_data.index
Now we quickly create a partial mapper to use to create our currency data – I am using the country_code
and have just filled out the mapping details for USD, GBP and Germany (EUR) – if so inclined, feel free to fill out the rest of the data and complete the mapping.
mapper = {'US': 'USD', 'GB': 'GBP', 'DE': 'EUR'} exchange_data['currency'] = exchange_data['country_code'].map(mapper) exchange_data[['id', 'name', 'currency', 'code']]
which gives us a DataFrame that looks as follows:

Ok lets now insert this data into our database table- this will look as in the image below (if everything has worked!)
conn = sqlite3.connect(os.path.join('..','data',db_name)) exchange_data[['id', 'name', 'currency', 'code']].to_sql("exchange", conn, if_exists="append", index=False)

Next we move on to scraping the tickers for the constituents of the S&P500 and the Dow Jones Industrial Average. We will use these to first download some company data for each stock such as the name of the company, the industry and sector within which it operates and lastly the location of the company HQ.
We can retrieve the necessary information from Wikipedia using Pandas and the read_html method – this will convert the relevant HTML table on the Wikipedia web page into a DataFrame which we can then use as we wish. I have renamed some of the columns in the code below so as to fit the naming convention of our pre-made database tables.
# scrape wiki table with symbols and details of Dow Jones constituents dj_constituents = pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average#Components')[1] dj_constituents.rename(columns={'Company':'name', 'Industry': 'industry', 'Symbol': 'ticker'}, inplace=True)
this gets us the following:

We have to clean up some of the tickers by splitting and removing the “NYSE” part and the colon. The line of code below does that; its a little bit of a hack but it works fine. The “.apply” uses a lambda function to first reverse the ticker string, then apply the “.partition” method which allows us to deal with the fact that some of the tickers don’t have the ‘:’ separator in them. This returns a tuple of 3 values – of which we select the first value, and finally we re-reverse the string back around the right way. The second line of code just strips any whitespace from around the ticker. (this is no longer necessary as the ticker format has changed and no longer contains the “NYSE:” element in any of the scraped companies).
dj_constituents['ticker'] = dj_constituents['ticker'].apply(lambda x: x[::-1].partition(':')[0][::-1]) dj_constituents['ticker'] = dj_constituents['ticker'].str.strip()
# scrape wiki table with symbols and details of s&P500 constituents sp_constituents = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]#.to_csv('constintuents_current.csv', index=False) sp_constituents.rename(columns={'Symbol': 'ticker', 'Security':'name', 'Headquarters Location': 'hq_location', 'GICS Sector': 'sector', 'GICS Sub-Industry':'industry'},inplace=True)
and this gets us the table below:

Next I concatenate the two tables together into one , selecting only the relevant columns from each. Make sure to drop any duplicates in the table as some stocks appear in both of our chosen indices!
company_table = pd.concat([sp_constituents[['name', 'industry', 'sector', 'hq_location']],dj_constituents[['name', 'industry']]]) company_table.drop_duplicates('name', inplace=True) company_table.sort_values('name', inplace=True) company_table.reset_index(inplace=True, drop=True) company_table['id'] = company_table.index
Below is the code to create the security table – similar in logic to the company table creation with the two constituent tables being concatenated together to form the final table.
# security_table sp_security_table = sp_constituents[['ticker', 'name']].copy() # sp_security_table['currency'] = 'USD' dj_security_table = dj_constituents[['ticker', 'name']].copy() # dj_security_table['currency'] = 'USD' security_table = pd.concat([sp_security_table,dj_security_table]) security_table.drop_duplicates(subset='ticker',inplace=True) security_table.sort_values('ticker', inplace=True) security_table.reset_index(inplace=True, drop=True) security_table['id'] = security_table.index
Now we need to create the columns holding the correct company_id in the security table, and vice versa the correct security_id in the company table. Now in this instance, due to the fact the two tables are populated with data for exactly the same companies, in the same order – this step may seem a little redundant as the two id columns will be exactly the same. It is a vital step however as we need to link the two tables together with their foreign keys so we may extract the correct data corresponding to the correct company whenever needed. You may wonder why we don’t just reference them using their ticker symbols; firstly ticker symbols are not unique across exchanges – different exchanges assign the same letters to different companies – also it is not that uncommon to see a company change its ticker symbol, whether due to a merger or takeover etc. So we need our own immutable ids which we know will remain constant no matter what.
The way I have done it is just to create a dictionary with key, value pairs being the name of the company and the corresponding id within its own table, which will be our foreign key in the second table.
Just to illustrate how it would work in a real world scenario, when everything isnt all perfectly lined up, I have sorted the tables in a different order.
I start by populating the security table with the company_id:
company_id_mapper = pd.Series(company_table.id.values, index=company_table.name).to_dict() security_table['company_id'] = security_table['name'].map(company_id_mapper)
and then I populate the company table with the security_id:
security_id_mapper = pd.Series(security_table.id.values, index=security_table.name).to_dict() company_table['security_id'] = company_table['name'].map(security_id_mapper)
The next step is to populate the exchange_id column. For this I shall use information that can be found here https://old.nasdaq.com/screening/company-list.aspx. I downloaded 2 csv files covering all the NYSE and NASDAQ traded stocks – I shall use this info to assign the correct exchange_id for each stock in my security table.
The csv can be downloaded using these links in case the url isn’t stable:
nyse = pd.read_csv('nyse.csv') nasdaq = pd.read_csv('nasdaq.csv') security_table['exchange_id'] = None security_table['exchange_id'] = np.where(security_table['ticker'].isin(nyse['Symbol'].values), 1300, '') security_table['exchange_id'] = np.where(security_table['ticker'].isin(nasdaq['Symbol'].values), 367, security_table['exchange_id'])
And upload the data into the database tables which will look as shown in the two screenshots below:
security_table.to_sql("security", conn, if_exists="append", index=False) company_table.to_sql("company", conn, if_exists="append", index=False)


Now it’s time to download and store the price histories for each of our companies and tickers in the tables we have just created. The code below iterates through each of the tickers in the security table and downloads daily data starting on 01/01/2000, or as far back as the price data goes for that stock if it started trading after that date. For this we will use Pandas-Datareader:
from tqdm import tqdm_notebook as tqdm import pandas_datareader.data as web stock_pricing_dfs = [] for stock_id in tqdm(security_table['id']): try: stock_pricing_df = web.DataReader(security_table.iloc[stock_id]['ticker'], start='2010-1-1', end='2020-08-31', data_source='yahoo') stock_pricing_df['security_id'] = stock_id stock_pricing_dfs.append(stock_pricing_df) except: pass security_price_table = pd.concat(stock_pricing_dfs)
Now we rename the columns to adhere to the database column names and create an ‘id’ columns which at this stage can just be a direct copy of the index values. Once we get to the stage of updating the data and adding more history and more stocks possibly, we will need to be more careful and make sure that the ‘id’ values are all unique and correctly ordered etc.
Then we just insert the data into our database table and there you go! We have finished the process of adding our various stock data (for now); the final table should look similar to that shown in the image below:
security_price_table.columns = ['high', 'low', 'open', 'close', 'volume', 'adj_close', 'security_id'] security_price_table.reset_index(inplace=True) security_price_table['id'] = security_price_table.index security_price_table.to_sql("security_price", conn, if_exists="append", index=False)

I’ll leave it here for now as I think this is a good place to take a natural break in the post; when we return we will be looking to augment the data with some spot FX currency data, along with perhaps some economic data series and we wil add to the equity based data by scraping and storing some fundamental company data; balance sheets, cash flows and the like.
I might also at some point go about creating a “survivorship bias free” (mini) data set using the S&P constituents – i.e. I shall work back historically and store an S&P500 constituent list for each time period, adapted and correctly accounting for all the additions and removals from the list of stocks making up the 500 at any point in time. This allows one to test strategies over data that doesn’t suffer from survivorship bias by just using the current day’s constituent list across the entire historic period.
Ok thanks all for visiting, I shall follow up soon with part 2!
20 comments
Well, I lost interest when you stated that company hq location was categorical data not numerical data. Clearly location can be coded as numerical data (latitude, longitude) and used numerically to calculate many items of interest.
I take your point that cities exist in some location, and that a location can indeed be represented in a format which allows calculations etc, and therefore that particular representation of location would indeed be classed as numerical data, yes. But the fact remains that a city name is a label and labels fall firmly in the categorical data category – so you’re not quite correct on this one I’m afraid 😉
Thank you so much! As a fellow CFA looking to develop my python/programming skills for my career, this is all amazing. Looking forward to part 2.
Wow, your educational background is impressive! I mean really impressive. The article was amazing. Is there a particular book on the subject you can recommend? Is there a particular database you favor for financial data? I am just starting out but do not want to waste my time with defunct databases.
(While I doubt anyone else is confused by “NOBODY CARES” comments, let me say the author is 100% accurate. “NOBODY CARES” is referring to GPS coordinates, NOT city names and has confused the two. The author clearly knows this but wanted to keep the focus on the article and not some other topic).
Hello, thanks for your awesome blog, please don’t let this jewel die!
Thanks so much for the article! I’d love to see your approach to creating a database free from survivorship bias.
I’ve just finished something similar (produced a dataset free from survivorship bias at 30-minute intervals going back to 2010 for the SP500), but I only stored price data, and I did it in .csv files rather than a database. I’d like to move to a database approach in the future.
Thanks s666 for your continued contribution to python finance world!!
so far best blog for those who interested in learning python for finance.
Hi s666, could you advise how to set up below please?
I have put your fisrt section about create database code in python, the def function of “main()” responses “unable to open database file”.
Thanks in advance
The code should actually create the database if one does not exist in that location already, which leads me to believe it is the folder/directory structure which is not in place. The easiest solution for immediate use would be to just change the following:
def main():
database = os.path.join(‘..’,’data’,db_name)
and remove the part in parenthesis before the db_name.
i.e. remove ‘..’,’data’
That should then just create the database file in the current working directory from which you are running the main .py file
Are you able to advise the define of eod in the last section about the eod_table please?
eod_table.to_sql(‘eod_table’, conn, if_exists=”replace”)
Thanks
My apologies – there should be no reference to eod_table – that was leftover code from when I was composing the article. I have removed the line you are referring to and also altered the code box below that one to correctly use “security_price_table” rather than “eod_table”.
Thanks for the spot!!
Cool, I was thinkg that eod sounded like end of the day, anyway, great work!
Furthermore, when the last section was executed, the number of variable columns were not match, it may miss the ‘Data’.
ValueError: Length mismatch: Expected axis has 8 elements, new values have 7 elements
Cheers
Thanks a lot for the materials and help. Your blog is very impressive.
Why would you put security prices into a relational database?
The data is rarely ever retrieved for one date; any sort of quantitative analysis will require at least 30 calendar days and often a lot more than that. Its trivial to trim the window by date using pandas, and generally a quantitative analysis will need several different windows (a short term and a long term window at minimum).
Storing price data in a sql database is unecessarily clogging up your sql server… and to make individual records by date!!
This just stinks of academic pedantics, while ignoring practical considerations.
In a real world app, you would need a lot more speed and more efficient (compressed) data storage.
Hi Greg, in your opinion, how should it be constructed to fulfil these practical considerations?
Hi Dear,
We are waiting for the second part. Overall you score 10 out of 10 man. Great article.
Hi Greg, I accept that time-series DBs have certain benefits over SQL DBs when it comes to large amounts of financial time-series data. However, I am not aiming this blog at those who would have their own SQL servers and who are looking to build an enterprise grade financial DB here. This whole blog is aimed at individuals, many of whom are in the learning phase of their journey…so I stand by my choice of an SQL database for this article. Sqlite comes pre-packaged with Python, it exists in a single .db file and is accessible without faffing around with setting up SQL servers and the like. For the amounts of data they are likley to be scraping (end of day in my example) then, again an SQL DB is more than sufficient for that.
I am not quite clear where your feelings of this being “academic pedantry” stem from; I can assure you at many smaller funds and CTAs you will often find their data held in SQL DBs. It all comes down to use case….
Hi Hassan, thanks for the comment. I am planning on getting around to writing part 2 soon!
Hi S666,
I absolutely agree with Hassan here, hands up for your work, as this is the most comprehensive (and probably the unique) guide to build a database with python from scratch, as I have searched the internet for some time and have not found anything as complete. So if you can find the time to do it, it would be greatly appreciated!
P.d. (newbie question here): is it possible to call the data from yfinance using multi-threading instead of doing it in series? Would it make a difference in extracting data faster?
This is amazing! Thanks so much for sharing. Just discovered your amazing blog. By the way, is there a part 2! The automatic updating would be really useful.