Scraping ETF data from the web and sticking it in an SQLite3 database

Scraping ETF data from the web and sticking it in an SQLite3 database

Categories Trading Strategy Backtest

SQLite database python web scrape

As promised, here is the first part of my “ETF mean reversion strategy backtest” series. This article deals with scraping ETF data from the web and sticking it in an SQLite3 database.

I’m using SQLite as it’s very easy to set up, and requires nothing complex to get up and running. Firstly you need to visit SQLite download page and download the pre compiled binaries from the Windows section.

You will need to:

1) Download sqlite-shell-win32-*.zip and sqlite-dll-win32-*.zip zipped files.

2) Create a folder C:\>sqlite and unzip above two zipped files in this folder which will give you sqlite3.def, sqlite3.dll and sqlite3.exe files.

3) Add C:\>sqlite in your PATH environment variable and finally go to the command prompt and issue sqlite3 command, which should display a result something as below.

SQLite version 2013-01-09 11:53:05
Enter ".help" FOR instructions
Enter SQL statements TERMINATED WITH a ";"

The next thing we need to do, is actually create the database with which we will be working. Find and navigate to the sqlitebrowser.exe and run that file. This will open the GUI we will use to create the database. It’s pretty self explanatory, and the “create database” option can be found in the top left corner of the GUI.

Once SQLite is installed and we have created a database, we can get on to the task of writing the python script to scrape the relevant data and upload that to our database.

I have decided to use the following site to scrape data from:

It holds all the data we need and each relevant page can be scraped by copying and pasting the necessary URL into our python script. I’ve done the hard work for you, so feel free to copy and paste the code below:

import pandas as pd
import sqlite3 as db
import requests
#create empty list to store our data scraped from web
frames = []
#create a string of webpages which we will feed into the html parser
s = '''
#split urls in string and feed into pandas html parser to create dataframes of scraped info
for i in s.split():
    print "Scraping data from {}.".format(i)
    df = pd.read_html(requests.get(i,headers={'User-agent': 'Mozilla/5.0'}).text)
    #df holds multiple DataFrames - index [5] is the DataFrame we 
    #are interested in the scraped data has slightly different 
    #headings across different webpages, so we reset the 
    #column headings to be identical for each DataFrame  
    #these names also match the columns we
    #will set up in out SQLite database
    df[5].columns = ['Fund Name','Ticker','Asset Class','Region','Geography','Category','Focus',
                      'Niche', 'Inverse','Leveraged','ETN','Underlying Index','Selection Criteria',
                      'Weighting Scheme','Active per SEC']
#create a "master" dataframe which concatenates all the relevant 
#(index 5) DataFrames together
masterFrame = pd.concat(frames)
#create connection to our previously created SQLite database
#use the path and name which corresponds to your local database
cnx = db.connect('C:\Users\Stuart\sqlite_databases\etfs.db')
cur = cnx.cursor()
#remove table if already exists and any data it contains
cur.execute('DROP TABLE IF EXISTS etftable;')
#create the table within the database
sql = '''CREATE TABLE etftable ('Fund Name' TEXT, 'Ticker' TEXT, 'Asset Class' TEXT, 
                                'Region' TEXT, 'Geography' TEXT, 'Category' TEXT, 'Focus' TEXT,
                                'Niche' TEXT, 'Inverse' TEXT, 'Leveraged' TEXT, 'ETN' TEXT, 
                                'Underlying Index' TEXT, 'Selection Criteria' TEXT, 'Weighting Scheme' TEXT, 
                                'Active per SEC' TEXT)'''
#append the data
masterFrame.to_sql(name='etftable', con=cnx, if_exists = 'append', index=False)

If you have set up the database correctly, and set the name and path correctly in the above script, you should be able to just copy and paste this code, run it, and end up with a nice SQLite database full of hundreds and hundreds of ETF tickers along with about 12 columns worth of supporting information including “underlying index”, “region” and ETF “focus” among other things.

Here’s what it should look like:


So hopefully this will prove very useful, and allow us to use a bit of SQL magic to make requests in our main Python backtest script to pull in ETF tickers based upon criteria such as “underlying ETF asset” or geographic region…help us pull down ETF tickers that are more likley to show co-integrating behaviour due to logical, underlying economic rationale…rather than just brute force testing each and every ETF against each other.

OK – so!! In the next post, we shall begin the real task of writing the actual backtest itself. Oh the joy! Until then…

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInEmail this to someonePin on PinterestShare on Reddit

5 thoughts on “Scraping ETF data from the web and sticking it in an SQLite3 database

  1. Great addition here.

    I was trying to run your exact code and got a 403 Error: Forbidden. Can you show your pd.show_versions()?

    Really helpful!

      1. Hi Dan…OK I have figured out a fix. basically the ETF site was blocking “User-Agent” used by the Pandas “read_html()” function. All we need to do is import the “requests” module and alter the user-agent using that.

        I have updated the code and it should all be working again now.

        Let me know if it isnt!!

    1. Hi Dan, thanks for your comment. Ok so I have taken a quick look and it seems there is nothing wrong with your versions etc, rather the actual ETF website I was using to scrape the data from no longer allows requests from Pandas “read_html()” function.

      I havn’t had time to come up with an alternative, which is a shame – I will try to update this when I get some free time.

      To get you started on your own path perhaps, you could look into the use of the “Requests” module.

      I quickly tried:

      import requests
      from bs4 import BeautifulSoup
      url = ""
      soup = BeautifulSoup(requests.get(url).text, 'lxml')
      print soup # prints the HTML you are expecting

      for example, and it successfully scraped the site…so that could be a way around it. Obviously the rest of the code would have to be adapted to deal with the new format of the scraped data.

      At least it’s a start for you and I will try to update when I have some free time as mentioned previously!

  2. S666 – Many thanks, I’ll take a look. I ended up using Selenium.Webdriver to open up Chrome and hit the relevant x_paths for the data. I realize I still have to clean the code up but there’s some good stuff here.

    import pandas as pd
    from yahoo_finance import Share
    import requests
    from bs4 import BeautifulSoup
    import re
    from json import loads
    import locale
    from selenium import webdriver
    import sqlite3 as db

    locale.setlocale(locale.LC_ALL, ”)
    chromedriver_path = ( CHROME SPECIFIC PATH HERE.EXE”)
    driver = webdriver.Chrome(chromedriver_path)

    def xpath_to_list(xpath, n):
    “””Returns list of list for given xpath and number of desired elements.”””
    xpath_list = driver.find_element_by_xpath(xpath)
    xpath_list = xpath_list.text.split(“\n”)

    def chunker(alist, n):
    “””Yield successive n-sized chunks of ‘alist’.”””
    for i in range(0, len(alist), n):
    yield alist[i:i+n]

    temp_list = list(chunker(xpath_list, n))
    return list(temp_list)
    temp_list = []
    return temp_list

    def to_df(adict):
    “””Create MultiIndex Dataframe from dictionary of k=characteristics &
    v=values, data scrapped from”””
    temp_df = pd.DataFrame()
    for item in adict:
    if bool(adict[item]) == True:
    item_df = pd.DataFrame(adict[item])
    item_df.index = pd.MultiIndex.from_product([[item],list(item_df[0])])
    del item_df[0]
    item_df.rename(columns={1:etf_symbol}, inplace=True)
    item_df = pd.DataFrame()
    temp_df = pd.concat([temp_df, item_df])
    return temp_df

    def numeric_coerce(col_list, df):
    “””Coerces list of columns in a given df to numbers using pd.Numeric”””
    for col in col_list:
    df[col] = pd.to_numeric(df[col], errors=”coerce”)
    return df

    nasdaq_conn = “”
    symbol_df = pd.read_csv(nasdaq_conn)
    symbol_srs = pd.Series(symbol_df[“Symbol”])
    etf_df = pd.DataFrame()
    for etf_symbol in symbol_srs[:10]:
    etf_conn = “”+ etf_symbol
    print(“Scraping ” + etf_conn)
    countries = xpath_to_list(‘//*[@id=”lblchartTop10Countries”]’, 2)
    sectors = xpath_to_list(‘//*[@id=”lblchartTop10Sectors”]’, 2)
    holders = xpath_to_list(‘//*[@id=”lblchartTop10Holdings”]’, 2)
    summary = xpath_to_list(‘//*[@id=”form-reports-overview”]/div[2]/div[1]/div’, 2)
    summary = summary[:-1]
    port_data = xpath_to_list(‘//*[@id=”form-reports-overview”]/div[2]/div[3]/div’, 2)
    segment = xpath_to_list(‘//*[@id=”form-reports-title”]/div/div[3]/div[1]/div/a’,1)

    if len(segment) != 0:
    segment[0].append(“ETF Segment”)
    segment = [segment[0][::-1]]
    dict_of_attrs = {“countries”:countries, “sectors”:sectors, “holdings”:holders,
    “summary”: summary, “PortData”:port_data, “segment”:segment}
    temp_df = to_df(dict_of_attrs)
    if temp_df.empty:
    if etf_df.empty:
    etf_df = temp_df
    etf_df = etf_df.join(temp_df, how=”outer”)

    cnx = db.connect(“DB SPECIFIC PATH HERE”)
    cur = cnx.cursor()
    cur.execute(“DROP TABLE IF EXISTS etftable;”)

    sql = “””CREATE TABLE etftable
    (“Fund Name”, “Distribution Yield” TEXT, “Duration” TEXT,
    “Income Yield TTM” TEXT, “Number of Commodities” TEXT,
    “Number of Holdings” TEXT, “Price to Book Ratio” TEXT,
    “Price to Earnings Ratio” TEXT, “Rolling Strategy” TEXT, “Strategy” TEXT,
    “Weighted Average Market Cap” TEXT, “Weighted Avg Maturity” TEXT,
    “Yield to Maturity” TEXT)”””

    sql = cur.execute(sql)

    sql_etf = etf_df.loc[“PortData”].T
    sql_etf.drop(sql_etf[[“Access”, “Rebalancing Frequency”, “Exposure Type”,
    “Next Distribution Date”, “Next Ex-Dividend Date”,
    “Yield to Maturity”]],
    axis=1, inplace=True)

    sql_etf.rename(columns={“index”:”Fund Name”,
    “Income Yield (TTM)”:”Income Yield TTM”,
    “Price / Book Ratio”:”Price to Book Ratio”,
    “Price / Earnings Ratio”:”Price to Earnings Ratio”,
    “Weighted Avg. Maturity”:”Weighted Avg Maturity”},

    exempt_list = [“Weighted Average Market Cap”, “Fund Name”, “Strategy”,
    “Rolling Strategy”,”Weighted Avg Maturity”]

    sql_col_list = [col for col in list(sql_etf.columns) if col not in exempt_list]
    sql_etf = numeric_coerce(sql_col_list, sql_etf)

    sql_etf.to_sql(name=”etftable”, con=cnx, if_exists=”append”, index=False)”””

Leave a Reply