Scraping ETF data from the web and sticking it in an SQLite3 database

SQLite database python web scrape

As promised, here is the first part of my “ETF mean reversion strategy backtest” series. This article deals with scraping ETF data from the web and sticking it in an SQLite3 database.

I’m using SQLite as it’s very easy to set up, and requires nothing complex to get up and running. Firstly you need to visit SQLite download page and download the pre compiled binaries from the Windows section.

You will need to:

1) Download sqlite-shell-win32-*.zip and sqlite-dll-win32-*.zip zipped files.

2) Create a folder C:\>sqlite and unzip above two zipped files in this folder which will give you sqlite3.def, sqlite3.dll and sqlite3.exe files.

3) Add C:\>sqlite in your PATH environment variable and finally go to the command prompt and issue sqlite3 command, which should display a result something as below.

SQLite version 2013-01-09 11:53:05
Enter ".help" FOR instructions
Enter SQL statements TERMINATED WITH a ";"

The next thing we need to do, is actually create the database with which we will be working. Find and navigate to the sqlitebrowser.exe and run that file. This will open the GUI we will use to create the database. It’s pretty self explanatory, and the “create database” option can be found in the top left corner of the GUI.

Once SQLite is installed and we have created a database, we can get on to the task of writing the python script to scrape the relevant data and upload that to our database.

I have decided to use the following site to scrape data from:

It holds all the data we need and each relevant page can be scraped by copying and pasting the necessary URL into our python script. I’ve done the hard work for you, so feel free to copy and paste the code below:

import pandas as pd
import sqlite3 as db
import requests
#create empty list to store our data scraped from web
frames = []
#create a string of webpages which we will feed into the html parser
s = '''
#split urls in string and feed into pandas html parser to create dataframes of scraped info
for i in s.split():
    print "Scraping data from {}.".format(i)
    df = pd.read_html(requests.get(i,headers={'User-agent': 'Mozilla/5.0'}).text)
    #df holds multiple DataFrames - index [5] is the DataFrame we 
    #are interested in the scraped data has slightly different 
    #headings across different webpages, so we reset the 
    #column headings to be identical for each DataFrame  
    #these names also match the columns we
    #will set up in out SQLite database
    df[5].columns = ['Fund Name','Ticker','Asset Class','Region','Geography','Category','Focus',
                      'Niche', 'Inverse','Leveraged','ETN','Underlying Index','Selection Criteria',
                      'Weighting Scheme','Active per SEC']
#create a "master" dataframe which concatenates all the relevant 
#(index 5) DataFrames together
masterFrame = pd.concat(frames)
#create connection to our previously created SQLite database
#use the path and name which corresponds to your local database
cnx = db.connect('C:\Users\Stuart\sqlite_databases\etfs.db')
cur = cnx.cursor()
#remove table if already exists and any data it contains
cur.execute('DROP TABLE IF EXISTS etftable;')
#create the table within the database
sql = '''CREATE TABLE etftable ('Fund Name' TEXT, 'Ticker' TEXT, 'Asset Class' TEXT, 
                                'Region' TEXT, 'Geography' TEXT, 'Category' TEXT, 'Focus' TEXT,
                                'Niche' TEXT, 'Inverse' TEXT, 'Leveraged' TEXT, 'ETN' TEXT, 
                                'Underlying Index' TEXT, 'Selection Criteria' TEXT, 'Weighting Scheme' TEXT, 
                                'Active per SEC' TEXT)'''
#append the data
masterFrame.to_sql(name='etftable', con=cnx, if_exists = 'append', index=False)

If you have set up the database correctly, and set the name and path correctly in the above script, you should be able to just copy and paste this code, run it, and end up with a nice SQLite database full of hundreds and hundreds of ETF tickers along with about 12 columns worth of supporting information including “underlying index”, “region” and ETF “focus” among other things.

Here’s what it should look like:


So hopefully this will prove very useful, and allow us to use a bit of SQL magic to make requests in our main Python backtest script to pull in ETF tickers based upon criteria such as “underlying ETF asset” or geographic region…help us pull down ETF tickers that are more likley to show co-integrating behaviour due to logical, underlying economic rationale…rather than just brute force testing each and every ETF against each other.

OK – so!! In the next post, we shall begin the real task of writing the actual backtest itself. Oh the joy! Until then…

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInEmail this to someonePin on PinterestShare on Reddit
Written by s666