Scraping ETF data from the web and sticking it in an SQLite3 database

SQLite database python web scrape

As promised, here is the first part of my “ETF mean reversion strategy backtest” series. This article deals with scraping ETF data from the web and sticking it in an SQLite3 database.

I’m using SQLite as it’s very easy to set up, and requires nothing complex to get up and running. Firstly you need to visit SQLite download page and download the pre compiled binaries from the Windows section.

You will need to:

1) Download sqlite-shell-win32-*.zip and sqlite-dll-win32-*.zip zipped files.

2) Create a folder C:\>sqlite and unzip above two zipped files in this folder which will give you sqlite3.def, sqlite3.dll and sqlite3.exe files.

3) Add C:\>sqlite in your PATH environment variable and finally go to the command prompt and issue sqlite3 command, which should display a result something as below.

The next thing we need to do, is actually create the database with which we will be working. Find and navigate to the sqlitebrowser.exe and run that file. This will open the GUI we will use to create the database. It’s pretty self explanatory, and the “create database” option can be found in the top left corner of the GUI.

Once SQLite is installed and we have created a database, we can get on to the task of writing the python script to scrape the relevant data and upload that to our database.

I have decided to use the following site to scrape data from:

It holds all the data we need and each relevant page can be scraped by copying and pasting the necessary URL into our python script. I’ve done the hard work for you, so feel free to copy and paste the code below:

If you have set up the database correctly, and set the name and path correctly in the above script, you should be able to just copy and paste this code, run it, and end up with a nice SQLite database full of hundreds and hundreds of ETF tickers along with about 12 columns worth of supporting information including “underlying index”, “region” and ETF “focus” among other things.

Here’s what it should look like:


So hopefully this will prove very useful, and allow us to use a bit of SQL magic to make requests in our main Python backtest script to pull in ETF tickers based upon criteria such as “underlying ETF asset” or geographic region…help us pull down ETF tickers that are more likley to show co-integrating behaviour due to logical, underlying economic rationale…rather than just brute force testing each and every ETF against each other.

OK – so!! In the next post, we shall begin the real task of writing the actual backtest itself. Oh the joy! Until then…

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInEmail this to someonePin on PinterestShare on Reddit
Written by s666