Home Trading Strategy Backtest Scraping ETF data from the web and sticking it in an SQLite3 database

Scraping ETF data from the web and sticking it in an SQLite3 database

by s666

SQLite database python web scrape

As promised, here is the first part of my “ETF mean reversion strategy backtest” series. This article deals with scraping ETF data from the web and sticking it in an SQLite3 database.

I’m using SQLite as it’s very easy to set up, and requires nothing complex to get up and running. Firstly you need to visit SQLite download page and download the pre compiled binaries from the Windows section.

You will need to:

1) Download sqlite-shell-win32-*.zip and sqlite-dll-win32-*.zip zipped files.

2) Create a folder C:\>sqlite and unzip above two zipped files in this folder which will give you sqlite3.def, sqlite3.dll and sqlite3.exe files.

3) Add C:\>sqlite in your PATH environment variable and finally go to the command prompt and issue sqlite3 command, which should display a result something as below.

C:\>;sqlite3
SQLite version 3.7.15.2 2013-01-09 11:53:05
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

The next thing we need to do, is actually create the database with which we will be working. Find and navigate to the sqlitebrowser.exe and run that file. This will open the GUI we will use to create the database. It’s pretty self explanatory, and the “create database” option can be found in the top left corner of the GUI.

Once SQLite is installed and we have created a database, we can get on to the task of writing the python script to scrape the relevant data and upload that to our database.

I have decided to use the following site to scrape data from:

www.etf.com

It holds all the data we need and each relevant page can be scraped by copying and pasting the necessary URL into our python script. I’ve done the hard work for you, so feel free to copy and paste the code below:

import pandas as pd
import sqlite3 as db
import requests


#create empty list to store our data scraped from web
frames = []

#create a string of webpages which we will feed into the html parser
s = '''
http://www.etf.com/channels/bond-etfs
http://www.etf.com/channels/mlp-etfs
http://www.etf.com/channels/silver-etfs
http://www.etf.com/channels/china-etfs
http://www.etf.com/channels/muni-etfs
http://www.etf.com/channels/us-broad-market-bond-etfs
http://www.etf.com/channels/dividend-etfs
http://www.etf.com/channels/natural-gas-etfs
http://www.etf.com/channels/global-bond-etfs
http://www.etf.com/channels/oil-etfs
http://www.etf.com/channels/treasury-etfs
http://www.etf.com/channels/gold-etfs
http://www.etf.com/channels/reit-etfs
http://www.etf.com/channels/high-dividend-yield-etfs
http://www.etf.com/channels/japan-etfs
http://www.etf.com/channels/smart-beta-etfs
http://www.etf.com/etf-lists/alternatives-etfs
http://www.etf.com/etf-lists/asset-allocation-etfs
http://www.etf.com/etf-lists/currency-etfs
http://www.etf.com/etf-lists/fixed-income-etfs
http://www.etf.com/channels/alpha-seeking-etfs
http://www.etf.com/channels/basic-materials-etfs
http://www.etf.com/channels/consumer-cyclicals-etfs
http://www.etf.com/channels/consumer-non-cyclicals-etfs
http://www.etf.com/channels/energy-etfs
http://www.etf.com/channels/extended-market-etfs
http://www.etf.com/channels/financials-etfs
http://www.etf.com/channels/health-care-etfs
http://www.etf.com/channels/high-dividend-yield-etfs
http://www.etf.com/channels/industrials-etfs
http://www.etf.com/channels/real-estate-etfs
http://www.etf.com/channels/small-cap-etfs
http://www.etf.com/channels/technology-etfs
http://www.etf.com/channels/telecommunications-etfs
http://www.etf.com/channels/theme-etfs
http://www.etf.com/channels/total-market-etfs
http://www.etf.com/channels/utilities-etfs
http://www.etf.com/channels/asia-pacific-etfs
http://www.etf.com/channels/developed-markets-etfs
http://www.etf.com/channels/emerging-markets-etfs
http://www.etf.com/channels/europe-etfs
http://www.etf.com/channels/global-etfs
http://www.etf.com/channels/global-ex-us-etfs
http://www.etf.com/channels/latin-america-etfs
http://www.etf.com/channels/middle-east-and-africa-etfs
'''


#split urls in string and feed into pandas html parser to create dataframes of scraped info
for i in s.split():
    print "Scraping data from {}.".format(i)
    df = pd.read_html(requests.get(i,headers={'User-agent': 'Mozilla/5.0'}).text)
    #df holds multiple DataFrames - index [5] is the DataFrame we 
    #are interested in the scraped data has slightly different 
    #headings across different webpages, so we reset the 
    #column headings to be identical for each DataFrame  
    #these names also match the columns we
    #will set up in out SQLite database
    df[5].columns = ['Fund Name','Ticker','Asset Class','Region','Geography','Category','Focus',
                      'Niche', 'Inverse','Leveraged','ETN','Underlying Index','Selection Criteria',
                      'Weighting Scheme','Active per SEC']
    frames.append(df[5])

#create a "master" dataframe which concatenates all the relevant 
#(index 5) DataFrames together
masterFrame = pd.concat(frames)

#create connection to our previously created SQLite database
#use the path and name which corresponds to your local database
cnx = db.connect('C:\Users\Stuart\sqlite_databases\etfs.db')
cur = cnx.cursor()

#remove table if already exists and any data it contains
cur.execute('DROP TABLE IF EXISTS etftable;')

#create the table within the database
sql = '''CREATE TABLE etftable ('Fund Name' TEXT, 'Ticker' TEXT, 'Asset Class' TEXT, 
                                'Region' TEXT, 'Geography' TEXT, 'Category' TEXT, 'Focus' TEXT,
                                'Niche' TEXT, 'Inverse' TEXT, 'Leveraged' TEXT, 'ETN' TEXT, 
                                'Underlying Index' TEXT, 'Selection Criteria' TEXT, 'Weighting Scheme' TEXT, 
                                'Active per SEC' TEXT)'''

cur.execute(sql)

#append the data
masterFrame.to_sql(name='etftable', con=cnx, if_exists = 'append', index=False)

cnx.close()

If you have set up the database correctly, and set the name and path correctly in the above script, you should be able to just copy and paste this code, run it, and end up with a nice SQLite database full of hundreds and hundreds of ETF tickers along with about 12 columns worth of supporting information including “underlying index”, “region” and ETF “focus” among other things.

Here’s what it should look like:

database

So hopefully this will prove very useful, and allow us to use a bit of SQL magic to make requests in our main Python backtest script to pull in ETF tickers based upon criteria such as “underlying ETF asset” or geographic region…help us pull down ETF tickers that are more likley to show co-integrating behaviour due to logical, underlying economic rationale…rather than just brute force testing each and every ETF against each other.

OK – so!! In the next post, we shall begin the real task of writing the actual backtest itself. Oh the joy! Until then…

You may also like

18 comments

Dan January 12, 2017 - 8:24 pm

Great addition here.

I was trying to run your exact code and got a 403 Error: Forbidden. Can you show your pd.show_versions()?

Really helpful!

Reply
Dan January 12, 2017 - 8:25 pm

to be clear this was in pd.read_html()

Reply
s666 January 25, 2017 - 2:22 am

Hi Dan…OK I have figured out a fix. basically the ETF site was blocking “User-Agent” used by the Pandas “read_html()” function. All we need to do is import the “requests” module and alter the user-agent using that.

I have updated the code and it should all be working again now.

Let me know if it isnt!!

Reply
s666 January 19, 2017 - 9:08 am

Hi Dan, thanks for your comment. Ok so I have taken a quick look and it seems there is nothing wrong with your versions etc, rather the actual ETF website I was using to scrape the data from no longer allows requests from Pandas “read_html()” function.

I havn’t had time to come up with an alternative, which is a shame – I will try to update this when I get some free time.

To get you started on your own path perhaps, you could look into the use of the “Requests” module.

I quickly tried:

import requests
from bs4 import BeautifulSoup

url = "http://www.etf.com/channels/middle-east-and-africa-etfs"

soup = BeautifulSoup(requests.get(url).text, 'lxml')

print soup # prints the HTML you are expecting

for example, and it successfully scraped the site…so that could be a way around it. Obviously the rest of the code would have to be adapted to deal with the new format of the scraped data.

At least it’s a start for you and I will try to update when I have some free time as mentioned previously!

Reply
Dan January 25, 2017 - 2:32 am

S666 – Many thanks, I’ll take a look. I ended up using Selenium.Webdriver to open up Chrome and hit the relevant x_paths for the data. I realize I still have to clean the code up but there’s some good stuff here.

import pandas as pd
from yahoo_finance import Share
import requests
from bs4 import BeautifulSoup
import re
from json import loads
import locale
from selenium import webdriver
import sqlite3 as db

locale.setlocale(locale.LC_ALL, '') 
chromedriver_path = ( CHROME SPECIFIC PATH HERE.EXE")
driver = webdriver.Chrome(chromedriver_path)

def xpath_to_list(xpath, n): 
    """Returns list of list for given xpath and number of desired elements."""
    try:
        xpath_list = driver.find_element_by_xpath(xpath)
        xpath_list = xpath_list.text.split("\n")
   
        def chunker(alist, n): 
            """Yield successive n-sized chunks of 'alist'."""
            for i in range(0, len(alist), n): 
                yield alist[i:i+n]
    
        temp_list = list(chunker(xpath_list, n))
        return list(temp_list)
    except: 
        temp_list = []
        return temp_list   

def to_df(adict):
    """Create MultiIndex Dataframe from dictionary of k=characteristics &
    v=values, data scrapped from etf.com."""
    temp_df = pd.DataFrame()
    for item in adict: 
        if bool(adict[item]) == True: 
            item_df = pd.DataFrame(adict[item])
            item_df.index = pd.MultiIndex.from_product([[item],list(item_df[0])])
            del item_df[0]
            item_df.rename(columns={1:etf_symbol}, inplace=True)
        else:
            item_df = pd.DataFrame()
        temp_df = pd.concat([temp_df, item_df])
    return temp_df

def numeric_coerce(col_list, df):
    """Coerces list of columns in a given df to numbers using pd.Numeric"""
    for col in col_list: 
        df[col] = pd.to_numeric(df[col], errors="coerce")
    return df

nasdaq_conn = "http://www.nasdaq.com/investing/etfs/etf-finder-results.aspx?download=Yes"
symbol_df = pd.read_csv(nasdaq_conn)
symbol_srs = pd.Series(symbol_df["Symbol"])
etf_df = pd.DataFrame()
for etf_symbol in symbol_srs[:10]:
    etf_conn = "http://www.etf.com/"+ etf_symbol
    print("Scraping " + etf_conn)
    driver.get(etf_conn)
    countries = xpath_to_list('//*[@id="lblchartTop10Countries"]', 2)
    sectors = xpath_to_list('//*[@id="lblchartTop10Sectors"]', 2)
    holders = xpath_to_list('//*[@id="lblchartTop10Holdings"]', 2)
    summary = xpath_to_list('//*[@id="form-reports-overview"]/div[2]/div[1]/div', 2)
    summary = summary[:-1]
    port_data = xpath_to_list('//*[@id="form-reports-overview"]/div[2]/div[3]/div', 2)
    segment = xpath_to_list('//*[@id="form-reports-title"]/div/div[3]/div[1]/div/a',1)
    
    if len(segment) != 0:
        segment[0].append("ETF Segment")
        segment = [segment[0][::-1]]
    else: 
        pass
    dict_of_attrs = {"countries":countries, "sectors":sectors, "holdings":holders, 
                     "summary": summary, "PortData":port_data, "segment":segment}
    temp_df = to_df(dict_of_attrs)
    if temp_df.empty: 
        pass
    else: 
        if etf_df.empty:
            etf_df = temp_df
        else:
            etf_df = etf_df.join(temp_df, how="outer")


cnx = db.connect("DB SPECIFIC PATH HERE")
cur = cnx.cursor()
cur.execute("DROP TABLE IF EXISTS etftable;")

sql = """CREATE TABLE etftable 
        ("Fund Name", "Distribution Yield" TEXT, "Duration" TEXT, 
        "Income Yield TTM" TEXT, "Number of Commodities" TEXT, 
        "Number of Holdings" TEXT, "Price to Book Ratio" TEXT, 
        "Price to Earnings Ratio" TEXT, "Rolling Strategy" TEXT, "Strategy" TEXT,
        "Weighted Average Market Cap" TEXT, "Weighted Avg Maturity" TEXT, 
        "Yield to Maturity" TEXT)"""
        
sql = cur.execute(sql)

sql_etf = etf_df.loc["PortData"].T
sql_etf.reset_index(inplace=True)
sql_etf.drop(sql_etf[["Access", "Rebalancing Frequency", "Exposure Type",
                      "Next Distribution Date", "Next Ex-Dividend Date",
                      "Yield to Maturity"]], 
             axis=1, inplace=True)

sql_etf.rename(columns={"index":"Fund Name", 
                        "Income Yield (TTM)":"Income Yield TTM",
                        "Price / Book Ratio":"Price to Book Ratio",
                        "Price / Earnings Ratio":"Price to Earnings Ratio",
                        "Weighted Avg. Maturity":"Weighted Avg Maturity"},
                        inplace=True) 

exempt_list = ["Weighted Average Market Cap", "Fund Name", "Strategy", 
               "Rolling Strategy","Weighted Avg Maturity"]

sql_col_list = [col for col in list(sql_etf.columns) if col not in exempt_list]
sql_etf = numeric_coerce(sql_col_list, sql_etf)

sql_etf.to_sql(name="etftable", con=cnx, if_exists="append", index=False)"""
Reply
Luke April 9, 2017 - 11:44 pm

Hi,
I try to run your code but a error came up:
ValueError: Length mismatch: Expected axis has 17 elements, new values have 15 elements

Could you please tell me what went wrong?

Thx!

Reply
s666 April 15, 2017 - 9:09 pm

Hi Luke…apologies for the apparent delay in replying; I actually replied the same day as your post but for some reason it never showed up, and I just realised.

If you were able to post the complete error message and trace back that would allow me to get an idea as to whatline 0f code is causing the problem and try to assist.

Cheers

Reply
s666 April 15, 2017 - 9:14 pm

Actually I believe I can guess which line of code is causing the error…it’s the line which attempts to assign new column headers to the data frame. I assume perhaps the structure of the website table has changed since I wrote this article, and now contains only 15 columns rather than the 17 new column headers we are trying to assign.

Perhaps you could debug by trying to print one of the individual data frames and see how many columns it contains and what the column names are. Then you can just adjust the new column headers in the assignment, and reduce it to 15 removing the columns that no longer exist.

If any of that doesn’t make sense, let me know and I’ll explain further.

I’ll also fix the code and repost it when I get a free moment anyway.

Reply
Janusz May 21, 2017 - 5:20 pm
### Corrected python code ### 
#########################
import pandas as pd
import sqlite3 as db
import requests
 
 
#create empty list to store our data scraped from web
frames = []
 
#create a string of webpages which we will feed into the html parser
s = '''
http://www.etf.com/channels/bond-etfs
http://www.etf.com/channels/mlp-etfs
http://www.etf.com/channels/silver-etfs
http://www.etf.com/channels/china-etfs
http://www.etf.com/channels/muni-etfs
http://www.etf.com/channels/us-broad-market-bond-etfs
http://www.etf.com/channels/dividend-etfs
http://www.etf.com/channels/natural-gas-etfs
http://www.etf.com/channels/global-bond-etfs
http://www.etf.com/channels/oil-etfs
http://www.etf.com/channels/treasury-etfs
http://www.etf.com/channels/gold-etfs
http://www.etf.com/channels/reit-etfs
http://www.etf.com/channels/high-dividend-yield-etfs
http://www.etf.com/channels/japan-etfs
http://www.etf.com/channels/smart-beta-etfs
http://www.etf.com/etf-lists/alternatives-etfs
http://www.etf.com/etf-lists/asset-allocation-etfs
http://www.etf.com/etf-lists/currency-etfs
http://www.etf.com/etf-lists/fixed-income-etfs
http://www.etf.com/channels/alpha-seeking-etfs
http://www.etf.com/channels/basic-materials-etfs
http://www.etf.com/channels/consumer-cyclicals-etfs
http://www.etf.com/channels/consumer-non-cyclicals-etfs
http://www.etf.com/channels/energy-etfs
http://www.etf.com/channels/extended-market-etfs
http://www.etf.com/channels/financials-etfs
http://www.etf.com/channels/health-care-etfs
http://www.etf.com/channels/high-dividend-yield-etfs
http://www.etf.com/channels/industrials-etfs
http://www.etf.com/channels/real-estate-etfs
http://www.etf.com/channels/small-cap-etfs
http://www.etf.com/channels/technology-etfs
http://www.etf.com/channels/telecommunications-etfs
http://www.etf.com/channels/theme-etfs
http://www.etf.com/channels/total-market-etfs
http://www.etf.com/channels/utilities-etfs
http://www.etf.com/channels/asia-pacific-etfs
http://www.etf.com/channels/developed-markets-etfs
http://www.etf.com/channels/emerging-markets-etfs
http://www.etf.com/channels/europe-etfs
http://www.etf.com/channels/global-etfs
http://www.etf.com/channels/global-ex-us-etfs
http://www.etf.com/channels/latin-america-etfs
http://www.etf.com/channels/middle-east-and-africa-etfs
'''
 
 
#split urls in string and feed into pandas html parser to create dataframes of scraped info
for i in s.split():
    print "Scraping data from {}.".format(i)
    df = pd.read_html(requests.get(i,headers={'User-agent': 'Mozilla/5.0'}).text)
    #df holds multiple DataFrames - index [5] is the DataFrame we 
    #are interested in the scraped data has slightly different 
    #headings across different webpages, so we reset the 
    #column headings to be identical for each DataFrame  
    #these names also match the columns we
    #will set up in out SQLite database
    df[5].columns = ['Fund Name','Ticker','Asset Class','Region','Geography','Category','Focus',
                      'Niche', 'Inverse','Leveraged','ETN','Underlying Index','Selection Criteria',
                      'Weighting Scheme','Active per SEC','Index Provider','Strategy']
    frames.append(df[5])
 
#create a "master" dataframe which concatenates all the relevant 
#(index 5) DataFrames together
masterFrame = pd.concat(frames)
 
#create connection to our previously created SQLite database
#use the path and name which corresponds to your local database
cnx = db.connect('C:\Program Files\DB Browser for SQLite\etfs.db')
cur = cnx.cursor()
 
#remove table if already exists and any data it contains
cur.execute('DROP TABLE IF EXISTS etftable;')
 
#create the table within the database
sql = '''CREATE TABLE etftable ('Fund Name' TEXT, 'Ticker' TEXT, 'Asset Class' TEXT, 
                                'Region' TEXT, 'Geography' TEXT, 'Category' TEXT, 'Focus' TEXT,
                                'Niche' TEXT, 'Inverse' TEXT, 'Leveraged' TEXT, 'ETN' TEXT, 
                                'Underlying Index' TEXT, 'Selection Criteria' TEXT,
                                'Weighting Scheme' TEXT, 'Active per SEC' TEXT,
                                'Index Provider' TEXT, 'Strategy' TEXT)'''
                                
 
cur.execute(sql)
 
#append the data
masterFrame.to_sql(name='etftable', con=cnx, if_exists = 'append', index=False)
 
cnx.close()

Reply
Jacob January 4, 2018 - 4:47 pm

Just came across this, thanks for the code!

I don’t know if things changed since you posted this, but as of now, when I tried the code it did not scrape all the data, far from it. Usually, more data is available per ETF category, but only the first 20 results appear in the table (if you go to the website, you see you have to click to go to page 2 to get next 20 results etc.). Unfortunately, the code scrapes only the first 20 data points in each category.

Reply
Carlos March 30, 2018 - 5:14 am

I also came across this post, googling for data webscraping and db storing.

The code does not work for me. I included some auditing code and found that today not a single DataFrame has 15 columns, and definitely DataFrame 5 does not seem to be the one you would want to go scraping or.

My code…

import pandas as pd
import sqlite3 as db
import requests

#create empty list to store our data scraped from web
frames = []

#create a string of webpages which we will feed into the html parser
s = ”’
http://www.etf.com/channels/bond-etfs
http://www.etf.com/channels/mlp-etfs
http://www.etf.com/channels/silver-etfs
http://www.etf.com/channels/china-etfs
http://www.etf.com/channels/muni-etfs
http://www.etf.com/channels/us-broad-market-bond-etfs
http://www.etf.com/channels/dividend-etfs
http://www.etf.com/channels/natural-gas-etfs
http://www.etf.com/channels/global-bond-etfs
http://www.etf.com/channels/oil-etfs
http://www.etf.com/channels/treasury-etfs
http://www.etf.com/channels/gold-etfs
http://www.etf.com/channels/reit-etfs
http://www.etf.com/channels/high-dividend-yield-etfs
http://www.etf.com/channels/japan-etfs
http://www.etf.com/channels/smart-beta-etfs
http://www.etf.com/etf-lists/alternatives-etfs
http://www.etf.com/etf-lists/asset-allocation-etfs
http://www.etf.com/etf-lists/currency-etfs
http://www.etf.com/etf-lists/fixed-income-etfs
http://www.etf.com/channels/alpha-seeking-etfs
http://www.etf.com/channels/basic-materials-etfs
http://www.etf.com/channels/consumer-cyclicals-etfs
http://www.etf.com/channels/consumer-non-cyclicals-etfs
http://www.etf.com/channels/energy-etfs
http://www.etf.com/channels/extended-market-etfs
http://www.etf.com/channels/financials-etfs
http://www.etf.com/channels/health-care-etfs
http://www.etf.com/channels/high-dividend-yield-etfs
http://www.etf.com/channels/industrials-etfs
http://www.etf.com/channels/real-estate-etfs
http://www.etf.com/channels/small-cap-etfs
http://www.etf.com/channels/technology-etfs
http://www.etf.com/channels/telecommunications-etfs
http://www.etf.com/channels/theme-etfs
http://www.etf.com/channels/total-market-etfs
http://www.etf.com/channels/utilities-etfs
http://www.etf.com/channels/asia-pacific-etfs
http://www.etf.com/channels/developed-markets-etfs
http://www.etf.com/channels/emerging-markets-etfs
http://www.etf.com/channels/europe-etfs
http://www.etf.com/channels/global-etfs
http://www.etf.com/channels/global-ex-us-etfs
http://www.etf.com/channels/latin-america-etfs
http://www.etf.com/channels/middle-east-and-africa-etfs
”’

#split urls in string and feed into pandas html parser to create dataframes of scraped info
for i in s.split():
print “Scraping data from {}.”.format(i)
df = pd.read_html(requests.get(i,headers={‘User-agent’: ‘Mozilla/5.0’}).text)
print type(df)
print len(df)
for j in xrange(1, len(df)):
print “Dataframe”, j,”,”, len(df[j].columns), “columns”
print df[j].columns
print
#df holds multiple DataFrames – index [5] is the DataFrame we
#are interested in the scraped data has slightly different
#headings across different webpages, so we reset the
#column headings to be identical for each DataFrame
#these names also match the columns we
#will set up in out SQLite database
df[5].columns = [‘Fund Name’,’Ticker’,’Asset Class’,’Region’,’Geography’,’Category’,’Focus’,
‘Niche’, ‘Inverse’,’Leveraged’,’ETN’,’Underlying Index’,’Selection Criteria’,
‘Weighting Scheme’,’Active per SEC’]
frames.append(df[5])

#create a “master” dataframe which concatenates all the relevant
#(index 5) DataFrames together
masterFrame = pd.concat(frames)

#create connection to our previously created SQLite database
#use the path and name which corresponds to your local database
cnx = db.connect(‘C:\Users\Stuart\sqlite_databases\etfs.db’)
cur = cnx.cursor()

#remove table if already exists and any data it contains
cur.execute(‘DROP TABLE IF EXISTS etftable;’)

#create the table within the database
sql = ”’CREATE TABLE etftable (‘Fund Name’ TEXT, ‘Ticker’ TEXT, ‘Asset Class’ TEXT,
‘Region’ TEXT, ‘Geography’ TEXT, ‘Category’ TEXT, ‘Focus’ TEXT,
‘Niche’ TEXT, ‘Inverse’ TEXT, ‘Leveraged’ TEXT, ‘ETN’ TEXT,
‘Underlying Index’ TEXT, ‘Selection Criteria’ TEXT, ‘Weighting Scheme’ TEXT,
‘Active per SEC’ TEXT)”’

cur.execute(sql)

#append the data
masterFrame.to_sql(name=’etftable’, con=cnx, if_exists = ‘append’, index=False)

cnx.close()

The result…

========== RESTART: /Users/carlosgorricho/Documents/import_etf.py ===========
Scraping data from http://www.etf.com/channels/bond-etfs.

15
Dataframe 1 , 3 columns
Int64Index([0, 1, 2], dtype=’int64′)

Dataframe 2 , 3 columns
Int64Index([0, 1, 2], dtype=’int64′)

Dataframe 3 , 3 columns
Int64Index([0, 1, 2], dtype=’int64′)

Dataframe 4 , 3 columns
Int64Index([0, 1, 2], dtype=’int64′)

Dataframe 5 , 3 columns
Int64Index([0, 1, 2], dtype=’int64′)

Dataframe 6 , 7 columns
Index([u’Ticker’, u’Fund Name’, u’Issuer’, u’Expense Ratio’, u’AUM’,
u’Spread %’, u’Segment’],
dtype=’object’)

Dataframe 7 , 8 columns
Index([u’Ticker’, u’Fund Name’, u’1 Month’, u’3 Month’, u’1 Year’, u’5 Year’,
u’10 Year’, u’As Of’],
dtype=’object’)

Dataframe 8 , 9 columns
Index([u’Ticker’, u’Fund Name’, u’Issuer’, u’Segment’, u’Analyst Pick’,
u’Grade’, u’E’, u’T’, u’F’],
dtype=’object’)

Dataframe 9 , 10 columns
Index([u’Ticker’, u’Fund Name’, u’DividendYield’, u’P/E’, u’P/B’, u’ ‘,
u’Duration’, u’CreditQuality’, u’YTM’, u’ .1′],
dtype=’object’)

Dataframe 10 , 2 columns
Index([u’Ticker’, u’Fund Name’], dtype=’object’)

Dataframe 11 , 17 columns
Index([u’Fund Name’, u’Ticker’, u’Asset Class’, u’Strategy’, u’Region’,
u’Geography’, u’Category’, u’Focus’, u’Niche’, u’Inverse’, u’Leveraged’,
u’ETN’, u’Underlying Index’, u’Index Provider’, u’Selection Criteria’,
u’Weighting Scheme’, u’Active per SEC’],
dtype=’object’)

Dataframe 12 , 6 columns
Index([u’Ticker’, u’Fund Name’, u’LegalStructure’, u’MAX LT CapitalGains Rate’,
u’Max ST CapitalGains Rate’, u’TaxReporting’],
dtype=’object’)

Dataframe 13 , 8 columns
Index([u’Ticker’, u’Fund Name’, u’MSCI ESGQuality Score’,
u’ESG ScorePeer Rank’, u’ESG ScoreGlobal Rank’,
u’Carbon IntensityTons CO2e/$M Sales’, u’SustainableImpact Exposure’,
u’SRI ScreeningCriteria Exposure’],
dtype=’object’)

Dataframe 14 , 1 columns
Int64Index([0], dtype=’int64′)

Traceback (most recent call last):
File “/Users/carlosgorricho/Documents/import_etf.py”, line 77, in
‘Weighting Scheme’,’Active per SEC’]
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py”, line 3627, in __setattr__
return object.__setattr__(self, name, value)
File “pandas/_libs/properties.pyx”, line 69, in pandas._libs.properties.AxisProperty.__set__
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py”, line 559, in _set_axis
self._data.set_axis(axis, labels)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py”, line 3074, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 3 elements, new values have 15 elements
>>>

Reply
Peter Koppelman April 3, 2018 - 1:29 pm

I’m having problems at the line of code df[5].columns. When I run the code I’m getting the error message “ValueError: Length mismatch: Expected axis has 3 elements, new values have 15 elements”. When I print df[5] I get the column headings Ticker, Fund Name, Pct Change. I think that the column command will only allow to change the names of the current columns, it won’t allow you to add new columns. Based on some research that I’ve done there appears to be several ways to do this (df.loc might do it) but why are you doing this? If you just have three columns of information for each ETF, can’t you just upload them to the database table and have the other columns left blank? What is the logic for adding the other 12 columns of data at this point in the code?

Reply
Peter Koppelman April 4, 2018 - 2:11 am

I started to look at the underlying data to try and figure out what was going wrong. I think that I found something…

I commented out the line of code df[5].columns =…. and put in a print(df[5]) statement inside the loop. That allowed me to see the data that was returned for each scraping. I found that most of the web sites returned three columns: ticker, fund name and percent change. The following web sites did not return the three columns. They returned a lot more information:
1) http://www.etf.com/channels/silver-etfs
2)http://www.etf.com/channels/natural-gas-etfs
3) http://www.etf.com/channels/extended-market-etfs
4) http://www.etf.com/channels/telecommunications-etfs
5) http://www.etf.com/channels/middle-east-and-africa-etfs

When I went to http://www.etf.com and looked at the individual ETF Channels I found that each of the ETFs that returned the three columns (ticker, fund name and percentage change) had six lists at the top of the page. The six lists were:
1) Top ETF Performers
2) Bottom ETF Performers
3) Top Creations
4) Top Redemptions
5) AUM Winners
6) AUM Losers

The code is pulling the top five AUM Losers (df[5] returns the first five rows in the list).

The five ETFs that did not return the three columns did not have the six lists at the top of the web page. The code went to the “master list” of ETFs at the bottom of the screen and pulled the first five ETFs from the Classification list.

Reply
Peter Koppelman April 10, 2018 - 10:46 pm

I’ve been working on the scraping and I’ve found a few things…
1) If you look at the main menu on etf.com you’ll see an option called etf tools and data. Under this is an option etf screener and database. This is the master database for this website. You can see etf results for over 2,100 different etfs. I believe that you were scraping data from the classification tab. If you want to try and get data, this is the place to go. You only have to go to one website. It’s easier to manage.
2) There is no way to scrape anything more than the information on the 20 etfs on page 1. I believe that when you scraped data using your code you got data from about 50 different screens which gave you a lot of data, but not the entire set of 2,100+ etfs. I’ve run your code, changed the code, posted a question on stackoverflow.com and finally contacted people at etf.com. They said that there is no way to scrape the full data set as the data belongs to third parties and there are licensing restrictions.

I’m assuming that you are going to delete this comment, but you might want to put a note above your code stating that the website has been changed and there is no longer a way to scrape all of the data off of it. That way people won’t get frustrated and spin their wheels for no reason trying to get the code to work.

Reply
bob July 2, 2018 - 4:05 am

Was able to get the complete list of current etfs by using this code. The list of etfs in the post seems outdated. “`from bs4 import BeautifulSoup

import requests
r = requests.get(“http://www.etf.com/channels”)
data = r.text

soup = BeautifulSoup(data, “lxml”)

s = []
for link in soup.find_all(‘a’):
if ‘/channels/’ in str(link.get(‘href’)):
s.append(‘http://www.etf.com’ + str(link.get(‘href’)))

s“`

Reply
Thomas A. Anderson February 12, 2019 - 2:40 pm

Here is a working example:

Get data from website

res = requests.get(“https://www.etf.com/channels/bond-etfs”)

Find ETF results

soup = BeautifulSoup(res.content,’lxml’)
table = soup.find_all(‘div’, id=’etf-finder-results’)

Find base64Data

quoted = re.compile(‘”[^”]*”‘)
data = quoted.findall(str(table[0]))[8]
decoded = base64ToString(data)[1:-1]

Find all json-like entries in the data

items = re.findall(r'{(.*?)}’, decoded)

Convert to json entries

json_data = {}
c = 0
for item in items:
json_data.update({c:json.loads( “{” + item + “}”)})
c+=1

Reply
s666 February 12, 2019 - 4:47 pm

Many thanks to everyone who is helping with posts containing up to date working code!!! This has been an absolute nightmare to try to keep on top of and try to keep a working version of the code updated…

I shall probably put a big NOTICE at the start of the post stating that guests should check the comments section rather than struggle with the code as posted in the article.

Once again, Thomas…and all who posted previously – many thanks for your efforts and helping keep this as a “working example”!!

Your efforts are much appreciated!!!

Reply
s666 February 12, 2019 - 4:48 pm

(posted again so that all following this comment section get notified)

Many thanks to everyone who is helping with posts containing up to date working code!!! This has been an absolute nightmare to try to keep on top of and try to keep a working version of the code updated…

I shall probably put a big NOTICE at the start of the post stating that guests should check the comments section rather than struggle with the code as posted in the article.

Once again, Thomas…and all who posted previously – many thanks for your efforts and helping keep this as a “working example”!!

Your efforts are much appreciated!!!

Reply

Leave a Reply

%d bloggers like this: