Scraping stock prices and econometric data for machine learning projects

Introduction

Time series prediction is less flashy than natural language processing or computer vision, but is certainly no less of an important target for machine learning. In this blog post, I document how to collect stock and econometric data for a data science project, and clean it for later learning.

Data acquisition

An easy source for historical financial data is Yahoo! Finance, which can be accessed through the yfinance and pandas-datareader packages.

We can install the packages using conda or pip:

conda install yfinance pandas-datareader -c conda-forge

I use the pandas-datareader interface to access the financial data. Besides Yahoo! Finance, it also offers a variety of different sources for econometric data, such as GDP and EuroStat statistics. As of February 2023, the builtin Yahoo! Finance module of pandas-datareader is broken and requires a fix by importing it like this:

from pandas_datareader import data as pdr
import yfinance as yf
yf.fix_pdr()

Stock and commodity data

A historical data for a single stock like BAS.DE for a certain time-interval can be acquired by

quote = pdr.DataReader("BAS.DE", '2012-01-01', '2012-01-04')

Date Open High Low Close Adj Close Volume
2012-01-02 00:00:00 53.62 55.3 53.62 55.14 31.7988 3.44626e+06
2012-01-03 00:00:00 55.79 56.79 55.58 56.35 32.4966 4.00376e+06
2012-01-04 00:00:00 55.93 56.57 55.81 56.16 32.387 2.71296e+06

The return value of the datareader is a DataFrame that is indexed by the Date (in python datetime format) and contains the following data for each day:

  • Open: the price at the opening of trading that day
  • High: the highest price during that day
  • Low: the lowest price during that day
  • Close: the price at the end of that day
  • Adj Close: the price at the end of that day, corrected for dividends that have been paid out and stock splits that change the stock price but do not change market capitalization
  • Volume: number of shares traded that day

Based on this information, the relative change of the previous day might be a relevant indicator, let us call it movement:

$ \mathrm{Movement} = 2 \frac{\mathrm{Close}-\mathrm{Open}}{\mathrm{Open}+\mathrm{Close}}$

An estimate for the volatility might be, together with the volume, correlated to the range of prices during that time period:

$ \mathrm{Volatility} = 2 \frac{\mathrm{Max}-\mathrm{Min}}{\mathrm{Open}+\mathrm{Close}}$

Looking at the data, it becomes apparent that the first of January was not a trading day and therefore does not have any price data. We will surely have to do some cleanup in the next section.

To get a good estimate for the change of a stock price, one needs a set of estimates for external factors that affect a companies performance. I chose general stock indices, exchange rates, and the prices of commodities that are required for the value chain:

Symbol Name Description
BAS.DE BASF SE stock price Target variable
SPY S&P 500 stock index Stock index focused on US economy
^FTSE FTSE stock index Stock index focused on European economy, including large British multinationals
^STOXX STOXX 600 stock index Stock index focused on European economy
EURUSD=X EUR/USD exchange rate Exchange rate affects the effective price of commodities and business environments
CL=F Crude oil future Essential commodity
NG=F Natural gas future, US Essential commodity for US sites
TTF=F Natural gas future, Europe Essential commodity for European sites

For these, the adjusted closing price and volume seem to be the most reasonable choices for training an ML model.

Econometric data sourcing

The Federal Reserve Bank of St. Louis offers a wealth of datasets with labor market data, interest rate and government bond rate time series, among many others. As such, it provides a good foundation for data-driven decision making. For my project, I chose the following time series:

Symbol Name Description
DFF Federal Funds Effective Rate Major factor on world economy, USD interest rate
DBAA Moody’s Seasoned Baa Corporate Bond Yield Proxy for risk in the US economy

The data can be retrieved by using the data pdr.FredReader, in analogy to the general pdr.DataReader.

Collecting the data

I collected data for the period between 2011 to 2020 in a DataFrame:

start_time = '2011-01-01'
end_time = '2019-12-31'
fields_to_copy = ['Adj Close', 'Volume', 'Volatility', 'Movement']
# data from Yahoo! Finance
symbols_to_correlate = [
  'BAS.DE',
  'SPY', '^FTSE', '^STOXX',
  'EURUSD=X',
  'CL=F', 'NG=F', #'TTF=F' TTF=F is only available from 2014 onwards
]

# data from Yahoo! Finance
fred_symbols = [
  'DFF', 'DBAA',
]

def add_movement(data):
    return (data['Close']-data['Open'])/(data['Open']+data['Close'])*2

def add_volatility(data):
    return np.abs(data['High']-data['Low'])/(data['Open']+data['Close'])*2

def scrape_data(reader, symbol, start_time, end_time, fields_to_copy=[]):
    """ Helper function to retrieve data and rename fields for a given time span"""

    # get the data and rename
    data = reader(
      symbol, start_time, end_time
    )

    if reader == pdr.DataReader:
      # for renaming the columns of interest to include the symbol
      renamedict = {
        fieldname: fieldname + ' ' + symbol
        for fieldname in fields_to_copy
      }

      data['Movement'] = add_movement(data)
      data['Volatility'] = add_volatility(data)

      data.rename(columns=renamedict, inplace=True)

      # only return selected fields
      return data[renamedict.values()]
    return data.read()

# Execute scraping for yfinance
scraped_data = [scrape_data(pdr.DataReader, symbol, start_time, end_time, fields_to_copy)
  for symbol in symbols_to_correlate
]

# Acquire data from FRED and append it to the collection
fred_data = scrape_data(pdr.FredReader, fred_symbols, start_time, end_time)
for key in fred_data.keys():
  fred_data['Diff '+key] = fred_data[key].diff()
scraped_data.append(fred_data)

# Join data on the date index
resx = scraped_data.pop()
while len(scraped_data) > 0:
    resx = resx.join(scraped_data.pop())

Data preparation and feature engineering

Now that the columns in resx contain the aforementioned data, let’s prepare the data for training.

Dealing with incomplete data: interpolation

Considering the abundance of holidays and, therefore, incomplete data, we can simply interpolate and drop the first and last rows where interpolation is not possible:

resx = resx.sort_index().interpolate().dropna()

This, of course, introduces errors and should ideally be avoided.

Capturing seasonality

The seasonality of business plays a big role, which can be captured by including features that capture the periodicity of a year, as well as the day of the week and the month.

import numpy as np
def seasonalize_date(x, periodic_func=np.sin):
    return periodic_func((((x.month - 1) + (x.day - 1)/ 29.5 )) / 12 * np.pi * 2)

resx['season-sin'] = [
  seasonalize_date(x, periodic_func=np.sin) for x in resx.index]
resx['season-cos'] = [
  seasonalize_date(x, periodic_func=np.cos) for x in resx.index]
resx['weekday'] = [x.weekday() for x in resx.index]
resx['month'] = [x.month for x in resx.index]

Normalizing the data (optional)

Depending on where your project may take you, normalizing the data may be required. Scikit-learn includes a simple way to do this by using the StandardScaler that

  • subtracts the mean $\mu$ and
  • divides by the standard deviation $\sigma$ in a columnwise manner:
from sklearn.preprocessing import StandardScaler
import pandas as pd

scaler = StandardScaler()
resx_scaled = pd.DataFrame(
    scaler.fit_transform(resx), columns=resx.columns
)

The mean is saved as scaler.mean_ and the variance as scaler.var_. More datapoints can be scaled using scaler.transform and the inverse can be called using scaler.inverse_transform.

Train–validation–test splitting

For training any machine-learning model, performing train–validation–test splitting is neccessary. Using off-the-shelf splitting methods such as those included in scikit-learn is not possible for time series data. For simplicity, we simply do a 70%-20%-10% split:

n = len(resx_scaled)
train_df = resx_scaled.iloc[:int(0.7*n)]
val_df = resx_scaled.iloc[int(0.7*n):int(0.9*n)]
test_df = resx_scaled.iloc[int(0.9*n):]

Note that in the case of time series, this already introduces a bias towards worse validation and test scores because recent trends are not captures. After validating a model structure using this split, it is highly advisable to retrain it using the most up-to-date data available.

Data handling

Scraping large volumes of datasets is not looked favorably upon by free sites such as Yahoo! Finance. Thus, I would recommend saving the data after scraping and re-reading it from disk rather than scraping again.

Data saving

Saving the data to disk is rather straightforward. I recommend the HDF format because it allows to efficiently store multiple sets of structured data in a single file.

output_filename = 'data.hdf'
resx.to_hdf(output_filename, key='unscaled')
resx_scaled.to_hdf(output_filename, key='scaled')
train_df.to_hdf(output_filename, key='training')
val_df.to_hdf(output_filename, key='validation')
test_df.to_hdf(output_filename, key='test')

Data loading

output_filename = 'data.hdf'
resx = pd.read_hdf(output_filename, key='unscaled')
resx_scaled = pd.read_hdf(output_filename, key='scaled')
train_df = pd.read_hdf(output_filename, key='training')
val_df = pd.read_hdf(output_filename, key='validation')
test_df = pd.read_hdf(output_filename, key='test')