Scraping stock prices and econometric data for machine learning projects

Last updated on 25. Feb, 2020 7 min read

Introduction

Time series prediction is less flashy than natural language processing or computer vision, but is certainly no less of an important target for machine learning. In this blog post, I document how to collect stock and econometric data for a data science project, and clean it for later learning.

Data acquisition

An easy source for historical financial data is Yahoo! Finance, which can be accessed through the yfinance and pandas-datareader packages.

We can install the packages using conda or pip:

conda install yfinance pandas-datareader -c conda-forge

I use the pandas-datareader interface to access the financial data. Besides Yahoo! Finance, it also offers a variety of different sources for econometric data, such as GDP and EuroStat statistics. As of February 2023, the builtin Yahoo! Finance module of pandas-datareader is broken and requires a fix by importing it like this:

from pandas_datareader import data as pdr
import yfinance as yf
yf.fix_pdr()

Stock and commodity data

A historical data for a single stock like BAS.DE for a certain time-interval can be acquired by

quote = pdr.DataReader("BAS.DE", '2012-01-01', '2012-01-04')

Date	Open	High	Low	Close	Adj Close	Volume
2012-01-02 00:00:00	53.62	55.3	53.62	55.14	31.7988	3.44626e+06
2012-01-03 00:00:00	55.79	56.79	55.58	56.35	32.4966	4.00376e+06
2012-01-04 00:00:00	55.93	56.57	55.81	56.16	32.387	2.71296e+06

The return value of the datareader is a DataFrame that is indexed by the Date (in python datetime format) and contains the following data for each day:

Open: the price at the opening of trading that day
High: the highest price during that day
Low: the lowest price during that day
Close: the price at the end of that day
Adj Close: the price at the end of that day, corrected for dividends that have been paid out and stock splits that change the stock price but do not change market capitalization
Volume: number of shares traded that day

Based on this information, the relative change of the previous day might be a relevant indicator, let us call it movement:

$ \mathrm{Movement} = 2 \frac{\mathrm{Close}-\mathrm{Open}}{\mathrm{Open}+\mathrm{Close}}$

An estimate for the volatility might be, together with the volume, correlated to the range of prices during that time period:

$ \mathrm{Volatility} = 2 \frac{\mathrm{Max}-\mathrm{Min}}{\mathrm{Open}+\mathrm{Close}}$

Looking at the data, it becomes apparent that the first of January was not a trading day and therefore does not have any price data. We will surely have to do some cleanup in the next section.

To get a good estimate for the change of a stock price, one needs a set of estimates for external factors that affect a companies performance. I chose general stock indices, exchange rates, and the prices of commodities that are required for the value chain:

Symbol	Name	Description
BAS.DE	BASF SE stock price	Target variable
SPY	S&P 500 stock index	Stock index focused on US economy
^FTSE	FTSE stock index	Stock index focused on European economy, including large British multinationals
^STOXX	STOXX 600 stock index	Stock index focused on European economy
EURUSD=X	EUR/USD exchange rate	Exchange rate affects the effective price of commodities and business environments
CL=F	Crude oil future	Essential commodity
NG=F	Natural gas future, US	Essential commodity for US sites
TTF=F	Natural gas future, Europe	Essential commodity for European sites

For these, the adjusted closing price and volume seem to be the most reasonable choices for training an ML model.

Econometric data sourcing

The Federal Reserve Bank of St. Louis offers a wealth of datasets with labor market data, interest rate and government bond rate time series, among many others. As such, it provides a good foundation for data-driven decision making. For my project, I chose the following time series:

Symbol	Name	Description
DFF	Federal Funds Effective Rate	Major factor on world economy, USD interest rate
DBAA	Moody’s Seasoned Baa Corporate Bond Yield	Proxy for risk in the US economy

The data can be retrieved by using the data pdr.FredReader, in analogy to the general pdr.DataReader.

Collecting the data

I collected data for the period between 2011 to 2020 in a DataFrame:

start_time = '2011-01-01'
end_time = '2019-12-31'
fields_to_copy = ['Adj Close', 'Volume', 'Volatility', 'Movement']
# data from Yahoo! Finance
symbols_to_correlate = [
  'BAS.DE',
  'SPY', '^FTSE', '^STOXX',
  'EURUSD=X',
  'CL=F', 'NG=F', #'TTF=F' TTF=F is only available from 2014 onwards
]

# data from Yahoo! Finance
fred_symbols = [
  'DFF', 'DBAA',
]

def add_movement(data):
    return (data['Close']-data['Open'])/(data['Open']+data['Close'])*2

def add_volatility(data):
    return np.abs(data['High']-data['Low'])/(data['Open']+data['Close'])*2

def scrape_data(reader, symbol, start_time, end_time, fields_to_copy=[]):
    """ Helper function to retrieve data and rename fields for a given time span"""

    # get the data and rename
    data = reader(
      symbol, start_time, end_time
    )

    if reader == pdr.DataReader:
      # for renaming the columns of interest to include the symbol
      renamedict = {
        fieldname: fieldname + ' ' + symbol
        for fieldname in fields_to_copy
      }

      data['Movement'] = add_movement(data)
      data['Volatility'] = add_volatility(data)

      data.rename(columns=renamedict, inplace=True)

      # only return selected fields
      return data[renamedict.values()]
    return data.read()

# Execute scraping for yfinance
scraped_data = [scrape_data(pdr.DataReader, symbol, start_time, end_time, fields_to_copy)
  for symbol in symbols_to_correlate
]

# Acquire data from FRED and append it to the collection
fred_data = scrape_data(pdr.FredReader, fred_symbols, start_time, end_time)
for key in fred_data.keys():
  fred_data['Diff '+key] = fred_data[key].diff()
scraped_data.append(fred_data)

# Join data on the date index
resx = scraped_data.pop()
while len(scraped_data) > 0:
    resx = resx.join(scraped_data.pop())

Data preparation and feature engineering

Now that the columns in resx contain the aforementioned data, let’s prepare the data for training.

Dealing with incomplete data: interpolation

Considering the abundance of holidays and, therefore, incomplete data, we can simply interpolate and drop the first and last rows where interpolation is not possible:

resx = resx.sort_index().interpolate().dropna()

This, of course, introduces errors and should ideally be avoided.

Capturing seasonality

The seasonality of business plays a big role, which can be captured by including features that capture the periodicity of a year, as well as the day of the week and the month.

import numpy as np
def seasonalize_date(x, periodic_func=np.sin):
    return periodic_func((((x.month - 1) + (x.day - 1)/ 29.5 )) / 12 * np.pi * 2)

resx['season-sin'] = [
  seasonalize_date(x, periodic_func=np.sin) for x in resx.index]
resx['season-cos'] = [
  seasonalize_date(x, periodic_func=np.cos) for x in resx.index]
resx['weekday'] = [x.weekday() for x in resx.index]
resx['month'] = [x.month for x in resx.index]

Normalizing the data (optional)

Depending on where your project may take you, normalizing the data may be required. Scikit-learn includes a simple way to do this by using the StandardScaler that

subtracts the mean $\mu$ and
divides by the standard deviation $\sigma$ in a columnwise manner:

from sklearn.preprocessing import StandardScaler
import pandas as pd

scaler = StandardScaler()
resx_scaled = pd.DataFrame(
    scaler.fit_transform(resx), columns=resx.columns
)

The mean is saved as scaler.mean_ and the variance as scaler.var_. More datapoints can be scaled using scaler.transform and the inverse can be called using scaler.inverse_transform.

Train–validation–test splitting

For training any machine-learning model, performing train–validation–test splitting is neccessary. Using off-the-shelf splitting methods such as those included in scikit-learn is not possible for time series data. For simplicity, we simply do a 70%-20%-10% split:

n = len(resx_scaled)
train_df = resx_scaled.iloc[:int(0.7*n)]
val_df = resx_scaled.iloc[int(0.7*n):int(0.9*n)]
test_df = resx_scaled.iloc[int(0.9*n):]

Note that in the case of time series, this already introduces a bias towards worse validation and test scores because recent trends are not captures. After validating a model structure using this split, it is highly advisable to retrain it using the most up-to-date data available.

Data handling

Scraping large volumes of datasets is not looked favorably upon by free sites such as Yahoo! Finance. Thus, I would recommend saving the data after scraping and re-reading it from disk rather than scraping again.

Data saving

Saving the data to disk is rather straightforward. I recommend the HDF format because it allows to efficiently store multiple sets of structured data in a single file.

output_filename = 'data.hdf'
resx.to_hdf(output_filename, key='unscaled')
resx_scaled.to_hdf(output_filename, key='scaled')
train_df.to_hdf(output_filename, key='training')
val_df.to_hdf(output_filename, key='validation')
test_df.to_hdf(output_filename, key='test')

Data loading

output_filename = 'data.hdf'
resx = pd.read_hdf(output_filename, key='unscaled')
resx_scaled = pd.read_hdf(output_filename, key='scaled')
train_df = pd.read_hdf(output_filename, key='training')
val_df = pd.read_hdf(output_filename, key='validation')
test_df = pd.read_hdf(output_filename, key='test')