Scraping stock prices and econometric data for machine learning projects
Introduction
Time series prediction is less flashy than natural language processing or computer vision, but is certainly no less of an important target for machine learning. In this blog post, I document how to collect stock and econometric data for a data science project, and clean it for later learning.
Data acquisition
An easy source for historical financial data is Yahoo! Finance, which can be accessed through the yfinance and pandas-datareader packages.
We can install the packages using conda or pip:
conda install yfinance pandas-datareader -c conda-forge
I use the pandas-datareader interface to access the financial data. Besides Yahoo! Finance, it also offers a variety of different sources for econometric data, such as GDP and EuroStat statistics. As of February 2023, the builtin Yahoo! Finance module of pandas-datareader is broken and requires a fix by importing it like this:
from pandas_datareader import data as pdr
import yfinance as yf
yf.fix_pdr()
Stock and commodity data
A historical data for a single stock like BAS.DE for a certain time-interval can be acquired by
quote = pdr.DataReader("BAS.DE", '2012-01-01', '2012-01-04')
Date | Open | High | Low | Close | Adj Close | Volume |
---|---|---|---|---|---|---|
2012-01-02 00:00:00 | 53.62 | 55.3 | 53.62 | 55.14 | 31.7988 | 3.44626e+06 |
2012-01-03 00:00:00 | 55.79 | 56.79 | 55.58 | 56.35 | 32.4966 | 4.00376e+06 |
2012-01-04 00:00:00 | 55.93 | 56.57 | 55.81 | 56.16 | 32.387 | 2.71296e+06 |
The return value of the datareader is a DataFrame that is indexed by the Date (in python datetime format) and contains the following data for each day:
- Open: the price at the opening of trading that day
- High: the highest price during that day
- Low: the lowest price during that day
- Close: the price at the end of that day
- Adj Close: the price at the end of that day, corrected for dividends that have been paid out and stock splits that change the stock price but do not change market capitalization
- Volume: number of shares traded that day
Based on this information, the relative change of the previous day might be a relevant indicator, let us call it movement:
$ \mathrm{Movement} = 2 \frac{\mathrm{Close}-\mathrm{Open}}{\mathrm{Open}+\mathrm{Close}}$
An estimate for the volatility might be, together with the volume, correlated to the range of prices during that time period:
$ \mathrm{Volatility} = 2 \frac{\mathrm{Max}-\mathrm{Min}}{\mathrm{Open}+\mathrm{Close}}$
Looking at the data, it becomes apparent that the first of January was not a trading day and therefore does not have any price data. We will surely have to do some cleanup in the next section.
To get a good estimate for the change of a stock price, one needs a set of estimates for external factors that affect a companies performance. I chose general stock indices, exchange rates, and the prices of commodities that are required for the value chain:
Symbol | Name | Description |
---|---|---|
BAS.DE | BASF SE stock price | Target variable |
SPY | S&P 500 stock index | Stock index focused on US economy |
^FTSE | FTSE stock index | Stock index focused on European economy, including large British multinationals |
^STOXX | STOXX 600 stock index | Stock index focused on European economy |
EURUSD=X | EUR/USD exchange rate | Exchange rate affects the effective price of commodities and business environments |
CL=F | Crude oil future | Essential commodity |
NG=F | Natural gas future, US | Essential commodity for US sites |
TTF=F | Natural gas future, Europe | Essential commodity for European sites |
For these, the adjusted closing price and volume seem to be the most reasonable choices for training an ML model.
Econometric data sourcing
The Federal Reserve Bank of St. Louis offers a wealth of datasets with labor market data, interest rate and government bond rate time series, among many others. As such, it provides a good foundation for data-driven decision making. For my project, I chose the following time series:
Symbol | Name | Description |
---|---|---|
DFF | Federal Funds Effective Rate | Major factor on world economy, USD interest rate |
DBAA | Moody’s Seasoned Baa Corporate Bond Yield | Proxy for risk in the US economy |
The data can be retrieved by using the data pdr.FredReader
, in analogy to the general pdr.DataReader
.
Collecting the data
I collected data for the period between 2011 to 2020 in a DataFrame:
start_time = '2011-01-01'
end_time = '2019-12-31'
fields_to_copy = ['Adj Close', 'Volume', 'Volatility', 'Movement']
# data from Yahoo! Finance
symbols_to_correlate = [
'BAS.DE',
'SPY', '^FTSE', '^STOXX',
'EURUSD=X',
'CL=F', 'NG=F', #'TTF=F' TTF=F is only available from 2014 onwards
]
# data from Yahoo! Finance
fred_symbols = [
'DFF', 'DBAA',
]
def add_movement(data):
return (data['Close']-data['Open'])/(data['Open']+data['Close'])*2
def add_volatility(data):
return np.abs(data['High']-data['Low'])/(data['Open']+data['Close'])*2
def scrape_data(reader, symbol, start_time, end_time, fields_to_copy=[]):
""" Helper function to retrieve data and rename fields for a given time span"""
# get the data and rename
data = reader(
symbol, start_time, end_time
)
if reader == pdr.DataReader:
# for renaming the columns of interest to include the symbol
renamedict = {
fieldname: fieldname + ' ' + symbol
for fieldname in fields_to_copy
}
data['Movement'] = add_movement(data)
data['Volatility'] = add_volatility(data)
data.rename(columns=renamedict, inplace=True)
# only return selected fields
return data[renamedict.values()]
return data.read()
# Execute scraping for yfinance
scraped_data = [scrape_data(pdr.DataReader, symbol, start_time, end_time, fields_to_copy)
for symbol in symbols_to_correlate
]
# Acquire data from FRED and append it to the collection
fred_data = scrape_data(pdr.FredReader, fred_symbols, start_time, end_time)
for key in fred_data.keys():
fred_data['Diff '+key] = fred_data[key].diff()
scraped_data.append(fred_data)
# Join data on the date index
resx = scraped_data.pop()
while len(scraped_data) > 0:
resx = resx.join(scraped_data.pop())
Data preparation and feature engineering
Now that the columns in resx
contain the aforementioned data, let’s prepare the data for training.
Dealing with incomplete data: interpolation
Considering the abundance of holidays and, therefore, incomplete data, we can simply interpolate and drop the first and last rows where interpolation is not possible:
resx = resx.sort_index().interpolate().dropna()
This, of course, introduces errors and should ideally be avoided.
Capturing seasonality
The seasonality of business plays a big role, which can be captured by including features that capture the periodicity of a year, as well as the day of the week and the month.
import numpy as np
def seasonalize_date(x, periodic_func=np.sin):
return periodic_func((((x.month - 1) + (x.day - 1)/ 29.5 )) / 12 * np.pi * 2)
resx['season-sin'] = [
seasonalize_date(x, periodic_func=np.sin) for x in resx.index]
resx['season-cos'] = [
seasonalize_date(x, periodic_func=np.cos) for x in resx.index]
resx['weekday'] = [x.weekday() for x in resx.index]
resx['month'] = [x.month for x in resx.index]
Normalizing the data (optional)
Depending on where your project may take you, normalizing the data may be required.
Scikit-learn includes a simple way to do this by using the StandardScaler
that
- subtracts the mean $\mu$ and
- divides by the standard deviation $\sigma$ in a columnwise manner:
from sklearn.preprocessing import StandardScaler
import pandas as pd
scaler = StandardScaler()
resx_scaled = pd.DataFrame(
scaler.fit_transform(resx), columns=resx.columns
)
The mean is saved as scaler.mean_
and the variance as scaler.var_
.
More datapoints can be scaled using scaler.transform
and the inverse can be called using scaler.inverse_transform
.
Train–validation–test splitting
For training any machine-learning model, performing train–validation–test splitting is neccessary. Using off-the-shelf splitting methods such as those included in scikit-learn is not possible for time series data. For simplicity, we simply do a 70%-20%-10% split:
n = len(resx_scaled)
train_df = resx_scaled.iloc[:int(0.7*n)]
val_df = resx_scaled.iloc[int(0.7*n):int(0.9*n)]
test_df = resx_scaled.iloc[int(0.9*n):]
Note that in the case of time series, this already introduces a bias towards worse validation and test scores because recent trends are not captures. After validating a model structure using this split, it is highly advisable to retrain it using the most up-to-date data available.
Data handling
Scraping large volumes of datasets is not looked favorably upon by free sites such as Yahoo! Finance. Thus, I would recommend saving the data after scraping and re-reading it from disk rather than scraping again.
Data saving
Saving the data to disk is rather straightforward. I recommend the HDF format because it allows to efficiently store multiple sets of structured data in a single file.
output_filename = 'data.hdf'
resx.to_hdf(output_filename, key='unscaled')
resx_scaled.to_hdf(output_filename, key='scaled')
train_df.to_hdf(output_filename, key='training')
val_df.to_hdf(output_filename, key='validation')
test_df.to_hdf(output_filename, key='test')
Data loading
output_filename = 'data.hdf'
resx = pd.read_hdf(output_filename, key='unscaled')
resx_scaled = pd.read_hdf(output_filename, key='scaled')
train_df = pd.read_hdf(output_filename, key='training')
val_df = pd.read_hdf(output_filename, key='validation')
test_df = pd.read_hdf(output_filename, key='test')