CMSC320 Final Project
Name: Ryan Downing
Due: May 18, 2020
In the past few decades, the world of finance has been transformed by data. Now more than ever, companies are utilizing big data in attempt to predict the future. More specifically in regards to the stock market, investors want to predict whether a stock's share price will appreciate, or depreciate in the future.
This tutorial will walkthrough the data science process using data from: https://www.quandl.com/tables/SF1/SHARADAR-SF1
We will explore how we can use visualizations to help investors understand market conditions, and then apply more advanced techniques of machine learning to predict future performance. In particular, this tutorial will compare the prediction performance of a multivariate logistic regression against a random forest prediction.
import datetime
import requests
import psycopg2
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
SEED = 1234
np.random.seed(SEED)
CONNECTION_PARAMS = {
'database': ## REDACTED ##
'user': ##################
'password': ##############
'host': ##################
'port': #### REDACTED ####
}
conn = psycopg2.connect(**CONNECTION_PARAMS)
cur = conn.cursor()
The data used in this tutorial comes from a SQL database as defined above. To concentrate the scope of this analysis, we will be looking at those companies that are currently in the S&P500 (taken from wikipedia). In addition, we will be looking at data between January 1st, 2005, and December 31st 2019. This gives us 15 years of data to work with, of which 12 years (2005-2017) will be used for training the model, and 3 years (2017-2020) will be reserved to test the model.
Each entry in the database represents a SEC 10Q filing for a company, given by ticker
. The attributes represent the different data elements of the SEC filing. For this analysis, we will focus on those given by the income statement
, statement of cash flows
, and balance sheet
. We will also look at many of the common metrics
used by investors to asses companies. A more thorough documentation of these attributes can be found here.
The code below constructs a SQL query to pull the necessary data from the database. To do this we must first pull the table of S&P companies from the wikipedia page, and then select the Symbol
column which gives us the ticker of each company. We must then create a list of all the attributes we want to pull from the database, features
.
Since we are interested in the change of share price for a given ticker over time, I also add datekey
, ticker
, and price
to the attributes list. We can then build a string query for the SQL database by joining the ticker conditions with OR
and the datae conditions with AND
, and then finally combine both of these with AND
for the WHERE
clause of the query.
The database also needs to know the list of parameters these conditions refer to, so we can simply build a list by combining the tickers with the start and end date. Now we can create the full query using the SELECT
, FROM
, WHERE
template.
note: dimension = \'ARQ\'
refers to "as-reported quarterly" which avoids future-bias, more information can be found here
table = 'fundamental_raw' # table name in database
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
sp_tickers = pd.read_html(url)[0]['Symbol'].unique() # Gets table from wikipedia and selects unique tickers
start_date = datetime.datetime(year=2005, month=1, day=1)
end_date = datetime.datetime(year=2019, month=12, day=31)
# Income statement
IS = ['revenue', 'cor', 'sgna', 'rnd', 'opex',
'intexp', 'taxexp', 'ebit', 'netinc', 'eps']
# Statement of Cash flows
CF = ['capex', 'ncfbus', 'ncfinv', 'ncff',
'ncfdebt', 'ncfcommon', 'ncfdiv', 'ncfi',
'ncfo', 'ncfx', 'ncf', 'sbcomp', 'depamor']
# Balance Sheet
BS = ['assets', 'cashneq', 'investments', 'investmentsc',
'investmentsnc', 'deferredrev', 'deposits','ppnenet',
'inventory', 'taxassets', 'receivables', 'payables',
'intangibles', 'liabilities', 'equity', 'retearn',
'assetsc', 'assetsnc', 'liabilitiesc', 'liabilitiesnc',
'taxliabilities', 'debt', 'debtc', 'debtnc']
# Common Metrics
MT = ['marketcap', 'ev', 'fcf', 'gp', 'opinc', 'grossmargin',
'netmargin', 'ebitdamargin', 'payoutratio', 'evebitda',
'evebit', 'pe', 'sps', 'ps', 'pb', 'de', 'divyield',
'currentratio', 'workingcapital', 'fcfps', 'bvps']
features = IS + CF + BS + MT # create a single list of all the features we are interested in
attributes = ['datekey', 'ticker', 'price'] + features # Create list of all attributes
ticker_conds = ' OR '.join(['ticker = %s' for ticker in sp_tickers]) # query for each ticker
date_conds = 'datekey >= %s AND datekey <= %s' # date query
query_conds = f'({ticker_conds}) AND ({date_conds}) AND (dimension = \'ARQ\')' # combine to create full "WHERE" query
attrs_query = ', '.join(attributes)
params = list(sp_tickers) + [start_date, end_date] # create list of parameters for SQL query
fundamental_query = f'SELECT {attrs_query} FROM {table} WHERE {query_conds}' # create full query
Now that we have the full query constructed, we can execute the query in the database using cur.execute
. The response gives us the full dataset, which we can simply put into a pandas dataframe, using the column names received from cur.description
.
# Execute query and store into dataframe
cur.execute(fundamental_query, params)
reponse = cur.fetchall()
col_names = [desc.name for desc in cur.description]
fund = pd.DataFrame(data=reponse, columns=col_names)
fund.head()
To get this data into a usable format for analysis, we must do some cleaning to prepare it. Fortunately, this database is already in a fairly clean format. However, we must convert the datekey
column to datetime
datatype, and then rename it date
. Now that we have this, we can easily extract the year using dt.year
to create a year
column.
Since companies have the ability to post corrections for their SEC filing, there is a chance that a company has multiple reports for a single date. To correct for this, we should drop duplicate rows, regarding ticker
and date
.
fund['datekey'] = pd.to_datetime(fund['datekey'])
fund.rename(columns={'datekey': 'date'}, inplace=True)
fund['year'] = fund['date'].dt.year
fund.drop_duplicates(subset=['date', 'ticker'], keep='last', inplace=True)
fund.head()
Since we are not concerned with the share price, but rather the change in share price, we must calculate the percent change in price between report periods for each ticker. To do this we must first ensure the data is sorted by date so the percent change is calculated between consecutive periods. Then we can just groupby ticker
and use pandas pct_change()
method to calculate the price change for each ticker.
Calculating change results in a NaN
in the first row for each ticker, so we should drop these rows. Then we transform the price_change
to a binary up
or down
signal using np.where
.
fund.sort_values('date', inplace=True)
fund['price_change'] = fund.groupby('ticker')['price'].pct_change()
fund.dropna(subset=['date', 'ticker', 'price', 'price_change'], inplace=True)
fund['movement'] = np.where(fund['price_change'] >= 0, 'up', 'down')
fund.head()
Since we are interested in creating a model to predict future price movement, let's first take a look at what the data we are working with looks like. To get an understanding of how price_change
is distributed, we can create distplot
s facetted on year to see the varying distributions across years. We can immediately see that price_change
tends to be normally distributed with a mean just above 0. There is exceptions to this as we see in 2008, 2009 (during the Great Financial Crisis) where the stock market experienced a strong decline as well as increased volatiltiy, as seen by the increased spread.
g = sns.FacetGrid(fund, col='year', col_wrap=5) # create facet grid
g.map(sns.distplot, 'price_change') # plot yearly histograms
plt.xlim(-1, 1)
plt.show()
We can also take a look at the price trends for individual stocks, or a group of stocks as shown below. To do this we must groupby ticker
once again, and take a cumulative sum of the price changes for the given ticker. Now that we have the cumulative returns stored in a new column, we can plot them across dates easily. I encourage you to change ticker_subset
to look at different companies' performance.
ticker_subset = ['FB', 'AMZN', 'NFLX', 'GOOGL', 'AAPL']
df = fund[['date', 'ticker', 'price_change']].copy() # make a copy for plotting
df = df[df['ticker'].isin(ticker_subset)] # select only interested tickers
df['returns'] = df.groupby('ticker')['price_change'].cumsum() # calculate cumulative returns
plt.rcParams['figure.figsize'] = [12, 6] # adjust plot size
sns.lineplot(data=df, x='date', y='returns', hue='ticker')
plt.show()
Because our model is only concerned with whether the price moved up or down and not by how much, we should take a look at how our movement
variable changes over time. To do this we have to groupby year
and then calculate the proportion of up
movements in each year. Since total movement
s is given by x.shape[0]
we can just sum all the occurrences of up
and then divide.
In the plot we can see that between 2005 and 2020, the proportion of up
movements fluctuates between 50 and 75 percent, with the exception of 2008 which only experienced about 30% up movements.
yearly_movement = fund.groupby('year')['movement'].apply(lambda x: (x == 'up').sum() / x.shape[0])
yearly_movement.plot()
We can take a more detailed look at this by plotting the individual price changes across time. Since this results in a lot of data points, I take a 10% sample from the original dataset to use for plotting. Furthermore, we can create a cut
function to label the market capitalizations (marketcap) of each data point into three equal groups: small cap, medium cap, and large cap.
The plot supports what we concluded previously. The distribution appears to be normal, centered slightly above 0. We can also see the drawdown from 2008-2010 in a more detailed way, as many companies experience negative price changes.
def cut(x, bins, labels=None):
"""Splits x into [bins] equal groups"""
percentiles = np.arange(0, 100, 100 / bins)
cutoffs = np.nanpercentile(x, percentiles) # value of each percentile in x
out = np.digitize(x, cutoffs) # value of each group the respective x value falls in
if labels: # convert bins to label names if labels are given
lab_dict = dict(zip(range(1, bins+1), labels))
return pd.Series(out).replace(lab_dict).values
else:
return out
fund['mktcap_bin'] = cut(fund['marketcap'], 3, ['small cap', 'medium cap', 'large cap'])
plot_data = fund.sample(fund.shape[0] // 10, random_state=SEED) # select 10% of data points
sns.scatterplot(x='date', y='price_change', hue="mktcap_bin", alpha=.5,
data=plot_data, hue_order=['small cap', 'medium cap', 'large cap'])
del plot_data # no longer needed, clear from memory
plt.xlim(start_date, end_date)
plt.ylim(-1, 1)
plt.show()
Now that we assigned a bin for the marketcap of each stock, lets take a closer look at how the price_change
distribution varies between each group. We can do this by creating violin plots for each marketcap bin where the y-axis is the price_change
. This graph gives shows us that while the mean price_change
is about the same for each group, the spread decreases as marketcap increases. This makes sense because we expect larger companies to have lower volatility. This can be attributed to larger trading volumes, more clear news from large companies, or many other factors.
# sort by mktcap_bin so the x-axis is in order: small, medium, large
sns.violinplot(data=fund.sort_values('mktcap_bin', ascending=False), inner='quartile', x='mktcap_bin', y='price_change')
plt.ylim(-1, 1)
plt.show()
Lastly, we can gain some idea of the relationship between features by constructing a correlation matrix between them. For this example I will just look at the balance sheet attributes, in addition to the price. Still focusing on change, we calculate the pct_change
for each attribute grouping first by ticker and then selecting the relevant attributes, attrs
. Then we call pandas corr()
method to create a correlation matrix and use plt.matshow
to plot.
This matrix is not very telling, but begins to give us an idea how the different features are related. We are most concerned with the correlations with price
, as seen in the last row or column. There are not any strong correlations, so perhaps combining some of these features with machine learning will be more useful. We do see stronger correlations where we would expect so, between the different asset
and liability
classes, which we should expect to trend together.
attrs = BS + ['price']
correlations = fund.groupby('ticker')[attrs].pct_change().corr()
plt.rcParams['figure.figsize'] = [15, 15]
plt.matshow(correlations)
plt.xticks(range(correlations.shape[1]), correlations.columns, rotation=90, fontsize=12)
plt.yticks(range(correlations.shape[1]), correlations.columns, fontsize=12)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.show()
From our previous exploratory data analysis (EDA), we found that there may be a relationship between the standard deviation of price changes for a company, and that company's marketcap. We can test this hypothesis using a linear fit against the relevant data. First we summarize each ticker, taking the mean marketcap for each, and standard deviation of price changes. We should also take the log of the marketcaps for each ticker to even out the distribution, since there are many small companies, with only a few very large ones.
ticker_group = fund.groupby('ticker')
ticker_data = pd.DataFrame(index=fund['ticker'].unique()) # create new dataframe to hold data for each ticker
ticker_data['ticker'] = ticker_data.index
ticker_data['std'] = ticker_group['price_change'].std()
ticker_data['marketcap'] = ticker_group['marketcap'].mean()
ticker_data['log_mktcap'] = np.log(ticker_data['marketcap'].values)
ticker_data.dropna(inplace=True)
We can plot this relationship using seaborn's lmplot
which shows that there is a slight negative relationship between these factors.
sns.lmplot(data=ticker_data, x='log_mktcap', y='std', aspect=2, scatter_kws={'alpha': .25})
plt.ylim(0, .5)
plt.show()
We can also use statsmodels to see the statistics of this fit. While the R-squared value is poor, only accounting for 1.8% of the variance, the $\mathrm{p-value} = .003 \leq \alpha = .05$ which means there is a statistically significant relationship between the (log) marketcap of a company, and it's share price's volatility.
import statsmodels.formula.api as smf
model = smf.ols('std~log_mktcap', data=ticker_data)
results = model.fit()
results.summary()
Now that we have done some basic hypothesis testing, we can move into more advanced machine learning techniques. To do this we must first convert the movement
label to a binary [1, 0]
score so it can be used in our models. As previously mentioned, we should split the data at 2017/01/01 for our training set and out-of-sample (oos) testing set.
Since we want to predict future price movement, we shift movement
column up 1 so it refers to the next change in share price.
data = fund.copy()
data['movement'] = data.groupby('ticker')['movement'].shift(-1) # movement now associated with future movement
data['mvmt_binary'] = np.where(data['movement'] == 'up', 1, 0)
# Split training and testing set at January 1st, 2017
predictor_df = data[data['date'] < datetime.datetime(year=2017, month=1, day=1)]
oos_df = data[data['date'] >= datetime.datetime(year=2017, month=1, day=1)]
# Split into X and Y and convert to numpy matrices
X = predictor_df[features].values
y = predictor_df['mvmt_binary'].values
X_oos = oos_df[features].values
y_oos = oos_df['mvmt_binary'].values
We will be looking at the prediction performance between a logistic regression and random forest, so first import all the necessary modules from sklearn
.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import auc
For the logistic regression we need to adjust the max iterations to be the length of the dataset to ensure the model converges. For the random forest I will be using 250 decision trees with a max depth of 10 and also require at least 10 samples to split at a decision node.
Since the data is not perfectly clean, we will use sklearn's SimpleImputer
to fill NaN
values with the mean of the values for that feature. We then use fit_transform
to clean our data sets and then fit our models using the training data. Once the models are fitted, we use predict_proba
to test our out-of-sample data against the model.
# Create classification models
log_reg = LogisticRegression(max_iter=X.shape[0])
rf = RandomForestClassifier(n_estimators=250, max_depth=10,
min_samples_split=10, max_features="auto")
imp = SimpleImputer(missing_values=np.nan, strategy='mean') # object for cleaning data
# Clean training and testing sets
train_clean = imp.fit_transform(X)
test_clean = imp.fit_transform(X_oos)
# Fit models
log_reg.fit(train_clean, y)
rf.fit(train_clean, y)
# Get scores of models
log_scores = log_reg.predict_proba(test_clean)[:,1]
rf_scores = rf.predict_proba(test_clean)[:,1]
We can look at the coefficients from the logistic regression and the feature importances from the random forest to understand how the different models are weighting the various features.
def pretty_dict(val, per_row=5):
"""Prints dict items to stdout where per_row refers
to the number of items printed on each line"""
out = ""
for i, (k, v) in enumerate(val.items(), 1):
out += f"'{k}': {v}, "
if i % per_row == 0:
out += '\n'
return out
# Create dict of features' coefficients, sort, and format for printing
log_coefs = dict(zip(features, log_reg.coef_[0]))
log_coefs = sorted(log_coefs.items(), key=lambda item: item[1], reverse=True)
log_coefs = {k: '{:.3e}'.format(v) for k, v in log_coefs}
# Create dict of features' importances, sort, and format for printing
rf_importances = dict(zip(features, rf.feature_importances_))
rf_importances = sorted(rf_importances.items(), key=lambda item: item[1], reverse=True)
rf_importances = {k: round(v, 3) for k, v in rf_importances}
print(f"Logistic Regression Coefficients:\n{pretty_dict(log_coefs, 4)}")
print(f"Random Forest Feature Importances:\n{pretty_dict(rf_importances, 5)}")
We can also export one of the decision trees from the random forest to an image to see the actual decision path of one of the trees in the model.
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydot
estimator = rf.estimators_[0]
# Export decision tree as dot file
export_graphviz(estimator, out_file='tree.dot',
feature_names = features,
class_names = None,
rounded = True, proportion = False,
precision = 2, filled = True,
max_depth=3)
# Use pydot to convert dot file to png
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
Image(filename='tree.png') # display image