Summary of findings

One issue with using aggregated regular season statistics in an NCAA tournament model is that it doesn't account for opponent strength of schedule. We can use embeddings to extend pair-wise comparison methods, like the Bradley-Terry model, to more complex metrics than just wins and losses. These methods assume static team skill, but qualitatively appear more robust than aggregating statistics.

Update: See this post for a simple comparison of tournament models built with the emebeddings and aggregated point data. The embeddings provide an uplift over the data they were trained by. More testing is needed to see if embeddings prove useful when extended to advanced statistics.

Note: This work was inspired by this Kaggle notebook, which is the Kaggle notebook I've seen applying embeddings to the NCAA tournament. Here is a second example of embeddings used for baseball (this is a great place to start if you are not familiar with embeddings) and the concept could easily be applied to other sports.

Introduction

I expected to compete in my first Kaggle competitions this March - both the NCAA Men's and Women's basketball tournaments. Unfortunately, both tournaments were canceled because of the covid-19 pandemic. Regardless, working through my different ideas was a great way to learn some of the basics and nuances of training an ML model effectively.

The typical successful model on Kaggle is an XGBoost ensemble model aimed at classifying wins and losses of tournament games. The biggest differences generally come down to feature selection and engineering. A key advantage of neural networks is the ability of a large network to learn non-linear relationships, which ultimately reduces the necessity for complex feature engineering. However, training and test data are limited in this case. The most detailed data set extend back to 2003, ~1,140 games. Many of these games have spurious results that the best models would not predict. As expected, a neural network architecture without feature engineering performs much more poorly on this data set; a network with feature engineering performs similarly to an XGBoost model. In future posts, I will explore more complex deep learning architectures such as RNNs for tournament prediction, but here I'll focus on feature engineering.

The easiest way to generate features for tournament prediction is to average a team's regular season statistics. The Kaggle data set contains various statistics, but users also commonly generate advanced statistics before aggregation. The aggregated features are useful, but do not compensate for opponent strength. If a team had an easy schedule, it may have artificially higher statistics than another. Some methods like the Bradley-Terry model (implemented in Python here), solve for single-value team strengths using previous comparisons of results across all teams. This implementation solves for team strengths based only on wins and losses and couldn't possibly distinguish between aspects of a team's strength (e.g. differentiating defensive skill from offensive skill).

But what if we generalize this concept of pairwise-comparisons using embeddings? Embeddings are can be any length and can be trained to represent complex relationships. This notebook will generate team-level embeddings, representative of regular season data (win/losses, point differential, court location) that could be used to train a tournament model. A simple exploratory analysis suggests that the trained embeddings are a richer representation of the original feature data set, that could be implemented in a tournament model. Future testing and expansion of this method to advanced statistics will be needed to confirm that!

What you will see in this notebook:

  • Brief data prep - we are only using wins/losses, points, home/away, and team IDs as inputs to the model. Later, I will expand this model to advanced statistics, but training the model on this subset of data allows us to test the concept all the way back to 1985!
  • Model build - This model is being built with the sole purpose of generating useful embeddings. To achieve that we are training the model to be predictive of features that we would ordinarily use as feature inputs to a real tournament model (in this case, regular season wins and losses).
  • Training and validation - the model is trained using only regular season data from all years and is validated on a secondary set of tournament data (NIT). This is difficult because we have a slight mismatch between our training and validation data. The validation data is generally similar and likely more representative of the real NCAA tournament. That is okay; the end goal of this model is trained embeddings and not win prediction.
  • Sense check and exploratory analysis - First thing is to check that predictions from the model are sensible, but what we really care about is the embeddings. Do they carry more useful information than simple aggregations of the data they represent? In short, Yes!

#collapse_hide
from pathlib import Path
import numpy as np
import pandas as pd
import tensorflow as tf
import keras as k
from keras.models import Model
from keras.layers import Dense, Input, Dropout, Multiply, Lambda, Concatenate, Add, Subtract
from keras.layers.embeddings import Embedding
from keras.initializers import glorot_uniform, glorot_normal, Constant
from keras.optimizers import Adam
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.manifold import TSNE
import altair as alt
np.random.seed(13)

Input data

Kaggle provides two regular season data sets: one with less detailed statistics dating back to 1985, and one with more detailed statistics dating back to 2003. For this proof of concept, I'll simplify the model by working with the older data set. The model will be trained on less powerful statistics, but we will have more historical context during data exploration. Let's preview the first few rows of the regular season data:

#collapse_hide
dataLoc=Path('./data/2020-05-04-NCAA-Embeddings/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/MDataFiles_Stage2/')

df_teams = pd.read_csv(dataLoc/'MTeams.csv')
teams_dict = df_teams[['TeamID','TeamName']].set_index('TeamID').to_dict()['TeamName']

df_regSeason_data = pd.read_csv(dataLoc/'MRegularSeasonCompactResults.csv')
df_regSeason_data.head() # cols = Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT
0 1985 20 1228 81 1328 64 N 0
1 1985 25 1106 77 1354 70 H 0
2 1985 25 1112 63 1223 56 H 0
3 1985 25 1165 70 1432 54 H 0
4 1985 25 1192 86 1447 74 H 0

Each game is listed only once, with separate columns for winner and loser statistics. The data will be reformatted to leave us with the following inputs and outputs to the model:

  • Team IDs
  • Team locations (home or away)
  • Odds of 'Team A' winning (classification/logistic regression)
  • Expected point differential (team A - team B, regression)

Each game is presented to the model twice, once as 'Team A vs. Team B', and then vice versa. Team IDs also need to be reformatted to differentiate a single school's team by year (e.g. Duke 2019 $\neq$ Duke 2020). This may be an oversimplification, but is clearly a better assumption than treating the entire history of a school the same.

Finally, we want to use as much of the training data as possible to train the embeddings. Because the true accuracy of the model is insignificant, secondary tournament data (NIT) is used as training validation. This provides a sense that the model is training correctly, but doesn't require sacrificing valuable regular season or NCAA tournament data in the process. This data set has the same format as the regular season data and is processed accordingly.

#collapse_hide
# load validation data
df_otherTourney_data = pd.read_csv(dataLoc/'MSecondaryTourneyCompactResults.csv').drop(columns='SecondaryTourney')

# Create team encoding that differentiates teams by year and school
def newTeamID(df):
    # df = df.sample(frac=1).reset_index(drop=True)
    df['Wnewid'] = df['Season'].astype(str) + df['WTeamID'].astype(str)
    df['Lnewid'] = df['Season'].astype(str) + df['LTeamID'].astype(str)
    return df

df_regSeason_data = newTeamID(df_regSeason_data)
df_otherTourney_data = newTeamID(df_otherTourney_data)

def idDicts(df):
    newid_W = list(df['Wnewid'].unique())
    newid_L = list(df['Lnewid'].unique())
    ids = list(set().union(newid_W,newid_L))
    ids.sort()
    oh_to_id = {}
    id_to_oh = {}
    for i in range(len(ids)):
        id_to_oh[ids[i]] = i 
        oh_to_id[i] = ids[i]

    return oh_to_id, id_to_oh

oh_to_id, id_to_oh = idDicts(df_regSeason_data)    

# add training data in swapped format so network sees both wins and losses
def swapConcat_data(df):

    df['Wnewid'] = df['Wnewid'].apply(lambda x: id_to_oh[x])
    df['Lnewid'] = df['Lnewid'].apply(lambda x: id_to_oh[x])

    loc_dict = {'A':-1,'N':0,'H':1}
    df['WLoc'] = df['WLoc'].apply(lambda x: loc_dict[x])

    swap_cols = ['Season', 'DayNum', 'LTeamID', 'LScore', 'WTeamID', 'WScore', 'WLoc', 'NumOT', 'Lnewid', 'Wnewid']

    df_swap = df[swap_cols].copy()

    df_swap['WLoc'] = df_swap['WLoc']*-1

    df.columns = [x.replace('WLoc','T1_Court')
                   .replace('W','T1_')
                   .replace('L','T2_') for x in list(df.columns)]

    df_swap.columns = df.columns

    df = pd.concat([df,df_swap])

    df['Win'] = (df['T1_Score']>df['T2_Score']).astype(int)
    df['Close_Game']= abs(df['T1_Score']-df['T2_Score']) <3
    df['Score_diff'] = df['T1_Score'] - df['T2_Score']
    df['Score_diff'] = df['Score_diff'] - (df['Score_diff']/df['Score_diff'].abs())
    df['T2_Court'] = df['T1_Court']*-1
    df[['T1_Court','T2_Court']] = df[['T1_Court','T2_Court']] + 1

    cols = df.columns.to_list()

    df = df[cols].sort_index()
    df.reset_index(drop=True,inplace=True)


    return df

df_regSeason_full = swapConcat_data(df_regSeason_data.copy().sort_values(by='DayNum'))
df_otherTourney_full = swapConcat_data(df_otherTourney_data.copy())

# Convert to numpy arrays in correct format
def prep_inputs(df,id_to_oh, col_outputs):
    Xteams = np.stack([df['T1_newid'].values,df['T2_newid'].values]).T
    Xloc = np.stack([df['T1_Court'].values,df['T2_Court'].values]).T

    if len(col_outputs) <2:
        Y_outputs = df[col_outputs].values
        Y_outputs = Y_outputs.reshape(len(Y_outputs),1)
    else:
        Y_outputs = np.stack([df[x].values for x in col_outputs])

    return [Xteams, Xloc], Y_outputs

X_train, Y_train = prep_inputs(df_regSeason_full, id_to_oh, ['Win','Score_diff'])
X_test, Y_test = prep_inputs(df_otherTourney_full, id_to_oh, ['Win','Score_diff'])

# Normalize point outputs - Win/loss unchanged
def normalize_outputs(Y_outputs, stats_cache=None):
    if stats_cache == None:
        stats_cache = {}
        stats_cache['mean'] = np.mean(Y_outputs,axis=1)
        stats_cache['var'] = np.var(Y_outputs,axis=1)
    else: pass
    
    numOut = Y_outputs.shape[0]
    Y_normout = (Y_outputs-stats_cache['mean'].reshape((numOut,1)))/stats_cache['var'].reshape((numOut,1))

    return Y_normout, stats_cache

Y_norm_train, stats_cache_train = normalize_outputs(Y_train,None)
Y_norm_test, _ = normalize_outputs(Y_test,stats_cache_train)
Y_norm_train[0,:] = Y_train[0,:]
Y_norm_test[0,:] = Y_test[0,:]

Building the model

This model is built with two input types - home/away flags and team IDs. Each input is repeated for each team and is fed through a location embedding layer and a team embedding layer. The location embedding is 1-dimensional and multiplied by each team's embedding vector element by element. The team embeddings are separately fed through the same two-layers before being subtracted. This subtracted layer finally connects to two output layers - one 'softmax' for win/loss prediction and one dense layer with no activation for point prediction.

A lot of the choices I made here are subjective and would need to be tested against some other options. Among the obvious parameters like number of layers or units per layer are a key parameters that have a significant effect and haven't been fully explored. For example, should the location embedding activations be multiplied, added, or simply concatenated with the teams embedding layer? What should the dimensions of the teams embedding layer be? Some of these choices, like length of embeddings can be mitigated with regularization (dropout, weight decay), but something like how the two embedding layers combine could have a large effect. Regardless, let's move forward with the model as is.

#collapse_hide
# build model

tf.keras.backend.clear_session()

def NCAA_Embeddings_Joint(nteams,teamEmb_size):
    team_input = Input(shape=[2,],dtype='int32', name='team_input')
    X_team = Embedding(input_dim=nteams, output_dim=teamEmb_size, input_length=2, embeddings_initializer=glorot_uniform(), name='team_encoding')(team_input)

    loc_input = Input(shape=[2,],dtype='int32', name='loc_input')
    X_loc = Embedding(input_dim=3, output_dim=1, input_length=2, embeddings_initializer=Constant(.5), name='loc_encoding')(loc_input)
    X_loc = Lambda(lambda z: k.backend.repeat_elements(z, rep=teamEmb_size, axis=-1))(X_loc)
    
    X = Multiply()([X_team,X_loc])
    X = Dropout(rate=.5)(X)
    T1 = Lambda(lambda z: z[:,0,:])(X)
    T2 = Lambda(lambda z: z[:,1,:])(X)

    D1 = Dense(units = 20, use_bias=True, activation='tanh')
    DO1 = Dropout(rate=.5)

    D2 = Dense(units = 10, use_bias=True, activation='tanh')
    DO2 = Dropout(rate=.5)

    X1 = D1(T1)
    X1 = DO1(X1)

    X1 = D2(X1)
    X1 = DO2(X1)

    X2 = D1(T2)
    X2 = DO1(X2)

    X2 = D2(X2)
    X2 = DO2(X2)

    X_sub = Subtract()([X1,X2])

    output_p= Dense(units = 1, use_bias=False, activation=None, name='point_output')(X_sub)
    output_w= Dense(units = 1, use_bias=False, activation='sigmoid', name='win_output')(X_sub)


    model = Model(inputs=[team_input, loc_input],outputs=[output_w,output_p],name='ncaa_embeddings_joint')

    return model

mymodel = NCAA_Embeddings_Joint(len(id_to_oh),8)
mymodel.summary()
WARNING:tensorflow:From /Users/ryanarmstrong/opt/miniconda3/envs/ds37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Model: "ncaa_embeddings_joint"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
loc_input (InputLayer)          (None, 2)            0                                            
__________________________________________________________________________________________________
team_input (InputLayer)         (None, 2)            0                                            
__________________________________________________________________________________________________
loc_encoding (Embedding)        (None, 2, 1)         3           loc_input[0][0]                  
__________________________________________________________________________________________________
team_encoding (Embedding)       (None, 2, 8)         92752       team_input[0][0]                 
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 2, 8)         0           loc_encoding[0][0]               
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 2, 8)         0           team_encoding[0][0]              
                                                                 lambda_1[0][0]                   
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 2, 8)         0           multiply_1[0][0]                 
__________________________________________________________________________________________________
lambda_2 (Lambda)               (None, 8)            0           dropout_1[0][0]                  
__________________________________________________________________________________________________
lambda_3 (Lambda)               (None, 8)            0           dropout_1[0][0]                  
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 20)           180         lambda_2[0][0]                   
                                                                 lambda_3[0][0]                   
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 20)           0           dense_1[0][0]                    
                                                                 dense_1[1][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 10)           210         dropout_2[0][0]                  
                                                                 dropout_2[1][0]                  
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 10)           0           dense_2[0][0]                    
                                                                 dense_2[1][0]                    
__________________________________________________________________________________________________
subtract_1 (Subtract)           (None, 10)           0           dropout_3[0][0]                  
                                                                 dropout_3[1][0]                  
__________________________________________________________________________________________________
win_output (Dense)              (None, 1)            10          subtract_1[0][0]                 
__________________________________________________________________________________________________
point_output (Dense)            (None, 1)            10          subtract_1[0][0]                 
==================================================================================================
Total params: 93,165
Trainable params: 93,165
Non-trainable params: 0
__________________________________________________________________________________________________

Training the model

The model is trained using regular season data and validated using secondary tournament data (not the 'Big Dance'). The weights of the two losses are adjusted so that they back-propagate a similar amount of error. Because the point differential data has been normalized, the losses are multiple orders of magnitude less than the log loss metric for wins/losses. Given our goal is not really to predict wins and losses here, we could also train on only the point differential. We will see that the two end up being perfectly correlated anyway, but using both gives us a framework for how this could be implemented with advanced statistics in a future example.

#collapse_show
# Joint model
optimizer = Adam(learning_rate=.01, beta_1=0.9, beta_2=0.999, amsgrad=False)
mymodel.compile(loss=['binary_crossentropy','logcosh'],
                loss_weights=[0.5,400],
                optimizer=optimizer,
                metrics = ['accuracy'])
numBatch = round(X_train[0].shape[0]/50)
results = mymodel.fit(X_train, [*Y_norm_train], validation_data=(X_test, [*Y_norm_test]), epochs = 30, batch_size = numBatch,shuffle=True, verbose=True)
WARNING:tensorflow:From /Users/ryanarmstrong/opt/miniconda3/envs/ds37/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 333760 samples, validate on 3248 samples
Epoch 1/30
333760/333760 [==============================] - 6s 17us/step - loss: 1.2979 - win_output_loss: 0.6906 - point_output_loss: 0.0024 - win_output_accuracy: 0.5408 - point_output_accuracy: 0.0483 - val_loss: 0.8859 - val_win_output_loss: 0.6908 - val_point_output_loss: 0.0014 - val_win_output_accuracy: 0.5382 - val_point_output_accuracy: 0.0616
Epoch 2/30
333760/333760 [==============================] - 5s 15us/step - loss: 1.0580 - win_output_loss: 0.6422 - point_output_loss: 0.0019 - win_output_accuracy: 0.6641 - point_output_accuracy: 0.0483 - val_loss: 0.8394 - val_win_output_loss: 0.6707 - val_point_output_loss: 0.0013 - val_win_output_accuracy: 0.5844 - val_point_output_accuracy: 0.0616
Epoch 3/30
333760/333760 [==============================] - 3s 10us/step - loss: 0.9983 - win_output_loss: 0.5979 - point_output_loss: 0.0017 - win_output_accuracy: 0.6844 - point_output_accuracy: 0.0483 - val_loss: 0.8217 - val_win_output_loss: 0.6588 - val_point_output_loss: 0.0012 - val_win_output_accuracy: 0.6195 - val_point_output_accuracy: 0.0616
Epoch 4/30
333760/333760 [==============================] - 3s 10us/step - loss: 0.9603 - win_output_loss: 0.5859 - point_output_loss: 0.0017 - win_output_accuracy: 0.6989 - point_output_accuracy: 0.0483 - val_loss: 0.8071 - val_win_output_loss: 0.6510 - val_point_output_loss: 0.0012 - val_win_output_accuracy: 0.6139 - val_point_output_accuracy: 0.0616
Epoch 5/30
333760/333760 [==============================] - 4s 12us/step - loss: 0.9462 - win_output_loss: 0.5780 - point_output_loss: 0.0016 - win_output_accuracy: 0.7039 - point_output_accuracy: 0.0483 - val_loss: 0.7933 - val_win_output_loss: 0.6423 - val_point_output_loss: 0.0012 - val_win_output_accuracy: 0.6268 - val_point_output_accuracy: 0.0616
Epoch 6/30
333760/333760 [==============================] - 3s 9us/step - loss: 0.9328 - win_output_loss: 0.5739 - point_output_loss: 0.0016 - win_output_accuracy: 0.7085 - point_output_accuracy: 0.0483 - val_loss: 0.8160 - val_win_output_loss: 0.6489 - val_point_output_loss: 0.0012 - val_win_output_accuracy: 0.6250 - val_point_output_accuracy: 0.0616
Epoch 7/30
333760/333760 [==============================] - 4s 12us/step - loss: 0.9497 - win_output_loss: 0.5711 - point_output_loss: 0.0016 - win_output_accuracy: 0.7085 - point_output_accuracy: 0.0483 - val_loss: 0.7755 - val_win_output_loss: 0.6329 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6355 - val_point_output_accuracy: 0.0616
Epoch 8/30
333760/333760 [==============================] - 5s 15us/step - loss: 0.9141 - win_output_loss: 0.5642 - point_output_loss: 0.0016 - win_output_accuracy: 0.7124 - point_output_accuracy: 0.0483 - val_loss: 0.7665 - val_win_output_loss: 0.6253 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6583 - val_point_output_accuracy: 0.0616
Epoch 9/30
333760/333760 [==============================] - 5s 14us/step - loss: 0.9182 - win_output_loss: 0.5619 - point_output_loss: 0.0016 - win_output_accuracy: 0.7138 - point_output_accuracy: 0.0483 - val_loss: 0.7503 - val_win_output_loss: 0.6164 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6792 - val_point_output_accuracy: 0.0616
Epoch 10/30
333760/333760 [==============================] - 4s 13us/step - loss: 0.8924 - win_output_loss: 0.5534 - point_output_loss: 0.0015 - win_output_accuracy: 0.7176 - point_output_accuracy: 0.0483 - val_loss: 0.7732 - val_win_output_loss: 0.6088 - val_point_output_loss: 0.0012 - val_win_output_accuracy: 0.6810 - val_point_output_accuracy: 0.0616
Epoch 11/30
333760/333760 [==============================] - 5s 16us/step - loss: 0.8952 - win_output_loss: 0.5520 - point_output_loss: 0.0016 - win_output_accuracy: 0.7209 - point_output_accuracy: 0.0483 - val_loss: 0.7371 - val_win_output_loss: 0.6057 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6804 - val_point_output_accuracy: 0.0616
Epoch 12/30
333760/333760 [==============================] - 7s 21us/step - loss: 0.8900 - win_output_loss: 0.5540 - point_output_loss: 0.0015 - win_output_accuracy: 0.7215 - point_output_accuracy: 0.0483 - val_loss: 0.7528 - val_win_output_loss: 0.6078 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6773 - val_point_output_accuracy: 0.0616
Epoch 13/30
333760/333760 [==============================] - 8s 23us/step - loss: 0.8793 - win_output_loss: 0.5477 - point_output_loss: 0.0015 - win_output_accuracy: 0.7234 - point_output_accuracy: 0.0483 - val_loss: 0.7266 - val_win_output_loss: 0.6026 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6835 - val_point_output_accuracy: 0.0616
Epoch 14/30
333760/333760 [==============================] - 5s 14us/step - loss: 0.8666 - win_output_loss: 0.5444 - point_output_loss: 0.0015 - win_output_accuracy: 0.7248 - point_output_accuracy: 0.0483 - val_loss: 0.7266 - val_win_output_loss: 0.6017 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6872 - val_point_output_accuracy: 0.0616
Epoch 15/30
333760/333760 [==============================] - 6s 18us/step - loss: 0.8641 - win_output_loss: 0.5392 - point_output_loss: 0.0015 - win_output_accuracy: 0.7251 - point_output_accuracy: 0.0483 - val_loss: 0.7290 - val_win_output_loss: 0.6013 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6866 - val_point_output_accuracy: 0.0616
Epoch 16/30
333760/333760 [==============================] - 8s 23us/step - loss: 0.8680 - win_output_loss: 0.5444 - point_output_loss: 0.0015 - win_output_accuracy: 0.7243 - point_output_accuracy: 0.0483 - val_loss: 0.8162 - val_win_output_loss: 0.6027 - val_point_output_loss: 0.0013 - val_win_output_accuracy: 0.6909 - val_point_output_accuracy: 0.0616
Epoch 17/30
333760/333760 [==============================] - 2s 7us/step - loss: 0.8827 - win_output_loss: 0.5459 - point_output_loss: 0.0015 - win_output_accuracy: 0.7251 - point_output_accuracy: 0.0483 - val_loss: 0.7484 - val_win_output_loss: 0.6063 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6817 - val_point_output_accuracy: 0.0616
Epoch 18/30
333760/333760 [==============================] - 2s 5us/step - loss: 0.8706 - win_output_loss: 0.5418 - point_output_loss: 0.0015 - win_output_accuracy: 0.7269 - point_output_accuracy: 0.0483 - val_loss: 0.7290 - val_win_output_loss: 0.6012 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6835 - val_point_output_accuracy: 0.0616
Epoch 19/30
333760/333760 [==============================] - 2s 6us/step - loss: 0.8557 - win_output_loss: 0.5376 - point_output_loss: 0.0015 - win_output_accuracy: 0.7283 - point_output_accuracy: 0.0483 - val_loss: 0.7286 - val_win_output_loss: 0.5987 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6884 - val_point_output_accuracy: 0.0616
Epoch 20/30
333760/333760 [==============================] - 2s 6us/step - loss: 0.8516 - win_output_loss: 0.5347 - point_output_loss: 0.0014 - win_output_accuracy: 0.7272 - point_output_accuracy: 0.0483 - val_loss: 0.7281 - val_win_output_loss: 0.6007 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6884 - val_point_output_accuracy: 0.0616
Epoch 21/30
333760/333760 [==============================] - 1s 4us/step - loss: 0.8491 - win_output_loss: 0.5341 - point_output_loss: 0.0014 - win_output_accuracy: 0.7274 - point_output_accuracy: 0.0483 - val_loss: 0.7514 - val_win_output_loss: 0.6003 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6921 - val_point_output_accuracy: 0.0616
Epoch 22/30
333760/333760 [==============================] - 1s 4us/step - loss: 0.8625 - win_output_loss: 0.5446 - point_output_loss: 0.0015 - win_output_accuracy: 0.7293 - point_output_accuracy: 0.0483 - val_loss: 0.7429 - val_win_output_loss: 0.5990 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6903 - val_point_output_accuracy: 0.0616
Epoch 23/30
333760/333760 [==============================] - 1s 4us/step - loss: 0.8728 - win_output_loss: 0.5396 - point_output_loss: 0.0015 - win_output_accuracy: 0.7266 - point_output_accuracy: 0.0483 - val_loss: 0.7210 - val_win_output_loss: 0.5980 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6983 - val_point_output_accuracy: 0.0616
Epoch 24/30
333760/333760 [==============================] - 1s 4us/step - loss: 0.8614 - win_output_loss: 0.5420 - point_output_loss: 0.0015 - win_output_accuracy: 0.7254 - point_output_accuracy: 0.0483 - val_loss: 0.7340 - val_win_output_loss: 0.6033 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6786 - val_point_output_accuracy: 0.0616
Epoch 25/30
333760/333760 [==============================] - 1s 4us/step - loss: 0.8534 - win_output_loss: 0.5405 - point_output_loss: 0.0014 - win_output_accuracy: 0.7278 - point_output_accuracy: 0.0483 - val_loss: 0.7405 - val_win_output_loss: 0.6014 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6810 - val_point_output_accuracy: 0.0616
Epoch 26/30
333760/333760 [==============================] - 1s 3us/step - loss: 0.8520 - win_output_loss: 0.5373 - point_output_loss: 0.0015 - win_output_accuracy: 0.7289 - point_output_accuracy: 0.0483 - val_loss: 0.7255 - val_win_output_loss: 0.6016 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6810 - val_point_output_accuracy: 0.0616
Epoch 27/30
333760/333760 [==============================] - 2s 5us/step - loss: 0.8471 - win_output_loss: 0.5368 - point_output_loss: 0.0014 - win_output_accuracy: 0.7302 - point_output_accuracy: 0.0483 - val_loss: 0.7738 - val_win_output_loss: 0.6038 - val_point_output_loss: 0.0012 - val_win_output_accuracy: 0.6897 - val_point_output_accuracy: 0.0616
Epoch 28/30
333760/333760 [==============================] - 2s 6us/step - loss: 0.8516 - win_output_loss: 0.5344 - point_output_loss: 0.0015 - win_output_accuracy: 0.7290 - point_output_accuracy: 0.0483 - val_loss: 0.7666 - val_win_output_loss: 0.6041 - val_point_output_loss: 0.0012 - val_win_output_accuracy: 0.6847 - val_point_output_accuracy: 0.0616
Epoch 29/30
333760/333760 [==============================] - 2s 7us/step - loss: 0.8660 - win_output_loss: 0.5403 - point_output_loss: 0.0015 - win_output_accuracy: 0.7301 - point_output_accuracy: 0.0483 - val_loss: 0.7302 - val_win_output_loss: 0.6017 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.6860 - val_point_output_accuracy: 0.0616
Epoch 30/30
333760/333760 [==============================] - 2s 6us/step - loss: 0.8442 - win_output_loss: 0.5338 - point_output_loss: 0.0014 - win_output_accuracy: 0.7304 - point_output_accuracy: 0.0483 - val_loss: 0.7245 - val_win_output_loss: 0.5991 - val_point_output_loss: 0.0011 - val_win_output_accuracy: 0.7001 - val_point_output_accuracy: 0.0616

#collapse_hide
accuracy = results.history['win_output_accuracy']
val_accuracy = results.history['val_win_output_accuracy']
loss = results.history['win_output_loss']
val_loss = results.history['val_win_output_loss']
# summarize history for accuracy
plt.plot(accuracy)
plt.plot(val_accuracy)
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(loss)
plt.plot(val_loss)
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Model predictions

The cross plots for point differential and win/loss are generally well-behaved. The loss and accuracy of the model differ significantly from results on the Kaggle leaderboard for this competition. For reference, anything below a loss of 0.5 would be considered fantastic, and flipping a coin would give you a loss of about 0.69. We see a similar effect in the point spread prediction with a rather loose correlation of 0.47.

However, this model is optimized to train embeddings, not to predict tournament games. In fact, a model trained to predict winners and losers of regular season games would perform poorly on tournament data anyway. Only the best teams play in the tournament - different features matter and making predictions is even harder). So the model predictions aren't very useful, but the question still remains: are the embeddings meaningful? A comparison of the embeddings to the aggregated data used in training show that embeddings are a richer representation of the regular season data.

#collapse_hide
def transform_y(preds,stats_cache):
    preds = stats_cache['var'][1] * preds + stats_cache['mean'][1]
    return preds

preds = mymodel.predict(X_test)

x = transform_y(preds[1],stats_cache_train).reshape(-1)
y = transform_y(Y_norm_test[1],stats_cache_train).reshape(-1)


print('Pearson coefficient: ', round(stats.pearsonr(x, y)[0]*100)/100)
plt.scatter(x, y, alpha=0.08)
plt.xlabel('Predicted point difference')
plt.ylabel('Actual point difference')
plt.show()
Pearson coefficient:  0.47

#collapse_hide
x = transform_y(preds[1],stats_cache_train).reshape(-1)
y = preds[0].reshape(-1)


print('Pearson coefficient: ', round(stats.pearsonr(x, y)[0]*100)/100)
plt.scatter(x, y, alpha=0.08)
plt.xlabel('Predicted point difference')
plt.ylabel('Predicted Win Probability')
plt.show()
Pearson coefficient:  1.0

#collapse_hide
ywin = []
yloss = []
for ii in range(len(Y_test[0])):
    if Y_test[0,ii]==1:
        ywin += [y[ii]]
    else:
        yloss += [y[ii]]

plt.hist(yloss,bins=100, alpha=.5, label='True Losses')
plt.hist(ywin,bins=100, alpha=.5, label='True Wins')
plt.xlabel('Predicted Win Probability')
plt.ylabel('Count')
plt.legend(loc="upper center")
plt.show()

As a brief aside: One benefit of designing the network to predict pairwise-comparisons is its symmetry of predictions when the team order is swapped. Other ML models, such as XGBoost, treat the feature inputs of Team A and Team B differently, which can result in varying predictions when the teams are input in reverse order. This can be an issue even when training sets contain matchups and swapped matchups as documented in this discussion.

Exploratory analysis

The we trained two embeddings in this model - a location embeddings with length 1 for each entry and a teams embedding with length 8. The teams embedding is what we want to look at, but we will need to use t-SNE to map the embeddings into length 2 vectors that we can more easily visualize. First let's take a peek at the location embedding:

#collapse_hide

df_locEmb = pd.DataFrame(mymodel.layers[2].get_weights()[0])
df_locEmb.index = ['Away','Neutral','Home']
df_locEmb.head()
0
Away 0.183678
Neutral 0.226365
Home 0.428420

Remember that this these elements are multiplied by each team for each game. For a game at a neutral site, each team would get an equivalent boost and it would be a wash. However, The model gives a significant multiplier to the home team. It seems that it wouldn't make much sense to apply a multiplier to a team embedding that can be positive or negative. After all, do we really expect a poor team to be even worse on their home court? It seems more logical that the team and location embeddings should be added, but a quick test showed very similar final results. Another option is to just concatenate the two embeddings and let the network decide.

Now, let's look at the team embeddings.

embeddings = mymodel.layers[3].get_weights()[0]
pd.DataFrame(embeddings).head()
0 1 2 3 4 5 6 7
0 0.049389 0.077394 0.028948 -0.007543 0.043675 -0.005593 0.010935 0.196465
1 -0.087588 0.179515 0.029412 -0.004801 0.036039 0.075006 0.060224 -0.069750
2 0.376848 0.348337 -0.383493 -0.403572 -0.345107 -0.379201 -0.497723 -0.320207
3 0.186613 0.035614 -0.095917 0.044093 -0.092695 -0.026047 0.002132 -0.051819
4 0.221811 0.179064 -0.249094 -0.374636 -0.372454 -0.259162 -0.381504 -0.215773

Each row in the table above is a team. These values have been trained as a result of each team's regular season match-ups and could now be frozen and used as features in a tournament model. The numbers above are meaningless to us without context, and to visualize them we need to condense them down to a form that we can plot more easily. T-SNE is a non-linear mapping algorithm that can represent the vectors in 2D space. It tends to keep vectors that are similar in the embedding space clustered together in 2D space. In our case, t-SNE projects the embeddings into a warped space with the strongest teams tending toward one side and the weakest toward the other.

Let's take a look at some comparisons between our t-SNE embeddings with and the aggregated game statistics of each team. All of these plots (excluding the final plot for 2020) include only the NCAA tournament teams of a given year. The plots below also highlight a few different historical factors.

  • Tournament seed number
  • Number of NCAA Tournament games won that season
  • Tournament winners
  • Some of the biggest upsets (from these two articles -> 1, 2)
  • The tournament field by season
  • All data for all teams in the 2020 regular season

#collapse_hide
pd.DataFrame(embeddings).to_csv('./data/2020-05-04-NCAA-Embeddings/embeddings.csv', index=False)
t = TSNE(n_components=2,random_state=13)
embed_tsne = t.fit_transform(embeddings)

df_regSeason_full['T1_TeamName'] = df_regSeason_full['T1_TeamID'].apply(lambda x: teams_dict[x]) + '-' + df_regSeason_full['Season'].astype(str)
df_agg=df_regSeason_full.groupby('T1_TeamName').mean()
df_agg.reset_index(inplace=True,drop=False)

df_agg[['T1_TeamName','Win','Score_diff']]
df_agg.drop(columns='Season',inplace=True)

df_tourney_data = pd.read_csv(dataLoc/'MNCAATourneyCompactResults.csv')
df_tourney_data['WTeamName'] = df_tourney_data['WTeamID'].apply(lambda x: teams_dict[x]) + '-' + df_tourney_data['Season'].astype(str)
df_tourney_data['Wins'] = 0
df_wins = df_tourney_data[['WTeamName','Wins']].groupby('WTeamName').count()
tourneyWinners = [df_tourney_data.loc[df_tourney_data['Season']==s,'WTeamName'].values[-1] for s in df_tourney_data['Season'].unique()]

df_seeds = pd.read_csv(dataLoc/'MNCAATourneySeeds.csv')
df_seeds['TeamName'] = df_seeds['TeamID'].apply(lambda x: teams_dict[x]) + '-' + df_seeds['Season'].astype(str)
df_seeds['Seed'] = df_seeds['Seed'].str.extract(r'(\d+)')
df_seeds['WonTourney'] = df_seeds['TeamName'].apply(lambda x: True if x in tourneyWinners else False)
df_seeds = df_seeds[['TeamName','Seed','WonTourney']]

df_upsets = pd.read_csv('./data/2020-05-04-NCAA-Embeddings/Upsets.csv')
df_upsets['David']=df_upsets['David']+'-'+df_upsets['Season'].astype(str)
df_upsets['Goliath']=df_upsets['Goliath']+'-'+df_upsets['Season'].astype(str)
upsets = {}
for ii in df_upsets['David'].unique():
    upsets[ii] = 'Surprise'
for ii in df_upsets['Goliath'].unique():
    upsets[ii] = 'Bust'
df_seeds = pd.merge(left=df_seeds, right=df_wins, how='left', left_on='TeamName',right_index=True)
df_seeds['Wins'].fillna(0,inplace=True)

def upset(x):
    try:
        y = upsets[x]
    except:
        y = None
    return y
df_seeds['Upset'] = df_seeds['TeamName'].apply(lambda x: upset(x))

df = pd.DataFrame(embed_tsne,columns=['factor1','factor2'])
df['TeamName'] = [str(teams_dict[int(oh_to_id[x][-4:])]) + '-' + oh_to_id[x][:4] for x in df.index]
df['Season'] = [int(oh_to_id[x][:4])for x in df.index]

df = pd.merge(left=df, right=df_seeds, how='left', on='TeamName')
df = pd.merge(left=df, right=df_agg, how='left', left_on='TeamName',right_on='T1_TeamName')

df = df[['TeamName','Season','factor1','factor2','Win','Score_diff','Seed','Wins','Upset','WonTourney']]
df.columns = ['TeamName','Season','factor1','factor2','RegWins','RegPoint_diff','Seed','TourneyWins','Upset','WonTourney']

df2020 = df[df['Season']==2020].copy()

df.dropna(inplace=True,subset=['Seed'])

df['TourneyWinsScaled'] = df['TourneyWins']/df['TourneyWins'].max()
df['SeedScaled'] = df['Seed'].astype(int)/df['Seed'].astype(int).max()

How correlated are the embeddings to tournament seeding?

We see a high correlation between the assigned seed and our embeddings. Our embeddings appear to be a better representation of the seeding than the aggregated statistics, which makes sense since our method uses pair-wise comparisons and effectively accounts for team strength while aggregated statistics do not.

#collapse_hide

axis_ranges = [[-80,100],
               [-80,80],
               [-10,30],
               [.2,1.2]]

def plot_comparison(df, colorBy, orderBy, axis_ranges):

    selector = alt.selection_single(empty='all', fields=['TeamName'])

    base = alt.Chart(df).mark_point(filled=True,size=50).encode(
        color=alt.condition(selector,
                            colorBy,
                            alt.value('lightgray') ),
        order=orderBy,
        tooltip=['TeamName','Seed']
    ).properties(
        width=250,
        height=250
    ).add_selection(selector).interactive()

    chart1 = [alt.X('factor1:Q',
                    scale=alt.Scale(domain=axis_ranges[0]),
                    axis=alt.Axis(title='t-SNE factor 1')),
            alt.Y('factor2:Q',
                    scale=alt.Scale(domain=axis_ranges[1]),
                    axis=alt.Axis(title='t-SNE factor 2'))]

    chart2 = [alt.X('RegPoint_diff:Q',
                    scale=alt.Scale(domain=axis_ranges[2]),
                    axis=alt.Axis(title='Average regular season point differential')),
            alt.Y('RegWins:Q',
                    scale=alt.Scale(domain=axis_ranges[3]),
                    axis=alt.Axis(format='%', title='Regular eason win percentage'))]

    return base, chart1, chart2

colorBy = alt.Color('Seed:Q', scale=alt.Scale(scheme='viridis',reverse=True))
orderBy = alt.Order('Seed:Q', sort='descending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Are features associated with tournament wins?

The embeddings appear to be far less correlated to the number of games won by tournament than they were with seeds. This is logical since, tournament seeds are fundamentally generated from the same distribution of data we used in training, while the number of tournament wins requires predictive power. There does appear to be a significant reduction in overlap in the t-SNE representation, especially on the low end of the color scale.

#collapse_hide
colorBy = alt.Color('TourneyWins:Q', scale=alt.Scale(scheme='viridis',reverse=False))
orderBy = alt.Order('TourneyWins:Q', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Are tournament winners differentiated?

Generally, the tournament winners are clustered together in both of the plots below. They appear to be more closely clustered, but it is difficult to tell given the overall difference in point distribution between our two plots. The distinguishing the impact here will likely require two tournament models using these as inputs.

#collapse_hide
colorBy = alt.Color('WonTourney:N', scale=alt.Scale(scheme='tableau10'))
orderBy = alt.Order('WonTourney:N', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Can we predict "the upsets"?

In short, no. These upsets are compiled from here and here - underdogs shown in red. Generally, the model agrees with the experts. These were upsets and wouldn't have been predicted by this method. If anything this method would likely have predicted no upset with even greater conviction than a model trained on just aggregated points and wins. The only exception to this is the 1986 "upset" of Cleveland State over the Indiana Hoosiers. Both the embeddings model and the aggregated statistics indicate that Cleveland State may have been the better team. Perhaps it was an issue of name recognition that led this to be called an upset.

#collapse_hide
colorBy = alt.Color('Upset:N', scale=alt.Scale(scheme='tableau10'))
orderBy = alt.Order('Upset:N', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

How do these features change from year to year?

The plot below is colored by games won - the tournament winner will appear as yellow. The spread of teams is quite variable year to year. Notably, the tournament that the 1985 Villanova team won as a heavy underdog has less spread in the competition than other years. But interestingly, this model suggests that the 1988 tournament had even less spread and relative to the rest of the field the 1988 Kentucky team may have been an even bigger underdog.

#collapse_hide
## create slider
select_year = alt.selection_single(
    name='select', fields=['Season'], init={'Season': 1985},
    bind=alt.binding_range(min=1985, max=2019, step=1))

colorBy = alt.Color('TourneyWins:Q', scale=alt.Scale(scheme='viridis',reverse=False))
orderBy = alt.Order('TourneyWins:Q', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)
base = base.add_selection(select_year).transform_filter(select_year)

base.encode(*chart1)  | base.encode(*chart2)

What would the 2020 tournament have looked like?

The plot below suggests that the four strongest teams from these metrics are Gonzaga, Kansas, Dayton, and Duke. However, we've already established that the embeddings alone are not enough to predict winners - we need a more detailed model to do that. Nonetheless, it is interesting to scan which teams this model has in it's top 50 or so. This year had some surprising teams and some of the names in the chart might surprise you! I will try to include the 2020 teams in future analyses since we missed out on what could have been a very exciting tournament this year.

#collapse_hide
colorBy = alt.Color('TeamName:N')
axis_ranges = [[-80,90],[-80,70],[-30,30],[0,1]]
orderBy = alt.Order('Upset:N', sort='ascending')
base, chart1, chart2 = plot_comparison( df2020, colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Conclusions

The embeddings appear to have learned which teams are better and which are worse. It seems that they are a better representation of team skill than simply aggregating the statistics used in model training (wins and point differentials). This post shows a quick example of the uplift that using these embeddings as features can provide to a simple tournament mode. The model could be easily generalized to more advanced statistics, but it remains to be seen if embeddings can outperform more complex features. I'll test this in a future post and would be surprised if this concept doesn't find its way into my 2021 NCAA Tournament models.