In this exercise we are using historical match data to predict future wins.

PRELIMINARIES

Libraries needed for data processing and plotting:

library("dplyr")
library("magrittr")
library("ggplot2")

Source external script with handy functions definitions:

source("./scripts/my_defs_u2.R")

The content of above external file taken from https://github.com/pedrosan/TheAnalyticsEdge/tree/master/scripts

PART 1: LOADING THE DATA

The dataset has been created as described: here

A final copy of the dataset is available on github: https://github.com/hicksonanalytics/main/tree/master/data

Read the dataset NBA_rugby.csv

baseball <- read.csv("data/NBA_rugby.csv")
str(baseball)
## 'data.frame':    108 obs. of  34 variables:
##  $ Team        : Factor w/ 15 levels "BATH","BTL","EXET",..: 13 3 12 8 1 5 10 9 4 11 ...
##  $ Team_Desc   : Factor w/ 15 levels "Bath Rugby","Bristol Rugby",..: 13 3 12 6 1 5 10 9 4 11 ...
##  $ League      : Factor w/ 1 level "AP": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year        : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ PF          : int  693 667 579 567 486 532 476 430 533 471 ...
##  $ PA          : int  502 452 345 445 440 526 490 581 537 595 ...
##  $ W           : int  17 15 16 14 12 11 10 10 7 7 ...
##  $ TF          : int  89 86 66 58 52 57 52 51 61 55 ...
##  $ KI          : int  604 581 513 509 434 475 424 379 472 416 ...
##  $ BA          : logi  NA NA NA NA NA NA ...
##  $ Playoffs    : int  1 1 1 1 0 0 0 0 0 0 ...
##  $ RankSeason  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ RankPlayoffs: logi  NA NA NA NA NA NA ...
##  $ G           : int  22 22 22 22 22 22 22 22 22 22 ...
##  $ TA          : int  305 275 140 240 245 295 250 365 320 350 ...
##  $ KA          : int  197 177 205 205 195 231 240 216 217 245 ...
##  $ TRIES       : int  87 80 65 56 49 57 51 51 60 54 ...
##  $ CONVERSIONS : int  67 72 42 44 41 41 39 35 51 38 ...
##  $ PENALTIES   : int  38 31 54 59 44 55 46 35 42 40 ...
##  $ KICK_PERC   : num  0.79 0.8 0.7 0.8 0.77 0.79 0.77 0.75 0.77 0.73 ...
##  $ KICK        : int  544 515 622 524 603 522 587 522 536 536 ...
##  $ PASS        : int  3403 3730 2627 3147 3189 3362 2917 3690 2871 3339 ...
##  $ RUN         : int  2524 3209 2378 2550 2304 2620 2392 2674 2416 2598 ...
##  $ POSS1H      : num  0.58 0.59 0.48 0.56 0.46 0.51 0.47 0.49 0.44 0.45 ...
##  $ POSS2H      : num  0.5 0.59 0.47 0.56 0.45 0.52 0.51 0.43 0.44 0.56 ...
##  $ TERR1H      : num  0.28 0.31 0.49 0.33 0.34 0.29 0.3 0.2 0.23 0.2 ...
##  $ TERR2H      : num  0.26 0.28 0.49 0.36 0.33 0.3 0.31 0.21 0.2 0.23 ...
##  $ BREAKS      : int  264 165 167 162 182 189 139 177 189 130 ...
##  $ DEF_BEAT    : int  539 452 342 391 412 448 315 460 352 342 ...
##  $ OFFLOADS    : int  220 223 192 223 197 245 230 241 186 266 ...
##  $ RUCKW       : num  0.95 0.96 0.95 0.95 0.95 0.91 0.95 0.96 0.95 0.91 ...
##  $ MAULW_PERC  : num  0.55 0.44 0.89 0.54 0.67 0.45 0.48 0.35 0.38 0.33 ...
##  $ TURN_CONC   : int  300 307 315 301 306 286 293 302 307 274 ...
##  $ TACKLES     : num  0.88 0.84 0.86 0.86 0.87 0.84 0.85 0.86 0.84 0.82 ...

PART 2: MAKING THE PLAYOFFS

Decided to take all data, hence set subset to less than 2020. First graph looks at wins required to make playoffs:

moneyball_1996_2001 <- subset(baseball, Year < 2020)
ggplot(data = moneyball_1996_2001, aes(x = W, y = Team)) + theme_bw() + 
    scale_color_manual(values = c("grey", "red3")) + 
      geom_vline(xintercept = c(13.0, 15.0), col = "green2", linetype = "longdash") +
      geom_point(aes(color = factor(Playoffs)), pch = 16, size = 3.0)

How does a team win games?

  • They score more points than their opponent
  • But how many more?
  • Calculated that they needed to score 147 more points than conceded during the regular season to expect to win 15 games.
  • Let’s see if we can verify this using linear regression

Compute run difference and define new variable RD added to the moneyball data frame:

moneyball <- subset(baseball, Year < 2020)
moneyball$RD <- moneyball$PF - moneyball$PA

Scatterplot to check for linear relationship between RD and wins (W):

ggplot(data = moneyball, aes(x = RD, y = W)) + theme_bw() + 
     scale_color_manual(values = c("grey", "red3")) + 
     geom_hline(yintercept = c(13.0, 15.0), col = "green2", linetype ="longdash") +
     geom_point(aes(color = factor(Playoffs)), alpha = 0.5, pch = 16, size = 3.0) 

Regression model to predict wins

Given the clear correlation between Points Difference and W, it makes sense to try first a linear regression with just Points Difference as predictor.

Wins_Reg <- lm(W ~ RD, data = moneyball)
printStats(Wins_Reg)
## MODEL        : W ~ RD
## SUMMARY STATS: R^2 = 0.8141  (adj. = 0.8123)
##                F-stats: 464.147 on 1 and 106 DF,  p-value: 1.606699e-40
## 
##               Estimate     Std. Error   Pr(>|t|)     Signif
## (Intercept)   1.0639e+01   1.7708e-01   1.0595e-83   ***   
## RD            2.3069e-02   1.0708e-03   1.6067e-40   ***

\[ W = 10.64 + 0.023 * R_D \]

\[W > 14 \]

\[ \therefore 10.64 + 0.023 * R_D > 14 \]

\[ R_D > \frac {14-10.64} {0.023} \]

\[ R_D > 146 \]

There are n=22 matches each season so the average points difference per match can be computed as: \[ \frac {R_D} {n} >\frac {146} {22} \] \[ \frac {R_D} {n} > 7 \] This means a team needs to maintain an average points difference of +7 over the course of any given season.

PART 3: PREDICTING POINTS DIFFERENCE

In order to score points in rugby a team can score tries for 5 points and penalties for 3 points. A try can be converted to add an additional 2 points. Therefore we need to look at the optimal combination of when to play for tries and when to play for penalties.

4 areas that factor into points difference:

  • Tries Scored
  • Penalties Scored
  • Tries Conceded
  • Penalties Conceded

Predicting Tries Scored

For our training data, a subset before 2016 is taken.

moneyball <- subset(baseball, Year < 2016)

Regression model with 10 Predictors

RS_reg_1 <- lm(TF ~ KICK_PERC + KICK + PASS + RUN + POSS1H + POSS2H + BREAKS + DEF_BEAT + OFFLOADS + RUCKW, data = moneyball)
printStats(RS_reg_1)
## MODEL        : TF ~ KICK_PERC + KICK + PASS + RUN + POSS1H + POSS2H + BREAKS + DEF_BEAT + OFFLOADS + RUCKW
## SUMMARY STATS: R^2 = 0.6684  (adj. = 0.6229)
##                F-stats: 14.711 on 10 and 73 DF,  p-value: 6.475133e-14
## 
##               Estimate      Std. Error    Pr(>|t|)      Signif
## (Intercept)    9.1819e+01    8.2380e+01    2.6868e-01         
## KICK_PERC      3.8004e+00    1.3820e+01    7.8411e-01         
## KICK           1.5443e-02    1.9184e-02    4.2344e-01         
## PASS           1.2782e-02    5.3141e-03    1.8700e-02   *     
## RUN           -9.9535e-03    6.6264e-03    1.3738e-01         
## POSS1H         3.5157e+01    1.9204e+01    7.1223e-02   .     
## POSS2H        -1.7904e+01    2.0771e+01    3.9152e-01         
## BREAKS         3.8872e-01    5.0408e-02    4.9011e-11   ***   
## DEF_BEAT      -3.7176e-02    2.7755e-02    1.8458e-01         
## OFFLOADS      -3.9620e-02    2.8328e-02    1.6616e-01         
## RUCKW         -1.0810e+02    8.9216e+01    2.2957e-01

If we take a look at our regression equation many variables are insignificant, yet it is a good start to use many predictors in order to quickly rule out these variables. BREAKS are significant which relates to line breaks. This makes sense as the attacking team must create line breaks to get through the defensive line and score tries.

Regression model with 2 Predictors

One option is to combine line breaks and offloads into a model:

RS_reg_2 <- lm(TF ~ BREAKS + OFFLOADS, data = moneyball)
printStats(RS_reg_2)
## MODEL        : TF ~ BREAKS + OFFLOADS
## SUMMARY STATS: R^2 = 0.6012  (adj. = 0.5914)
##                F-stats: 61.057 on 2 and 81 DF,  p-value: 6.763876e-17
## 
##               Estimate      Std. Error    Pr(>|t|)      Signif
## (Intercept)    5.1537e+00    5.1016e+00    3.1540e-01         
## BREAKS         4.0303e-01    3.7636e-02    3.5098e-17   ***   
## OFFLOADS      -4.1628e-02    2.5752e-02    1.0988e-01

We get the linear regression model: \[ T_F = 5.15 + 0.4 * BREAKS - 0.04 * OFFLOADS \] So this model is simpler, with only two independent variables, and has about the same R^2. Overall a better model.

We can see that BREAKS has a larger coefficient than OFFLOADS and there is a small collinearity between BREAKS and OFFLOADS.

pairs(moneyball[, c("TF", "BREAKS", "OFFLOADS")], gap=0.5,  las=1, pch=21, bg=rgb(0,0,1,0.25), panel=mypanel, lower.panel=function(...) panel.cor(..., color.bg=TRUE), main="")
mtext(side=3, "pairs plot with correlation values", outer=TRUE, line=-1.2, font=2)
mtext(side=3, "Dashed lines are 'lm(y~x)' fits.\nCorrelation and scatterplot frames are color-coded on the strength of the correlation", outer=TRUE, line=-1.6, padj=1, cex=0.8, font=1)

PART 4: PREDICTING POINTS DIFFERENCE

In this last part we will try to make predictions for the 2017 season. We have used 2015 backwards data as training data.

Read-in test set

moneyball_test <- subset(baseball, Year < 2018 & Year >= 2017)

Predicting Tries Scored

Let’s try to predict using our model RS_reg_2 how many points we will see in the 2017 season. We use the predict() command here, and we give it the model that we just determined to be the best one. The new data which is moneyball_test.

TriesPredictions <- predict(RS_reg_2, newdata = moneyball_test)

We can compute the out of sample R^2. This is a measurement of how well the model predicts on test data. The R^2 value we had before from our model, the 0.6012, is the measure of an in-sample R^2, which is how well the model fits the training data. But to get a measure of the predictions goodness of fit, we need to calculate the out of sample R^2.

Out-Of-Sample R2R2 and RMSE

First we need to compute the sum of squared errors (SSE), i.e. the sum of the predicted amount minus the actual amount of points squared

First we need to compute the sum of squared errors (SSE), i.e. the sum of the predicted amount minus the actual amount of points squared

SSE <- sum((TriesPredictions - moneyball_test$TF)^2)

We also need the total sums of squares (SST), which is just the sum of the test set actual number of points minus the average number of points in the training set.

SST <- sum((mean(moneyball$TF) - moneyball_test$TF)^2)

The R2 then is calculated as usual, 1 minus the sum of squared errors divided by total sums of squares.

R2 <- 1 - SSE/SST

The Out Of Sample R2 is 0.5636.

RMSE <- sqrt(SSE/nrow(moneyball_test))

At 13.59, it is a little higher than the training dataset below of 8.928

summary(RS_reg_2)
## 
## Call:
## lm(formula = TF ~ BREAKS + OFFLOADS, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.516  -6.063  -1.720   5.958  22.568 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.15374    5.10163   1.010    0.315    
## BREAKS       0.40303    0.03764  10.709   <2e-16 ***
## OFFLOADS    -0.04163    0.02575  -1.616    0.110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.928 on 81 degrees of freedom
## Multiple R-squared:  0.6012, Adjusted R-squared:  0.5914 
## F-statistic: 61.06 on 2 and 81 DF,  p-value: < 2.2e-16

PART 5: FURTHER WORK AND CURRENT ISSUES

Fix 2016 data issues

Fix data gaps for territory and possession

Convert tackles, mauls, rucks into 2 pairs of absolutes as opposed to percentages

Remove old template baseball and NBA references in the code

Built algos for the following:

  • Predicting Penalties Scored
  • Predicting Tries Conceded
  • Predicting Penalties Conceded