In this exercise we are using historical match data to predict future wins.
Libraries needed for data processing and plotting:
library("dplyr")
library("magrittr")
library("ggplot2")
Source external script with handy functions definitions:
source("./scripts/my_defs_u2.R")
The content of above external file taken from https://github.com/pedrosan/TheAnalyticsEdge/tree/master/scripts
The dataset has been created as described: here
A final copy of the dataset is available on github: https://github.com/hicksonanalytics/main/tree/master/data
Read the dataset NBA_rugby.csv
baseball <- read.csv("data/NBA_rugby.csv")
str(baseball)
## 'data.frame': 108 obs. of 34 variables:
## $ Team : Factor w/ 15 levels "BATH","BTL","EXET",..: 13 3 12 8 1 5 10 9 4 11 ...
## $ Team_Desc : Factor w/ 15 levels "Bath Rugby","Bristol Rugby",..: 13 3 12 6 1 5 10 9 4 11 ...
## $ League : Factor w/ 1 level "AP": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ PF : int 693 667 579 567 486 532 476 430 533 471 ...
## $ PA : int 502 452 345 445 440 526 490 581 537 595 ...
## $ W : int 17 15 16 14 12 11 10 10 7 7 ...
## $ TF : int 89 86 66 58 52 57 52 51 61 55 ...
## $ KI : int 604 581 513 509 434 475 424 379 472 416 ...
## $ BA : logi NA NA NA NA NA NA ...
## $ Playoffs : int 1 1 1 1 0 0 0 0 0 0 ...
## $ RankSeason : int 1 2 3 4 5 6 7 8 9 10 ...
## $ RankPlayoffs: logi NA NA NA NA NA NA ...
## $ G : int 22 22 22 22 22 22 22 22 22 22 ...
## $ TA : int 305 275 140 240 245 295 250 365 320 350 ...
## $ KA : int 197 177 205 205 195 231 240 216 217 245 ...
## $ TRIES : int 87 80 65 56 49 57 51 51 60 54 ...
## $ CONVERSIONS : int 67 72 42 44 41 41 39 35 51 38 ...
## $ PENALTIES : int 38 31 54 59 44 55 46 35 42 40 ...
## $ KICK_PERC : num 0.79 0.8 0.7 0.8 0.77 0.79 0.77 0.75 0.77 0.73 ...
## $ KICK : int 544 515 622 524 603 522 587 522 536 536 ...
## $ PASS : int 3403 3730 2627 3147 3189 3362 2917 3690 2871 3339 ...
## $ RUN : int 2524 3209 2378 2550 2304 2620 2392 2674 2416 2598 ...
## $ POSS1H : num 0.58 0.59 0.48 0.56 0.46 0.51 0.47 0.49 0.44 0.45 ...
## $ POSS2H : num 0.5 0.59 0.47 0.56 0.45 0.52 0.51 0.43 0.44 0.56 ...
## $ TERR1H : num 0.28 0.31 0.49 0.33 0.34 0.29 0.3 0.2 0.23 0.2 ...
## $ TERR2H : num 0.26 0.28 0.49 0.36 0.33 0.3 0.31 0.21 0.2 0.23 ...
## $ BREAKS : int 264 165 167 162 182 189 139 177 189 130 ...
## $ DEF_BEAT : int 539 452 342 391 412 448 315 460 352 342 ...
## $ OFFLOADS : int 220 223 192 223 197 245 230 241 186 266 ...
## $ RUCKW : num 0.95 0.96 0.95 0.95 0.95 0.91 0.95 0.96 0.95 0.91 ...
## $ MAULW_PERC : num 0.55 0.44 0.89 0.54 0.67 0.45 0.48 0.35 0.38 0.33 ...
## $ TURN_CONC : int 300 307 315 301 306 286 293 302 307 274 ...
## $ TACKLES : num 0.88 0.84 0.86 0.86 0.87 0.84 0.85 0.86 0.84 0.82 ...
Decided to take all data, hence set subset to less than 2020. First graph looks at wins required to make playoffs:
moneyball_1996_2001 <- subset(baseball, Year < 2020)
ggplot(data = moneyball_1996_2001, aes(x = W, y = Team)) + theme_bw() +
scale_color_manual(values = c("grey", "red3")) +
geom_vline(xintercept = c(13.0, 15.0), col = "green2", linetype = "longdash") +
geom_point(aes(color = factor(Playoffs)), pch = 16, size = 3.0)
Compute run difference and define new variable RD added to the moneyball data frame:
moneyball <- subset(baseball, Year < 2020)
moneyball$RD <- moneyball$PF - moneyball$PA
Scatterplot to check for linear relationship between RD and wins (W):
ggplot(data = moneyball, aes(x = RD, y = W)) + theme_bw() +
scale_color_manual(values = c("grey", "red3")) +
geom_hline(yintercept = c(13.0, 15.0), col = "green2", linetype ="longdash") +
geom_point(aes(color = factor(Playoffs)), alpha = 0.5, pch = 16, size = 3.0)
Given the clear correlation between Points Difference and W, it makes sense to try first a linear regression with just Points Difference as predictor.
Wins_Reg <- lm(W ~ RD, data = moneyball)
printStats(Wins_Reg)
## MODEL : W ~ RD
## SUMMARY STATS: R^2 = 0.8141 (adj. = 0.8123)
## F-stats: 464.147 on 1 and 106 DF, p-value: 1.606699e-40
##
## Estimate Std. Error Pr(>|t|) Signif
## (Intercept) 1.0639e+01 1.7708e-01 1.0595e-83 ***
## RD 2.3069e-02 1.0708e-03 1.6067e-40 ***
\[ W = 10.64 + 0.023 * R_D \]
\[W > 14 \]
\[ \therefore 10.64 + 0.023 * R_D > 14 \]
\[ R_D > \frac {14-10.64} {0.023} \]
\[ R_D > 146 \]
There are n=22 matches each season so the average points difference per match can be computed as: \[ \frac {R_D} {n} >\frac {146} {22} \] \[ \frac {R_D} {n} > 7 \] This means a team needs to maintain an average points difference of +7 over the course of any given season.
In order to score points in rugby a team can score tries for 5 points and penalties for 3 points. A try can be converted to add an additional 2 points. Therefore we need to look at the optimal combination of when to play for tries and when to play for penalties.
For our training data, a subset before 2016 is taken.
moneyball <- subset(baseball, Year < 2016)
RS_reg_1 <- lm(TF ~ KICK_PERC + KICK + PASS + RUN + POSS1H + POSS2H + BREAKS + DEF_BEAT + OFFLOADS + RUCKW, data = moneyball)
printStats(RS_reg_1)
## MODEL : TF ~ KICK_PERC + KICK + PASS + RUN + POSS1H + POSS2H + BREAKS + DEF_BEAT + OFFLOADS + RUCKW
## SUMMARY STATS: R^2 = 0.6684 (adj. = 0.6229)
## F-stats: 14.711 on 10 and 73 DF, p-value: 6.475133e-14
##
## Estimate Std. Error Pr(>|t|) Signif
## (Intercept) 9.1819e+01 8.2380e+01 2.6868e-01
## KICK_PERC 3.8004e+00 1.3820e+01 7.8411e-01
## KICK 1.5443e-02 1.9184e-02 4.2344e-01
## PASS 1.2782e-02 5.3141e-03 1.8700e-02 *
## RUN -9.9535e-03 6.6264e-03 1.3738e-01
## POSS1H 3.5157e+01 1.9204e+01 7.1223e-02 .
## POSS2H -1.7904e+01 2.0771e+01 3.9152e-01
## BREAKS 3.8872e-01 5.0408e-02 4.9011e-11 ***
## DEF_BEAT -3.7176e-02 2.7755e-02 1.8458e-01
## OFFLOADS -3.9620e-02 2.8328e-02 1.6616e-01
## RUCKW -1.0810e+02 8.9216e+01 2.2957e-01
If we take a look at our regression equation many variables are insignificant, yet it is a good start to use many predictors in order to quickly rule out these variables. BREAKS are significant which relates to line breaks. This makes sense as the attacking team must create line breaks to get through the defensive line and score tries.
One option is to combine line breaks and offloads into a model:
RS_reg_2 <- lm(TF ~ BREAKS + OFFLOADS, data = moneyball)
printStats(RS_reg_2)
## MODEL : TF ~ BREAKS + OFFLOADS
## SUMMARY STATS: R^2 = 0.6012 (adj. = 0.5914)
## F-stats: 61.057 on 2 and 81 DF, p-value: 6.763876e-17
##
## Estimate Std. Error Pr(>|t|) Signif
## (Intercept) 5.1537e+00 5.1016e+00 3.1540e-01
## BREAKS 4.0303e-01 3.7636e-02 3.5098e-17 ***
## OFFLOADS -4.1628e-02 2.5752e-02 1.0988e-01
We get the linear regression model: \[ T_F = 5.15 + 0.4 * BREAKS - 0.04 * OFFLOADS \] So this model is simpler, with only two independent variables, and has about the same R^2. Overall a better model.
We can see that BREAKS has a larger coefficient than OFFLOADS and there is a small collinearity between BREAKS and OFFLOADS.
pairs(moneyball[, c("TF", "BREAKS", "OFFLOADS")], gap=0.5, las=1, pch=21, bg=rgb(0,0,1,0.25), panel=mypanel, lower.panel=function(...) panel.cor(..., color.bg=TRUE), main="")
mtext(side=3, "pairs plot with correlation values", outer=TRUE, line=-1.2, font=2)
mtext(side=3, "Dashed lines are 'lm(y~x)' fits.\nCorrelation and scatterplot frames are color-coded on the strength of the correlation", outer=TRUE, line=-1.6, padj=1, cex=0.8, font=1)
In this last part we will try to make predictions for the 2017 season. We have used 2015 backwards data as training data.
moneyball_test <- subset(baseball, Year < 2018 & Year >= 2017)
Let’s try to predict using our model RS_reg_2 how many points we will see in the 2017 season. We use the predict() command here, and we give it the model that we just determined to be the best one. The new data which is moneyball_test.
TriesPredictions <- predict(RS_reg_2, newdata = moneyball_test)
We can compute the out of sample R^2. This is a measurement of how well the model predicts on test data. The R^2 value we had before from our model, the 0.6012, is the measure of an in-sample R^2, which is how well the model fits the training data. But to get a measure of the predictions goodness of fit, we need to calculate the out of sample R^2.
First we need to compute the sum of squared errors (SSE), i.e. the sum of the predicted amount minus the actual amount of points squared
First we need to compute the sum of squared errors (SSE), i.e. the sum of the predicted amount minus the actual amount of points squared
SSE <- sum((TriesPredictions - moneyball_test$TF)^2)
We also need the total sums of squares (SST), which is just the sum of the test set actual number of points minus the average number of points in the training set.
SST <- sum((mean(moneyball$TF) - moneyball_test$TF)^2)
The R2 then is calculated as usual, 1 minus the sum of squared errors divided by total sums of squares.
R2 <- 1 - SSE/SST
The Out Of Sample R2 is 0.5636.
RMSE <- sqrt(SSE/nrow(moneyball_test))
At 13.59, it is a little higher than the training dataset below of 8.928
summary(RS_reg_2)
##
## Call:
## lm(formula = TF ~ BREAKS + OFFLOADS, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.516 -6.063 -1.720 5.958 22.568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.15374 5.10163 1.010 0.315
## BREAKS 0.40303 0.03764 10.709 <2e-16 ***
## OFFLOADS -0.04163 0.02575 -1.616 0.110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.928 on 81 degrees of freedom
## Multiple R-squared: 0.6012, Adjusted R-squared: 0.5914
## F-statistic: 61.06 on 2 and 81 DF, p-value: < 2.2e-16