This post talks you through how to build a model that predict individual tennis matches. I’ll build a simple toy model model that you can extend to make your own predictions.

My aims are:

- To give some insight into how my tennis models work
- To share how my reusable code package, pmpackage, helps me build these models

A later post will talk about how to string these predictions together in a Monte Carlo tournament simulator, to let you start predicting the results of whole tournaments.

The stages I’ll go through are:

- Loading the libraries
- Fetching the data
- Computing new characteristics
- Checking the data quality
- Splitting into development and validation samples
- Building the model
- Checking how the model performs

`library(tidyverse)`

I use the same functions and routines for multiple projects. These are saved on github in the pmpackage.

`# Run this once to install the library#devtools::install_github('quietsnooze/pmpackage')library(pmpackage)`

I get a lot of data from tennis-data.co.uk, a great site that pre-aggregates tennis results. You can download and clean the excel files by hand, or use the automated script below:

`wta_data <- pmpackage::pm_tennis_eloify_dataset(pmpackage::pm_tennis_fetchAllDatasets())`

Each row of the resulting dataframe is a contest between a *player* and an *opponent*. These are grouped by:

- match_date
- tournament
- round, and
- surface

The result of the match is saved in *actualResult* with *1* being a win for the player, and *0* a loss.

`wta_data %>% select(player_name, opponent_name, tournament, round, match_date, surface, actualResult) %>% sample_n(5) %>% kable()`

player_name | opponent_name | tournament | round | match_date | surface | actualResult |
---|---|---|---|---|---|---|

Hantuchova D. | Voegele S. | Barclays Dubai Tennis Championships | 1st Round | 2009-02-16 00:00:15 | Hard | 1 |

Radwanska A. | Garbin T. | Sony Ericsson Open | 4th Round | 2007-03-26 00:02:05 | Hard | 0 |

Jankovic J. | Gavrilova D. | Prudential Hong Kong Tennis Open | Quarterfinals | 2015-10-16 00:04:10 | Hard | 1 |

Hlavackova A. | Jankovic J. | Topshelf Open | 2nd Round | 2015-06-11 00:00:31 | Grass | 0 |

Mauresmo A. | Kucova K. | Wimbledon | 2nd Round | 2009-06-25 00:00:31 | Grass | 1 |

## Comparison variables

All tennis matches involved two players, a *player* and an *opponent*. To predict the result of the match I need to know which player is stronger than the other. Absolute measures of a player’s strength, such as *rank*, can be misleading. The player ranked number five in the world might be stronger than most players, but when they play someone in the top four they are an underdog.

Instead of a player’s *rank*, it is therefore more useful to consider the *difference* or the *ratio* of their rank to their opponents.

`my_analysis_tbl <- pmpackage::pm_modelling_compare_player_opponent(wta_data,'rank')`

This function creates a ratio and a difference based on the player’s rank and their opponent’s rank. In this case I’m interested in *rank_difference*.

`my_analysis_tbl %>% select(player_name, opponent_name, player_rank, opponent_rank, rank_difference) %>% head() %>% kable()`

player_name | opponent_name | player_rank | opponent_rank | rank_difference |
---|---|---|---|---|

Castano C. | Golds S. | 54 | 599 | -545 |

Bremond S. | Benesova I. | 40 | 58 | -18 |

Ivanova E. | Sequera M. | 183 | 103 | 80 |

Bammer S. | Hingis M. | 49 | 7 | 42 |

Camerin M.E. | Vesnina E. | 50 | 47 | 3 |

Sugiyama A. | Ferguson S. | 26 | 184 | -158 |

Negative numbers indicate a player who is stronger than their opponent, positive numbers indicate a player who is weaker than their opponent.

## Lifetime performance

The second variable is going to be lifetime performance. Again, I’ll use a helper function to save on coding.

`my_analysis_tbl <- pmpackage::pm_modelling_lifetime_performance(df = my_analysis_tbl)my_analysis_tbl %>% select(player_name, player_lifetime_win_pct, opponent_name, opponent_lifetime_win_pct) %>% sample_n(5) %>% kable()`

player_name | player_lifetime_win_pct | opponent_name | opponent_lifetime_win_pct |
---|---|---|---|

Cohen Aloro S. | 0.4000000 | Meusburger Y. | 0.6000000 |

Giorgi C. | 0.5094340 | Vesnina E. | 0.5177515 |

Mauresmo A. | 0.6417910 | Harkleroad A. | 0.4358974 |

Svitolina E. | 0.1818182 | Lepchenko V. | 0.4152542 |

Parmentier P. | 0.4457831 | Daniilidou E. | 0.4457831 |

The function can also compare their strength on the different surfaces:

`my_analysis_tbl <- pmpackage::pm_modelling_lifetime_performance(df = my_analysis_tbl, group_variables = 'surface')my_analysis_tbl %>% select(surface, player_name, player_surface_lifetime_win_pct, player_lifetime_win_pct) %>% sample_n(5) %>% kable()`

surface | player_name | player_surface_lifetime_win_pct | player_lifetime_win_pct |
---|---|---|---|

Hard | Lisicki S. | 0.5769231 | 0.6028037 |

Grass | Halep S. | 0.2000000 | 0.4954955 |

Hard | Hampton J. | 0.4838710 | 0.4358974 |

Hard | Kerber A. | 0.6503268 | 0.6481069 |

Hard | Kvitova P. | 0.5217391 | 0.5000000 |

## Compare lifetime surface strength between player and opponent

As with *rank*, rather than a measure of each player’s absolute strength I need to compare their strength relative to their opponent:

`my_analysis_tbl <- my_analysis_tbl %>% pmpackage::pm_modelling_compare_player_opponent('surface_lifetime_win_pct') my_analysis_tbl %>% select(player_name, opponent_name, player_surface_lifetime_win_pct, opponent_surface_lifetime_win_pct, surface_lifetime_win_pct_difference ) %>% sample_n(5) %>% kable()`

player_name | opponent_name | player_surface_lifetime_win_pct | opponent_surface_lifetime_win_pct | surface_lifetime_win_pct_difference |
---|---|---|---|---|

Cornet A. | Parmentier P. | 0.4756098 | 0.4107143 | 0.0648955 |

Vekic D. | Al Nabhani F. | 0.4615385 | 0.0000000 | 0.4615385 |

Cornet A. | Pennetta F. | 0.6428571 | 0.6470588 | -0.0042017 |

Petkovic A. | Kucova K. | 0.5801105 | 0.4444444 | 0.1356661 |

Santangelo M. | Yan Z. | 0.5333333 | 0.3750000 | 0.1583333 |

Here I check for data quality issues and make some fixes.

pmpackage has a function, *pm_dq_check_all_columns*, that does this for me.

`my_analysis_tbl %>% sample_frac(size = 0.1) %>% select(actualResult, surface_lifetime_win_pct_difference, rank_difference) %>% pmpackage::pm_dq_check_all_columns() %>% kable()`

col | class | num | numMissing | numInfinite | avgVal | minVal | maxVal |
---|---|---|---|---|---|---|---|

actualResult | numeric | 2551 | 0 | 0 | 0.490 | 0 | 1 |

surface_lifetime_win_pct_difference | numeric | 2551 | 178 | 0 | -0.006 | -1 | 1 |

rank_difference | integer | 2551 | 10 | 0 | 4.733 | -1014 | 1085 |

The *actualResult* column looks good, with no missing values.

Both *surface_lifetime_win_pct_difference* and *rank_difference* need some fixes due to missing values.

## Data fixes

For today’s purposes I’m going to set the difference to zero where it is missing.

`fixVarToX=function(v,x){ v = ifelse(is.na(v)|is.nan(v)|is.infinite(v),x,v) v}my_analysis_tbl$surface_lifetime_win_pct_difference <- fixVarToX(my_analysis_tbl$surface_lifetime_win_pct_difference,0)my_analysis_tbl$rank_difference <- fixVarToX(my_analysis_tbl$rank_difference,0)`

## Plot key variables

To help me check that the two variables I’ve built are predictive and intuitive I need to look at how they relate to the target variable, *actualResult*.

Once more, pmpackage has a time-saving function: *pm_modelling_plot_variable*. This plots the distribution, the Gini and the relationship between the variable and match results.

### rank_difference

`pmpackage::pm_modelling_plot_variable(df = my_analysis_tbl, dependent_var = actualResult, independent_var='rank_difference')`

The average rank_difference is clustered around zero (chart in the top right), and provides a fair degree of discrimination (the Gini metric in the bottom left). The relationship has a slight downward trend in the bottom right graph, such that a bigger rank_difference (i.e.one the player has a higher - or worse - ranking than opponent) means a lower chance of winning.

This all seems nice and plausible.

### surface_lifetime_win_pct_difference

`pmpackage::pm_modelling_plot_variable(df = my_analysis_tbl, dependent_var = actualResult, independent_var = 'surface_lifetime_win_pct_difference')`

Lifetime wins on the match surface is a highly predictive characteristic. A positive difference means that the player has better record on the surface than their opponent, so it makes sense that bigger differences are associated with better win probabilities (bottom right chart).

For today I’ll simply split the dataset into two: a development dataset (for building the model) and a validation data (for checking it). This will help prevent overfitting to the development dataset.

`dev_sample_flag <- rbinom(n=nrow(my_analysis_tbl),size=1,prob=0.7)dev_sample <- my_analysis_tbl %>% filter(dev_sample_flag == 1)oos_sample <- my_analysis_tbl %>% filter(dev_sample_flag == 0)`

Now it’s time to build the simple model using the two new variables: surface_lifetime_win_pct_difference and rank_difference.

A simple logistic regression:

`my_simple_model <- glm(data=dev_sample, actualResult ~ surface_lifetime_win_pct_difference + rank_difference, family='binomial')`

`summary(my_simple_model)`

`## ## Call:## glm(formula = actualResult ~ surface_lifetime_win_pct_difference + ## rank_difference, family = "binomial", data = dev_sample)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.7599 -1.0889 0.0178 1.0886 3.3916 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 0.0029780 0.0158110 0.188 0.851## surface_lifetime_win_pct_difference 1.7047905 0.0826138 20.636 <2e-16## rank_difference -0.0059118 0.0002183 -27.083 <2e-16## ## (Intercept) ## surface_lifetime_win_pct_difference ***## rank_difference ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 24737 on 17843 degrees of freedom## Residual deviance: 22736 on 17841 degrees of freedom## AIC: 22742## ## Number of Fisher Scoring iterations: 4`

The moment of truth. I want to see how the model performs on unseen data (i.e.data that wasn’t used to train it).

First I score up the unseen validation data using the simple model:

`oos_sample$predictedValues <- predict(my_simple_model, newdata=oos_sample, type='response', se.fit=FALSE)oos_sample$predictedWin <- ifelse(oos_sample$predictedValues > 0.5,1,0)`

And I compute a Gini, comparing the simple model to the naive betting strategy of betting on the better ranked player:

`oos_sample <- oos_sample %>% mutate(random_var_for_sorting = rnorm(n=nrow(oos_sample))) %>% #don't want to accidentally benefit from pre-sorting arrange(random_var_for_sorting) %>% arrange(-predictedValues) %>% mutate(simple_model_cumsum_wins = cumsum(actualResult == 1), simple_model_cumsum_losses = cumsum(actualResult == 0), simple_model_pct_wins = simple_model_cumsum_wins / sum(actualResult == 1), simple_model_pct_losses = simple_model_cumsum_losses / sum(actualResult == 0), simple_model_auc = (simple_model_pct_wins) * (simple_model_pct_losses - dplyr::lag(simple_model_pct_losses,default = 0)), naive_prediction = if_else(player_rank < opponent_rank,1,0)) %>% arrange(random_var_for_sorting) %>% #don't want to accidentally benefit from pre-sorting arrange(-naive_prediction) %>% mutate(naive_model_cumsum_wins = cumsum(actualResult == 1), naive_model_cumsum_losses = cumsum(actualResult == 0), naive_model_pct_wins = naive_model_cumsum_wins / sum(actualResult == 1), naive_model_pct_losses = naive_model_cumsum_losses / sum(actualResult == 0), naive_model_auc = (naive_model_pct_wins) * (naive_model_pct_losses - dplyr::lag(naive_model_pct_losses,default = 0))) # compute the Ginisimple_model_auc <- sum(oos_sample$simple_model_auc)simple_model_gini <- 1 - 2*(1-simple_model_auc)naive_model_auc <- sum(oos_sample$naive_model_auc)naive_model_gini <- 1 - 2*(1-naive_model_auc)# plot everythingggplot(data = oos_sample) + geom_line(aes(x=simple_model_pct_losses, y= simple_model_pct_wins), colour='tomato1') + geom_line(aes(x=naive_model_pct_losses, y= naive_model_pct_wins), colour='green') + geom_abline(slope=1,intercept = 0) + pmpackage::pm_ggplot_theme() + ylab('Percentage of winners found') + xlab('Percentage of losers found') + ggtitle('Performance of the new model',subtitle = paste0('New model Gini = ',round(simple_model_gini,2),'. Naive model Gini = ',round(naive_model_gini,2)))`

The model works! And it significantly outperforms the naive strategy of betting on the better ranked player.

This is good news: unless I can beat the naive strategy there’s not much hope of successfully predicting the winners of the tournaments.

While the models I use to predict tennis tournaments are more advanced than this, they’re not *a lot* more advanced. Most of the work goes into repeatable functions that clean and analyse the data. And these you are welcome to download and use yourself.