Building a simple tennis prediction model using pmpackage | Peter McIntyre (2023)

This post talks you through how to build a model that predict individual tennis matches. I’ll build a simple toy model model that you can extend to make your own predictions.

My aims are:

  • To give some insight into how my tennis models work
  • To share how my reusable code package, pmpackage, helps me build these models

A later post will talk about how to string these predictions together in a Monte Carlo tournament simulator, to let you start predicting the results of whole tournaments.

The stages I’ll go through are:

  • Loading the libraries
  • Fetching the data
  • Computing new characteristics
  • Checking the data quality
  • Splitting into development and validation samples
  • Building the model
  • Checking how the model performs
library(tidyverse)

I use the same functions and routines for multiple projects. These are saved on github in the pmpackage.

# Run this once to install the library#devtools::install_github('quietsnooze/pmpackage')library(pmpackage)

I get a lot of data from tennis-data.co.uk, a great site that pre-aggregates tennis results. You can download and clean the excel files by hand, or use the automated script below:

wta_data <- pmpackage::pm_tennis_eloify_dataset(pmpackage::pm_tennis_fetchAllDatasets())

Each row of the resulting dataframe is a contest between a player and an opponent. These are grouped by:

  • match_date
  • tournament
  • round, and
  • surface

The result of the match is saved in actualResult with 1 being a win for the player, and 0 a loss.

(Video) 20 FUNNIEST AND MOST EMBARRASSING MOMENTS IN SPORTS

wta_data %>% select(player_name, opponent_name, tournament, round, match_date, surface, actualResult) %>% sample_n(5) %>% kable()
player_nameopponent_nametournamentroundmatch_datesurfaceactualResult
Hantuchova D.Voegele S.Barclays Dubai Tennis Championships1st Round2009-02-16 00:00:15Hard1
Radwanska A.Garbin T.Sony Ericsson Open4th Round2007-03-26 00:02:05Hard0
Jankovic J.Gavrilova D.Prudential Hong Kong Tennis OpenQuarterfinals2015-10-16 00:04:10Hard1
Hlavackova A.Jankovic J.Topshelf Open2nd Round2015-06-11 00:00:31Grass0
Mauresmo A.Kucova K.Wimbledon2nd Round2009-06-25 00:00:31Grass1

Comparison variables

All tennis matches involved two players, a player and an opponent. To predict the result of the match I need to know which player is stronger than the other. Absolute measures of a player’s strength, such as rank, can be misleading. The player ranked number five in the world might be stronger than most players, but when they play someone in the top four they are an underdog.

Instead of a player’s rank, it is therefore more useful to consider the difference or the ratio of their rank to their opponents.

my_analysis_tbl <- pmpackage::pm_modelling_compare_player_opponent(wta_data,'rank')

This function creates a ratio and a difference based on the player’s rank and their opponent’s rank. In this case I’m interested in rank_difference.

my_analysis_tbl %>% select(player_name, opponent_name, player_rank, opponent_rank, rank_difference) %>% head() %>% kable()
player_nameopponent_nameplayer_rankopponent_rankrank_difference
Castano C.Golds S.54599-545
Bremond S.Benesova I.4058-18
Ivanova E.Sequera M.18310380
Bammer S.Hingis M.49742
Camerin M.E.Vesnina E.50473
Sugiyama A.Ferguson S.26184-158

Negative numbers indicate a player who is stronger than their opponent, positive numbers indicate a player who is weaker than their opponent.

Lifetime performance

The second variable is going to be lifetime performance. Again, I’ll use a helper function to save on coding.

my_analysis_tbl <- pmpackage::pm_modelling_lifetime_performance(df = my_analysis_tbl)my_analysis_tbl %>% select(player_name, player_lifetime_win_pct, opponent_name, opponent_lifetime_win_pct) %>% sample_n(5) %>% kable()
player_nameplayer_lifetime_win_pctopponent_nameopponent_lifetime_win_pct
Cohen Aloro S.0.4000000Meusburger Y.0.6000000
Giorgi C.0.5094340Vesnina E.0.5177515
Mauresmo A.0.6417910Harkleroad A.0.4358974
Svitolina E.0.1818182Lepchenko V.0.4152542
Parmentier P.0.4457831Daniilidou E.0.4457831

The function can also compare their strength on the different surfaces:

my_analysis_tbl <- pmpackage::pm_modelling_lifetime_performance(df = my_analysis_tbl, group_variables = 'surface')my_analysis_tbl %>% select(surface, player_name, player_surface_lifetime_win_pct, player_lifetime_win_pct) %>% sample_n(5) %>% kable()
surfaceplayer_nameplayer_surface_lifetime_win_pctplayer_lifetime_win_pct
HardLisicki S.0.57692310.6028037
GrassHalep S.0.20000000.4954955
HardHampton J.0.48387100.4358974
HardKerber A.0.65032680.6481069
HardKvitova P.0.52173910.5000000

Compare lifetime surface strength between player and opponent

As with rank, rather than a measure of each player’s absolute strength I need to compare their strength relative to their opponent:

my_analysis_tbl <- my_analysis_tbl %>% pmpackage::pm_modelling_compare_player_opponent('surface_lifetime_win_pct') my_analysis_tbl %>% select(player_name, opponent_name, player_surface_lifetime_win_pct, opponent_surface_lifetime_win_pct, surface_lifetime_win_pct_difference ) %>% sample_n(5) %>% kable()
player_nameopponent_nameplayer_surface_lifetime_win_pctopponent_surface_lifetime_win_pctsurface_lifetime_win_pct_difference
Cornet A.Parmentier P.0.47560980.41071430.0648955
Vekic D.Al Nabhani F.0.46153850.00000000.4615385
Cornet A.Pennetta F.0.64285710.6470588-0.0042017
Petkovic A.Kucova K.0.58011050.44444440.1356661
Santangelo M.Yan Z.0.53333330.37500000.1583333
(Video) Ellen Degeneres is Officially CANCELLED After This Happened...

Here I check for data quality issues and make some fixes.

pmpackage has a function, pm_dq_check_all_columns, that does this for me.

my_analysis_tbl %>% sample_frac(size = 0.1) %>% select(actualResult, surface_lifetime_win_pct_difference, rank_difference) %>% pmpackage::pm_dq_check_all_columns() %>% kable()
colclassnumnumMissingnumInfiniteavgValminValmaxVal
actualResultnumeric2551000.49001
surface_lifetime_win_pct_differencenumeric25511780-0.006-11
rank_differenceinteger25511004.733-10141085

The actualResult column looks good, with no missing values.

Both surface_lifetime_win_pct_difference and rank_difference need some fixes due to missing values.

Data fixes

For today’s purposes I’m going to set the difference to zero where it is missing.

fixVarToX=function(v,x){ v = ifelse(is.na(v)|is.nan(v)|is.infinite(v),x,v) v}my_analysis_tbl$surface_lifetime_win_pct_difference <- fixVarToX(my_analysis_tbl$surface_lifetime_win_pct_difference,0)my_analysis_tbl$rank_difference <- fixVarToX(my_analysis_tbl$rank_difference,0)

Plot key variables

To help me check that the two variables I’ve built are predictive and intuitive I need to look at how they relate to the target variable, actualResult.

(Video) SSAC16: Research Paper Finals

Once more, pmpackage has a time-saving function: pm_modelling_plot_variable. This plots the distribution, the Gini and the relationship between the variable and match results.

rank_difference

pmpackage::pm_modelling_plot_variable(df = my_analysis_tbl, dependent_var = actualResult, independent_var='rank_difference')

Building a simple tennis prediction model using pmpackage | Peter McIntyre (1)

The average rank_difference is clustered around zero (chart in the top right), and provides a fair degree of discrimination (the Gini metric in the bottom left). The relationship has a slight downward trend in the bottom right graph, such that a bigger rank_difference (i.e.one the player has a higher - or worse - ranking than opponent) means a lower chance of winning.

This all seems nice and plausible.

surface_lifetime_win_pct_difference

pmpackage::pm_modelling_plot_variable(df = my_analysis_tbl, dependent_var = actualResult, independent_var = 'surface_lifetime_win_pct_difference')

Building a simple tennis prediction model using pmpackage | Peter McIntyre (2)

Lifetime wins on the match surface is a highly predictive characteristic. A positive difference means that the player has better record on the surface than their opponent, so it makes sense that bigger differences are associated with better win probabilities (bottom right chart).

For today I’ll simply split the dataset into two: a development dataset (for building the model) and a validation data (for checking it). This will help prevent overfitting to the development dataset.

(Video) Trying to Stop Our Brains Shrinking | Frank J Gunn-Moore, Inaugural Lecture

dev_sample_flag <- rbinom(n=nrow(my_analysis_tbl),size=1,prob=0.7)dev_sample <- my_analysis_tbl %>% filter(dev_sample_flag == 1)oos_sample <- my_analysis_tbl %>% filter(dev_sample_flag == 0)

Now it’s time to build the simple model using the two new variables: surface_lifetime_win_pct_difference and rank_difference.

A simple logistic regression:

my_simple_model <- glm(data=dev_sample, actualResult ~ surface_lifetime_win_pct_difference + rank_difference, family='binomial')
summary(my_simple_model)
## ## Call:## glm(formula = actualResult ~ surface_lifetime_win_pct_difference + ## rank_difference, family = "binomial", data = dev_sample)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.7599 -1.0889 0.0178 1.0886 3.3916 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 0.0029780 0.0158110 0.188 0.851## surface_lifetime_win_pct_difference 1.7047905 0.0826138 20.636 <2e-16## rank_difference -0.0059118 0.0002183 -27.083 <2e-16## ## (Intercept) ## surface_lifetime_win_pct_difference ***## rank_difference ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 24737 on 17843 degrees of freedom## Residual deviance: 22736 on 17841 degrees of freedom## AIC: 22742## ## Number of Fisher Scoring iterations: 4

The moment of truth. I want to see how the model performs on unseen data (i.e.data that wasn’t used to train it).

First I score up the unseen validation data using the simple model:

oos_sample$predictedValues <- predict(my_simple_model, newdata=oos_sample, type='response', se.fit=FALSE)oos_sample$predictedWin <- ifelse(oos_sample$predictedValues > 0.5,1,0)

And I compute a Gini, comparing the simple model to the naive betting strategy of betting on the better ranked player:

oos_sample <- oos_sample %>% mutate(random_var_for_sorting = rnorm(n=nrow(oos_sample))) %>% #don't want to accidentally benefit from pre-sorting arrange(random_var_for_sorting) %>% arrange(-predictedValues) %>% mutate(simple_model_cumsum_wins = cumsum(actualResult == 1), simple_model_cumsum_losses = cumsum(actualResult == 0), simple_model_pct_wins = simple_model_cumsum_wins / sum(actualResult == 1), simple_model_pct_losses = simple_model_cumsum_losses / sum(actualResult == 0), simple_model_auc = (simple_model_pct_wins) * (simple_model_pct_losses - dplyr::lag(simple_model_pct_losses,default = 0)), naive_prediction = if_else(player_rank < opponent_rank,1,0)) %>% arrange(random_var_for_sorting) %>% #don't want to accidentally benefit from pre-sorting arrange(-naive_prediction) %>% mutate(naive_model_cumsum_wins = cumsum(actualResult == 1), naive_model_cumsum_losses = cumsum(actualResult == 0), naive_model_pct_wins = naive_model_cumsum_wins / sum(actualResult == 1), naive_model_pct_losses = naive_model_cumsum_losses / sum(actualResult == 0), naive_model_auc = (naive_model_pct_wins) * (naive_model_pct_losses - dplyr::lag(naive_model_pct_losses,default = 0))) # compute the Ginisimple_model_auc <- sum(oos_sample$simple_model_auc)simple_model_gini <- 1 - 2*(1-simple_model_auc)naive_model_auc <- sum(oos_sample$naive_model_auc)naive_model_gini <- 1 - 2*(1-naive_model_auc)# plot everythingggplot(data = oos_sample) + geom_line(aes(x=simple_model_pct_losses, y= simple_model_pct_wins), colour='tomato1') + geom_line(aes(x=naive_model_pct_losses, y= naive_model_pct_wins), colour='green') + geom_abline(slope=1,intercept = 0) + pmpackage::pm_ggplot_theme() + ylab('Percentage of winners found') + xlab('Percentage of losers found') + ggtitle('Performance of the new model',subtitle = paste0('New model Gini = ',round(simple_model_gini,2),'. Naive model Gini = ',round(naive_model_gini,2)))

Building a simple tennis prediction model using pmpackage | Peter McIntyre (3)

The model works! And it significantly outperforms the naive strategy of betting on the better ranked player.

This is good news: unless I can beat the naive strategy there’s not much hope of successfully predicting the winners of the tournaments.

(Video) Alasdair MacIntyre on Having Survived Academic Moral Philosophy of 20th Century

While the models I use to predict tennis tournaments are more advanced than this, they’re not a lot more advanced. Most of the work goes into repeatable functions that clean and analyse the data. And these you are welcome to download and use yourself.

Videos

1. Round table
(Leeds Institute for Fluid Dynamics)
2. Episode 13 | Lunch with ... Stuart Croft
(WMG Warwick)
3. Pimento Report#29: The Arena and The Park
(pimento3)
4. Ep. 53 - What is going?
(4Guyz Roundtable Show)
5. "Prometheus in Russia: From Revolutionary to Superhero” Graduate Student Lecture by Ekaterina But
(CSEEES@OSU)
6. David Troughton in conversation
(University of Birmingham)
Top Articles
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated: 01/10/2023

Views: 5888

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.