The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
For my capstone project, I built various machine learning models to use data to predict the winners of matches on the professional Women's Tennis Association (WTA) Tour. I also constructed and deployed an elegant, simple-to-use Dash app that allows users to predict the winners of future WTA matches, display betting odds, and compare statistics between players. The question at the heart of this project, and at the heart of all sports predictions, is: What features must be considered to produce a consistently reliable model?
In the case of tennis, a simple, unscientific, and reasonably accurate method is to predict that the higher-ranked player will win every match--in other words, that there will be no upsets. In the data used for this project, this leads to a correct prediction 66.21% of the time. I considered achieving a predictive accuracy that exceeded this baseline of 66.21% a crucial measure of success, as a sophisticated machine-learning model should be able to outperform such a basic predictive method.
Due to the nature of the baseline, a successful model would need to be adept at the difficult task of predicting upsets--that is, at correctly predicting when a lower-ranked player will win. Given the unpredictability inherent to all professional sports, I aimed for an improvement of 4% to 5% on the baseline.
Data on Feature Engineering
The datasets used for this project are made available by GitHub user JeffSackmann. These datasets contain detailed information for each match played on the WTA tour dating back to the 1970s, including the winner and loser of the match, the players' rankings, the court surface, the tournament level, date, and location, the score of the match, and serve statistics for the players in the match. I used data from only the past twenty years in my predictive models, as differences in racket technology and court surfaces prior to 2000 would likely bias the results.
The primary challenge that the data presented was the need to engineer features that were cumulative and chronological. I first arranged all matches in the dataset chronologically, and then engineered several new features that summarized aspects of a player's performance prior to their current match. The specific features I engineered were:
- Surface win percent: A player's win percent on the current match's court surface prior to the match.
- Level win percent: A player's win percent at a current match's tournament level prior to the match.
- Head-to-head: The number of matches won by the player against her current opponent prior to the match.
- Recent form: A player's overall win percent prior to the current match, plus a "penalty" of log10(1-(overall win %)+(last 6 months win %)).
Background Information of Our Analysis
"Surface" refers to the material of which the court is made, and can take on four values: hard, grass, clay, and carpet. Different court surfaces favor different styles of play--for instance, defensive players often perform well on clay, while powerful, aggressive players excel on grass--and so court surface is a crucial consideration when predicting the winner of a match. "Tournament level" refers to any of eight tiers of professional tournament: Grand Slam, Premier, Premier Mandatory, WTA Finals, Olympics, Fed Cup, International, and Challenger.
Certain players, a prime example of which is Serena Williams, are known for performing exceptionally well at more prestigious events such as Grand Slams, while performing uncharacteristically poorly at smaller events like Premiers and Internationals. Thus, tournament level is also important to consider when making a prediction. "Recent form" quantifies whether a player is on a hot streak, in a slump, or playing close to their usual level. The added "penalty" increases a player's career win percent if she is playing better than average in recent months, and decreases it if she is playing worse than average in recent months.
Inherent to these new features was a flaw that needed to be addressed: These features contained a large number of zeros, which could potentially bias the results of any predictive model. Zeros were particularly rampant in rows featuring players who had not played many matches in their careers, so, to combat this issue, I removed all observations containing a player who had played fewer than 100 matches up to that point. I also removed all matches that ended in a "retirement"--that is, a player being forced to forfeit due to injury or illness.
Unfortunately, I found it necessary to remove all matches from the year 2020 from the data set, as player rankings have effectively been frozen due to COVID and may not reflect a player's "true" standing in the sport. For instance, Ashleigh Barty is currently ranked #1 in the world despite having not played a professional match for over 11 months due to safety concerns and stringent travel restrictions in her native Australia. This paring down of the data still left me with ample observations (over 18,000 matches) on which to base my predictive models.
It was also important to obscure the winner and the loser of each match (as this is what I aim to predict!). To this end, I changed all occurrences of the strings "winner" and loser" among the feature names to "player_1" and "player_2," with player_1 referring to the player in each row whose name comes first alphabetically.
Finally, the data contained several missing values, particularly in the "player ranking" features, and this required some creativity to impute. I was able to impute many of these missing values by merging with a separate WTA Rankings data frame. However, many of these values were not missing at all--rather, the "missing" values were indications that the player was unranked at the time of the match, which can occur when a player has not played a match in over 12 months or has come out of retirement.
Using the WTA Rankings data frame, I located these players' rankings at the time of their last match before their extended layoffs, added 1 for each month of their layoff, and imputed the missing ranking with the result. The WTA Rankings data frame had some missing values of its own, and in this case, I imputed instead with players' rankings from the week before, as it is quite unusual for a player's ranking to vary greatly in the span of one week.
Data Analysis and Results
I tested predictive accuracy for a variety of machine learning algorithms: gradient boosting, random forest, logistic regression, Gaussian naive Bayes, and linear discriminant analysis. Before running my predictive models, I divided my data into an 80/20 train-test split. The train and test accuracy for each model is presented below.
Each machine learning model exceeded the benchmark of 66.21%, and the tree-based models in particular exceeded it by the margin of 4% to 5% that I hoped to achieve at the outset of this project. I wanted to examine the influence that my engineered features had on the predictions, so for the tree-based models, I calculated feature importances.
For both the random forest and gradient boosting classifiers, level and surface win percent, as well as recent form, factored heavily into the predictions, while my remaining engineered feature, head-to-head, did not appear to hold much sway in predicting winners.
Gradient Boosting Classifier
Returning to the sentiment expressed in the introduction--that a successful model must be able to predict upsets--I wanted to look more in-depth into the performance of the most accurate model, the gradient boosting classifier (GBC), in an effort to explain its strengths and its drawbacks. The following charts display the GBC's predictive accuracy for matches in which the higher-ranked player won (non-upsets) and for matches in which the lower-ranked player won (upsets) across each surface and tournament level.
There is little variation in the model's success at predicting non-upsets across all surfaces and tournament levels. Highly noteworthy, however, is the model's ability to predict upsets at four particular tournament levels: Challenger, Olympics, WTA Finals, and Fed Cup. In fact, the overall predictive accuracy at these events was 77.42%, much higher than the 66.21% baseline.
Tournament Level Win Percents
I was curious as to what sets these particular tournaments apart, so I examined more closely the distributions of two particular variables: the difference between players' tournament level win percents and the difference between the logarithm of players' rankings.
For the four tournament levels in question, D, F, O, and C (corresponding to Fed Cup, WTA Finals, Olympics, and Challenger, respectively), the distributions are slightly right-skewed, while the others are more symmetrical. This indicates that tournament level win percent is a stronger predictor at these particular levels than they are on average. When looking at the difference between players' rankings at various tournament levels, we find a much smaller range at levels F, O, and C, in particular.
Log of Players
With less variation in player ranking at these levels, ranking becomes a less important feature, and other features, particularly tournament level win percent, "pick up the slack" in the predictive model. In short, the GBC performed best at tournaments where ranking mattered least. This led me to consider re-running my predictive models without including player ranking as a feature, but, strangely, upon attempting this, the train and test accuracies actually decreased for all models.
Finally, I selected eight predictions made by the GBC about particular noteworthy matches from the past 20 years of professional women's tennis. Four of these matches, correctly predicted by the model, were won by the much lower ranked player, and illustrate the ability of the model to weigh other important features, particularly recent form, surface win percent, and level win percent, against player ranking.
The four other matches illustrate a potential drawback of the model: all of these matches were won by the higher ranked player, while the model incorrectly predicted an upset. In these cases, the player's past performance on the given court surface and tournament level perhaps weighed too heavily in the prediction.
Given the success of the predictive models, I was excited to translate my work into a Dash app that allows users to make predictions about future WTA matches.
Such a platform should display information that is meaningful for the purpose of sports betting or fantasy sports, so I included betting odds as part of the prediction, as well as a detailed visual comparison between player statistics. The logistic regression classifier was ideal for this purpose, as it is simple to extract the probabilities associated to each individual classification. While this model is not as successful as the tree-based models, its predictive accuracy exceeds the 66.21% baseline by over 3%, and thus it still provides meaningful insights to users of the app. Below is a picture of the user interface.
The app allows the user to select two players, the court surface, and the tournament level. It then displays the predicted winner of a hypothetical next match between these players, as well as her odds of winning. The user can choose to compare visually one of five different statistics: recent form, surface win percent, tournament level win percent, the head-to-head between the two selected players, and the rankings of the selected players.
As previously mentioned, the data on which the logistic regression model is based contains only matches in which both players have played at least 100 professional matches, so users are only able to select such players in the app.
- Keep An Eye On The Tennis World Rankings. ...
- Check Tennis Head-to-Heads. ...
- Look At Player Profile Performance Statistics. ...
- Finding Value Tennis Bets. ...
- How To Pick Tennis Winners?
- Step 1: Collect the data.
- Step 2: Fit a regression model to the data.
- Step 3: Verify that the model fits the data well.
- Step 4: Use the fitted regression equation to predict the values of new observations.
Tennis Analytics provides objective game film breakdown and events indexing, along with detailed match reports. We chart your match footage (both singles and doubles) and create an online, searchable match index. You get entire “ball in play,” driven by the score, with all key performance indicators tagged.How is statistics used in tennis? ›
The benefit of statistics for tennis players is that it allows them to detect what they have done well in a match and what needs to be improved. After a tennis match the players will tend to ask many question about their game like “how many unforced errors did I do” for e.g. What was my serve percentage?Which algorithm is best for prediction? ›
- Naive Bayes algorithm.
- KNN classification algorithm.
- Random forest algorithm.
- Artificial neural networks (ANNs)
- Recurrent neural networks (RNNs)
It's the most widely used predictive analytics model, with several common methods:
- Linear regression/ multivariate linear regression.
- Polynomial regression.
- Logistic regression.
All four levels create the puzzle of analytics: describe, diagnose, predict, prescribe.Who is the best game predictor? ›
|S.N||Football Prediction Sites|
Tennis is one of the easiest sports to predict. For beginners, tennis is the best sport to predict the winner as there are no draws. It can be called a game sport.
Still, even with new metrics and keen interest in analytics from top players like Novak Djokovic and Andy Murray, tennis has not fully embraced analytics, especially since the data requires time-consuming analysis and sometimes calls into question conventional thinking about how to compete and train.What are the 4 pillars of sports analytics? ›
The four pillars are Communication, Statistics, Programming, and domain knowledge: Sports.Why is technology used in tennis? ›
In tennis it's very important to measure how fast the bounce of the ball is on court and we also need to measure the sliding coefficient between the shoes (and the court). With this technology we can measure very easily all these kinds of things which helps in manufacturing equipment.What are the most important stats in tennis? ›
Therefore, the frequency of aces becomes an important statistical parameter which suggests a tennis player's dominance over his/her rival. The frequency with which a tennis player breaks his/her opponent's serve is the next most important statistical factor that must be kept in mind while betting on tennis matches.How is AI used in tennis? ›
The AI is also speeding up media coverage of the tournament. AI is slicing and dicing data to create video content in seconds, a job that would normally take a multimedia team hours to do. "Fans are able to access and analyse match highlights and other smart playlists almost immediately after a match."What are the 4 basic elements of statistics? ›
Sample size, variables required, numerical summary tools, and conclusions are the four elements of a descriptive statistics problem.What are the three most used predictive modeling techniques? ›
Three of the most widely used predictive modeling techniques are decision trees, regression and neural networks.What are the three different types of predictive analytics? ›
Types of Predictive Analytical Models
There are three common techniques used in predictive analytics: Decision trees, neural networks, and regression.
The variable we are making predictions about is called the dependent variable (also commonly referred to as: y, the response variable, or the criterion variable).What are the two types of prediction? ›
Abstract. This article discusses recent moves in political science that emphasise predicting future events rather than theoretically explaining past ones or understanding empirical generalisations. Two types of prediction are defined: pragmatic, and scientific.
Methods including water divining, astrology, numerology, fortune telling, interpretation of dreams, and many other forms of divination, have been used for millennia to attempt to predict the future.What are the two types of predictive modeling? ›
1. Simple linear regression: A statistical method to mention the relationship between two variables which are continuous. 2. Multiple linear regression: A statistical method to mention the relationship between more than two variables which are continuous.What are the 7 steps of data analysis? ›
- Defining the question.
- Collecting the data.
- Cleaning the data.
- Analyzing the data.
- Sharing your results.
- Embracing failure.
- Step One: Ask The Right Questions. So you're ready to get started. ...
- Step Two: Data Collection. This brings us to the next step: data collection. ...
- Step Three: Data Cleaning. You've collected and combined data from multiple sources. ...
- Step Four: Analyzing The Data. ...
- Step Five: Interpreting The Results.
- Collaborate your needs. ...
- Establish your questions. ...
- Harvest your data. ...
- Set your KPIs. ...
- Omit useless data. ...
- Conduct statistical analysis. ...
- Build a data management roadmap. ...
- Integrate technology.
- HAVE A GOOD KNOWLEDGE OF THE GAME. ...
- PATIENCE. ...
- DON'T BET WITH YOUR HEART. ...
- QUALITY OVER QUANTITY. ...
- CHANGE BOOKMAKERS. ...
- RESEARCH ON MATCH STATISTICS. ...
- HOME GROUND ADVANTAGE. ...
- RESEARCH ON TEAM CALENDAR.
- Read Aloud Picture Books. Reading aloud picture books is a great way to model and practice this reading comprehension strategy. ...
- Make Anchor Charts. Anchor charts are another great way to teach students about predicting. ...
- Use Videos.
- BTTS: BTTS bet demands the punters to predict if both teams will score a goal or not. ...
- Over/Under: This bet can work in your favor when you have chosen a smaller figure as reference. ...
- Double chance bet: In this bet, you win money when any two from the three outcomes are obtained.
Assessment of predictive accuracy. Predictive accuracy should be measured based on the difference between the observed values and predicted values. However, the predicted values can refer to different information. Thus the resultant predictive accuracy can refer to different concepts.
Odds are presented as a positive or negative number next to the team's name. A negative number means the team is favored to win, while a positive number indicates that they are the underdog.What is the best way to bet on tennis? ›
The most simple bet in sports, the most popular and easy way to bet on tennis, is by betting the moneyline. This essentially is just betting on which player you think will win the match. Rafael Nadal is playing Novak Djokovic at the US Open. Nadal is +120 on the moneyline, and Djokovic is -200.What is the hardest sport to win? ›
The Stanley Cup is without a doubt the hardest championship trophy to win in all of professional sports. Harder than winning the Superbowl, harder the winning the World Series, and harder than winning the NBA Title. There are 16 teams in playoff contention, only other sport that can claim the same would be the NBA.What sport is the hardest to score? ›
|Degree of Difficulty: Sport Rankings|
For each match, make your predictions by guessing which team will win and answer the daily match prediction question. You have the option of changing your answer multiple times up to 30 minutes after the match begins. For each match, the top 10 participants with the closest prediction will win daily prizes.What are the 5 ways to win a point in tennis? ›
There are 5 five ways of winning points: winners, double-bounces, errors by the opponent at the net, errors from the opponent where they hit the ball outside the court markings, and double-faults. Each of these ways gives the player one point.Who is the best game Predictor? ›
- Patience. Perhaps the most important trait in life and especially in betting is the ability to be patient. ...
- Platform. Just like not all people in the boiler room are equal betters, no two platforms are quite the same. ...
- Knowledge. ...
The equations of calculation of percentage prediction error ( percentage prediction error = measured value - predicted value measured value × 100 or percentage prediction error = predicted value - measured value measured value × 100 ) and similar equations have been widely used.What is the first step in making a prediction? ›
Form a hypothesis.
As scientists, we want to be able to predict future events. We must therefore use our ability to reason. Scientists use their knowledge of past events to develop a general principle or explanation to help predict future events. The general principle is called a hypothesis.
When the hand moved to 60, the game was over. However, in order to ensure that the game could not be won by a one-point difference in players' scores, the idea of "deuce" was introduced. To make the score stay within the "60" ticks on the clock face, the 45 was changed to 40.What are the 2 most important shots in tennis? ›
While the margin of success between the most crucial shot (second serve) and return of serve shots has diminished over the years, the second serve is still the most important shot, which determines success rate in tennis today.What are 3 things you Cannot do in tennis? ›
Players/teams cannot touch the net or posts or cross onto the opponent's side. Players/teams cannot carry the ball or catch it with the racquet. Players cannot hit the ball twice. Players must wait until the ball passes the net before they can return it.