Machine learning for tennis forecasting
Mathematical modelling of tennis is gaining popularity before our eyes.
Every year new analytical models and services appear, competing with each other in accuracy of forecasting the outcomes of tennis matches.
This is caused by the desire to earn on the fast growing online sports betting market: it often happens that the amount of bets on a single match in professional tennis reaches millions of dollars.
In this review I will consider basic mathematical methods of tennis forecasting: hierarchical Markov models, machine learning algorithms, and I will also take apart cases of IBM, Microsoft and one Russian service that use machine learning to predict results of tennis matches.
Introduction to the problem of tennis prediction
Big tennis is a great spectacle and big money. The Association of Tennis Professionals (ATP) holds more than 60 professional tournaments in 30 countries each year. The TV broadcast of Andy Murray versus Milos Raonic in the 2016 Wimbledon final was watched by over 13.3 million people in the UK alone. Betting on tennis is catching up with football. On the world's largest online betting exchange Betfair, total bets on the Murray-Jokovic match in the 2013 Wimbledon final amounted to $63 million. The potential profits and scientific interest have led to a flurry of research into tennis match prediction algorithms.
The scoring system in tennis has a hierarchical structure: a match consists of sets, which are made up of games, which are made up of individual points. Most modern tennis forecasting approaches use this structure to derive hierarchical expressions of a player's probability of winning a match based on Markov chains. Assuming that points in tennis are independently and identically distributed (IID), only the probability of each player winning a point on serve needs to be known to obtain the expression. From these basic statistics, easily obtained from historical data on the Internet, one can calculate the probability of each player winning a game, then a set, and finally a match.
As elegant as this approach is, it cannot be considered ideal. By representing players only by one metric (service wins) it fails to account for the finer factors that also influence the outcome of a match. For example a player's adherence to a certain strategy, time since injury, general fatigue from previous matches can only indirectly influence the match prediction obtained by the hierarchical model method. Furthermore the characteristics of the match itself - coverage, location, weather - are not taken into account at all in such forecasts.
Taking into account the vast amount of historical data on tennis, we can propose an alternative approach to tennis match prediction - machine learning. Player and match parameters together with the match result can constitute a training sample. A machine learning algorithm with a tutor can use this sample to construct a prediction function for new match results.
ML for betting
Although machine learning is self-evident as a solution to the problem of tennis prediction, this approach has, until recently, received considerably less attention from researchers than stochastic hierarchical methods. Most studies applying machine learning to tennis use logistic regression and neural networks. The ROI of the most accurate model described in the scientific literature is 4.35%, which the author claims is 75% better than current stochastic models.
Most of online tennis forecasting services like The legality of betting sites in India (we do not consider human predictors) use stochastic models and offer users probabilities of each player to win with statistics which one should analyze by oneself. I will consider more interesting cases where machine learning algorithms analyse not only the probabilities of winning a point when serving, but also historical statistics on players and match parameters. I will look at the cases of giants like IBM, Microsoft, predicting tennis using machine learning algorithms.