As the UK soccer Premier League re-starts, football fans can be expected to go through a highly familiar cause of frustration: when their new, highly-well paid star striker is poised in front of the goalmouth, about to surely score… yet somehow misses what seems to be an easy goal.
However, data specialists at Spanish football league LaLiga can now reveal to fans, using machine learning, just how statistically difficult it is to complete the successful shots they expect.
Rafael Zambrano López, the Madrid-based Head of Data Science at the League’s tech organization, says:
Fans say that the probability of this kind of shot being successful should be 99 or 100 percent. But it’s actually very difficult to score in football, and most times the probability is more like only 72.
López and his team claim to have discovered the reality of top tier football via intense number-crunching, analyzing a detailed recording of every move in two years of games, with the results now being shared with broadcasters and fans - including a new ‘Goal Probability’ metric.
The objective is that for every single game, the system knows the position of every player in relation to the ball.
This is done via gathering a large amount of game data and then analyzing it, with specialist partners collecting what LaLiga calls every ‘event’ for every game -from every pass, every goal, every refereeing decision, and so on.
At the moment, this is done by human observers watching the game, who type in each event in real-time. LaLiga acknowledges that when you tag something manually errors can be introduced, so in parallel, 16 cameras are now in place at every first and second division in the League.
These cameras capture the position of every player on the ball 25 times per second, delivering 300,000 frames per match. A second partner company then uses AI to transform the images into coordinates of where every player is around the ball on a moment-by-moment basis.
Lots of data - and lots of variables
The processes result in two vast data files that LaLiga works on to produce 20 performance analyses that fans can use to better understand the complexities of the game. The camera analysis of all those coordinates alone, for example, results in a plain text file with more than 3.6 million data points per game.
These files are then transformed into tables in a large XML file to create a simple relational model, which is loaded onto LaLiga’s cloud-based data platform. López says:
This is not a simple task, because these two sources don’t speak the same language. For example, in the event data we have player’s IDs, and we have timestamps for the elements and number of frames for the tracking. We need to combine the sources time- and space-wise to ensure that we really do know the position of every player and the ball at every second of the 90 minutes.
Manual intervention to deal with possible tagging errors is also performed. He adds:
In one second in football, everything can change, so we do things like look very carefully at the frames both immediately before and after every shot, measuring the distance between the player and the ball at the exact moment the shot took place, correcting the target shot for the actual shot and this allow us to get more precise metrics.
Historical statistics about individual player’s past performance is also added at this stage, as if the shot is made by high-quality attack players like Benzema or Messi, versus a midfielder or defender, there is a higher probability of success. López adds:
The variables here are complex. You must factor in the distance to the goal, the number of opponents and how close they are, the angle of the cone of visibility the player has at that moment. We also include historical data about that individual player’s previous track record, as well as the goalie’s.
Trigonometric analysis of the path of balls through the air or on the ground is also added as a variable, while other calculations are also performed, such as distance to the nearest opponent as that’s going to make it more difficult to score, which, again, is expressed in mathematical terms.
After all these millions of rows and tables have been worked over, a machine learning model is then applied that performs the detailed analysis. To train the model, LaLiga used records of more than 20,000 games to be able to learn how many variables affect the probability of scoring.
Results are then shared, in real-time, with the TV companies showing the game. And it’s this combination of AI and analytics that has produced the Goal Probability metric, which López claims is the world’s first predictive football metric.
It’s also the application of applying analytics to the game at this level, says López, which shows that misses happen a lot in football and that even the very best players miss shots - but aren’t as much of a ‘tap in’ as observers believe.
Detailed pre- or post-match analysis
To get this level of analysis, LaLiga needed to build a complex data engine at its back end.
This is made up of a mix of Microsoft Azure and data lakehouse technology from supplier Databricks, which allows LaLiga and its partners to access all this data in one place. Partners here include not just broadcasters, but the clubs themselves, who can use LaLiga supplied data analysis tools to obtain in-game tactical insights, perform detailed pre- or post-match analysis or even predict player injuries.
So successful has the project been for LaLiga that the League’s internal data team has now been spun off into a standalone, 180-strong sports IT data services company, LaLiga Tech.
Next steps for applying Machine Learning to sport, López concludes, will be trying to see if other football Leagues, starting with South America, but also other sports, such as Spain’s highly popular ‘padel.’