I’ve been doing football1 analytics for over a decade. In that time, we have seen incredible progress across the field and in the sport itself. Those very first “expected goals” systems have been surpassed by new ones built on newly-collected information, and increasingly we see properly big data from player tracking being incorporated into all kinds of metrics. Scholarly journals publish cutting-edge soccer analytics articles. Within the sport nearly every big club has an analytics department, and most famously Liverpool owners and executives have spoken publicly about how their recent trophy run was driven by an analytics approach. In the media there are stats everywhere. xG has gone from a handful of tiny blogs to Match of the Day in just a few years.
At the same time, our ability to inform people about the sport of football has not progressed nearly as far as I would have imagined a decade ago. Major and straightforward questions about the game remain unanswered, at least in the public sphere. What have we really learned in the last decade about game state, home field advantage, substitution patterns and their effects, or the league and team contexts of player statistics? What have we learned about which statistics project future success for a player or team over what samples?
These are the sorts of questions that had reasonably good answers in baseball analytics by the end of the 1990s. Back in 2014 it seemed nearly certain that they would be at least mapped out for football within the decade. But even today, our explanations of existing statistics often risk veering into hand-waving:
“oh their numbers are biased by game state since they scored so many early goals”
“she’s transferring to a better team that has more possession so scale her numbers up a bit”
“but perhaps that’s not enough minutes this season to believe his uptick in take-on success is real”
and so on. It places a limit on how well we can describe the game using the statistical record. In particular, I think what has happened is that we have progressed along one track of analytics while failing to make the same progress on two others.
When I started putting together my thoughts on what a “soccer analytics” newsletter would be, I tried to define analytics as a field of study. What is it that makes a discussion of sporting records, something that people have done for as long as there have been sports, into “analytics” properly? There are three aspects to this.
The first, obviously, is valuation. What actions taken in a sporting competition led to wins? How much did each action impact the likelihood of winning? If someone is trying to evaluate the “on-ball value” of an action, or a team’s likelihood of winning based on the chances created, that is valuation analytics. “Expected Goals” in this sense is primarily a valuation statistic, as it was created to provide an estimate of how well a team played and how likely they were to win the match. The holy grail of valuation is some sort of “Wins Above Replacement” (maybe “Points Above Replacement”) statistic that provides a single number to quantify what a player contributed to team success.
The second is context. How was the accumulation of these statistics affected by the setting of the game, where and when it is played, the tactical and positional roles taken up by the players, the quality of players or the opposition, and the competitive context whether in terms of the game state of the individual match or the larger season competition? If someone is working on “league adjustments” or “team adjustments” to account for a player’s competition or teammates, they are doing context. The holy grail of context is universal comparability. Can you take the statistical record of a team in Serie B and extrapolate what an equivalent record would be in Ligue 1? If a player has averaged 28 touches per match and 1.7 passes into the penalty area for a team with 42 percent possession and 12 penalty area touches per match, what would their equivalent statistics be on a team with 55 percent possession and 24 penalty area touches per match? Comparability is fundamentally a question about statistics themselves, not necessarily about their value. You don’t need to know whether a touch is valuable to make these comparisons and account for context.
Finally there is randomness. Sports analytics does not merely attribute outcomes to bad luck2 but seeks to quantify that variation to understand how, when and to what degree it impacts sporting outcomes and achievements. When someone says that a player who is not finishing their chances right now likely will not remain cold, this is a statement about randomness. The sample size over which the statistical record has been compiled is too small to draw conclusions about the real underlying skill of the player or team that compiled those statistics. The holy grail of the study of randomness is projectibility. Can you say based on a player’s record what their likely statistical production will be over the next several weeks, months or years? What parts of the statistical record, over what sample sizes, provide useful signal to identify those underlying real tendencies and capabilities?
As I have watched soccer analytics develop, I have seen incredible progress in understanding valuation. People are increasingly putting solid numbers to the incredibly difficult problem of identifying which actions help win matches. But on questions of context and randomness we have fallen short of the expectations a decade ago.
I wonder if there’s a reason for this. Valuation is where the money is. Coaches want to know what actions on the pitch lead to success. Owners want to know which player can be signed for a cheap contract but produce more marginal points. Valuation points outward to the most important questions for people inside the game. Context and randomness, by contrast, are very much questions about statistics. They do not necessarily produce actionable results for coaches. At first, what they do is help us understand the statistical record that we have.
I am not going to make the case that this is a more valuable thing to do than anything else in the field. We can all decide what to do with our one wild and precious life. But I know that I want to understand football statistics better than I do, and I enjoy helping other people understand football statistics. And I see a real gap in analytics right now in capturing the contexts and the randomness of the game. There are dozens of topics within this rubric that I think need further exploration, and this newsletter will be my opportunity to explore. Hopefully it will be fun for you to explore with me.
Another more idiosyncratic reason that Expecting Goals came to be is that I just like this kind of writing. When I first started posting about soccer analytics at Cartilage Free Captain, I wrote studies. As I took on freelance jobs, the material I could get paid to produce shifted to more specific and of-the-moment questions. As I have hosted a weekly podcast, football has become an increasingly year-round sport with fewer and fewer opportunities to take a break from the weekly grind of matches and crises and transfers and business skullduggery. This newsletter offers me a chance to step away from this week’s Premier League drama and focus more broadly on the game of football. I hope that you will all enjoy this opportunity to take stock together, as well.
Because football is such a wonderfully complex and dynamic game, every question of context, variation or valuation inevitably overlaps with several others. So these studies will not be separate and discrete pieces of work, like statues placed successively in a garden. At the same time, because these studies address questions without settled answers, they cannot be built on top of one another like pouring a foundation and then creating precision-cut pieces to complete a particular structure. Rather, I see this as the creation of building blocks, which can be rearranged over time into many different forms. Further I hope these building blocks will be useful not only to my future studies but to fans and analysts who can use them to help understand the game better themselves. I do not know where these studies will take me, but I have already planned to double back many times based on the work I have already completed. Perhaps this will involve continually re-cutting the blocks as we learn better ways to fit them together.
In this sense of exploration and play, I have a plan for how the newsletter will be scheduled. Every month I will produce a new study on a topic in soccer analytics. The first study here in January 2024 will be on substitute effects. Following the Study will be the Sandbox. Every study produces table after table with new statistics for players and teams. I will look through the material produced for the study to find interesting stories, and I will take your requests to look at players, teams, and topics you want to know more about.
One important editing note before we begin. I have written for British publications where I called the sport “football” and for American publications where I wrote “soccer.” They both had strict rules that I could only use one or the other. But on my podcast I use the terms interchangeably, depending on my whim and the rhythms of the sentence. You can expect that to continue in the newsletter. Soccer and football are just two names for the same sport.
Unless it happens to your favorite team. Then it’s definitely bad luck.
One question I have is why xG and xA in the PL is 10% higher than actuals?