Building Marcel, Part 1: The Monkey

If that doesn't make any sense I'll explain it. Anyway this is about a projection system.

Sep 06, 2024

∙ Paid

In my first newsletter, I discussed how one of the big goals of this project was to better understand the statistical record of soccer. We now have 15 years of on-ball data, spanning an increasing number of leagues. A typical match has over 2,000 events. In the big five European leagues alone, that means there have been something like 300 million on-ball events logged since 2009-10. It is my contention we should know a lot more about what these data points mean than we do.

There are many different ways to start to make sense of this data — this is what the newsletter is about! — but as someone who came into sports analysis through baseball, I have always wanted a projection system. If a player attempted three shots per match one season after averaging under two per match the season before, what should we expect them to do next season? What about a player who has a big season in assists or take-ons or interceptions? If a player has won a high percentage of their aerial duels in 1,500 minutes, does that indicate they will likely continue to win those duels at a high rate?

The question of projection sidesteps the question of value. I don’t need to know whether a player’s pass completion rate increased or decreased the likelihood of a team scoring goals to estimate their most likely pass completion rate next season. There are ways into the value question from projection, but they don’t need to be followed in order to learn more about the game. At the same time, statistical projection raises many other complicated questions.

League translation, for one. If a player completed six progressive passes per match in the Italian Serie A, would we expect them to complete more or fewer in the Premier League? And of course this question dovetails with team and tactical context considerations. If the Serie A team had 44 percent possession and played a low-tempo defensive style, and the new Premier League team plays a high-pressing and high-tempo style with more possession, how should that affect our projection of their passing statistics? Even season to season in the same league, on the same club, a player’s team could change managers or tactical styles and the governing context of their statistics would be different.

League, team and tactical contexts are huge questions in soccer analytics and this series will get to them in time. The problem is, you cannot identify the league or tactical context for statistics if you don’t first estimate what the baseline is for projection. That is, say we take that player who completed six progressive passes per match in Serie A. Say they completed five per match on their new Premier League team. To what degree could that change be attributed to contextual factors, or would it be indicative of a more typical level of regression to the mean that would happen in any player’s numbers? To understand league and team contexts, we need to have a handle on how to do projections in the first place.

So this is where I’m starting. In baseball analytics, we had something called the Marcel projection. That’s a Friends reference, which apparently I do not need to explain because Friends is the most popular tv show among Zoomers worldwide (or something). But I do need to explain why it has anything to do with sports analytics. The simplest form of statistical projection is a weighted average of past performance. Weight more recent production more heavily and production further in the past less heavily, add them all together, and include some regression to the mean component as well. That simple result, merely a regressed and weighted average, is the sort of statistical projection system that a monkey could build.

Then if you add just a few bells and whistles to it, some context adjustments and age adjustments, then it’s smarter than what a typical monkey could make. Marcel from Friends was smarter than the average monkey, so that would have been Marcel’s projection system.

In baseball, “some context adjustments” mostly means park effects, which are relatively trivial to calculate. In soccer, context is everything. Context is team and tactical effects, league effects, the aging curve, and more. Extracting a statistical signal of real player tendencies from all of these contexts is one of the holy grails of soccer analytics. Getting all the way to Marcel will be a journey. Just making the monkey smarter would be a major advance in soccer analytics, at least in the public sphere.

So we will begin with the monkey. This involves a few distinct questions. What is the proper weighting of past season performance? How is the regression to the mean component calculated and what should its weight be? Should different weights and approaches be used for different statistics?

Keep reading with a 7-day free trial

Subscribe to Expecting Goals to keep reading this post and get 7 days of free access to the full post archives.