London bus analytics conferences
We're talkin' StatsBomb Evolve and Stats Perform Pro Forum and what they both *mean* for us all
Honestly. You wait all year for a football analytics conference/event and then two come along in the space of a week. On 17 March we had ‘StatsBomb Evolve’ and on 24 March we had the Stats Perform Pro Forum.1 In all the weeks in all the months in the world, they just happened to walk into almost the same one.
Both brought announcements, both brought fine analytics minds, and during both I suffered internet issues (on my end) while watching. So I won’t do a play-by-play for either. However, the post-conference dusk makes for a good time to think about what they indicate for where analytics ‘is’ and where it’s heading.
This newsletter will have a brief summary of each, linking out to the company’s material on the event. Then I’ll talk about what it all seems to mean.
In chronological order
StatsBomb Evolve was split into four speakers and five basic parts (after a brief ‘Previously, on StatsBomb’ intro):
StatsBomb 360 (the new product, ‘where lots of players are at the moment of each event’)
StatsBomb data collection
StatsBomb data quality control [imo these two sections were far more interesting than they sound; this is the sole bit of editorialising I’ll do in this section of the newsletter]
StatsBomb data science (mainly discussing their OBV, or ‘on-ball value’, model)
There’s a StatsBomb summary of each section, along with select slides, here.
The Stats Perform (née Opta) Pro Forum followed the same structure as previous years: presentations from members of the analytics community (two of which followed a research question proposed by a pro analyst working in the game), then a guest speaker. The presentations were:
Ola Lidmark Eriksson – Volatility and calculation of risk-adjusted return in football scouting
Aditya Kothari – A physics based measurement of defensive contributions
Caterina De Bacco – Identifying and evaluating the efficiency of each player during the pressing phase against an opponent’s controlled build-up play
Stats Perform’s very own Paul Power, Thomas Seidl, and Michael Stöckl doing lots of things with tracking data (presentation titled ‘Making Offensive Play Predictable’)
Debangan Dey, Rahul Ghosal and Atanu Mitra – Enriching event data: A semi-supervised augmentation approach using location information
Vignesh Jayanth – Identifying and evaluating strategies for successfully penetrating a high opposition press from short goal kicks, played inside the box, to move the ball into the opposition half
Laurynas Raudonius won the inaugural Dr. Garry Gelade award, named in honour of the late analytics thoughtleader and Forum stalwart, and presented his poster ‘Recognizing and evaluating opportunities in counterattacks using tracking data’. Delivering the guest talk was Mo Bobat, Performance Director for the England Cricket Board.
A more detailed description of each of the main presentations by public analysts is here.
The analytical themes
On the analytics front, I think there were two main themes from these two events.
One of them was “data that is somewhere between the two traditional camps of ‘event’ and ‘tracking’ data”. To explain what those two types are briefly: if event data is someone smooshing a paintbrush against a canvas to mark every pass someone makes, tracking data is twenty-two sets of hands2 keeping their pens to the paper for the whole match.
Having all of that tracking data is, in theory, great. You know everything! But just like having the internet’s boundless knowledge on hand 24/7 doesn’t make you a genius, tracking data doesn’t necessarily solve all your problems either.
Approaching the problem from one end we have something like Dey, Ghosal, and Mitra’s Pro Forum presentation, ‘Enriching event data’. They had a multi-step process to take insights from tracking data and find them in event data. The potential uses for this could be for scouting purposes if you have tracking data for your own matches/league but not others: you use the tracking data that you do have to work out how to use the event data in fancy and helpful ways.3
Approaching it from the other direction is StatsBomb and their ‘360’ data. Instead of using tracking data to enrich event data, they’re new thing is using their event data collection to identify moments to snapshot tracking data. For every event in their dataset, they’ll essentially capture a single frame of tracking data.4
Instead of you having to work with lots of frames to work out, say, how close the nearest defender was when someone receives a pass, you have that single snapshot connected to each event in the data.
Both approaches, the whole theme of enriching event data, makes a lot of sense. It allows you to count more stuff (and more useful stuff at that) while not adding too much extra technical requirement in skills or computational power.
The other main theme was defending.5 Two of the Pro Forum’s presentations were, by the title, about that side of the game, and a further two were at least pretty directly applicable.
StatsBomb also made a point of saying how their ‘360’ freezeframe data will allow for more defensive analysis. At first this’ll be on a team level (maybe holes in a team’s defensive structure, where they’re more likely to appear), and then more player-level stuff will be possible when StatsBomb assign player IDs to their 360 freezeframes.6
The reason for this second theme is, I think, a subliminal bonus theme: the maturity of the industry.
I feel that some of the focus on defending is coming now not necessarily because of new ideas, but because of technical or computational improvements that allow people to investigate much more easily. People have long wanted to investigate defending from the point of view of ‘space management’, it’s just been quite difficult to do.
However, nowadays best practices of storing and dealing with tracking data are being worked out and passed around; Friends of Tracking’s YouTube and Github repo have given everyone the opportunity to toy around with a pitch control model and tracking data; Stats Perform had a very fancy looking animation of their own data and models which has clearly been worked on a lot; StatsBomb have been developing the data collection processes and computer vision tech to enable them to collect the 360 data.
If this isn’t too grand a comparison, it’s a little like Europe in the early modern period: without the printing press, a lot of stuff would’ve struggled to get off the ground.7 We now seem to have the printing press.
The business themes
But yes, these events weren’t merely for the advancement of science. I’m beginning to suspect that it wasn’t even a coincidence they were just a week apart.8
It shouldn’t take me to tell you what the business rationale behind StatsBomb offering more data is. I won’t do their sales pitch; I’m neither charismatic enough nor on their payroll enough. What I will say is that it’s an interesting shift in the market.
Assuming the pricing, execution, and delivery9 are right, it’ll mean that StatsBomb are straddling the event and tracking data spaces in a fairly unique way. Of course, they’re not the only company to straddle this divide — tracking data companies generally seem to produce ‘counting’ type metrics that are essentially event data, and the other company this newsletter rests on, Stats Perform, is a merger of a tracking data and an event data company after all. But StatsBomb’s approach is far more of a ‘pure hybrid’ than others’.
Another worthwhile note from the ‘Evolve’ event is that StatsBomb are going to be making their OBV (on-ball value) model a free add-on to the rest of their data. One might wish to say that that sounds like a way of undercutting competitors who are currently developing their own model and/or planning to charge for it. One might.
However, here we should make a brief digression back into analytics to talk about what OBV actually is and therefore why we should care about it being free.
The OBV model is part of a growing family of ‘possession value-type’ models. I’m talking non-shot expected goals, expected threat, on-ball value, expected possession value, goals added, possesion value added. They all do similar things. Hell, I’ll just crib StatsBomb’s one-line explanation from their blog of the event (link again here) because it’s good and works for pretty much all of them:
…valuing every event that happens on the pitch based on how it changes a team’s likelihood of scoring or conceding.
Each model tackles things in slightly different ways and not all of them value events based on likelihood of scoring or conceding like OBV does (some more basic/specific ones just do the scoring half).10
Where OBV is concerned, an interesting point was made by Dinesh Vatvani, StatsBomb’s Head of Data Science, in his part of ‘Evolve’ about team strength effects. Some possession value-type models include information about ‘the possession so far’ as a feature (e.g. what happened in the previous three passes or how long the string of possession has already lasted). This is done to act as a proxy for opposition defensive structure, however Vatvani demonstrated that by including this information you may end up polluting the model with the stylistic tendencies of the strongest teams. (For clarity’s sake, this is my summation, not his words). OBV doesn’t include this ‘historic possession’ information then; some others do.
StatsBomb also briefly talked about other models they have, such as role classification, pass clustering, and an expected passing model. OBV was the main attraction though.
Back to business.
[Edit: this paragraph originally said 'the OBV model is still a bit of a WIP' - this has been amended below]
It’s worth noting here that the OBV model that will be made available for free is something of a stepping stone. Ted Knutson is clear that OBV is a worthwhile standalone model, even without the information from the StatsBomb 360 freezeframes, which it doesn't incorproate. There will be an OBV model that incorporates that data eventually, and will 'live alongside 360 data subscriptions'. That information — where a bunch of other players are on the pitch, remember — will surely be very handy to determining value of events, and gets rid of the need to proxy the opposition’s defensive structure.
So, free OBV won't include 360 data; the version that will won't be free. (It seems to me that StatsBomb’s pressure data, collected since their launch a few years ago, may already act as a bit of a proxy of opposition defensive structure anyway).
StatsBomb’s 360 data is new, even to them, and they were open during the event about the fact that they would be finding new things to do with it as time went on and as they gathered more of that type of data.11
On Stats Perform’s front, the presentation during the Forum by their own employees showcased a number of things:
their tracking data, which is captured in-stadium (and therefore always has all players on the pitch)
An ‘xReceiver’ model, based on how likely a player was to be the intended recipient of a pass at any given moment
An ‘xPass’ model based on how likely a pass from one player in the tracking data to another is to be completed
An ‘xThreat’ model, one in that family of possession value-type models I mentioned previously
When players are pressuring the ball, or passing options
‘Active runs’, which are when a player makes a run that increases the likelihood of them becoming an xReceiver above a certain threshold
And also, something slightly different, they’re launching a website!12 By the time you read this the site might be online, but for now I can only link to a Twitter page. Like StatsBomb, I’m sure they will produce good content with all of their various product- I mean, toys - on show. If it’s good enough for Twenty3 Sport (the people I work for) it can be good enough for them.13
The community themes
But while there is pure analytics and pure business involved in these events, the heart of them — and I genuinely mean this — is the community of people in the public and private analytics world.
The Pro Forum started life as a way for Opta (as it was then) to give some recognition (and some data) to the hobbyists, some of whom went on to write, and then work for, StatsBomb. During StatsBomb’s event, I thought it was a nice touch to give so much (deserved!) space to the folks at Arqam, the company StatsBomb worked with and then acquired to run their data collection operation.
On the community theme, it’s clear that people will give you time and help, and that barriers to being involved in analytics are falling. Partly aided by the virtual nature of the event, the presenters at the Pro Forum were from a varied field of backgrounds, and on more than one occasion they took the time to thank a string of people who’d helped them with their presentation.
Outside of that, there are collections of resources like Friends of Tracking, Devin Pleuler’s Soccer Analytics Handbook, and this amazing bibliography of expected goals literature. For group support, Lydia Vandenbergh Jackson and Arielle Dror have started a Slack for people from underrepresented genders interested in women’s soccer (and other sports) and data analysis. Original tweet on that group and how to access is here.
Another lesson that I think can be taken away from these two ‘conferences’ is that event data ain’t going anywhere. Companies — like StatsBomb and Stats Perform — will probably increasingly add event-type stats to their data feeds which are derived from tracking data, so that you don’t have to. That means that, while it might be interesting and advantageous to know how to deal with tracking data, you don’t need to to work in the analytics space.
Finally, if you want some inspiration of work you could do, there’s an interesting specificity to some of this year’s Pro Forum presentations. Ola Lidmark Eriksson’s was about determining how good and consistent a player’s metrics are (very useful); the club-led proposals, presented by Caterina De Bacco and Vignesh Jayanth, were about very specific in-game situations. (Here’s the link to the list of presentation titles and descriptions again). You don’t need to solve football to be useful; often, just improving a small part of it by a little can help a great deal.
On where to get data: the Friends of Tracking repository has some free Metrica data; StatsBomb have made some data available; there are some free samples of Wyscout data (albeit in .txt form) here; Football Reference is a tremendous starting point as it lets you download tables easily as .csv files. [NB: I’m likely to update this paragraph with other data sources I might have missed]
I said in my review of 2020/predictions for 2021 newsletter that:
In general, then, there’ll just be more. More use of data, more people working with data, more understanding of data, more little innovations.
I didn’t even know how right I was.
Due to concerns about Substack's approach to dealing with transphobia on their platform (outlined well here) I'll be using it to send out my free newsletter but pointing Twitter followers and anyone not on the mailing list to the free posts on the free Patreon here. In the future maybe there'll be an actual paid option there - let's be positive and leave ourselves the option - but for now everything there will be free. It isn’t designed to let you purely offer things for free though, so I had to add an unrealistic tier to hack into using it. I may return fully to Substack at some point, but that seems up in the air for the foreseeable.
The name of the latter event doesn’t need quotation marks because I don’t feel embarrassed writing it.
This isn’t always 22. Some companies collect the tracking data in the stadiums, but some do it based on video footage they gather from other sources (often TV broadcasts from somewhere or other). In these situations they may not — in fact, for TV footage, are likely to not — have all 22 players all the time.
(or, alternatively, if someone, somewhere has tracking data and shares the features of the model around you might not need any tracking data at all)?
It’s worth noting here that StatsBomb’s messaging has said that these 360 frames will have ‘every player on the pitch’, a statement which seems more like an aim than a reality. The examples in the presentation were clearly taken from broadcast footage, but on more than one occasion it was mentioned that the company are working to get more wide angle footage.
At the moment, the freezeframe data they’ve had around shots (which they’ve had since launch in 2018) has player ID for all players in frame, but 360 is launching without it. As Ted Knutson noted in the event, there are a hell of a lot more non-shot events in football matches than shots. I don’t believe they gave a timescale for when this information would be added.
Other historical analogies will no doubt also apply
Who can forget the original launch of StatsBomb data when Opta also dropped news of some data upgrades on the same day?
StatsBomb will be offering the 360 data in a feed, as well as deriving their own metrics and visualisations to put into their StatsBomb IQ software product, so that’s two different avenues for ‘delivering’ this.
I want to note two things here. One is that I find this variety in pretty similar models very interesting and am very curious about how it’ll play out — I believe that things will evolve and eventually there’ll probably just be a couple of different approaches called different things under a ‘possession value’ mini umbrella. The second is that I spoke to Vatvani on Twitter about the OBV model as I wanted to clarify some things I hadn’t fully followed during the presentation in case I wrote about more details of the model here. He was very helpful and I feel bad now that I haven’t ended up writing about the model apart from its current lack of 360 data. Another part of his presentation that I found interesting was the use of OBV for players shooting, working out the difference between OBV and post-shot xG to gauge how much a shooter had made of their opportunity.
I find this exciting. Like the Pro Forum, one of my favourite things about analytics events or announcements is the feeling of opportunity and exploration of how things can be used and what stories or insights they can bring.
I’ve just realised an amusing symmetry. StatsBomb are moving more firmly into the event/tracking data hybrid space that Stats Perform have been operating in; Stats Perform are moving more firmly into the content marketing space that StatsBomb was kinda born out of given its origins as a blog.
It’s nothing to do with us at Twenty3, but I do enjoy the coincidence that since we started producing public content to showcase our product: StatsBomb have hired a content marketer guy; Analytics FC have started content marketing across various platforms; Stats Perform are opening a website EDIT This kind of comes full circle. Content marketing isn't new, and this joining of random dots will miss a lot, but OptaJoe is long-established content marketing. Content marketing is very old, but it seems to be ramping up recently.