Sport is a thankful field for data journalists: There is a lot of data, but surprisingly few data journalistic stories about it. At Swiss radio and television (SRF), my colleagues have been reporting intensively on swiss tennis star Roger Federer for 20 years. We asked ourselves: "Who if not us" should make Roger Federer's best and most comprehensive data analysis? So we gave it a try. For those who haven't seen the result yet, you can find it here.
In tennis, every millimeter of the game is measured: It can't be that hard to find all of the “Maestro’s” (one of his many nicknames) moves. Or is it?
No official data source
ATP website, you can find the most important metrics for each match: service games won, number of aces, etc. But a download button? Forget it. The ATP, which organizes most tennis tournaments and manages the world rankings, does not have an official data API where one could download all their stats. So we had to get the data elsewhere.
But this turned out to be more difficult than expected. The first hurdle: Federer started playing tennis at a professional level in 1998. Since we had the ambition to map his entire career, it was clear that we needed a data source that would last from 1998 to the present day. So also one that is constantly updated.
And this was not easy to find on Google. If you search for "Tennis ATP Data API", the first thing you see is a scraper that downloads data from the ATP website – but given the large amount of data we wanted to avoid that.
Very quickly, you come across the Github repositories of Jeff Sackmann, a tennis nerd who collects quite a lot of tennis data that he scrapes or collects by hand together with volunteers. He makes it all available as CSVs. For all those who want to start doing something with tennis data, this is a great place to start. At the time of the research, however, it was unclear whether and how regularly the data would be updated.
Our process took over three months, from the idea to the publication. We wanted to publish the story at a major event in Federers career. But what event would that be? In autumn 2017, when we started working, it wasn't clear at all: would he win another big tournament? Or might he get injured and resign? We had to prepare for all eventualities (For those who are interested: here is our internal summary in German, which we write for each of our researches, where we discussed the possible dates).
A little help from serbia
Our rescue was the website ultimatetennisstatistics.com by Mileta Cekovic. The code of the whole site is Open Source. If you download the repository, you can recreate the whole page on your (Windows) computer – including a Postgres database, in which all data is stored in a structured way.
So we had the data, but what were we actually looking for? I knew the tennis rules, but had no idea what might be interesting in the data. We clearly needed help! This we got from Bernhard Schär, a Federer intimus of the first hour. Together with him we formulated a number of hypotheses and checked them for their truthfulness:
Is it true that Federer was older than others when he got to the top?
Who is his worst opponent? Why?
Is Federer really the GOAT? The greatest of all time?
What is the competition doing?
At the same time, we were looking for possible role models. In a Google Doc, all team members took screenshots of interesting data journalistic projects in the field of sports. With the help of this collection, we were able to formulate further exciting hypotheses and see which forms of presentation might be suitable for which data set.
There was one question that we struggled with at every viz: How deep can we go? It was clear that more than half of the readers would read the story on a smartphone – in the end it was over two thirds: Nevertheless, we didn't want to do without very detailed graphics.
One way to deal with this dilemma is to hide less important data on mobile devices. In the following graphic, for example, we've provided the most important players with small portraits on the desktop, and omitted this on mobile devices. In addition, on mobile devices we have drastically reduced the shown number of data points in order to make the differences between the important players more visible.
Looking back, we probably should have done the same with other graphics as well. For example, the first and the last graphics are very detailed. On mobile devices you can look at the graphics, but the legibility would have been better if we had removed some data points. This would have made the lines simpler and easier to read:
But we also always had to ask ourselves: Are we able to select data in such a way that even Federer fans can learn something new without excluding readers who don't know tennis at all?
In search of an answer, we tried to talk to as many people as possible: We showed prototypes of our graphics to sports editors and friends outside the company at an early stage and asked them: Do you understand what we have visualized here? Are you surprised? Are you interested?
It was very helpful that we were able to create a lot of graphics very quickly using R and ggplot. These graphics can also be found in our method description (a thing we always publish at the same time as our articles).
What was missing?
When the Australian Open took place in January and Federer won preliminary round after preliminary round, it slowly became clear to us: Ok, a possible publication is getting closer and closer. We knew that the final would take place on 28 January. Suddenly everything had to go pretty fast. That was also the reason why we couldn't take a closer look at certain things.
Which I personally think is a pity: Not a single tennis court has been visualized in the whole story. I would have loved to work with Hawk-Eye data to analyse Federer's game more closely and compare it with his competitors.
We also didn't analyze much other data regarding Federer's playing style: With which hand does he score? With the forehand or the backhand? An analysis of the data from Jeff Sackmann's Match Charting Project – a project in which volunteers log every second of a match – didn't immediately provide the desired answers.
Update: More difficult than expected
At SRF Data we highly value reproducibility. It was important to us that we could easily update the visualizations if we needed, in order to publish the story again if we want. Shortly after the first publication, Federer once again conquered the world number one ranking – as the oldest tennis player of all time. Further events important events in his career will probably follow, but an update would still require a lot of effort. Not only the graphics, but also the text would have to be rewritten. And not only in one language, but in eight. Swissinfo translated the piece into Japanese, Russian and Chinese, among others.
So we have to admit to ourselves that we will probably let the project rest after all and devote ourselves to more classic data journalistic topics again.