On 6th September 2015, the most popular Italian newspaper dedicated to sports, “La Gazzetta dello Sport” (http://www.gazzetta.it/), has published the net annual salaries of every single Serie A player for the current season, 2015/16.
For all football lovers and, in particular, for the “calcio” ones, these are quite interesting numbers if we want to take a look inside the italian football top division and its clubs' wallets.
Few weeks ago, on “Datascience+” we found an interesting article on Data Visualization with R (https://www.r-project.org/) and the “ggplot2” package, one of the main tools in Data Science. Basically they plotted a graphic which displays the base salaries earned by the MLS players in 2015, underscoring which are the clubs who pay more and who pay less, and what weight each player's salary has in the finances of his club.
Considering our common passion for football (...and my particular interest on the Italian league!), we took the opportunity to do something similar with Serie A players.
In this first part of the article we are going to show you how we have built and consolidated our data sets.
Cause we have been unable to find a proper file containing all the data we needed, we decided once again to exploit Data Science tools and, in particular, the software R. First of all, we found all players' salaries on this article, written by Italian website “Sport Live”. Since the clubs' salary caps published here do not match with the ones published by “La Gazzetta dello Sport” (very likely to be a calculus mistake), we collect the salary caps from this second Italian website, “Calcio News 24”. As we can check on these articles, players' salaries were released net, whereas clubs' salary caps gross: important points we need to keep in mind in the next steps.
We start then to collect data from the first website, “Sport Live”.
Thanks to the “getURL()” function we load on our workspace the two web pages of this article.
What we need to do now is to get only the parts of these pages which contain the information we need for the analysis: players' name, clubs and players' salary.
If we take a look at the article, we see that what we need from the first page starts from the line “Atalanta” and finishes just before the line “Juventus e le altre squadre a PAG 2”. In order to extract the data we need, we are going to use the “gsub()” function (for more insights, take a look at our Data Science article “Working on strings in R”).
In the first line of the R code above, the “gsub()” function replaces, in the string “webpage1”, all the article's page starting from the beginning (“^”) to the line “Atalanta” with the word “Atalanta” itself. In the second command, “gsub()” removes (from the same string “webpage1”) all the article's page starting from “Juventus e le altre squadre a PAG 2” to the end. In exactly the same way we will work on the second web page (and we will create the string “webpage2”).
Therefore, with the first four lines of code, we have collected all the information in just two strings:
As we see, we can still remove something from them: the HTML elements to bold a character (“<b>”) and to produce a line break in text (“<br>”). Again, thanks to “gsub()”, we remove all of them and we combine, with “paste()”, “webpage1” and “webpage2” in one single string, “webpage”.
Now that we have selected only the parts of the article necessary for our analysis, we are going to separate the information of each specific team from the others, in order to have 20 different strings (20 teams): we are going to include them in the “playerList” vector.
However, before doing this, we must define the teams' name: this will help us identify all the information concerning a specific team, and separate them from the information concerning others teams. So we simply create a string vector, “teams”, made by the teams' name as they are written in the articles.
Then, we can build the “playerList” vector:
This vector will look something like this:
We are almost there. Next step is to separate the players' names from the players' salaries. From the “playerList” vector we are going to build other two character vectors: “allplayers”, which will contain the players' names, and “allvalues”, which will hold all the players' salaries. To do this, we will make use again of the “gsub()” function.
In the code above we worked on players' names. With “gsub("^.*?)", "", playerList[i])” we remove all characters (from each string element of the “playerList” vector) before the first player's name appears (basically we remove the team's names and the club's cap salaries, parenthesis included). In particular, the “?” allows us to make the replacement (here, removal) only for the first time the function finds the match, so just for the first “)”.
After having removed all the salaries (i.e. values between parenthesis placed just after each player's name), replaced the HTML encoding of foreign characters with their respective letters (“à" with "à", "ò" with "ò", "é" with "é" and "è" with "è”) and replaced the double comma with a single one, thanks to the function “strsplit()” we split the players' names in different strings, using the comma as separator (take a look at our article “Working on strings in R” if you want to know more about this function). In this way, our vector containing all players' names is ready.
In the code we have just run, we can see another vector we have not explained yet: the “playersNumber”. This lets us know how many players each team has (at least in this article). To make it, we have basically counted how many players' names were contained in each string of the “playerList” vector, thanks to the loop we have just seen.
Having done this, we can build the second vector we were talking about: the “allvalues”.
Here we introduce two other functions: “gregexpr()” and “regmatches()”. The first one lets us find the matches, that is in which positions of the string “text” are placed the players' salaries (i.e. the values between parenthesis, “pattern”, parenthesis included). With the “regmatches()” function, instead, we will extract from “x” the elements which are in the positions found in “m” (that we have just done thanks to the “gregexpr()” function).
And it's a done deal! We can finally build our first data set, “salary”: the data set containing players' names, the clubs where they play, and their net annual salaries (in millions of euros).
This data set will then look like this:
The second (and last) data set we are going to build is the one containing all teams' annual salary caps (in millions of euros). As we said, we are going to collect these from the second website, “Calcio News 24”, and we will adopt the same techniques we have seen until now.
At this point, we will have finally collected and consolidated our data.
Before moving to the next part, as we have already mentioned, it is very important to keep in mind that all players' salaries collected are net (data set “salary”), whereas clubs' salary caps are gross (data set “salaryCaps”): therefore, we will not work on them together.
In the next part of this article we will make some interesting exploratory analyses and, in particular, some nice graph!
Calcio News 24. “Confronto classifica/monte ingaggi: boom toscane, flop Juve e Bologna” (2015) - http://www.calcionews24.com/serie-a-confronto-classifica-monte-ingaggi-boom-toscane-flop-juve-e-bologna-467487.html
Kodali, Teja. “Visualizing MLS Player Salaries with ggplot2” (2015) - http://datascienceplus.com/visualizing-mls-player-salaries/
La Gazzetta dello Sport. “Tutti gli ingaggi della Serie A: Juve leader, Milan oltre i 100, Inter +24” (2015) - http://www.gazzetta.it/Calciomercato/storie/07-09-2015/tutti-ingaggi-serie-a-juve-leader-milan-oltre-100-inter-24/
Sportlive.it. “Stipendi Serie A: ingaggi calciatori 2015/2016” (2015) - http://www.sportlive.it/calcio/stipendi-serie-a-ingaggi-calciatori-squadre-2015-2016.html