An Exploratory Analysis with R. In the first part of the article we have showed how to create the data sets “salary” and “salaryCaps” from web sites which contain the data we need. To summarize, the data set “salary” contains for each player his name, the club where he plays, and his net annual salary.
In the first part of the article we have showed how to create the data sets “salary” and “salaryCaps” from web sites which contain the data we need. To summarize, the data set “salary” contains for each player his name, the club where he plays, and his net annual salary. The data set “salaryCaps”, instead, holds the annual salary cap of each club, i.e. the limit on the amount of money that a team can spend on player salaries in a year. As said in the first part, data were released the 6th September 2015 from “La Gazzetta dello Sport”, the most popular Italian newspaper dedicated to sports (http://www.gazzetta.it/). Again, it is important to remember that players’ salaries were published net, whereas clubs’ salary caps were published gross: because of this, we cannot work on them together.
In the next step we will start with some basic statistics, to provide us a better understanding of our data sets. Then, in a second part, we will plot our data in order to improve our exploratory analysis. We will make use as well of the same graphic idea adopted by Teja Kodali in his article on “Datascience+”, where he displays the base salaries earned by the MLS players in 2015, underscoring which are the clubs who pay more and who pay less, and what weight each player’s salary has in the finances of his club.
Basic Statistics: the data set “salary”
Let’s take a look again at our first data set, “salary”:
As we have seen, the data set contains players’ name, their club, and the net annual salary (in millions of euros). Thanks to “summary()” (check our article “Data analysis: rules to follow in processing and cleaning data” to know more about it), R allows us to have immediately some interesting stats about the net salaries.
We can see, for example, that the minimum net salary in the Italian Serie A (season 2015-16) is 10k euros, whereas the maximum touches 6.5 millions of euros: quite an impressive gap! Talking about mean, a Serie A average salary is around 700k euros: not too bad but still very far from the highest one. Now, let’s see who are the players that give us these numbers.
8 players earn exactly the minimum salary we have seen before (10k euros): 2 of them play for Carpi (CAR), 6 for Frosinone (FRO), both teams are playing their first Serie A season ever.
Daniele De Rossi, Roma (ROM) vice-captain and 2006 world cup winner with the “Azzurri”, retains his spot as the highest-paid Serie A player, pocketing an impressive 6.5 millions of euros.
26 players, instead, earn the closest net annual salary to the Serie A average (704.1k euros). In particular, 10 of them play for Genova’s clubs, Genoa (GEN) and Sampdoria (SAM).
Basic Statistics: the data set “salaryCaps”
Let’s take now a look at the second data set, “salaryCaps”:
Again, we can start with a “summary()”:
The minimum annual salary cap (gross) in Serie A this season is 8 million of euros, whereas the maximum is 124 millions of euros: almost 16 times higher! If we consider all the 20 clubs, we can say that an Italian Serie A club spends, on average, 44.10 millions of euros on player salaries. Now, let’s check which teams correspond to these numbers.
As we were probably expecting after the stats seen on the previous data set, the club which has the lowest salary cap in Serie A is Frosinone, newly promoted club and first time ever they participate in the Italian major league. On the other side, Juventus (JUV), Serie A champions for last 4 seasons, is the club with the highest salary cap in the league, 124 millions of euros. Fiorentina (FIO), instead, is the club which has the closest salary cap to the Serie A average: 46 millions of euros.
Plotting the data: the data set “salary”
For plotting the players’ net annual salaries we will use the R library “ggplot2”, one of the main tools in Data Science. There are several books and guides which talk about this library. Here we will refer to one in particular, “ggplot2: Elegant Graphics for Data Analysis (Use R!)”.
As done by Teja Kodali in his Data Science article, we choose to display the players’ names inside the bars that correspond to their salaries. This because if we normally arrange them at the top of each section of the bar, we will probably mess up the way the graph looks. Therefore, to do this, we need to calculate the midpoint of each section of the bars and displaying the name at the midpoint. This can be done as follows:
In the code above we have introduced the function “transform()”. This function allows to perform group-wise transformations with very little work. This is particularly useful if you want to add new variables that are calculated on a per-group level, such as a per-group standardisation. As explained in the article “Visualizing MLS Player Salaries with ggplot2”, in our example it splits the data frame “salary” by the “CLUBS” variable, and then calculates the cumulative sum of net salaries for that bar minus half the net salary of that specific section of the bar to find its midpoint.
We can finally plot the data:
In particular we used the “geom_text()” function to specify the text to be included in the chart. Since some of the sections of the chart are very small and cannot fit a player’s name inside, we have decided to display just the name of players whose net annual salary is more than 1.5 millions of euros. This is the graph we obtain:
Plotting the data: the data set “salaryCaps”
As last step, we can plot data concerning Serie A salary caps.
We can start with a simple bar plot in order to visualize the distribution of the salary caps in the Italian Serie A.
We finish our analysis on Serie A players salaries with a pie chart which shows the percentage of salary cap for each team on the total amount of money that all Serie A teams spend on player salaries.
- Calcio News 24. “Confronto classifica/monte ingaggi: boom toscane, flop Juve e Bologna” (2015) – http://www.calcionews24.com/serie-a-confronto-classifica-monte-ingaggi-boom-toscane-flop-juve-e-bologna-467487.html
- Kodali, Teja. “Visualizing MLS Player Salaries with ggplot2” (2015) – http://datascienceplus.com/visualizing-mls-player-salaries/
- La Gazzetta dello Sport. “Tutti gli ingaggi della Serie A: Juve leader, Milan oltre i 100, Inter +24” (2015) – http://www.gazzetta.it/Calciomercato/storie/07-09-2015/tutti-ingaggi-serie-a-juve-leader-milan-oltre-100-inter-24/
- Sportlive.it. “Stipendi Serie A: ingaggi calciatori 2015/2016” (2015) – http://www.sportlive.it/calcio/stipendi-serie-a-ingaggi-calciatori-squadre-2015-2016.html
- Wickham, Hadley. “ggplot2: Elegant Graphics for Data Analysis (Use R!)” (2009) – Springer