Espace client
Datascience

Data analysis: rules to follow in processing and cleaning data

What we must do before analysing our data

One of the biggest problems for a data analyst (probably the biggest one) is to reassure himself of the reliability of the data he is going to work with. In a Data Science context we often face situations like missing values, data reported in different formats, or, simply, data wrongly collected.

One of the biggest problems for a data analyst (probably the biggest one) is to reassure himself of the reliability of the data he is going to work with. We often face situations like missing values, data reported in different formats, or, simply, data wrongly collected.

In fact, a big amount of data collected like in the example above could be quite frustrating for a data analyst. Apart from a missing value or a “YES” value wrongly reported (i.e. as “YE”) in the column “Legal_age”, variables collected in different formats could be pretty annoying if we do not find them out before starting our analysis. That is exactly the case of both variables “Last_login_date” and “Last_login_time”. In the first one we have dates collected in two different formats, “YYYY-MM-DD” and “DD/MM/YYYY”, whereas in the second we have most of the times reported as “HH:MM:SS” except in one case (user “3458”) where the seconds are not entered.

Having seen this, we should not hesitate to verify our data before starting the analysis. That is why the first things to do, straight after we have constituted our data set, are the data processing and, potentially, the data cleaning.

What we are going to see now is how these two stages work and which are the steps we should always remember to do.

 
Data processing: Step 1

In Data Science and especially in the Big Data world, we often find ourselves in a situation where we need to deal with differents data sets, usually files in “.xls” or “.csv” formats. This mostly happens because data have been collected per clusters, like periods of years or departments of a country, for example. Great Data Science tools like R (https://www.r-project.org/) help us to well organize data, like placing them into rows and columns in one unique table.

 
Data processing: Step 2

After we have consolidated our data set, we should always remember to check the meaning of each variable: we obviously need to perfectly understand all the variables we are going to deal with. It is helpful as well to make a list of them and note respective meanings, units and interval values. Remember: this is not just for us but, potentially, for other people who will work with us on these data (our developers and our designers, for example). In doing this, we could find interesting as well to rename variables with more pertinent and self-describing labels, if it has not already done: a smart labelling is always precious, especially in doing our future analysis.

 
Data processing: Step 3

Cette interrogation rejoint la notion de hiérarchie. Bon nombre de managers doivent faire face à leur KPI mais aussi ceux de leur direction. il est donc primordial de prendre la notion de partage à la direction du tableau de bord, même quand celui-ci a pour vocation d’être mise à disposition à plusieurs groupes de personne.

 
Data cleaning: Step 1

In Data Science the risk of having data wrongly collected or recorded in different formats is more than possible; this risk obviously increases when we combine two or more data sets together. That is why we must check the nature of our variables. On R, functions like “typeof()”, “levels()”, “table()” and “summary()” could be very useful in doing this (see the example below made on the database “state” that already comes with R).

As we can see, the function “typeof()” determines the type of object we are working on: very useful, for example, to know whether we are dealing with a numeric vector or a character vector.

The function “levels()”, instead, provides access to levels attribute of a variable. For example, this could help us to identify data wrongly collected. Let’s take a look at our first example, in particular at the variable “Legal_age” (i.e. has the user reached the legal age?):

This variable is a dichotomous variable which can basically assume values “YES” or “NO”. Thanks to “levels()” we have immediately pointed out that in our data set for “Legal_age” we have at least one missing value and at least one “YE”, which is obviously a “YES” value wrongly collected.

If we need to know for how many statistical units (in our case, User_IDs) the “Legal_age” variable assumes the “YE” value or has no value at all, we should take a look at the “table()” function:

So, in our example, we have 18 missing values (represented by a blank space in the “table()” function) and 35 “YE” values, which obviously need to be corrected.

 
Data cleaning: Step 2

The last command we have introduced in the previous step was “summary()”. This function provides a range of descriptive statistics at once: mean, median, 25th and 75th quartiles, min and max. It is pretty useful to give us a better vision of the values assumed by a variable and, potentially, check if there is anything anomalous among them (too low min, too high max, a completely different mean value compared to our expectations, etc.). An outlier, for example, could be sometimes an anomalous value. But what do we mean when we talk about outliers? An outlier is an observation that is numerically distant from the rest of the data, and usually affects most of statistical parameters, like means, standard deviations, and correlations, and every statistic based on these. Identifying outliers is an important task we should do in order to verify our data set. However, removing them from a data set is not a good practice, since they can be legitimate observations and are sometimes the most interesting ones. That is why it is important to investigate their nature before deciding what to do with them. However, in cases where it is obvious that the outlier is due to incorrectly entered or measured data, we can drop it without any problem.

In Data Science, graphic tools like boxplots are pretty straightforward to identify this kind of data. In a boxplot an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot. For example, running the R code below will plot a boxplot of a hundred observation sampled from a normal distribution, and will then enable us to see the outlier point:

 
Data cleaning: Step 3

Looking for duplicates is another important step in the cleaning data stage. This is often due to a wrong collection of data. Let’s take a look at this quick example:

Here we have a sample of Paris Saint-Germain 2015/16 squad where each variable indicates respectively player’s name and surname, seasonal appearances and goals scored (up to 30th November 2015). As we can see, Ibrahimovic has been collected twice: one time he has been entered with both his name and surname, a second time just with his surname. Obviously, in this way, we will have two statistical units for Ibrahimovic: this will wrongly lead us to have one more player compared to real Paris Saint-Germain 2015/16 squad. Basic text string functions (take a look at our article “Working on strings in R” if you want to know more about it) or simply “levels()” or “table()” functions can help us to individualise and solve these problems pretty quickly.

 
Data cleaning: Step 4

Before running our data analysis, it is always safer to do as well simple coherence tests, when it is obviously possible. Let’s see, for example, this sample of clients of a wine seller:

For each client we have his ID number, the quantity of bottles 75cl he bought, the total amount he paid for these bottles (in euro), the quantity of bottles 150cl he bought and again the total amount he paid for these last ones too (still in euro). Let’s say we know that a single bottle size 75cl sold by this wine seller costs 15 € whereas a bottle size 150cl costs 30 €. Said that, we can easily check if there has been any mistake in reporting the total amount each client has paid when he has bought bottles from the wine seller: this because the variables “Tot_75cl” and “Tot_150cl” depend respectively on “Bottles_75cl” and “Bottles_150cl” (knowing the unit prices).

We find out, in fact, that for the client “984” there has been an error in reporting his total amount paid for bottles 75cl. Dividing the total amount paid (i.e. 350) by the number of bottles 75cl bought (i.e. 24), we see that client “984” has paid 14.583 € per bottle 75cl instead of 15 €: so, it looks very likely there has been a mistake in recording the total amount paid.

 
References:
  • R Documentation, Box Plots – http://127.0.0.1:17603/library/graphics/html/boxplot.html
  • R Documentation, Cross Tabulation and Table Creation – http://127.0.0.1:17603/library/base/html/table.html
  • R Documentation, Levels Attributes – http://127.0.0.1:17603/library/base/html/levels.html
  • R Documentation, The Type of an Object – http://127.0.0.1:17603/library/base/html/typeof.html
  • R Documentation, US State Facts and Figures – https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/state.html
ESPACE CLIENT