Espace client
Datascience

Data Science : Working on strings in R

In Data Science, when we analyse data we usually do not deal just with numbers. We often find ourselves to work on data in a string form. As defined by Per Christensson (2006), a string is a data type used in programming, such as an integer and floating point unit, but is used to represent text rather than numbers. It is comprised of a set of characters that can also contain spaces, numbers and many of the symbols used in everyday work (hyphens, currency symbols, etc.).

Therefore, it seems obvious that if we want to conduct effective data analyses, we must know how to play with string data. To explore this, we decide to use R (https://www.r-project.org/).

R is the leading tool in Data Science. It is more than a statistical package, it’s a programming language: we can create our own objects, functions, and packages.

Despite some of us may argue that it is not very intuitive as a scripting language (like Java or Python, for example), R can still take us very far if we know how.

Last but not least, it presents strengths that are always worth to remember. At first, we can easily use it anywhere: it’s platform-independent, so we can have it on any operating system. It’s free: we can use it at any employer without having to persuade our boss to purchase a license. Not only is R free, but it’s also open-source: anyone can examine the source code and, as we have seen, potentially fix bugs and/or add features.

In R, the main type of text strings to work with is in the form of character variables.

We will describe here some of the most useful tools in Data Science that we often use in order to handle and process text-character-string. As practical example, we will make use of the database “state” that already comes with R (i.e. data sets related to the 50 states of the United States of America).

is.character () and as.character()

Before starting working on character variables, it is a good practice to check if we are really going to work with character variables. We can do this with the “is.character()” function.

If, for example, we need to convert “state.area” (numeric vector of state’ areas – in square miles) in a character variables, we can easily apply the “as.character()” function.

tolower() and toupper()

Like most of the programming languages, R finds differences between capital and small letters. “tolower()” and “toupper()” functions help us figure out this problem.

nchar()

Sometimes we could have the need to know the length of our strings. The “nchar()” function lets us count the number of characters (of any types) in a specific string.

For example, we could be interested in getting only the states made with 6 letters:

str_count()

With almost the same idea of “nchar()”, we introduce another function which allows us to get the number of occurrences of a specific character in one or more strings: “str_count()”. Here a practical example of the function considering the character “k” (just in the lowercase):

grep()

Let’s turn the problem now. Suppose this time we need to select those states which contain the letter “w”, both as capital and small letter (as we have already seen, R finds difference between the uppercase and the lowercase). The function “grep()” perfectly works in this scenario:

Since we were looking for the states containing both “w” and “W”, we set the argument “pattern” with both characters, “wW”. Notice the difference when setting the argument “value” at “FALSE” or at “TRUE”: in the first, we get the element’s positions in the vector “state.name” of the strings which correspond to our research, in the second we directly have their “value”.

paste()

While we are working on string data, we could have the necessity to combine several character variables into one string. The “paste()” function takes one or more R objects, converts them to character, and then it concatenates them to form one single character string.

In the example above, we combined 5 characters objects and separated them with a comma and a space. As we can see, the argument “sep” is basically a character string that is used as a separator. On the other hand, if we want to combine strings that belong to a single character vector we will use the argument “collapse”:

Now let’s note the difference between these following examples:

In the first case, thanks to “sep”, we combine each element of the vector “statesWith6characters” with the single string “is a state”, separated just by a space. As result, we have a new character vector, still formed by 5 elements that, now, are the combination between each element of “statesWith6characters” and “is a state”. In the second example, instead, we still combine each element of “statesWith6characters” with “is a state” (again separated by a space), but this time we create a single string element, not a vector.

In order to have a clearer output, we can work with both arguments “sep” and “collapse” in this way:

strsplit()

The opposite function of “paste()” is “strsplit()”. “strsplit()” allows us to split one or more strings into many shorter strings.

Notice the use of the argument “split” as delimiter in splitting the strings: as a single space in the first case, as letter “w” in the second.

sub() and gsub()

We could have the necessity as well of replacing a particular character in a string. “sub()” and “gsub()” let us do that.

We can easily check the difference between the two functions: while “sub()” replaces just the first occurrence of a pattern in a string (the first “a” in our example), “gsub()” replaces all the occurrences (so, all the “A” contained in the string “AlAbAmA”).

substr()

Suppose, this time, we do not need to replace particular characters but we just need to extract a specific part from a string. With “substr()” function we can, for example, extract the first 4 letters in all the states’ names. Here how it works:

As we can see from the example above, we have 4 particular cases: « New « , « New « , « New « , « New « . These elements simply correspond to « New Hampshire », « New Jersey », « New Mexico », « New York »: in all these strings the 4th letter is represented by a space which, in this context, is obviously considered as a character. We can underscore this even better if we consider the first 5 letters:

 
References:
  • Christensson, Per. “String Definition.” TechTerms. (2006). Accessed Nov 23, 2015. http://techterms.com/definition/string.
  • R Documentation, US State Facts and Figures – https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/state.html
  • Sanchez, G. (2013) Handling and Processing Strings in R – Trowchez Editions. Berkeley, 2013 – http://www.gastonsanchez.com/Handling and Processing Strings in R.pdf
  • Ulrich, Joshua. “Why Use R?” (2010) – http://www.r-bloggers.com/why-use-r/
ESPACE CLIENT