Acheter le livre

The method behind Karibu technology

Profiling methodology based on the interests of anonymous users present on the leader of social networks, identified according to their activity on the public pages of this network.


On 2018, the leader in social networks has about 150 million active users daily and more than 50 million public pages. It is therefore interesting to cross-reference these two types of information in order to know, for example, the interests of the users of this network according to the public pages on which they are active. For a company with a public page, the interest of this type of profiling is essentially marketing. In particular, it can enable it to better understand its customers’ tastes and thus guide its marketing operations.

The data

We were able to work with a 743-page public dataset from a wide variety of fields (TV channels, sports, actors, agri-food, etc.). For a page, the data corresponds to users (anonymized) who have clicked (like, comment or share) at least once on a publication on that page.

In order to develop a profiling methodology based on this data, we imagined that fast food company Y France was our client.

It is therefore the population involved on the page of this fast-food Y France that interests us; in other words, the customers loyal to this fast-food. Customers’ tastes are identified through their activities on the publications of the other public pages collected.

The objective is therefore to develop a methodology to identify the similarities between the tastes of the users involved on the Fast-food Y France page in order to define profiles. For this purpose, we have constructed the following data table :

Reading the table : User 1 (ID 1) clicked 17 times among the publications on the Skyrock page, twice among the publications on the TF1 page and 5 times among the publications on the Cristiano Ronaldo page.

This table contains all users who clicked at least once on the publications on the fast food page Y France and on the publications of at least one of the 743 pages.

The methodology

The data used are large in size. It is therefore essential to treat them in order to optimize the results.

1. The treatments

Users of the site

For each user, we calculated his average number of clicks based on the pages he clicked on. We noticed the presence of outliers. They correspond to “robot” users, i.e. users who have clicked, on average, a significantly higher number of times than the majority of users. The presence of outliers generally affects the results of clustering algorithms. So we deleted them.

We would like to know the tastes of the users involved on the Fast-food Y France page. We cannot consider that a user who has clicked only once among the publications on a page is engaged on that page. Nor can we determine precisely at what threshold a user can be engaged on a page. We assume that this threshold corresponds to the average number of clicks on the Fast-food Y France page, i.e. 2.33. Thus, we decide to delete all users who clicked less than 3 times (integer above average) on the fast food page Y France.

Among the users engaged on the Fast-food Y France study page, very few clicked on the other pages. For example, in the data table presented above, user 2 (ID 2) only clicked on the Skyrock (and Fast-food Y France) page. This case is recurrent, so the data table contains a lot of 0. It is said to be “sparse”. Generally, “sparsely” data require particular algorithms that require significant technological resources. So, to simplify our computational process, we want to remove users with the highest proportion of 0, i.e. having clicked on the least number of pages. Below is the distribution of 0 per user :

According to this histogram, 100,134 users (51,670 + 48,464) have more than 45% of 0, i.e. clicked on less than 55% of the pages. It was decided to delete these users.



In order to identify the tastes of Fast-food Y France customers, we would like to use popular public pages. The popularity of a page can be measured using PageRank, a famous algorithm created in 1998[1]. It is used by the Google search engine to note the importance of a web page. The score is between 0 and 10 (0 being the lowest score and 10 the highest). We decide to keep the 48 pages with the highest PageRank.

2. The clustering methods applied

In the “Profiling Engine” data sheet presented on the 10:11 am website, we proposed 3 clustering methods: K-means, K-modes and K-prototypes. In reality, there are many of them. We first chose to apply the K-means method on our dataset. In view of the structure of this one, we found it interesting to implement another method called Cluster Correspondence Analysis. Both methods allow similar observations to be grouped together.

Method 1: K-means

We have applied a Z-score normalization on our dataset to reduce scale differences. We applied the K-means method to these standardized data. This method allows observations to be assigned to groups according to the Euclidean distance between these observations and the centre of gravity of each group.

Method 2: Cluster Correspondence Analysis CCA

This method was proposed in 2017 by Michel Van de Velden[3]. The iterative CCA algorithm begins by randomly assigning observations to groups. Then, a dimension reduction method called Multiple Component Analysis (MCA) is applied from the group structure obtained in the previous step. Finally, the K-means algorithm is applied to the ACM results. This method requires a qualitative dataset. In order to be able to apply it, we have transformed our quantitative data into qualitative ones.

Our first idea was to create a grid of commitment levels. For example, we could have defined the following levels :

– A : not or not very involved

– B : engaged

– C : highly committed

Due to the differences in click values between pages and users, we were unable to validate a relevant statistical method to determine the thresholds for assigning these levels A, B and C. So we were forced to leave this idea aside.

To simplify, we finally decided to replace the click values with 0 or 1’s. If the value is greater than or equal to 1 click, then the value is replaced by 1. If the value is 0, then it remains 0. We then applied the CCA method to our new dataset filled with 0 and 1. This method being based on an ACM, the dataset is automatically transformed into a disjunctive array. Each variable (page) has two modes: 1 for clicked pages and 0 otherwise. The disjunctive table is of the following form :

Reading the table :  user 1 (ID 1) did not click on the Skyrock page (1 in mode 0) but clicked at least once among the publications on page TF1 (1 in mode 1).

The results

Method 1: K-means

At first glance, there is no clear separation between the identified groups. Indeed, groups 1, 2 and 4 overlap. Nevertheless, a separation can be observed between group 3 represented in green and the other three groups. From a Main Component Analysis, we can determine which pages contribute most to the assignment of users to a group. Here, the results are quite interesting. Group 3 users have one thing in common: they enjoy football. Indeed, the 5 pages that contribute the most to the construction of this group are those of Leo Messi, Cristiano Ronaldo, Neymar Jr, Real Madrid and FC Barcelona.

Method 2: Cluster Correspondence Analysis CCA

In the CCA method, the origin represents the average profile. Here, we see that all observations are grouped around the origin. This is due to the fact that there are many 1 for pages with modality 0 (not clicked) and, therefore, many 0 for pages with modality 1 (clicked). The average profile is therefore dominated by non-clipped pages.

To validate these results, a clear separation between the constructed groups must be observed. In this case, a boundary is drawn between the three groups, but the observations are not far enough apart between the groups.

Despite the applicative interest of this method, the structure of our data does not allow us to prove its effectiveness, mainly because of its adaptability on this type of data.

What can we learn from this work ?

The cross-referencing of the data of the leader of social networks allowed us to confirm once again the interest of this type of networks, and challenged us in the application and adaptation of well-known and lesser known algorithmic techniques.

By analysing in particular the behaviour and tastes of consumers (in our case, the Fast-food brand Y France), we analyse a strong heterogeneity in the 743 public pages considered. This resulted in the development of a behaviour table (clicks on pages) containing a high number of zero occurrences, in other words a sparse matrix. The application of processing techniques or page popularity analysis (PageRank), allowed us to study this particularity.

In order to challenge more traditional methods such as K-means, we have developed new clustering techniques that have already been successfully exploited on social network data (CCA).

The structure of the built dataset and, mainly, its sparse aspect, requires us to review the way we processed the data. For example, we could consider treating the presence of 0s differently, rather than removing them, we could use them to improve results. The validation of a grid of commitment levels presented in the CCA method could be taken into consideration. We could also test new clustering methods among all those that exist (for example, “Non-negative Matrix Factorization”).

Processing and exploiting data from social networks is now becoming a must for all companies, both technically and ethically. The use of such data requires measures to be taken to ensure that the privacy of the persons concerned is respected. The data used in this study are completely anonymous and do not under any circumstances make it possible to identify a specific person indirectly or by cross-checking information.


[1] Sergey Brin et Lawrence Page. L’anatomie d’un moteur de recherche hypertextuel à grande échelle. Réseaux informatiques et systèmes RNIS, 30(1-7) :107-117, 1998.

[2] John A. Hartigan et Manchek A. Wong. Algorithme comme 136 : Un algorithme de groupement k-means. Journal of the Royal Statistical Society. Série C (Statistiques appliquées), 28(1) :100-108, 1979.

[3] Michel van de Velden, A Iodice D’Enza et Francesco Palumbo. Analyse des correspondances en grappes. Psychometrika, 82(1) :158-185, 2017.

0 / 5 0
Cart Overview