The following demonstrates how to perform customer segmentation using the RFM model (recency, frequency, monetary value) on a medium-sized data set. The analysis was conducted with the statistical programming language R.
The data set is available at the UCI Machine Learning Repository. It consists of online transaction at a British retailer.
Some pre-processing is necessary to use the raw data set. To carry out a RFM segmentation we need
Records without a customer ID have been deleted from the data set. The date of the invoice will be used here as there is no separate date for the order in the data set. The data set includes the unit price of an article, which will be used to calculate the monetary value of each order position (as quantity * unit price). The relevant columns of the data set look like the following:
Invoice | CustomerID | InvoiceDate | Quantity | UnitPrice | MonetaryValue |
---|---|---|---|---|---|
489434 | 13085 | 2009-12-01 | 12 | 6.95 | 83.4 |
489434 | 13085 | 2009-12-01 | 12 | 6.75 | 81.0 |
489434 | 13085 | 2009-12-01 | 12 | 6.75 | 81.0 |
489434 | 13085 | 2009-12-01 | 48 | 2.10 | 100.8 |
489434 | 13085 | 2009-12-01 | 24 | 1.25 | 30.0 |
489434 | 13085 | 2009-12-01 | 24 | 1.65 | 39.6 |
The unit price in the data for the most part lies in the range between 1.25 and 3.75. Some values, are however much higher as the following boxplot shows.
Similarly, 95% of order positions have a quantity of less then 36. The maximum value is 19152. Since we cannot rule out data entry errors, records with very high values on unit price (larger than 8.5) or quantity (larger than 36) will be deleted. There are 40287 records that fulfill these criteria.
In addition, some entries for quantity are negative, presumably because these are cancellations. For the sake of simplicity, these entries will be deleted as well.
The original data is at the level of order items. We are interested in the frequency of orders, not the number of different items, thus we have to aggregate the individual order positions to the level of sales orders.
Invoice | CustomerID | InvoiceDate | MonetaryValue |
---|---|---|---|
489434 | 13085 | 2009-12-01 | 505.30 |
489435 | 13085 | 2009-12-01 | 145.80 |
489436 | 13078 | 2009-12-01 | 630.33 |
489437 | 15362 | 2009-12-01 | 310.75 |
489438 | 18102 | 2009-12-01 | 2286.24 |
489439 | 12682 | 2009-12-01 | 426.30 |
Based on the datset with the aggregated data, we now create a table with recency, frequency and monetary value for each customer. The data set covers orders billed between 2009-12-01 and 2010-12-09. For the analysis we use the 12 months period ending on 2010-12-01.
CustomerID | recency | frequency | monetary_value |
---|---|---|---|
12346 | 156 | 11 | 372.86 |
12347 | 31 | 1 | 611.53 |
12348 | 65 | 1 | 222.16 |
12349 | 34 | 3 | 2671.14 |
12351 | 2 | 1 | 300.93 |
12352 | 2 | 2 | 343.80 |
After cleaning the data set, we now have data for 19195 sales orders from 0 customers.
Most customers made their last order within the last month. Half of them had their last order within the last 48 days. The most frequent 25% of customers have made a purchase within the last 16 days. The least frequent 25% of customers had their last transaction more than 135 days before the beginning of the period of analysis. The following graph shows the distribution of recency with these quartiles (red dotted line). The distribution is heavily right-skewed.
The number of orders (frequency) has a mean value of 4.32 and a median of 2. The 25% of customers with the lowest number of sales made 1 purchase within the period of analysis. The quantile with the highest number of sales had more than 5 transactions. The maximum values lies at 197. The following plot shows a part of the distribution for the number (frequency) of orders (red dotted lines are the quantiles). The distribution is heavily right-skewed.
The monetary value of order per customers has a mean value of 1973.46 and a median of 687.39. The maximum values lies at 302804.93, thus far away from the center of the distribution. If we had access to more information, it would be worthwhile to further investigate this extreme outlier. The 25% of customers with the lowest monetary value generated less than 307.32. The quantile with the highest monetary value brought in more than 1675.075. The following plot shows a part of the distribution for monetary value (red dotted lines are the quantiles). The distribution is heavily right-skewed.
The monetary value increases with the number of orders (frequency). In the range between 25 and 50 orders, there are a number of customers with very high spending totals. Customers who order more frequently also seem to be spending more per order, although this impression might be driven by a few outliers (not plotted).
The top customers (in terms of monetary value) all had relatively recent purchases, but overall there is only a weak (if any) relationship between recency and monetary value.
Based on the values for recency, frequency and monetary value we can now calculate scores. The scores are calculated as the quantiles for each variable, giving values between 1 and 4. The higher the value, the better. Thus, a high score for recency indicates a more recent purchase and a high score for monetary value a larger amount.
There are two approaches to generating an overall assessment of a customer:
CustomerID | recency | frequency | monetary_value | recency_score | frequency_score | monetary_value_score | rfm_segment | rfm_score |
---|---|---|---|---|---|---|---|---|
12346 | 156 | 11 | 372.86 | 1 | 4 | 2 | 142 | 7 |
12347 | 31 | 1 | 611.53 | 3 | 1 | 2 | 312 | 6 |
12348 | 65 | 1 | 222.16 | 2 | 1 | 1 | 211 | 4 |
12349 | 34 | 3 | 2671.14 | 3 | 3 | 4 | 334 | 10 |
12351 | 2 | 1 | 300.93 | 4 | 1 | 1 | 411 | 6 |
12352 | 2 | 2 | 343.80 | 4 | 2 | 2 | 422 | 8 |
Most customers belong to the top segment (444), having the highest score for all three factors. The second largest group, however, has the lowest RFM scores. The two customer segments are by far the largest groups. Thus, there is one the one hand a lot of potential to develop customers and on the other a loyal customer base that has to be cared for. The third largest group has the highest scores for frequency and monetary value, but does not belong to the most recent customers. It would be worthwhile to investigate this further: what is the difference between customers in the most and the second most recent group? Is there a seasonal pattern? Was one group part of a promotion campaign?
## Selecting by n