Introduction

The following demonstrates how to perform customer segmentation using the RFM model (recency, frequency, monetary value) on a medium-sized data set. The analysis was conducted with the statistical programming language R.

The data set is available at the UCI Machine Learning Repository. It consists of online transaction at a British retailer.

Data Preparation

Pre-processing

Some pre-processing is necessary to use the raw data set. To carry out a RFM segmentation we need

  • a customer ID,
  • the date of the transaction (or invoice)
  • and the value of the transaction.

Records without a customer ID have been deleted from the data set. The date of the invoice will be used here as there is no separate date for the order in the data set. The data set includes the unit price of an article, which will be used to calculate the monetary value of each order position (as quantity * unit price). The relevant columns of the data set look like the following:

Pre-processed Data Set (excerpt)
Invoice CustomerID InvoiceDate Quantity UnitPrice MonetaryValue
489434 13085 2009-12-01 12 6.95 83.4
489434 13085 2009-12-01 12 6.75 81.0
489434 13085 2009-12-01 12 6.75 81.0
489434 13085 2009-12-01 48 2.10 100.8
489434 13085 2009-12-01 24 1.25 30.0
489434 13085 2009-12-01 24 1.65 39.6

Data Cleaning

The unit price in the data for the most part lies in the range between 1.25 and 3.75. Some values, are however much higher as the following boxplot shows.

Similarly, 95% of order positions have a quantity of less then 36. The maximum value is 19152. Since we cannot rule out data entry errors, records with very high values on unit price (larger than 8.5) or quantity (larger than 36) will be deleted. There are 40287 records that fulfill these criteria.

In addition, some entries for quantity are negative, presumably because these are cancellations. For the sake of simplicity, these entries will be deleted as well.

Aggregation to Sales Order

The original data is at the level of order items. We are interested in the frequency of orders, not the number of different items, thus we have to aggregate the individual order positions to the level of sales orders.

Aggregated Data Set (excerpt)
Invoice CustomerID InvoiceDate MonetaryValue
489434 13085 2009-12-01 505.30
489435 13085 2009-12-01 145.80
489436 13078 2009-12-01 630.33
489437 15362 2009-12-01 310.75
489438 18102 2009-12-01 2286.24
489439 12682 2009-12-01 426.30

Generating RFM Segments

Based on the datset with the aggregated data, we now create a table with recency, frequency and monetary value for each customer. The data set covers orders billed between 2009-12-01 and 2010-12-09. For the analysis we use the 12 months period ending on 2010-12-01.

RFM Table (excerpt)
CustomerID recency frequency monetary_value
12346 156 11 372.86
12347 31 1 611.53
12348 65 1 222.16
12349 34 3 2671.14
12351 2 1 300.93
12352 2 2 343.80

Exploratory Data Analysis

After cleaning the data set, we now have data for 19195 sales orders from 0 customers.

Most customers made their last order within the last month. Half of them had their last order within the last 48 days. The most frequent 25% of customers have made a purchase within the last 16 days. The least frequent 25% of customers had their last transaction more than 135 days before the beginning of the period of analysis. The following graph shows the distribution of recency with these quartiles (red dotted line). The distribution is heavily right-skewed.

The number of orders (frequency) has a mean value of 4.32 and a median of 2. The 25% of customers with the lowest number of sales made 1 purchase within the period of analysis. The quantile with the highest number of sales had more than 5 transactions. The maximum values lies at 197. The following plot shows a part of the distribution for the number (frequency) of orders (red dotted lines are the quantiles). The distribution is heavily right-skewed.

The monetary value of order per customers has a mean value of 1973.46 and a median of 687.39. The maximum values lies at 302804.93, thus far away from the center of the distribution. If we had access to more information, it would be worthwhile to further investigate this extreme outlier. The 25% of customers with the lowest monetary value generated less than 307.32. The quantile with the highest monetary value brought in more than 1675.075. The following plot shows a part of the distribution for monetary value (red dotted lines are the quantiles). The distribution is heavily right-skewed.

The monetary value increases with the number of orders (frequency). In the range between 25 and 50 orders, there are a number of customers with very high spending totals. Customers who order more frequently also seem to be spending more per order, although this impression might be driven by a few outliers (not plotted).

The top customers (in terms of monetary value) all had relatively recent purchases, but overall there is only a weak (if any) relationship between recency and monetary value.

RFM Segments and RFM Scores

Based on the values for recency, frequency and monetary value we can now calculate scores. The scores are calculated as the quantiles for each variable, giving values between 1 and 4. The higher the value, the better. Thus, a high score for recency indicates a more recent purchase and a high score for monetary value a larger amount.

There are two approaches to generating an overall assessment of a customer:

RFM Segment and Score (excerpt)
CustomerID recency frequency monetary_value recency_score frequency_score monetary_value_score rfm_segment rfm_score
12346 156 11 372.86 1 4 2 142 7
12347 31 1 611.53 3 1 2 312 6
12348 65 1 222.16 2 1 1 211 4
12349 34 3 2671.14 3 3 4 334 10
12351 2 1 300.93 4 1 1 411 6
12352 2 2 343.80 4 2 2 422 8

Most customers belong to the top segment (444), having the highest score for all three factors. The second largest group, however, has the lowest RFM scores. The two customer segments are by far the largest groups. Thus, there is one the one hand a lot of potential to develop customers and on the other a loyal customer base that has to be cared for. The third largest group has the highest scores for frequency and monetary value, but does not belong to the most recent customers. It would be worthwhile to investigate this further: what is the difference between customers in the most and the second most recent group? Is there a seasonal pattern? Was one group part of a promotion campaign?

## Selecting by n