Easy data science for increased insight (& open rates)
It’s difficult to read more than a couple of marketing articles these days without being told that we need to be more data-driven in what we do. The skills needed to start being more data-driven actually aren’t that difficult. The difficult bit is often creating a culture that understands the importance of creating the data in the first place, and having an infrastructure to store, manage and access it.
However, it often doesn’t take much data to start to apply data-informed practices to your activities. In this article, we’ll take a look at an ecommerce dataset, publicly available from the UCI Machine Learning Repository . We will use it to illustrate some basic concepts in segmentation and clustering.
In this first article using these data, let’s play the role of the UK Marketing Manager for this company. We’re tasked with increasing engagement with our email marketing campaign with the goal of increasing open rates and email-driven revenue to UK-based customers.
The company has made available a database of transaction data that links to a particular customer ID. We’ll assume for the purposes of this exercise that we can look this ID up in a separate customer database that provides their email address (including GDPR-compliant opt-in for email marketing!).
Given these data, how can we use the information to improve our email marketing efforts? Let’s turn to segmentation of our customers. Would they all want to receive the same messages, about the same products, on the same day of the week, at the same time?
Probably not, so let’s get to work on our data and see what it can give us back.
Exploring, filtering and making
Before we do anything else, we’ll load in our dataset and take a quick look to get a feel for what we’re working with:
# Load package to read in Excel spreadsheet library(readxl)
# Read in Excel file raw_data <- read_excel("Online Retail.xlsx")
# Overview of raw data library(dplyr) glimpse(raw_data)
Observations: 541,909 Variables: 8 $ InvoiceNo <chr> "536365", "536365", "536365", "536365" $ StockCode <chr> "85123A", "71053", "84406B", "84029G" $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER" $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32 $ InvoiceDate <dttm> 2010-12-01 08:26:00 $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65 $ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850 $ Country <chr> "United Kingdom", "United Kingdom"
Looking at our summary, we have a dataset of just over 540k rows, but with only eight features, not all of which have been imported as the type we want them to be, but we’ll attend to that shortly.
First, let’s filter to look at just the customers based in the United Kingdom:
# Filter to include UK rows only uk_data <- raw_data %>% filter(Country == "United Kingdom")
 495478 8
Okay, so that’s most of them then. Fine. More for us to work with…
Back to those variable types. We probably want to convert some of those to factors:
# Change some features to factors uk_data <- uk_data %>% mutate(InvoiceNo = factor(InvoiceNo), StockCode = factor(StockCode), CustomerID = factor(CustomerID), Country = factor(Country))
Observations: 495,478 Variables: 8 $ InvoiceNo <fct> 536365, 536365, 536365, 536365 $ StockCode <fct> 85123A, 71053, 84406B, 84029G $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER" $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6 $ InvoiceDate <dttm> 2010-12-01 08:26:00 $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65 $ CustomerID <fct> 17850, 17850, 17850, 17850, 17850 $ Country <fct> United Kingdom, United Kingdom
Those dealt with, let’s create some additional features from our date-time variable that will make things easier for us later:
# Create additional date and time features
uk_data <- uk_data %>% mutate(date = floor_date(InvoiceDate, "day"), hour = hour(InvoiceDate), day = wday(InvoiceDate, label = TRUE))
 "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"
weekend <- c("Sat", "Sun")
uk_data <- uk_data %>% mutate(is_weekend = day %in% weekend)
Observations: 495,478 Variables: 12 $ InvoiceNo <fct> 536365, 536365, 536365, 536365 $ StockCode <fct> 85123A, 71053, 84406B, 84029G $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER" $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00 $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65 $ CustomerID <fct> 17850, 17850, 17850, 17850, 17850 $ Country <fct> United Kingdom, United Kingdom $ date <dttm> 2010-12-01, 2010-12-01, 2010-12-01 $ hour <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 $ day <ord> Wed, Wed, Wed, Wed, Wed, Wed, Wed $ is_weekend <lgl> FALSE, FALSE, FALSE, FALSE, FALSE
Okay, we’ve got a bit more to work with now. So let’s crack on and empower our email strategy…
Slicing and dicing
In a couple of planned future posts, I’ll use this same dataset (and the same segmentation premise) to look at how we can segment these data using a couple of cluster analysis methods, but for this article, we’ll ask some specific questions, and the data will hopefully give us an answer.
Using the main verbs of the
dplyr package, we can manipulate our dataset quickly and easily in R in a manner similar to pivot tables in Excel. This approach will help us go from one observation per invoice, to one row per customer.
Using just four dplyr functions —
mutate — along with the handy pipe
%>% operator, we can build ‘pivot table-like’ pipelines to look at the data from whatever angle we want.
We’ll begin by creating a basic summary of our dataset that makes the customer the focus. We’ll start by grouping by customer number, then summarising their orders. The number of orders each customer has placed is given by
n() , then we’ll sum the item and revenue quantities to obtain the total number of items purchased and total revenue. We can find out the date of their last order with
max(date) , then we’ll create a couple of basic averages for their average order value and the average value of each item they have bought.
# Create customer-centric dataset
customer_summary <- uk_data %>% group_by(CustomerID) %>% summarise(orders = n(), quantity = sum(Quantity), revenue = sum(Quantity * UnitPrice), last_order = max(date)) %>% mutate(avg_order_val = revenue / orders, avg_item_val = revenue / quantity)
# A tibble: 6 x 7 CustomerID orders quantity revenue last_order <fct> <int> <dbl> <dbl> <dttm> 1 12346 2 0 0 2011-01-18 00:00:00 2 12747 103 1275 4196. 2011-12-07 00:00:00 3 12748 4642 24210 29072. 2011-12-09 00:00:00 4 12749 231 1422 3868. 2011-12-06 00:00:00 5 12820 59 722 942. 2011-12-06 00:00:00 6 12821 6 70 92.7 2011-05-09 00:00:00 # ... with 2 more variables: avg_order_val <dbl>, avg_item_val <dbl>
Hmmm. Something looks a bit strange with our first customer. We have two orders, but total quantities and revenue of 0. That doesn’t seem right. Let’s take a look:
uk_data %>% filter(CustomerID == 12346)
# A tibble: 2 x 12 InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice <fct> <fct> <chr> <dbl> <dttm> <dbl> 1 541431 23166 MEDIUM CER… 74215 2011-01-18 10:01:00 1.04 2 C541433 23166 MEDIUM CER… -74215 2011-01-18 10:17:00 1.04 # ... with 6 more variables: CustomerID <fct>, Country <fct>, date <dttm>, # hour <int>, day <ord>, is_weekend <lgl>
Ahh, okay: a refund. A refund of a purchase of 74,215 medium ceramic top storage jars at that. Bummer.
Let’s remove customers with revenue of £0 from our dataset:
# Remove customers with net spend of £0
customer_summary_spend <- customer_summary %>% filter(revenue > 0)
 3951 7
 3903 7
We’re starting to get somewhere. Now it’s time for graphs. As Captain Renault would say: “Round up the usual suspects”:
For our first planned email, we want to come up with something that looks popular. Something we can put in the email schedule every couple of weeks. To do that, let’s get a feeling for how many customers order items of a given amount. To do that, we can plot a histogram of our average item value:
ggplot(customer_summary_spend, aes(x = avg_item_val)) + geom_histogram(binwidth = 0.05) + scale_x_log10() + coord_cartesian(xlim = c(0.1, 100)) + labs(title = "Histogram of Customer Average Item Value", x = "Average item value (£)", y = "Count of customers") + theme_minimal()
Our first bit of insight. How about we plan a fortnightly email that focuses on products that are priced around this peak figure? If we use the
quantile function, we could pull out the range of prices that cover 10% of our customers centred around this peak:
# Find price range that covers 10% of our customers
quantile(customer_summary_spend$avg_item_val, c(0.45, 0.55))
45% 55% 1.704107 1.870029
Armed with this information, we can now create a segment of customerIDs with an average item cost of more than £1.70 and less than £1.87:
# Create segment of customerIDs with an average item cost of # more than £1.70 and less than £1.87
customer_item_10pc <- customer_summary_spend %>% filter(avg_item_val > 1.7, avg_item_val < 1.87)
 404 7
Our first email segment is ready. Well, almost, we’ve got a list of customerIDs, so we’ll need to use that to query our contact database. Depending on your system, SQL might be your friend there, or you could do it in R with a
dplyr join, or even a VLOOKUP in Excel. The choice, as they say, is yours.
The please come back offer
Our first segment built, let’s turn to getting some previous customers back. If we look for the most recent order in our dataset:
# Look for most recent purchase
customer_summary_spend %>% arrange(desc(last_order)) %>% head()
# A tibble: 6 x 7 CustomerID orders quantity revenue last_order <fct> <int> <dbl> <dbl> <dttm> 1 12748 4642 24210 29072. 2011-12-09 00:00:00 2 12985 78 1413 1216. 2011-12-09 00:00:00 3 13069 470 5466 3718. 2011-12-09 00:00:00 4 13113 280 2594 10510 2011-12-09 00:00:00 5 13426 159 2224 3551. 2011-12-09 00:00:00 6 13777 219 12804 25748. 2011-12-09 00:00:00 # ... with 2 more variables: avg_order_val <dbl>, avg_item_val <dbl>
we can see that these data do go back a bit… In the interests of thinking that Back to the Future is almost a flawless film (despite glaring holes in the plot), let’s travel back in time and imagine it’s the first day back in the office after new year 2012.
We want to see if we can entice back our customers that haven’t placed an order in the last six months.
# Haven't ordered in last six months
six_months <- customer_summary_spend %>% filter(last_order < "2011-07-01") %>% arrange(desc(last_order))
 857 7
There we go, a nice set of 857 customers to go back to with a “Please come back, here’s a 10% off coupon for your next order. Go on. Please” message.
Darling leave a light on
Of course, we might want to target groups of customers on the basis of what they’ve bought before. In the real world, it’s likely that we’d have access to another table of the database that would allow us to assign categories to our StockCodes, but we don’t have that, so let’s query the text of the item description.
We’ve just had a great shipment of deluxe LED strings and bargain unicorn nightlights arrive, so let’s create a customer segment of people who’ve bought some sort of light or light accessory before so we can share the good news with them:
# Customers who ordered lights
light_on <- uk_data %>% mutate(light = str_detect(uk_data$Description, "LIGHT")) %>% select(CustomerID, light) %>% unique()
FALSE TRUE 3937 2555
There we go, 2555 customers who have previously bought some sort of light. Ready for our solid gold lighting e-blast. Filter on:
# Create customer segment
lighting_lovers <- light_on %>% filter(light == TRUE)
and your next email list is ready, but when should we hit send? We’ll see when our lighting lovers spend the most with a quick plot:
# When do our lighting lovers shop?
uk_data %>% mutate(light = str_detect(uk_data$Description, "LIGHT")) %>% filter(light == TRUE) %>% group_by(CustomerID, day) %>% summarise(revenue = sum(Quantity * UnitPrice)) %>% ggplot(aes(day, revenue)) + geom_col() + labs(title = "Revenue From Customers Who Have \nBought Lights by Day", x = "Day of week", y = "Revenue (£)") + theme_minimal()
Looks like a weekday is the time to go for; certainly not a Sunday, perhaps Tuesday or Thursday? We can test for statistically significant differences in the amount of revenue between the days using an analysis of variance test followed by a Tukey’s post-hoc test:
# Analysis of variance on day of week revenue
day_revenue <- uk_data %>% mutate(light = str_detect(uk_data$Description, "LIGHT")) %>% filter(light == TRUE) %>% group_by(CustomerID, day) %>% summarise(revenue = sum(Quantity * UnitPrice)) %>% ungroup()
day_aov <- aov(day_revenue$revenue ~ day_revenue$day)
Call: aov(formula = day_revenue$revenue ~ day_revenue$day)
Terms: day_revenue$day Residuals Sum of Squares 1638522 2163494060 Deg. of Freedom 5 4978
Residual standard error: 659.2504 Estimated effects may be unbalanced
Df Sum Sq Mean Sq F value Pr(>F) day_revenue$day 5 1.639e+06 327704 0.754 0.583 Residuals 4978 2.163e+09 434611
Tukey multiple comparisons of means 95% family-wise confidence level
Fit: aov(formula = day_revenue$revenue ~ day_revenue$day)
$`day_revenue$day` diff lwr upr p adj Mon-Sun 43.0435315 -55.56340 141.65046 0.8148642 Tue-Sun 65.6262418 -31.76344 163.01592 0.3892285 Wed-Sun 39.9659986 -56.75344 136.68543 0.8474967 Thu-Sun 37.1521713 -56.36168 130.66602 0.8678936 Fri-Sun 43.2290358 -57.96363 144.42171 0.8283032 Tue-Mon 22.5827103 -69.46098 114.62640 0.9820466 Wed-Mon -3.0775328 -94.41176 88.25669 0.9999989 Thu-Mon -5.8913602 -93.82390 82.04118 0.9999651 Fri-Mon 0.1855044 -95.87306 96.24407 1.0000000 Wed-Tue -25.6602431 -115.67893 64.35844 0.9653110 Thu-Tue -28.4740705 -115.03938 58.09124 0.9367539 Fri-Tue -22.3972059 -117.20580 72.41139 0.9848541 Thu-Wed -2.8138274 -88.62439 82.99674 0.9999990 Fri-Wed 3.2630372 -90.85694 97.38302 0.9999987 Fri-Thu 6.0768645 -84.74582 96.89955 0.9999653
Looking at our output, it looks like there aren’t any significant differences between the days (at a significance level of 5%). Our best p-value is between Sunday and Tuesday ( p = 0.389), with the differences between Sunday and the other days giving us p values from around 0.81 to 0.87, well above the 0.05 we’d use as a standard. The values for comparisons between weekdays all sit around 1 so, before sending on a Tuesday, maybe you should look at your web analytics and see what else you can find out about your customer behaviour.
Summarising our summarising
While this was a bit of a contrived and superficial example, hopefully it’s illustrated how straightforward breaking down your database can be, and how you can ask questions of your data to generate segments that can feed directly into actionable marketing activities.
This has just been a quick overview to give a taste of how quickly and easily these types of questions can be asked and answered using R, but the concepts are translatable to whatever platform you use.
On the flip side, if you’re starting to use R, but are more interested in filtering a dataset of gene expression, functions and interactions rather than customer purchases, it’s just as easy.
In the examples above, I’ve skipped a few steps in the interests of brevity that you’d probably want to do if you were building a real segment at work. Simply calculating the mean average of our item costs without looking at the distribution is quite a corner to cut for a start…
That about wraps up part one of this series, and we’ll look at segmenting using algorithms rather than our own questions in the next couple of articles. If you’re interested, all of the code for this post is on GitHub . Thanks for reading, and see you for part two…