January 8, 2012

# Generating Artificial Sales Data

Our statistics lecturers would often end each session with a demonstration of the power of the statistical model under discussion. This would usually mean generating some artificial data and showing how good the tool was at recovering the parameters or correctly classifying the observations. It was highly artificial but had a very useful feature: you knew the true mechanism behind the data so you could see how good your model was at getting at the truth.

We work with marketing data, building models to understand the effect of marketing activity on sales. Of course here, as in any real world situation, we don’t know which mechanism generated the data (that’s what we are trying to find out). But we can get an idea of how good our tools are by testing them out on artificial data in the way we described above. If they don’t work here in these highly idealised situations then we ought to be concerned.

In this series I’m going to take some very simple simulated data sets and look at how well some of the best known marketing mix modelling techniques do at getting back to the true values. I will start by looking at LDSV (Least Squares Dummy Variables) models and then move on to mixed effects and Bayesian modelling.

There’s one other thing worth mentioning before we get started. With our simulated data sets we are able to turn the usual situation on its head and vary the data set rather than the modelling approach. This means we can ask questions like: under what conditions do our models work best?

Building an artificial data set

Our world will be very simple. Weekly sales will follow an overall linear trend to which we will add an annual seasonal cycle which we imagine to be a function of temperature (simulated using a sine wave). On top of that we need some marketing activity which we will add as TV adstock. Finally we will add some noise by simulating from a normal distribution. The final data generating equation looks like this:

$sales_t = alpha + theta_1 week_t + theta_2 temp_t + theta_3 adstock_t + epsilon_t$

where $epsilon sim N(0, sigma^2)$

and adstock is defined recursively as

$adstock_t= 1-e^{-frac{GRPs_t}{phi}} + lambda adstock_{t-1}$

I have generated this data set in R (we will use R throughout – if you are unfamiliar with this language please see the R homepage).

It would also be nice if we could vary the parameters to generate different sets of data so I have created the whole thing as an R function with the parameters as arguments.

```# *--------------------------------------------------------------------
# | FUNCTION: create_test_sets
# | Creates simple artifical marketing mix data for testing code and
# | techniques
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |29/11/2011|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  base_p         Number of base sales
# |          trend_p        Increase in sales for every unit increase
# |                         in time
# |          season_p       The seasonality effect will be
# |                         season_p*temp where -10&lt;temp&lt;10
# |          dim            The dim parameter in adstock (see below)
# |          dec            The dec parameter in adstock (see below)
# |          adstock_form   If 1 then the form is:
# |                         If 2 then the form is:
# |                         Default is 1.
# |          error_std      Standard deviation of the noise
# *--------------------------------------------------------------------
# | OUTPUTS: dataframe      Consists of sales, temp, tv_grps, week,
# |
# *--------------------------------------------------------------------
# | USAGE:   create_test_sets(base_p,
# |                           trend_p,
# |                           season_p,
# |                           dim,
# |                           dec,
# |                           error_std)
# |
# *--------------------------------------------------------------------
# | DEPENDS: None
# |
# *--------------------------------------------------------------------
# | NOTES:   Usually the test will consists of trying to predict sales
# |          using temp, tv_grps, week and recover the parameters.
# |
# *--------------------------------------------------------------------
length&lt;-length(media_var)
for(i in 2:length){
}
}
length&lt;-length(media_var)
for(i in 2:length){
}
}
#Function for creating test sets
#National level model
#Five years of weekly data
week&lt;-1:(5*52)
#Base sales of base_p units
base&lt;-rep(base_p,5*52)
#Trend of trend_p extra units per week
trend&lt;-trend_p*week
#Winter is season_p*10 units below, summer is season_p*10 units above
temp&lt;-10*sin(week*3.14/26)
seasonality&lt;-season_p*temp
#7 TV campaigns. Carry over is dec, theta is dim, beta is ad_p,
tv_grps&lt;-rep(0,5*52)
tv_grps[20:25]&lt;-c(390,250,100,80,120,60)
tv_grps[60:65]&lt;-c(250,220,100,100,120,120)
tv_grps[100:103]&lt;-c(100,80,60,100)
tv_grps[150:155]&lt;-c(500,200,200,100,120,120)
tv_grps[200:205]&lt;-c(250,120,200,100,120,120)
tv_grps[220:223]&lt;-c(100,100,80,60)
tv_grps[240:245]&lt;-c(350,290,100,100,120,120)
#Error has a std of error_var
error&lt;-rnorm(5*52, mean=0, sd=error_std)
#Full series
sales&lt;-base+trend+seasonality+TV+error
#Plot
#plot(sales, type='l', ylim=c(0,1200))
output
}
```

Here is a line graph showing a simulated sales series generated with the following parameters:

``` #Example
test&lt;-create_test_sets(base_p=1000,
trend_p=0.8,
season_p=4,
dim=100,
dec=0.3,
error_std=5)
library(ggplot2)
#Plot the simulated sales
ggplot(data=test, aes(x=week, y=sales))+geom_line(size=1)+ opts(title =&quot;Simulated Sales Data&quot;)
```

I’ve found these simulated data sets useful not only for experiments but also for debugging code (since we know exactly what to expect from them) and as toy examples to give to trainee analysts as templates for future models.

With marketing mix models we often work with hierarchical data (e.g. sales in stores in regions). In the next post I will provide some code to build regional data sets. Following that we will get to work on the modelling.

Simon Raper I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

1. Jeff

Fantastic stuff, cant wait to see more! I have no experience in market mix modeling but do in CRM/database marketing and looking to expand my skill set from modeling that sort of environment to this. Happy I stumbled upon your blog!

• Thanks Jeff. That’s very nice to hear. I’ve posted a couple more items on this subject (there’s one on ridge regression and one on visualising multi-collinearity) and I hope to add some more soon. Good luck with expanding your skills into market mix modelling.

Cheers

Simon

2. Jeff

Hey Simon, I am wondering if you can recommend any good texts to learn market mix modeling? Also, can you recommend a technique to use – do you typically use Arima with regressors or gls for this? Any recommendations for learning is appreciated. Thanks!

• Jeff

Thanks!

3. Hung Ta

HI, thanks for the nice article. Have you created any R package to implement the MMM?