Using D3 to show cost, revenue and ROI

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Return on investment is often measured as the gain from an investment divided by the cost of the investment, sometimes expressed as a percentage. For example if a marketing campaign cost £10K but brought in £20K of revenue on top of the usual sales then the ROI is 200%.

(Note ROI is arguably more properly defined as (gain – cost)/cost but I’ve found that most of the people and industries that I’ve worked with slip naturally into the first definition: gain/cost. In any case both definition capture the same idea. Thanks to Eduardo Salazar for pointing this out.)

Now if you are just given the ROI you’ll find you are missing any of idea of scale. The same ROI could be achieved with a revenue gain of £200 and with one of £200 million. So it would be nice to see cost, revenue and ROI visualised all in one go. There are a few ways to do this but after playing around I came up with the following representation which personally I like the best. It’s a simple scatterplot of cost against revenue but since all points on straight lines radiating from the origin have the same ROI it’s easy to overlay that information. If r is the ROI the the angle of the corresponding spoke is arctan(r).

Note you can drag about the labels. That’s my preferred solution for messy scatterplot labelling.


Hopefully it’s obvious that the idea is to read off the ROI from the position of the point relative to the spokes. The further out toward the circumference the greater the scale of the success or the disaster, depending on the ROI.

To modify the graph for your own purposes just take the code from here and substitute in your data where var dataset is defined. You can change which ROIs are represented by altering the values in the roi array. If you save the code as html and open in a browser you should see the graph. Because d3 is amazing the graph should adapt to fit your data.

You can also find the code here as a JSFiddle.

Thanks to Paul McAvoy for posing the problem and for all the other interesting things he’s shown me!

Four weeks to launch!

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone
O nly four weeks to go until our official launch date of 28th October. It feels like it’s been a long build up but we
believe it will be worth the wait! In the meantime here’s a bit more information about the kind of things we do, why we are different
and what motivates us. If you’re interested please do get in touch.

 

What do we do?

 

T here’s a huge interest in data science at the moment. Businesses understandably want to be a part of it. Very often they assemble the ingredients (the software, the hardware, the team) but then find that progress is slow.

Coppelia is a catalyst in these situations. Rather than endless planning we get things going straight away with the build, learn and build again approach of agile design. Agile and analytics are a natural fit!

Projects might be anything from using machine learning to spot valuable patterns in purchase behaviour to building decision making tools loaded with artificial intelligence.

The point is that good solutions tend to be bespoke solutions.

While we build we make sure that in-house teams are heavily involved – trained on the job. We get them excited about the incredible tools that are out there and new ways of doing things. This solves the problem of finding people with the data science skill set. It’s easier to grow your technologists in-house.

The tools are also important. We give our clients a full view of what’s out there, focusing on open source and cloud based solutions. If a client wishes to move from SAS to R we train their analysts not just in R but in the fundamentals of software design so that they build solid, reliable models and tools.

We teach the shared conventions that link technologies together so that soon their team will be coding in python and building models on parallelised platforms. It’s an investment for the long term.

Finally we know how important it is for the rest of the business to understand and get involved with these projects. Visualisation is a powerful tool for this and we emphasize two aspects that are often forgotten: interactivity (even if it’s just the eye exploring detail) and aesthetics: a single beautiful chart telling a compelling story can be more influential than a hundred stakeholder meetings.

 

Why are we different?

 

O ne thing is that we prioritise skills over tools. There are a lot of people out there building tools but they tend to be about either preprocessing data or prediction and pattern detection for a handful of well defined cases. We love the tools but they don’t address the most difficult problem of how you turn the data into information that can be used in decision making. For that you need skilled analysts wielding tools. Creating the skills is a much harder problem.

Coppelia offers a wide range of courses, workshops and hackathons to kickstart your data science team. See our solutions section for a full description of what we offer.

Another difference is that we are statisticians who have been inspired by software design. We apply agile methods and modular design not just to the tools we build ourselves but also to traditional analytical tasks like building models.

Collaboration using tools like git and trello has revolutionised the way we work. Analysis is no longer a solitary task, it’s a group thing and that means we can take on bigger and more ambitious projects.

But what is most exciting for us is our zero overhead operating model and what it enables us to do. Ten years ago if we’d wanted to run big projects using the latest technology we’d have had to work for a large organisation. Now we can run entirely on open source.

For statistical analysis we have R, to source and wrangle data we have python, we can rent hardware by the hour from AWS and use it to parallelise jobs using hadoop.

Even non-technical tasks benefit in this way: marketing using social media, admin through google drive, training on MOOCs, design using inkscape and pixlr, accounting on quick file.

Without these extra costs hanging over us we are free to experiment, innovate, cross disciplines and work on topics that interest us, causes we like. Above all it gives us time to give back to the sources which have allowed us to work in this way: publishing our code, sharing insights through blogging, helping friends and running local projects

 

What are we excited about?

 

A nything where disciplines are crossed. We like to look at how statistics and machine learning can be combined with AI, music, graphic design, economics, physics, philosophy.
We are currently looking at how the problem solving frameworks in AI might be applied to decision making in marketing.

Bayesian statistics and simulation for problem

solving always seem to be rich sources of ideas. We’re also interested in how browser technology allows greater scope for communication. We blog in a potent mixture of text, html, markdown and javascript.

 

What technology are we into?

 

I t’s a long list but most prominent are R, python, distributed machine learning
(currently looking at Spark), and d3.js. Some current projects include a package to convert
R output into d3 and AI enhanced statistical modelling.

Quick start regex for analysts: Part II

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In my previous post (Part I) I went over the basic metacharacters and special signs in regex. In this second part I will be showing you how to simplify regular expressions.

Let’s get started

Repetition metacharacters

So far, we looked at expressions that always match the pattern to a single position in the text. Repetition metacharacters make the expression much more flexible by expanding the pattern to a specified number of characters.

‘* (proceeding item zero or more times)
+ (proceeding item one or more times)
? (proceeding item zero or one time)

For example (click on the code to see how it works):

apples*

will match the word with no “s” as well as one or many “s”. But in:

apples+

the ‘s’ must be there so it will match words with at least one “s”.

Whereas:

apples?

The ‘s’ doesn’t have to be at the end of the string but it can’t be repeated.

Quantified repetition

This works similarly to the repetition metacharacters. The difference is that we can specify exact number of repetitions of the sign.

{ - start of quantified repetition
} – end of the repetition

Syntax:

{min,max} (min and max are positive numbers)

Tip

min must always be there even if it is 0. Max and the coma are optional.

In our previous post we had an example where we wanted to match only the year. We can now do the following:

\d{4}

This represents a digit repeated exactly 4 times and can be used instead of typing \d\d\d\d.

Similarly we find:

\w{5,10} - minimum 5 letter and maximum 10 letter word
\w{5,} - minimum 5 letters word
\w{5} - exactly 5 letters word

Let’s say we are trying to pick out IP addresses from the text. An IP address is a sequence of 4 sets of 1 – 3 digits separated by dots. It can be easily expressed by:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The .(dot) had to be escaped to be interpreted literally and each set of digits is repeated a minimum of one and a maximum of three times. Shortly we will learn how to simplify it even further.

Tip

There are usually many ways to write a regex expression matching your needs and no one perfect way! As long as it does the job and you are sure it matches exactly what you want don’t worry too much about what it looks like!

Grouping metacharacters

Using () around the groups of characters will enable repetation of that group.

Tip

Don’t group in the character sets (within []) as () will have the literal meaning.

(what)+

will match “what” one or more times.

And to match words with or without the prefix we use:

(in)?dependent

Coming back to the IP address example. We can group each 1-3 digits and an optional dot and then repeat it 4 times.

(\d{1,3}\.?){4}

This will fully match any IP address.

Alternation metacharacters

A common way of dealing with an incorrect spelling is by using the OR character and then grouping the two (or more) alternatives. This way we don’t need to repeat the [] sets.

| - (previous OR next expression)

Take for example the commonly misspelled word:

w(ei|ie)rd

Anchors

Anchors signify the position of the pattern in the text. Note, that it is a second meaning of carat (it is also a negation, check it out here).

^ (start of string/line)
$ (end of string or line)
\A (start of string, never end of line)
\Z (end of string, never end of line)

For example:

^apple

will match only ‘apple’ at the beginning of the line.

Lookaround assertions

Bear in mind that these expressions differ significantly in the different variants of regex.

?= (Assertion of what ought to be ahead)
?! (negative lookahead)
?<= (positive look behind assertion, what ought to be behind)
?<!-- (negative look behind)

For example:

(?=seashore)sea

Will match “sea” only if it is followed by “shore”.

Tip

Look behind can’t be used with repetitions or optional expressions. It also doesn’t work in JavaScript (hence can’t be tested in regexpal).

Tip

It tends not to work very well in text editors.

Differences between programming languages

Here is a quick summary of the major differences between regex in different programming languages. It is not exhaustive, but will give you the idea of the scope of differences. Again, it is always good to let Google know what language you are working in while searching for regex solutions.

Regex Ruby Java Perl Python/R Unix JavaScript PHP .NET
Character Classes :(e.g. \d; \w) Yes No Yes Yes No Yes Yes Yes
POSIX bracket expressions Yes No Yes No Yes No Yes No
Quantifiers: * Yes Yes Yes Yes Yes Yes Yes Yes
Quantifiers: + and ? Yes Yes Yes Yes No Yes Yes Yes
Anchors: \A and \Z Yes Yes Yes Yes No No Yes Yes
Line break: /m Yes No Yes No Yes Yes Yes No
Special command for line break No Yes No Yes No No No Yes
Lookaround assertions only 1.9 and above Yes Yes Yes No No Yes Yes

My next post is all about using regex in a real life example!

The local neighbourhood of C Major

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s a chart I drew for myself to understand the relationships between chords in music theory. Doesn’t seem to have much to do with machine learning and statistics but in a way it does since I found it a lot easier to picture the chords existing in a sort of network space linked by similarity. Similarity here is defined as the removal or addition of a note, or the sliding of a note one semitone up or down. What’s wrong with me!

The neighbourhood of C-Major

The neighbourhood of C-Major

Distribution for the difference between two binomially distributed random variables

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share.

So we start with

X \sim Bin(n_1, p_1)

Y \sim Bin(n_2, p_2)

We are looking for the probability mass function of Z=X-Y

First note that the min and max of the support of Z must be (-n_2, n_1) since that covers the most extreme cases (X=0 and Y=n_2 ) and (X=n_1 and Y=0 ).

Then we need a modification of the binomial pmf so that it can cope with values outside of its support.

m(k, n, p) = \binom {n} {k} p^k (1-p)^{n-k} when k \leq n and 0 otherwise.

Then we need to define two cases

1. Z \geq 0
2. Z < 0

In the first case

p(z) = \sum_{i=0}^{n_1} m(i+z, n_1, p_1) m(i, n_2, p_2)

since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of n_1 and n_2 . For example if n_2 = 4 then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero.

For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc.

$latex p(z) = \sum_{i=0}^{n_2} m(i, n_1, p_1) m(i+z, n_2, p_2)[l\atex]

Put them together and that’s your pmf.

CodeCogsEqn

Here’s the function in R and a simulation to check it’s right (and it does work.)

Excuses and Opportunities

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Regular readers of the blog will have noticed that I haven’t been a regular contributor to the blog over the last year. There are some good reasons/excuses for that, predominantly around buying and renovating our new home, getting married and starting a new job at Facebook. Having an Arsenal season ticket also doesn’t help.

I am hoping that now that most of the renovations are done, I’m five months into married life and 13 months into the new job, that should allow for some more writing, so watch this space.

In the mean time, there is an awesome role for a senior researcher available in my team, working to understand and quantify advertising effectiveness. The full description is available here and it’s based in either New York or California.

Needless to say that Facebook is an incredible place to work, with more data than you could ever hope for and lots of interesting, yet unanswered, questions to ask of the data.

Within the team, you’ll be working with other researchers who have a strong background in data science and research more widely, including Eurry Kim and Dan Chapsky who both recently joined the team.

If you’re interested in the role and think you’d be a good fit, then please submit your CV/Resume or alternatively, if you’ve got any questions, please reach out using the form below!

[contact-form-7 404 "Not Found"]

Thorstein Veblen and Hard Coding

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Veblen3a
It is still quite common to hear the career progression of an analyst described as one upwards from the hard graft of coding and “getting your hands dirty” towards the enviable heights of people management and strategic thinking. Whenever I hear this it reminds me of the book Conspicuous Consumption by the American economist Thorstein Veblen. It examines, in a very hypothetical way, the roots of economic behaviour in some of our basic social needs: to impress others, to dominate and to demonstrate status. It’s not a happy book.

His principal concept is the distinction between exploit and drudgery.

 

The institution of a leisure class is the outgrowth of an early discrimination between employments, according to which some employments are worthy and others unworthy. Under this ancient distinction the worthy employments are those which may be classed as exploit; unworthy are those necessary employments into which no appreciable element of exploit enters.

He sees this division arising early in history as honour and status are assigned to those are successful in making others do what they want and “at the same time, employment in industry becomes correspondingly odious, and, in the common-sense apprehension, the handling of the tools and implements of industry falls beneath the dignity of able-bodied men. Labour becomes irksome.”

We’re no longer hairy barbarians using the vanquished for foot-stools but, as Veblen points out, the distinction persists no matter how illogical and unprofitable. Coding is still somehow viewed as necessarily inferior to the boardroom meeting even if it is the genius piece of code that makes or breaks a business.

Attitudes are changing, but still so slowly that there is a desperate shortage of people with both the experience and the hands-on skills to get things done.

So if you are a good analyst, and you love what you do, then this is my advice to you: when they come to lure you from your lovely pristine scripts, resist. Stay where you are. The 21st century is going to need you!

Clegg vs Pleb: An XKCD-esque chart

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I saw an interesting “challenge” on StackOverflow last night to create an XKCD style chart in R. A couple of hours later & going in a very similar direction to a couple of the answers on SO, I got to something that looked pretty good, using the sin and cos curves for simple and reproducible replication.

Tonight, I thought I’d try and apply the theme and styling to some real word and slightly XKCD content: UK politics. Two of the biggest stories of the month in the UK have been Nick Clegg’s apology over reneging on the Liberal Democrat’s tuition fees pledge and Andrew Mitchell’s “incident” trying to cycle out of Downing Street, the so-called GateGate.

Using the newly redesigned Google Insights for Search, I looked at searches for clegg and pleb over the last 30 days. A quick manipulation into a csv and applying the XKCD theme and some geom_smoothing gives this:

Looks like Andrew Mitchell might be Nick Clegg’s new best friend in terms of deflecting some of the attention away from the sorry song…

And here’s the code (note that you need to have installed the Humor Sans font using install_fonts() ):

library(ggplot2)
library(extrafont)
### Already have read in fonts (see previous answer on how to do this)
loadfonts()
### Clegg and Pleb data
pleb.clegg &lt;- read.csv(&quot;pleb and clegg.csv&quot;)
pleb.clegg$Date &lt;- as.Date(pleb.clegg$Date, format=&quot;%d/%m/%Y&quot;)
pleb.clegg$xaxis &lt;- -4
### XKCD theme
theme_xkcd &lt;- theme(
	panel.background = element_rect(fill=&quot;white&quot;),
	axis.ticks = element_line(colour=NA),
	panel.grid = element_line(colour=&quot;white&quot;),
	axis.text.y = element_text(colour=NA),
	axis.text.x = element_text(colour=&quot;black&quot;),
	text = element_text(size=16, family=&quot;Humor Sans&quot;)
	)
### Plot the chart
p &lt;- ggplot(data=pleb.clegg, aes(x=Date, y=Pleb))+
geom_smooth(aes(y=Clegg), colour=&quot;gold&quot;, size=1, position=&quot;jitter&quot;, fill=NA)+
geom_smooth(colour=&quot;white&quot;, size=3, position=&quot;jitter&quot;, fill=NA)+
geom_smooth(colour=&quot;dark blue&quot;, size=1, position=&quot;jitter&quot;, fill=NA)+
geom_text(data=pleb.clegg[10, ], family=&quot;Humor Sans&quot;, aes(x=Date), colour=&quot;gold&quot;, y=20, label=&quot;Searches for clegg&quot;)+
geom_text(data=pleb.clegg[22, ], family=&quot;Humor Sans&quot;, aes(x=Date), colour=&quot;dark blue&quot;, y=4, label=&quot;Searches for pleb&quot;)+
geom_line(aes(y=xaxis), position = position_jitter(h = 0.1), colour=&quot;black&quot;)+
coord_cartesian(ylim=c(-5, 40))+
labs(x=&quot;&quot;, y=&quot;&quot;, title=&quot;Pleb vs Clegg: Google Keyword Volumes&quot;)+
theme_xkcd
ggsave(&quot;xkcd_cleggpleb.jpg&quot;, plot=p, width=8, height=5)

Why are pirates called pirates?

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In homage to International Talk Like a Pirate Day…

I recently stumbled across a series of blog posts from the folks at IDV that visualised the archive of recorded pirate attacks which has been collected by the US National Geospatial-Intelligence Agency. It’s a dataset of 6000+ pirate attacks which have been recorded over the last 30 or so years.

This first map shows where the attacks have been recorded, with four clear areas standing out when the data is aggregated into hexagon bins:

Map showing areas where pirate attacks have been recorded

Zooming in on the area around Yemen, there’s a clear ramp up in the number of attacks since 2008, which saw a 570% increase compared to the previous year.  As noted by the IDV analysis, most attacks take place on Wednesday and during the spring and autumn.

Number of attacks recorded in the Aden region by year

The reaction to the massive increase in attacks in 2008 seems to have been ships not travelling as close to the shore in 2009, leading to more attacks happening further out to sea. This can clearly be seen by looking only at the attacks in 2008 and 2009:

Number of Pirates attacks in the Aden area in 2008 and 2009 (Distance is in degrees)

Within the dataset, as well as information around the location of the attack, there are also descriptions of the attacks, which lends itself well to some text analysis to understand where there have been changes in the nature of the attacks above and beyond their distance.

Some analysis of the descriptions of the attacks reveals that the nature of the attacks also changed in 2010, with more featuring terms such as security and speedboats (full details of how these topic groups have been created is below). The analysis was used to identify five different types of attacks.

From the chart below, Topics 4 and 5came to prominence in 2008, with Topic 5 maintaining it’s share in 2010 before Topic 2 then increasing in number in 2011 and 2012. This is just scratching the surface with what can be done with topic analysis and given that all the documents are related to pirate attacks, there’s not the variation compared to what you would see in say, news articles about many subjects. There’s a good walk through of using the TopicModels package here.

Attacks in the Yemen region classified into one of five topics based on description

And what are these topics? The table below shows the top 10 terms for each of the 5 topics – they’re not as clear cut as you’d hope (mainly because there are a fair few verbs and numbers in there at the moment), but give an idea of some differences – skiffs versus speedboats, topic 2 featuring “security”, the numbers involved and months of the year all hit at different aspects of the attacks which have been picked out.

     Topic 1  Topic 2  Topic 3    Topic 4    Topic 5
1  attempted     were hijacked       boat      boats
2        six     fire      are      white       four
3       took security attacked       port        men
4  increased   skiffs    miles      about    persons
5      board     team  vessels      small      three
6     skiffs     when     this        sep        may
7     alarm,      jan  advised        apr speedboats
8        for    seven merchant    general       five
9   chemical      had  boarded reportedly    reports
10      guns    which exercise  speedboat       each

And the reason why pirates are called pirates? Because they Argghhhhhhhhhh.

NB. I haven’t had a chance to check on the copyright, etc. for hosting the pirate dataset, so please download it from here.

Reading the data into R.

library(maps)
library(sp)
library(maptools)
library(ggplot2)
library(spatstat)
gpclibPermit()
library(topicmodels)
pirates.data &lt;- readShapePoints(&quot;C:\ASAM 05 SEP 12&quot;)
pirates.data.2 &lt;- as.data.frame(pirates.data)

How far to the shore?

Next is to turn the data into a Planar Point Pattern in order to allow for calculation of nearest coastal point for each attack. The same technique is then used to create a similar file for the coastline. Use the nncross function to find the distance from each attack to the nearest point on the coast (and identify which point that is).

bb &lt;- c(40, 56, 7, 17)
pirates.ppp &lt;- as.ppp(pirates.data.2[,13:14], bb)
worldmap &lt;- map_data(&quot;world&quot;)
land.ppp &lt;- as.ppp(worldmap[, 1:2], bb)
land.df &lt;- as.data.frame(cbind(land.ppp$x, land.ppp$y))
reg &lt;- as.data.frame(map(&quot;world&quot;, xlim = c(40, 56), ylim = c(7, 17), plot = FALSE)1)
nearest.land &lt;- nncross(pirates.ppp, land.ppp)
pirates.nearest.land &lt;- as.data.frame(cbind(as.numeric(pirates.ppp$x), as.numeric(pirates.ppp$y), as.numeric(nearest.land$dist)))
pirates.data.aden &lt;- merge(pirates.nearest.land, pirates.data.2, by.x=c(&quot;V1&quot;, &quot;V2&quot;), by.y = c(&quot;coords.x1&quot;, &quot;coords.x2&quot;))

Calculate various extra columns such as year of date and number of attacks by year

pirates.data.aden$year &lt;- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
pirates.data.aden$month &lt;- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
pirates.data.aden$year &lt;- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
year.stats &lt;- ddply(pirates.data.aden, .(year), summarise, attacks= length(year))
year.stats$Delt &lt;- Delt(year.stats$attacks)

Plot the world map showing attacks binned into hexagons

ggplot()+
	stat_summary_hex(fun=&quot;length&quot;, data=pirates.data.2, aes(x=coords.x1, y=coords.x2, z=coords.x2)) +
	scale_fill_gradient(low=&quot;white&quot;, high=&quot;red&quot;, &quot;Pirate Attacks recorded&quot;) +
	geom_path(aes(x=x, y=y), data = world) +
	mb.theme +
	labs(x=&quot;&quot;, y=&quot;&quot;) +
	theme(panel.background = element_rect(fill=&quot;white&quot;),
		axis.ticks = element_line(colour=&quot;white&quot;),
		axis.text = element_text(colour=&quot;white&quot;),
		axis.line = element_line(colour=&quot;white&quot;),
		panel.grid = element_line(colour=NA)) +
		scale_x_continuous(breaks=NA)+
		scale_y_continuous(breaks=NA)

Number of attacks by year near Aden

ggplot(data=pirates.data.aden, aes(x=year))+
		geom_histogram(binwidth=1, colour=&quot;white&quot;, fill=&quot;dark blue&quot;)+
		mb.theme +
		labs(x=&quot;Year&quot;, y=&quot;Number of attacks recorded&quot;)

Attacks by distance from Shore as a histogram

ggplot(data=subset(pirates.data.aden, year%in%c(2008, 2009)), aes(x=V3))+geom_histogram(fill=&quot;dark blue&quot;, colour=&quot;white&quot;)+
		mb.theme+
		facet_wrap(~year, ncol=1) +
		labs(x=&quot;Distance from shore&quot;, y=&quot;Number of attacks&quot;)

Topic Models analysis of attack descriptions near Aden

corpus &lt;- Corpus(VectorSource(pirates.data.aden$Desc1))
dtm &lt;- DocumentTermMatrix(corpus)
term_tfidf &lt;- tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm &gt; 0))
dtm &lt;- dtm[, term_tfidf &gt;= 0.1]
dtm &lt;- dtm[row_sums(dtm) &gt; 0,]
k=5
SEED=2012
TM &lt;- list(VEM = LDA(dtm, k = k, control = list(seed = SEED)),
		VEM_fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
		Gibbs = LDA(dtm, k = k, method = &quot;Gibbs&quot;, control = list(seed = SEED, burnin = 1000, thin = 100, iter = 1000)),
		CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))))
pirates.data.aden$Topic &lt;- topics(TM[[&quot;Gibbs&quot;]], 1)
ggplot(data=pirates.data.aden, aes(x=year, fill=as.factor(Topic), group=as.factor(Topic)))+
		geom_histogram(binwidth=1, colour=&quot;white&quot;)+
		scale_fill_brewer(palette=&quot;Set3&quot;, &quot;Topic Group&quot;) +
		mb.theme +
		labs(x=&quot;&quot;, y=&quot;&quot;)

R: Dealing with package updates

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s a very short post to highlight one of the “highlights” of my week that I thought was worth sharing with the wider community.

One of the things I find great about R is the rapidly evolving ecosystem where new packages are being constantly created and others are being updated.

Up until now, I’ve found this to be a very good thing, but experienced the other side this week, where an upgrade to a package broke a pretty big script that I’d been working on.   

The “quick” solution in my case was to use the CRAN archive to download the source for earlier versions of the package (and it’s dependencies) that was causing issue and then build and install them to overwrite the upgrade – knowledge of how to build a package from source in Windows came in very handy and you can read more about how to do it in a previous D&L post

The longer term solution is a lot trickier, particularly where R is being used in a collaborative environment and reproducibility is important, either between machines or over time.

I suspect one solution is to have a common library that is centrally maintained and then changing the default location that R looks for installed packages (which is also a handy solution when dealing with not having to download all packages against once you’ve upgraded base R). 

On top of this, I suspect there would then need to be some type of test suite which, once updates to packages were available, checked in a development environment that existing scripts and processes still worked. None of this is new to software and IT folk, but it’s a novel issue at the moment for the analyst community I suspect.

Machine Learning and Analytics based in London, UK