## Quick start regex for analysts: Part I

…with much excitement and a tiny bit of nervousness, here I am with my first blog post on Coppelia! Starting with my beloved regex, this is a series of three posts. Thank you so much Simon Raper for having me here and I hope you guys will enjoy reading it at least as I enjoyed writing it!

Let’s do it…!

Learning and applying regex (short for regular expressions) can be a frustrating process. I have been there and spent many painful hours figuring it out. I’d now like to share what I’ve learnt in the hope that it will be useful to others.

###### What you’ll need
• A Java enabled browser
• An understanding of datasets in text format and different types of separators
• A text editor (I am using TextWrangler, but any is OK, more on that below)

## The local neighbourhood of C Major

Here’s a chart I drew for myself to understand the relationships between chords in music theory. Doesn’t seem to have much to do with machine learning and statistics but in a way it does since I found it a lot easier to picture the chords existing in a sort of network space linked by similarity. Similarity here is defined as the removal or addition of a note, or the sliding of a note one semitone up or down. What’s wrong with me!

The neighbourhood of C-Major

## Converting a dendrogram into a graph for a D3 force directed layout

I wrote this code for a project that didn’t work out but I thought I’d share. It takes the dendrogram produced by hclust in R and converts it into json to be used in a D3 force directed graph (slicing the dendrogram near the top to create a couple of clusters). The dendrogram in R looks like this

Dendrogram of clustering on US arrests

And the end result in D3 is this Read more

## Visualising cluster stability using Sankey diagrams

###### The problem

I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.

To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.

One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.

###### Seeing what’s going on

Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result. Read more

## Picturing the output of a neural net

Some time ago during a training session a colleague asked me what a surface plot of a two input neural net would look like. That is, if you have two inputs x_1 and x_2 and plot the output y as a surface what do you get? We thought it might look like a set of waterfalls stacked on each other.

### Tip

For this post I’m going to use draw.io, wolfram alpha and some javascript. Check other tools in the toolbox.

Since neural nets are often considered a black box solution I thought it would be interesting to check this. If you can picture the output for simple case it makes things less mysterious in the more general case.

Let’s look at a neural net with a single hidden layer of two nodes and a sigmoid activation function. In other words something that looks like this. (If you need to catch up on the theory please see this brilliant book by David MacKay that’s free to view online)

Neural net with a single hidden layer

Drawn using the lovely draw.io

###### Output for a single node

We can break the task down a little by doing a surface plot of a single node, say node 1. We are using the sigmoid activation function so what we are plotting is the application of the sigmoid function to what is essentially a function describing a plane. Read more

## Distribution for the difference between two binomially distributed random variables

I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share.

So we start with

$X \sim Bin(n_1, p_1)$

$Y \sim Bin(n_2, p_2)$

We are looking for the probability mass function of $Z=X-Y$

First note that the min and max of the support of Z must be $(-n_2, n_1)$ since that covers the most extreme cases ($X=0$ and $Y=n_2$) and ($X=n_1$ and $Y=0$).

Then we need a modification of the binomial pmf so that it can cope with values outside of its support.

$m(k, n, p) = \binom {n} {k} p^k (1-p)^{n-k}$ when $k \leq n$ and 0 otherwise.

Then we need to define two cases

1. $Z \geq 0$
2. $latex Z < 0$ In the first case $latex p(z) = \sum_{i=0}^{n_1} m(i+z, n_1, p_1) m(i, n_2, p_2)$ since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of $latex n_1$ and $latex n_2$. For example if $latex n_2 = 4$ then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero. For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc. \$latex p(z) = \sum_{i=0}^{n_2} m(i, n_1, p_1) m(i+z, n_2, p_2)[l\atex] Put them together and that's your pmf.

Here’s the function in R and a simulation to check it’s right (and it does work.)

## Are there any good five letter dot com domains?

Here’s a python script I wrote when tearing my hair out over domain names.

It creates all five letter word permutations, ranks them by how pronounceable they are (in a rough and ready kind of way), uses wiktionary to check they are a word in some language then finally checks whether the domain is free.

I didn’t like any of them.

## The analyst’s toolbox

There are hundreds, maybe thousands, of open source/free/online tools out there that form part of the analyst’s toolbox. Here’s what I have on my mac for day to day work. Click on the leaf node labels to be redirected to the relevant sites. Visualisation in D3.

I’ve also created a video tour of that gives a quick demo of some of the tools on my desktop

## Finding neighbours in a D3 force directed layout

I’ve adapted the code from this stackoverflow answer (from D3’s creator Mike Bostock) so that you can highlight neighbours in a force directed layout. Just doubleclick on a node to fade out any non-immediate neighbours and double click again to bring them back.