Searching for Christmas: How to customise your Gephi diagram with D3

It’s my first post for Coppelia and I was feeling festive, so decided to make a Christmas search visualisation.  You can use it to take a look at what people are Googling for this Christmas.

In this post I will show you how I to customise your Gephi diagram using D3

There have been quite a few posts on Coppelia by Simon, James and Andy on Gephi and D3 so I thought I’d bring the techniques together and create something new!

Dendrograms in R2D3

Hi, I’m Andrew and this is my first post for Coppelia! If you like the look of this feel free to visit my blog dinner with data (and see what happens when a data scientist hits the kitchen!)

I was excited by James’s last post on the new package R2D3, and I thought I would try to help further develop the package. This is a great new package, built by James Thomson (and in collaboration with myself and Simon Raper at Coppelia) that utilises D3 visualisations inside R. You can quickly create very striking visualisations with a just a few lines of code. This has recently been shared with a recent post, but since then a couple of updates have been made to increase the functionality.

In particular to the function D3Dendro, which creates dendrograms based on a hclust object in R. I had been working on a number of alternatives to the usual static dendrogram found in the package so far, so I thought I would add these in and describe them below.

I have created two new distinct functionalities:

• Collapsible nodes

You can clone the package from James’s github repository or run the following in R:

 install.packages("devtools") library(devtools) install_github("jamesthomson/R2D3") library(R2D3) 

I will include the example in the original post, so you can easily compare the differences.

Original dendrogram:

 hc < - hclust(dist(USArrests), "ave") JSON<-jsonHC(hc) D3Dendro(JSON, file_out="USArrests_Dendo.html") 

An A to Z of extra features for the D3 force layout

Since d3 can be a little inaccessible at times I thought I’d make things easier by starting with a basic skeleton force directed layout (Mike Bostock’s original example) and then giving you some blocks of code that can be plugged in to add various features that I have found useful.

The idea is that you can pick the features you want and slot in the code. In other words I’ve tried to make things sort of modular. The code I’ve taken from various places and adapted so thank you to everyone who has shared. I will try to provide the credits as far as I remember them!

Basic Skeleton

Here’s the basic skeleton based on Mike Bostock’s original example but with a bit of commentary to remind me exactly what is going on

A is for arrows

If you are dealing with a directed graph and you wish to use arrows to indicate direction your can just append this bit of code Read more

Converting an R HClust object into a D3.js Dendrogram

Hi all I’m James. This is my first blog for Coppelia. Thanks to Simon for encouraging me to do this.

I’ve been doing a lot of hierarchical clustering in R and have started to find the the standard dendrogram plot fairly unreadable once you have over a couple of hundred records. I’ve recently been introduced to the D3.js gallery and I wondered if I could hack something better together. I found this dendrogram I liked and started to play. I soon realised in order to get my data into it I needed a nested json. Read more

Converting a dendrogram into a graph for a D3 force directed layout

I wrote this code for a project that didn’t work out but I thought I’d share. It takes the dendrogram produced by hclust in R and converts it into json to be used in a D3 force directed graph (slicing the dendrogram near the top to create a couple of clusters). The dendrogram in R looks like this

Dendrogram of clustering on US arrests

And the end result in D3 is this Read more

Visualising cluster stability using Sankey diagrams

The problem

I wanted a way of understanding how a clustering solution will change as more data points are added to the dataset on which it is built.

To explain this a bit more, let’s say you’ve built a segmentation on customers, or products, or tweets (something that is likely to increase) using one or other clustering solution, say hierarchical clustering. Sooner or later you’ll want to rebuild this segmentation to incorporate the new data and it would be nice to know how much the segmentation will change as a result.

One way of assessing this would be to take the data you have now, roll it back to a previous point in time and then add new chunks of data sequentially each time rebuilding the clustering solution and comparing it to the one before.

Seeing what’s going on

Having recorded the different clusters that result from incrementally adding data, the next problem is to understand what is going on. I thought a good option would be a Sankey diagram. I’ve tested this out on the US crime data that comes with R. I built seven different clustering solutions using hclust, each time adding five new data points to the original 20 point data set. I used the google charts Sankey layout which itself is derived from the D3 layout. Here’s the result. Read more

Finding neighbours in a D3 force directed layout

I’ve adapted the code from this stackoverflow answer (from D3’s creator Mike Bostock) so that you can highlight neighbours in a force directed layout. Just doubleclick on a node to fade out any non-immediate neighbours and double click again to bring them back.

Freehand diagrams have two big virtues: they are quick and they are unconstrained.

I used to use a notebook (see What are degrees of freedom) but recently I got an ipad and then I found Adobe Ideas. It’s completely free and has just the right level of complexity for getting ideas down fast.

A diagram for teaching machine learning

It takes a bit of perseverance to get into a new habit but it pays off. Here are some examples and some tips on how to make it work as an alternative to paper.

First don’t bother using your fingers. You won’t have nearly enough control. Get a stylus. I use this one which is fairly cheap but works fine. I’ve noticed that it moves more freely when the ipad is in landscape position so I always draw with it that way round. I don’t know why this is but I guess it’s something to do with the glass.

The confusion matrix

Neighbourhood of C major

When you first open Adobe Ideas it’s tempting to see the page in front of view as something that needs to be filled as you would a piece of paper. That’s the wrong way to approach it as you’ll get overly distracted by the task of orientating your work on the page. Instead treat the page as an enormous workboard. Start by zooming in somewhere in the middle and work outwards.

A diagram for teaching machine learning

It’s vector graphic based so in principle infinitely deep. You’re only limited by how thin the pens get. Draw your diagram without any thought for the borders then right at the end scale it up to fit the page.

The next thing that might strike you is that there’s no copy and paste. Ideas is part of Adobe’s creative suite along with Photoshop and Illustrator and as such inherits their layer based approach to image editing. If you want to be able to adjust the positioning of parts of your diagram (say the key with respect to the rest of it) then you would be wise to put these on separate layers. You’d think this would be deeply frustrating however it’s one of those constraints that somehow makes you work better as you think more about your layout.

You are allowed 10 layers in total with one photo layer. I sometimes place an image of a grid on this layer if I need grid lines to work to.

If you are working quickly you don’t want to spending too much fretting about the colour palette. Again Ideas forces you into simplicity by limiting your colours to five at a time. Even better you can connect up to Kuler provided its app is also on your ipad. This gives you access to a huge range of palletes. The Kuler tool, which sucks the pallete out of an image, is also there by default in Ideas if you click on themes. This allows you to pull the colour scheme out any image on the photo layer.

Some Diagram Elements

When it comes to using the pen tool I tend to stick to the marker and constantly zoom in and out as I draw. I zoom right in for writing text and out again for drawing shapes. You soon get used to flipping between using your fingers to zoom and pan and your stylus to draw. It’s worth noting down the width of the pens you are using as it’s easy to get lose track of where you are and a diagram made up of many different line widths looks bad. I tend to use two widths: 6.5 and 3.

The toughest thing to pull off is anything that requires large scales swipes across the surface of the ipad, for example long straight lines or wide circles. It is easier to do this on a small scale with a thin pen and then enlarge the image.

One thing I didn’t realise for a while was that holding the pen tool for a few seconds in any completely circumscribed areas floods that areas with colour. This is very useful for filling shapes and creating colorful backgrounds.

I try to vary styles and colours although it takes an effort to keep this up if you are taking notes or in rush. I’ve listed in the panel on the left some of the diagram elements I’ve experimented with.

Adobe Ideas looks at first sight too limited to be particularly powerful but take a look at How to Use Adobe Ideas by Michael Startzman to see what is achievable by a professional!

A confused tangle

A confusion matrix is a confusing thing. There’s a surprising number of useful statistics that can be built out of just four numbers and the links between them are not always obvious. The terminology doesn’t help (is a true negative an observation that is truly in the class but classified negative or one that is negative and has truly been classified as such!) and neither does the fact that many of the statistics have more than one name (Recall=sensitivity=power!).

To unravel it a little i’ve used the tangle js library to create an interactive document that shows how these values are related. The code can be found here.

The interactive example

I have trained my classifier to separate wolves from sheep. Let’s say sheep is a positive result and wolf is a negative result (sorry wolves). I now need to test it on my test set. This consists of wolves and sheep That&#8217s test subjects altogether.

Say my classifier correctly identifies sheep as sheep (true positives) and wolves as wolves (true negatives)

This gives us the confusion matrix below:

Your browser does not support the HTML5 canvas tag.

Now some statistics that need to be untangled!

Precision (aka positive predictive value) is TP/(TP+FP). In our case /( + ) =

Recall (aka sensitivity, power) is TP/(TP+FN). In our case /( + ) =

Specificity is TN/(TN+FP). In our case /( + ) =

Negative predictive value is TN/(TN+FN). In our case /( + ) =

Accuracy is (TP+TN)/(TP+TN+FP+FN). In our case ( + )/( + + + ) =

False discovery rate is FP/(FP+TP). In our case /( + ) =

False positive rate (aka false alarm rate, fall-out) is FP/(FP+TN). In our case /( + ) =

Visualising Shrinkage

A useful property of mixed effects and Bayesian hierarchical models is that lower level estimates are shrunk towards the more stable estimates further up the hierarchy.

To use a time honoured example you might be modelling the effect of a new teaching method on performance at the classroom level. Classes of 30 or so students are probably too small a sample to get useful results. In a hierarchical model the data are pooled so that all classes in a school are modelled together as a hierarchy and even all schools in a district.

At each level in this hierarchy an estimate for the efficiency of the teaching method is obtained. You will get an estimate for the school as a whole and for the district. You will even get estimates for the individual classes. These estimates will be weighted averages of the estimates for the class and the estimate for the school (which in turn is a weighted average of the estimate for the school and the district.) The clever part is that this weighting is itself determined by the data. Where a class is an outlier, and therefore the overall school average is less relevant, the estimate will be weighted towards the class. Where it is typical it will be weighted towards the school. This property is known as shrinkage.

I’m often interested in how much shrinkage is affecting my estimates and I want to see it. I’ve created this plot which I find useful. It’s done in R using ggplot and is very simple to code.

The idea is that the non shrunk estimates bi (i.e. the estimates that would be obtained by modelling classes individually) are plotted on along the line x=y at the points (bi, bi). The estimates they are being shrunk towards ai are plotted at the points (bi, ai). Finally we plot the shrunk estimates si at (bi, si) and connect the points with an arrow to illustrate the direction of the shrinkage.

Here is an example. You can see the extent of the shrinkage by the the distance covered by the arrow towards the higher level estimate.

Note the arrows do sometimes point away from the higher level estimate. This is because this data is for a single coefficient in a hierarchical regression model with multiple coefficients. Where other coefficients have been stabilized by shrinkage this causes this particular coefficient to be revised.

The R code is as follows:

# *--------------------------------------------------------------------
# | FUNCTION: shrinkPlot
# | Function for visualising shrinkage in hierarchical models
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change
# |     01  |31/08/2013|Simon Raper |first version.
# *--------------------------------------------------------------------
# | INPUTS:  orig      Estimates obtained from individual level
# |                    modelling
# |          shrink    Estimates obtained from hierarchical modelling
# |          prior     Priors in Bayesian model or fixed effects in
# |                    mixed effects model (i.e. what it is shrinking
# |                    towards.
# |          window    Limits for the plot (as a vector)
# |
# *--------------------------------------------------------------------
# | OUTPUTS: A ggplot object
# *--------------------------------------------------------------------
# | DEPENDS: grid, ggplot2
# |
# *--------------------------------------------------------------------
library(ggplot)
library(grid)
shrinkPlot&lt;-function(orig, shrink, prior, window=NULL){
group&lt;-factor(signif(prior,3))
data&lt;-data.frame(orig, shrink, prior, group)
g&lt;-ggplot(data=data, aes(x=orig, xend=orig, y=orig, yend=shrink, col=group))
g2&lt;-g+geom_segment(arrow = arrow(length = unit(0.3, &quot;cm&quot;))) +geom_point(data=comp.in, aes(x=coef, y=mean))
g3&lt;-g2+xlab(&quot;Estimate&quot;)+ylab(&quot;Shrinkage&quot;)+ ggtitle(&quot;Shrinkage Plot&quot;)
if (is.null(window)==FALSE){
g3&lt;-g3+ylim(window)+xlim(window)
}
print(g3)
}