R: Dealing with package updates

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Here’s a very short post to highlight one of the “highlights” of my week that I thought was worth sharing with the wider community.

One of the things I find great about R is the rapidly evolving ecosystem where new packages are being constantly created and others are being updated.

Up until now, I’ve found this to be a very good thing, but experienced the other side this week, where an upgrade to a package broke a pretty big script that I’d been working on.   

The “quick” solution in my case was to use the CRAN archive to download the source for earlier versions of the package (and it’s dependencies) that was causing issue and then build and install them to overwrite the upgrade – knowledge of how to build a package from source in Windows came in very handy and you can read more about how to do it in a previous D&L post

The longer term solution is a lot trickier, particularly where R is being used in a collaborative environment and reproducibility is important, either between machines or over time.

I suspect one solution is to have a common library that is centrally maintained and then changing the default location that R looks for installed packages (which is also a handy solution when dealing with not having to download all packages against once you’ve upgraded base R). 

On top of this, I suspect there would then need to be some type of test suite which, once updates to packages were available, checked in a development environment that existing scripts and processes still worked. None of this is new to software and IT folk, but it’s a novel issue at the moment for the analyst community I suspect.

R: Creating a shortcut to run a gWidgets GUI

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I’ve been playing around with using gWidgets on Windows over the last few weeks as a way of creating front ends for various functions and set of functions that I’ve created, so that non R users can have the benefit of these without having to write a single line of code.

The likes of 4Dpiecharts offer some good introductions to various aspects of the tool, but one thing which I struggled to find was a way of creating an icon based way of launching the GUI – i.e. click on an icon on the desktop and have the GUI launch in an application like way.

At first, I assumed that this could be done via a batch file, but found that this didn’t seem to work when the script should display a GUI as the output.

I then came across this very handy stackoverflow post. Although, slightly enigmatic, the method outlined by Greg Snow worked and gave me what I wanted.

1. Create a new folder at a convenient place on your drive.

2. Create a shortcut to RGui.exe or Rterm.exe – these will likely be located in a folder like C:program filesRR-versionbini386. If the shortcut is to RGui then the full R GUI will be brought up, whilst R term will bring up the console version (I went with this option so that the focus would stay with the GUI).

3. Create an R function called .First which includes the various functions, data, etc. to launch your code.

 .First <- function() {  insert code here } 

4. Ensure that your R session is clean and only contains the functions (particularly .First) and data that you want to launch when you double click the shortcut. Save this workspace (which will be called “.rData”) to the folder that you created in Step 1.

5. Right click on the Shortcut that you created in Step 2 and change the Start In property (under the “Shortcut” tab) to the name of the folder that you created in Step 1.

Graphing the history of philosophy

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

A close up of ancient and medieval philosophy ending at Descartes and Leibniz

If you are interested in this data set you might like my latest post where I use it to make book recommendations.

This one came about because I was searching for a data set on horror films (don’t ask) and ended up with one describing the links between philosophers.

To cut a long story very short I’ve extracted the information in the influenced by section for every philosopher on Wikipedia and used it to construct a network which I’ve then visualised using gephi

It’s an easy process to repeat. It could be done for any area within Wikipedia where the information forms a network. I chose philosophy because firstly the influences section is very well maintained and secondly I know a little bit about it. At the bottom of this post I’ve described how I got there.

First I’ll show why I think it’s worked as a visualisation. Here’s the whole graph.

Each philosopher is a node in the network and the lines between them (or edges in the terminology of graph theory) represents lines of influence. The node and text are sized according to the number of connections (both in and out). The algorithm that visualises the graph also tends to put the better connected nodes in the centre of the diagram so we see the most influential philosophers, in large text, clustered in the centre. It all seems about right with the major figures in the western philosophical tradition taking the centre stage. (I need to also add the direction of influence with a arrow head – something I’ve not got round to yet.) A shortcoming however is that this evaluation only takes into account direct lines of influence. Indirect influence via another person in the network does not enter into it. This probably explains why Descartes is smaller than you’d think. It would also be better if the nodes were sized only by the number of outward connections although I think overall the differences would be slight. I’ll get round to that.

It gets more interesting when we use Gephi to identify communities (or modules) within the network. Roughly speaking it identifies groups of nodes which are more connected with each other than with nodes in other groups. Philosophy has many traditions and schools so a good test would be whether the algorithm picks them out.

It has been fairly successful. Below we can see the so called continental tradition picked out in green, stemming from Hegel and Nietzsche, leading into Heidegger and Sartre and ending in the isms of the twentieth century. It’s interesting that there is separate subgroup, in purple, influenced mainly by Schopenhauer (out of shot) and Freud.

The Continental Tradition

And this close up is of the analytical tradition emerging out of Frege, Russell and Wittgenstein. At the top and to the left you can see the British empirical school and the American pragmatists.

British Empiricism, American Pragmatism and the Analytical Tradition

It would be interesting to play with the number of groups picked out by the algorithm. It would hopefully identify sub groups within these overarching traditions.

The graph is probably most insightful when you zoom in close. Gephi produces a vector graphic output so if you’re interested you can download it here and explore it yourself.

Now for how you do it.

The first stop is dbpedia. This is a fantastic resource which stores structured information extracted from wikepdia in a database that accessible through the web. Among other things it stores all of the information you see in an infobox on a Wikipedia page. For example I was after the influenced and influenced by fields that you find on the infobox on the page for Plato.

The next step is to extract this information. For this we need two things: a SPARQL endpoint (try snorql), which is an online interface to submit our queries and little knowledge of SPARQL a specialist language for querying the semantic web. This is a big (and exciting) area that has to do with querying information that is structured in triples (subject-relationship-object). I assume it has its roots in predicate logic so the analytical philosophers would have been pleased. However the downside is that the language itself a lot more difficult to learn than say SQL and to complicate things still further you need to know the ontological structure of the resource you are querying. I probably wouldn’t have got anywhere at all were it not for a great blog post by Bob DuCharme which is a simple guide for getting the information out of wikipedia.

In the end the query I needed was very simple. You can test it by submitting it in snorql.

?p a <http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.

It then needed a bit of cleaning as the punctuation was coded for URLS. For this I used the following online URL decoder.

After a bit more simple manipulation in excel I had a finished csv file that consisted of three columns

Philosopher A
Philosopher B

Each row in the data set represented a line of influence from philosopher A to philosopher B. The weight column contained a dummy entry of 1 because in our graph we do not want any one link to matter more than any other.

Gephi is the tool I used to create the visualisation of the graph. It’s both fantastic and open source. You can download it and set it up in minutes. For a quick tutorial see this link. There are many settings you can use to change the way your graph looks. I used a combination of the force atlas and the fruchterman-reingold layout algorithms. I then scaled text and node size by node degree (number of connections) and suppressed all nodes with less than four connections (it was overwhelming otherwise). The partition tool is used to create the communities. Full instructions are in the tutorial. I also found this blog entry very useful as a guide.

I hope that helps anyone who is trying to do something similar. If anyone does has a data set on horror films tagged with keywords please let me know!


Update Griff at Griff’s graphs has used the instructions above to create a fantastic visualization of the influence network of everyone on Wikipedia. It’s well worth a look.

Creative Commons Licence
Graphing the history of philosophy by Simon Raper is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Based on a work at drunksandlampposts.files.wordpress.com.
Permissions beyond the scope of this license may be available at http://drunks-and-lampposts.com/.

Machine Learning and Analytics based in London, UK