Mahout for R Users

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I have a few posts coming up on Apache Mahout so I thought it might be useful to share some notes. I came at it as primarily an R coder with some very rusty Java and C++ somewhere in the back of my head so that will be my point of reference. I’ve also included at the bottom some notes for setting up Mahout on Ubuntu.

What is Mahout?

A machine learning library written in Java that is designed to be scalable, i.e. run over very large data sets. It achieves this by ensuring that most of its algorithms are parallelizable (they fit the map-reduce paradigm and therefore can run on Hadoop.) Using Mahout you can do clustering, recommendation, prediction etc. on huge datasets by increasing the number of CPUs it runs over. Any job that you can split up into little jobs that can done at the same time is going to see vast improvements in performance when parallelized.

Like R it’s open source and free!

So why use it?

Should be obvious from the last point. The parallelization trick brings data and tasks that were once beyond the reach of machine learning suddenly into view. But there are other virtues. Java’s strictly object orientated approach is a catalyst to clear thinking (once you get used to it!). And then there is a much shorter path to integration with web technologies. If you are thinking of a product rather than just a one off piece of analysis then this is a good way to go.

How is it different from doing machine learning in R or SAS?

Unless you are highly proficient in Java, the coding itself is a big overhead. There’s no way around it, if you don’t know it already you are going to need to learn Java and it’s not a language that flows! For R users who are used to seeing their thoughts realised immediately the endless declaration and initialisation of objects is going to seem like a drag. For that reason I would recommend sticking with R for any kind of data exploration or prototyping and switching to Mahout as you get closer to production.

What do you need to do to get started?

You’ll need to install the JDK (Jave Development Kit) and some kind of Java IDE (I like netbeans). You’ll also need maven (see below) to organise your code and its dependencies. A book is always useful. The only one around it seems is Mahout in Action but its good and all the code for examples is available for download. If you plan to run it on hadoop (which is recommended) then of course you need that too. If you’re going to be using hadoop in seriousness you’ll need an AWS account (assuming you haven’t your own grid). Finally you’ll need the mahout package itself. I found this all a lot easier on Linux with its natural affinity with other open source projects. You are welcome to follow my notes below on how to get this all up and running on an AWS Ubuntu instance.

Object Orientated

R is a nice gentle introduction object-orientated programming. If you’ve declared your own classes and methods using S3 you’re on your way. Even better more so if you’ve used S4 (must admit I haven’t). Even so there’s a big jump to the OO world of Java. Here’s a few tips:

  • To get something that executes include a method inside your class that begins public static void main(String[] args). An IDE like netbeans will pick this up and allow you to run that file. See here for a Hello World example
  • Remember every variable needs to be both declared and initialised and for everything that is not a Java literal this means creating a new instance of an object (I keep forgetting to include new when initialising.)
  • The easy R world of a script and a few functions is not an option. Everything should be an object or something pertaining to it. I find the easiest way to make this jump is to imagine I’m making bits of a machine and make an effort to keep this in my head. Everything is now like a little robot with data on the inside and predefined actions and responses.

Some useful terms

Maven a piece of software used by Mahout for managing project builds. It is similar to the package writing tools in R but more flexible.

JDK and JRE The first is the Java Development Kit, the software needed to write code in Java, the second is the Java Runtime Environment, the software that executes Java code. JRE will be on the machines of anyone who runs anything that uses Java (i.e. most people)

AWS Amazon web services – a cloud computing platform. We’ve quite a few posts on this subject. Here it is significant because it’s what you’ll need to run hadoop if you’ve not got your own grid.

Hadoop and map reduce There’s a million online tutorials on these but very quickly map-reduce is a powerful algorithm for parallelizing a very large class of tasks and Hadoop is an open source software framework that implements it. If you’ve used the parallel library in R then it does something similar on a much smaller scale (although I’m not sure whether it is formally map-reduce).

netbeans A good IDE for Java (there are many others). If you use R Studio for R it’s the same kind of thing but less stripped down, if you use Eclipse (which can also be used for Java) then you are already familiar with the set up.

Some general tips

  • When Mahout installs it does a lot of checks. I found it kept failing certain ones and this prevented the whole thing from installing. I disabled the checks with the option -DskipTests install and so far I’ve had no issues
  • I found it very useful when running the examples in Mahout In Action to explore the objects using the Netbeans debugger. This allows you to inspect the objects giving you good sense of how it all hangs together
  • Here’s a nice post explaining the map-reduce algorithm
  • Don’t forget to install the Maven plug-in in netbeans otherwise you’ll be struggling when executing the Mahout examples
  • Do a bit of Java programming to get your head into it (it might not be your thing but I downloaded and adapted this space invaders example)

My notes for setting up Mahout and running a simple example

This worked for me as of April 2013 on an AWS Ubuntu image (see earlier posts for setting this up). Obviously I’m referring to my particular directory set up. You’ll need to change it appropriately here and there and in particular change the versions of hadoop, maven and mahout to the latest. Thanks to the following post for the example.

Apologies, it’s a bit raw but it gets you from the beginning to the end.

Install Java JDK 7

sudo java -version [check whether java is installed]
sudo apt-get update
sudo apt-get install openjdk-7-jdk

Download and install hadoop

cd home/ubuntu
wget http://mirror.rmg.io/apache/hadoop/common/hadoop-1.0.4/hadoop-1.0.4.tar.gz
sudo cp hadoop-1.0.4.tar.gz /usr/local [Move the file to usr/lib]
cd /usr/local
sudo tar -zxvf  hadoop-1.0.4.tar.gz [unzip the package]
sudo rm hadoop-1.0.4.t

Set up environment variables

printenv
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop-1.0.4
export PATH=$PATH:$HADOOP_HOME/bin

Set up variable permanently

sudo vi /etc/environment

Add

    JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
    HADOOP_HOME=/usr/local/hadoop-1.0.4

Append to the path line “:HADOOP_HOME/bin”

Test hadoop is working

$HADOOP_HOME/bin/hadoop [displays help files]

Runs stand alone example

cd /usr/local/hadoop-1.0.4
sudo mkdir input
sudo cp conf/*.xml input
sudo bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
sudo cat output/*

Install maven

sudo apt-get update
sudo apt-get install maven
mvn -version [to check it installed ok]

Install mahout

cd home/ubuntu
wget http://apache.mirrors.timporter.net/mahout/0.7/mahout-distribution-0.7-src.tar.gz
sudo tar -zxvf  mahout-distribution-0.7-src.tar.gz
sudo cp -r /home/ubuntu/mahout-distribution-0.7 /usr/local
sudo mv mahout-distribution-0.7 mahout
cd mahout/core
sudo mvn -DskipTests install
cd mahout/examples
sudo mvn install

Create a maven project

cd /usr/local/mahout
sudo mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.unresyst -DartifactId=mahoutrec
cd mahoutrec
sudo mvn compile
sudo mvn exec:java -Dexec.mainClass="com.unresyst.App" [to print hello world]
sudo vi pom.xml

Then insert into pom.xml

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>0.7</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-math</artifactId>
<version>0.7</version>
</dependency>

Also add

<relativePath>../pom.xml</relativePath>

to the parent clause

Create recommender

Create datasets directory in mahoutrec folder

Add the csv file in https://code.google.com/p/unresyst/wiki/CreateMahoutRecommender

Create the java file listed in the post above in src/main/java/com/unresyst/

Go back to the project directory and run

sudo mvn compile
sudo mvn exec:java -Dexec.mainClass="com.unresyst.UnresystBoolRecommend"

Follow steps in D&L post to set up NX server

Set up netbeans

Download and install netbeans using the Ubuntu software centre

Tools>>Plugins>>>Settings
Enable all centres
Install the Maven plug in

Install Git

sudo apt-get install git

Download the repository for Analysing Data with Hadoop

cd /home/ubuntu
mkdir repositories
cd repositories
git clone https://github.com/tomwhite/hadoop-book.git

Download the repository for Mahout in Action

git clone https://github.com/tdunning/MiA.git

Running the hadoop maxtemperature example

Set up a new directory and copy across example files :

cp /home/ubuntu/repositories/hadoop-book/ch02/src/main/java/* /home/ubuntu/hadoopProjects/maxTemp

Make a build/classes directory within maxTemp

javac -verbose -classpath /usr/local/hadoop-1.0.4/hadoop-core-1.0.4.jar MaxTemperature*.java -d build/classes
export HADOOP_CLASSPATH=/home/ubuntu/hadoopProjects/maxTemp/build/classes
hadoop MaxTemperature /home/ubuntu/repositories/hadoop-book/input/ncdc/sample.txt output

To run the mahout example through netbeans just go to the mahoutrec maven directory and execute

Two Quick Recipes: Ubuntu and Hadoop

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

There are so many flavours of everything and things are changing so quickly that I find every task researched online ends up being a set of instructions stitched together from several blogs and forums. Here’s a couple of recent ones.

Ubuntu on AWS (50 mins)

Was going to buy a new laptop but it made more sense to set up a linux instance on AWS and remote in (a quarter of the price and more interesting). Here’s my recipe

    1. As in Mark’s earlier post set yourself up with an AWS account and a key pair by following this tutorial
    2. Launch an Ubuntu instance using the EC2 management console and select memory and processing power to suit.
    3. Start up the instance then connect to it by using Mindterm (very useful alternative to SSHing in with putty). To do this just select the instance in the terminal. Select Actions and then Connect. (You’ll need to provide the path to your saved key)
    4. Now you probably want to remote into your machine. Do this by setting up NoMachineNX following steps 2 to 4 in the following post
    5. However when you execute the last line of step 2 you’ll find that nxsetup is not found. To fix this switch to this post and follow steps 6-7 (life’s so complicated)

    6. Change password authentication to yes in  /etc/ssh/sshd_config
    7. Add gnome fall back

sudo apt-get install gnome-session-fallback

  1. Restart the instance and log in

Just remember to keep an eye on the charges!

Single Cluster Hadoop on Ubuntu (20 mins)

Of course you can run Hadoop directly on Amazon’s EMR platform but if you want to get more of a feel for how it works in a familiar environment you can set it up on a single instance.

  1. Follow the instructions in this post substituting in the latest hadoop stable release
  2. Install the latest JDK sudo apt-get install openjdk-7-jdk
  3. Set the JAVA_HOME path variable export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64 Substituting in the path to the JDK binaries
  4. From the Hadoop quick start guide follow the instructions in the “Prepare to start the Hadoop Cluster” and “Stand Alone Operations” sections. If this all works you should be ready to go.

 

EC2 Tutorials: rJava – annoying enough to have its own blog post

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

One of the most frustrating items that I’ve been trying to install on my EC2 instance is rJava. Its an R package that lots of other packages have as a dependency, including glmulti and MongoDB.

I’ve spent a fair few hours trying to get this installed, constantly receiving the error message:

configure: error: Java Development Kit (JDK) is missing or not registered in R
Make sure R is configured with full Java support (including JDK). Run
R CMD javareconf
as root to add Java support to R.

I dutifully ran the recommended line of code (several times, as root, as ec2-user, etc. etc. etc.)

About half way through this journey I checked with another EC2 instance with R installed and received the same error, which at least reassured me that I had not completely screwed up my R installation, so kept persevering.

My main line of inquiry was checking that the JDK was installed properly – from my limited Linux experience it looked OK, but I also came across some forum posts

Cutting a long story short, I finally came across this post and found this beautiful line of code, which hopefully will sort out your problems (or at least check off the problem that I had):

yum install java-1.6.0-openjdk-devel

EC2 Tutorial: NumPy and SciPy

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

Another quick note for getting set up on your EC2 instance. To install SciPy, you first need to install ATLAS and lapack. The following few lines of code run as root (sudo bash) should sort you out:

yum -y install atlas-devel
yum install lapack
pip install scipy

EC2 Tutorials: Scheduling tasks on EC2 using Crontab

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

One of my main reasons for wanting an EC2 instance was to be able to automatically run scripts at certain times, normally to collect data and save it to a database. As my EC2 instance is always running, I can forget about it for a month and have a month’s worth of data ready and waiting.

Want to schedule scripts so that they automatically run at set times or at set intervals (e.g. every day at 1pm or every 10 minutes)? No problem, Crontab is the easy to use Linux scheduling program that allows such execution in a no-thrills, straightforward way.

The syntax works along the lines of:

<minute> <hour> <day of month> <month> <weekday> <task>

For example:

0 2 * * * python /location/of/script.py

will run script.py at 2am every day. Simples. (* is essentially a placeholder to say “do it for all integer values”)

Lots more detail can be found here, here and here.

To create or access your crontab, type crontab -e on the EC2 command line.

To see what’s scheduled by crontab, type crontab -l on the EC2 command line.

EC2 Tutorials: Installing new software; yum, pip, easy_install, sudo-apt

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

For anyone familiar with python and easy_install, Amazon Linux uses “yum” as its easy installation system, and it is possible to install “pip” and “easy_install” to install new python packages.

As I’ve tried to install new software on my box, I’ve found lots and lots of references to sudo-apt as the standard way to install software where you’re using the Ubuntu flavour of Linux. As an FYI, Amazon Linux is based on Red Hat Linux (so sudo-apt doesn’t work with an Amazon Linux EC2 box).

Here’s a list of handy stuff to install (thanks to this post for a good starting point) :

yum install unixODBC-devel
yum install gcc-gfortran
yum install gcc-c++
yum install readline-devel
yum install libpng-devel libX11-devel libXt-devel
yum install texinfo-tex
yum install tetex-dvips
yum install docbook-utils-pdf
yum install cairo-devel
yum install java-1.6.0-openjdk-devel
yum install libxml2-devel
yum install mysql 
yum install mysql-server 
yum install mysql-devel

For Python packages, there are a couple of options available: easy_install and pip. As pip notes on its PyPl page:

pip is a replacement for easy_install. It mostly uses the same techniques for finding packages, so packages that are easy_installable should be pip-installable as well. This means that you can use pip install SomePackage instead of easy_install SomePackage.

I’ve found that pip tends to be easier to use than easy_install, and more reliable (based on where it installs packages to) so would recommend pip as the best way to install packages for Python. This link shows you how to install it. I installed it from source onto EC2.

EC2 Tutorials: Getting Started on Amazon Web Services

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

I’ve been interested in setting up an Amazon Web Services EC2 instance for a while – essentially a remote desktop in the cloud, which can be handy when you want an always-on machine (say, to run scripts at particular times, or to have easy access to a particular machine setup).

Over the next few weeks, I’m publishing a series of posts that aim to give a decent introduction to getting started on Amazon Web Services. There are a lot of great tutorials out there already (which I’ll link to and won’t reinvent the wheel), but my aim here is to pull them all together and also share solutions to a few stumbling blocks that I’ve come across.

Background

For a while now, Amazon have been offering a free Micro EC2 instance for the first year of being an Amazon customer. An instance is very similar to the computing bit of a PC, and in the case of a Micro Instance, this is like a fairly low powered machine – it isn’t designed for performance.

Having said that, now I’ve had my Instance running for over 500 hours, I’ve discovered I’m able to do a lot more with it than I thought I’d be able to. Firstly, by setting up R Studio Server on the instance, so I can write code in the cloud very easily. Secondly, I can connect to it to query databases and the like. Thirdly, I can take an image of my Instance and load that up on a much more powerful machine.

One thing to note is that the Free Tier Instance is based on Amazon Linux – not Windows – so there’s quite a steep learning curve if you haven’t operated in a Linux environment before (more on that later).

I’ve set my EC2 machine up to run a number of python scripts that go off to various websites and pull data from them into a set of MySQL and MongoDB databases. As the machine is always on, I don’t have to worry about someone turning off my machine or forgetting to run a script at a certain time.

One thing to note is that although Amazon offer the Free Tier for new AWS customers, there are lots of things that incur costs (for instance, choosing a Small Instance rather than a Micro Instance), so please keep an eye on the meter (which can be found in top left corner – Account Activity under My Account / Console).

Getting Started

There are many good resources on getting set up with Amazon Web Services. There are some slight differences depending on what OS you are using.

I used this guide to get set up on my Mac and this one to get set up on Windows.

Location, Location, Location

One thing to note is that your EC2 instance can be hosted in a number of locations around the globe. The location that you choose shouldn’t make much difference in terms of performance, but you should bear in mind that any services which screen your IP address (like Betfair) may not work in certain location. It’s possible to move an instance between locations after you’ve set it up, but from my experience, it’s not the easiest thing to do.

In the next post, I’ll provide some links on how to get set up installing the likes of R and Python, along with a few introductory lines of Linux.

A few handy terms

Getting started with EC2 can be quite a steep learning curve, so as I go through this series, I’ll try and explain the new terms that you might come accr

Security Group: The security settings for the machine  – predominantly what ports to open (to allow users to access the machine via, say the web or a database program).

Instance: (or EC2 to give it its alternative name). This is essentially the remote machine that you’re connecting to. These can vary in size from Micro (cheapest and smallest), through Small, to Large and more.

AMI: A special type of pre-configured operating system. E.g. it might be pre-configured with R, Python, MySQL and makes it much easier to get started with Amazon Web Services.

Keypair: I’ve come to think of this like a supercharged password. Essentially, it’s a file that contains a long string which is used in place of a password. Much more secure.

Elastic Block Storage: Storage which is attached to the EC2 machine (so it’s very similar to the hard drive in your PC)

Machine Learning and Analytics based in London, UK