One of the drawbacks with R has been its limitation with big datasets. It stores everything in RAM so once you have more than 100K records your PC really starts to slow down. However, since AWS allows you to use any size machine, you could now consider using R for scoring out your models on larger datasets. Just fire up a meaty EC2 with the RStudio amazon machine image (AMI) and off you go.
With this in mind I wondered how long it would take to score up a Neural Net depending on how many variables were involved and how many records you need to score out. There was only one way to find out.
If you’ve not done it before here’s some simple instructions on how to get an EC2 with R installed up and running, and then access it.
- Register with AWS
- Launch an EC2 with the RStudio AMI On “Step 6: Configure Security Group” make sure you set the type to ‘HTTP’ the ‘port’ to 80 and I’d suggest you set the source to ‘your IP’ to ensure only your IP can access it.
- Once the EC2 is running, access it by pasting the public IP of the Ec2 into a web browser, and enter rstudio as the username and password.
For my job I fired up a c3.8xlarge EC2, it has 60GB RAM, and cost me around $3 an hour.
Here’s the R script I ran. It generates a simple Neural Net model with 10 hidden nodes. The model is based on #x many random normal variables and then scores out on #y many records. I ran this across a variety of variables and records and timed how long the scoring part took. I ran it for 5 to 100 variables and 100 to 10,000,000 records.
Here’s the output plotted using ggplot
As you can see you can score out 10M records with a 100 variable Neural Net in 6-7mins. Not too shabby.