I keep getting a warning message saying all text should be in a section when I compile my package

This is just due to some default text that is put automatically into your package documentation. If you go into the man directory and open up yourpackage.rd you’ll see that there are bits of text surrounded by ~~. Delete these including the ~~ and you should be fine

 

I am having trouble getting ggplot to work inside a function. How do I pass the column names as parameters

To do this you’ll need to use aes_string rather than aes. Here is an example

library(ggplot2)
data(iris)
function_around_ggplot<-function(data_arg, x_arg, y_arg, facet_arg) {
  ggplot(data=data_arg, aes_string(x=x_arg, y=y_arg))+ geom_point() + facet_wrap(facet_arg)
}
function_around_ggplot(iris, 'Sepal.Length', 'Sepal.Width', 'Species')

In most cases it's probably even better to use reshape to melt the data inside the function prior to using ggplot.

 

I'm trying to scrape the movie info on a page from rottentomatoes. It works fine except when the second half of the synopsis is hidden under a More button. In such cases I only get the second half of the synopsis

You need to do a few things a) pull out all the strings inside the p element (otherwise it just defaults to the one in the span element if it’s there) b) strip out the white spaces c) before you do any of this strip out the javascript otherwise it shows up in the strings. You need something like this:

    for link in soup.find_all('p',{'class': 'movie_synopsis'}):
        [x.extract() for x in link.find_all('script')]
        for string in link.stripped_strings:
             your code here

 

When I run my web scraping script it sometimes times out or I get an http error code.

This is a very common problem. You need a little function that handles the http error codes so that the program pauses for a little while and then tries again. This code can be used on both the API stuff and the webscraping. Here as an example I have applied it to the rotten tomatoes script

import urllib
from bs4 import BeautifulSoup
def catchURL(queryURL): # Nicked this from someone. Afraid I can't remember who. Sorry
    try:
        queryResponse = urllib.urlopen(queryURL)
    except urllib2.HTTPError, E:
        if E.code in [400, 420]:
            print("400 or 420")
            time.sleep(600)
        elif E.code == 503:
            print("503")
            time.sleep(60)
        else:
            print("Wait 3 mins")
            time.sleep(180)
            queryResponse = urllib.urlopen(queryURL)
    return queryResponse
#Open up web page and put contents (html) into query_response
query_url = "http://www.rottentomatoes.com/top/bestofrt/top_100_science_fiction__fantasy_movies/?category=14"
query_response = catchURL(query_url)
#Turn it into soup!
soup_sf_films = BeautifulSoup(query_response)
#Get all anchors with target='_top'
films = soup_sf_films.find_all('a', {'target': '_top'})
#Cycle through these extracting the film title and href and store them in a list called film_hrefs
film_hrefs = []
for link in films:
   film_hrefs.append({'href': link.get('href'), 'title': link.text})
#Now we can cycle through this list running our original code but just appending the hrefs to http://www.rottentomatoes.com
for f in film_hrefs:
    #Open up web page and put contents (html) into query_response
    query_url = "http://www.rottentomatoes.com" + f['href']
    query_response = catchURL(query_url)
    #Turn it into soup!
    soup = BeautifulSoup(query_response)
    #Go through the soup and find all spans with itemprop = ratingValue
    #take out the text and put each in an array of strings called text_contents
    text_contents=[]
    for link in soup.find_all('span',{'itemprop': 'ratingValue'}):
        text_contents.append(link.string)
    #Pull out the two bits of text we want
    tomatometer = text_contents[0]
    aud_rating = text_contents[2]
    print f['title'], tomatometer, aud_rating

 

Back to course contents

Machine Learning and Analytics based in London, UK