Needlebase and playing with movie data
After hearing Marshall Kirkpatrick at a recent talk describe Needlebase, as a simple way to scrape sites and create a database of info, I had to try it out. With the Oscar season coming up, I figured movie data would be a good set to play with and imdb a great source of movie data.
There are a few tools available to query IMDB data, an API and the raw data files, but both have their limitations. The API I found only accepts 20 requests an hour and the raw data files are far too raw, a good example where too much data is too much, clean data is better.
My thought was to chart box office gross to movie ratings, so using Needlebase and the advanced IMDB search, I was able to create a database of movies released in 2010 with their box office numbers and ratings. You can view and download my data set here.
So I threw the data into R and got a nice chart out of it, but didn’t really turn up too many patterns.

I figured the chart would be a lot more interesting as an interactive tool, so you can see which movies are what. So using Protovis, Highcharts, a javascript graphing and visualization library. I created this view which allows some basic interaction, a bit better way to explore this data.
Note: Protovis does not work with IE browers, so switch to Highcharts. Feb 15, 2011