2020-05-01

Create a mini IMDb database in R

Introduction

In a previous post we showed how to extract movie info R info using the imdbapi package. In this post we will create a mini imdb database using the same package.

Solution

If we use the free version, the maximum number of requests per day is 1,000. We need to request an API key here.

First we need a vector containing the movie titles or the IMDbIDs (e.g.: for Vertigo the last section of the url https://www.imdb.com/title/tt0052357/, the string tt0052357. In our example we will use the list containing the results from the Sight and Sound 2012 poll of 846 critics, these are the films receiving at least 3 votes.

library(imdbapi)
library(data.table)
library(tidyverse)
sight_sound <- read.csv("https://sites.google.com/site/nubededatosblogspotcom/Sight&Sound2012-CriticsPoll.txt", stringsAsFactors = FALSE)
glimpse(sight_sound)
Observations: 588
Variables: 17
$ const                          "tt0052357", "tt0033467", "tt004643...
$ position                       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ created                        "Thu Aug 16 07:42:05 2012", "Thu Au...
$ description                    NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ modified                       "Thu Aug 16 07:42:05 2012", "Thu Au...
$ Title                          "Vertigo", "Citizen Kane", "Tôkyô m...
$ Directors                      "Alfred Hitchcock", "Orson Welles",...
$ Title.type                     "Feature Film", "Feature Film", "Fe...
$ IMDb.Rating                    8.5, 8.5, 8.2, 8.0, 8.3, 8.3, 8.0, ...
$ PeacefulAnarchy.rated          10, 9, 10, 9, 9, 6, 6, 10, 8, 9, 6,...
$ Runtime..mins.                 128, 119, 136, 110, 94, 160, 119, 6...
$ Genres                         "mystery, romance, thriller", "dram...
$ Year                           1958, 1941, 1953, 1939, 1927, 1968,...
$ Num.Votes                      153502, 205699, 16219, 14872, 19188...
$ Release.Date..month.day.year.  "1958-05-09", "1941-05-01", "1953-1...
$ Id                             1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ URL                            "http://www.imdb.com/title/tt005235...
We use the function lapply to extract the info for all IMDbIDs.

tt <-
  lapply(sight_sound$const, function(x) {
    return(tryCatch(
      find_by_id(
        x,
        type = NULL,
        year_of_release = NULL,
        plot = "full",
        include_tomatoes = TRUE,
        api_key = "12345678"
      ),
      error = function(e)
        NULL
    ))
  })
df_sight_sound <- rbindlist(tt, fill = TRUE)
df_sight_sound$Ratings <- as.character(df_sight_sound$Ratings)
df_sight_sound <- as.data.frame(df_sight_sound)
df_sight_sound %>% distinct(imdbID) %>% summarise(n= n())
    n
1 586
After running the code some titles may be missing. In our examples, two title. We will repeat the process until obtain all of them.

# Checking missing titles
m <- subset(sight_sound, !(const %in% df_sight_sound$imdbID))$const 
m
[1] "tt0115751" "tt0032551"
Finally, we keep distinct titles removing duplicates.

df_sight_sound <- df_sight_sound %>% 
  filter(grepl("Internet",Ratings)) %>% 
  group_by(imdbID) %>% 
  distinct()
# A tibble: 588 x 26
# Groups:   imdbID [588]
   Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                       
 1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
 2 Citi~ 1941  PG    1941-09-05 119 min Dram~ Orson W~ Herma~ Josep~ "A g~
 3 Toky~ 1953  NOT ~ 1972-03-13 136 min Drama Yasujir~ Kôgo ~ Chish~ An e~
 4 The ~ 1939  NOT ~ 1950-04-08 110 min Come~ Jean Re~ Jean ~ Nora ~ Avia~
 5 Sunr~ 1927  NOT ~ 1927-11-04 94 min  Dram~ F.W. Mu~ Carl ~ Georg~ "In ~
 6 2001~ 1968  G     1968-05-12 149 min Adve~ Stanley~ Stanl~ Keir ~ "\"2~
 7 The ~ 1956  PASS~ 1956-05-26 119 min Adve~ John Fo~ Frank~ John ~ Etha~
 8 Man ~ 1929  NOT ~ 1929-05-12 68 min  Docu~ Dziga V~ Dziga~ Mikha~ This~
 9 The ~ 1928  NOT ~ 1928-10-25 114 min Biog~ Carl Th~ Josep~ Maria~ The ~
10 8½    1963  NOT ~ 1963-06-25 138 min Drama Federic~ Feder~ Marce~ Guid~
# ... with 578 more rows, and 16 more variables: Language ,
#   Country , Awards , Poster , Ratings ,
#   Metascore , imdbRating , imdbVotes , imdbID ,
#   Type , DVD , BoxOffice , Production , Website ,
#   Response , totalSeasons 
To export the final results as a csv:

write.csv(df_sight_sound, "df_sight_sound.csv", row.names = FALSE)

Related posts

References

Nube de datos