Introduction
In a previous post we showed how to extract movie info R info using the imdbapi package. In this post we will create a mini imdb database using the same package.
Solution
If we use the free version, the maximum number of requests per day is 1,000. We need to request an API key here.
First we need a vector containing the movie titles or the IMDbIDs (e.g.: for Vertigo the last section of the url https://www.imdb.com/title/tt0052357/, the string tt0052357. In our example we will use the list containing the results from the Sight and Sound 2012 poll of 846 critics, these are the films receiving at least 3 votes.
library(imdbapi)
library(data.table)
library(tidyverse)
sight_sound <- read.csv("https://sites.google.com/site/nubededatosblogspotcom/Sight&Sound2012-CriticsPoll.txt", stringsAsFactors = FALSE)
glimpse(sight_sound)
Observations: 588
Variables: 17
$ const "tt0052357", "tt0033467", "tt004643...
$ position 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ created "Thu Aug 16 07:42:05 2012", "Thu Au...
$ description NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ modified "Thu Aug 16 07:42:05 2012", "Thu Au...
$ Title "Vertigo", "Citizen Kane", "Tôkyô m...
$ Directors "Alfred Hitchcock", "Orson Welles",...
$ Title.type "Feature Film", "Feature Film", "Fe...
$ IMDb.Rating 8.5, 8.5, 8.2, 8.0, 8.3, 8.3, 8.0, ...
$ PeacefulAnarchy.rated 10, 9, 10, 9, 9, 6, 6, 10, 8, 9, 6,...
$ Runtime..mins. 128, 119, 136, 110, 94, 160, 119, 6...
$ Genres "mystery, romance, thriller", "dram...
$ Year 1958, 1941, 1953, 1939, 1927, 1968,...
$ Num.Votes 153502, 205699, 16219, 14872, 19188...
$ Release.Date..month.day.year. "1958-05-09", "1941-05-01", "1953-1...
$ Id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ URL "http://www.imdb.com/title/tt005235...
We use the function lapply to extract the info for all IMDbIDs.
tt <-
lapply(sight_sound$const, function(x) {
return(tryCatch(
find_by_id(
x,
type = NULL,
year_of_release = NULL,
plot = "full",
include_tomatoes = TRUE,
api_key = "12345678"
),
error = function(e)
NULL
))
})
df_sight_sound <- rbindlist(tt, fill = TRUE)
df_sight_sound$Ratings <- as.character(df_sight_sound$Ratings)
df_sight_sound <- as.data.frame(df_sight_sound)
df_sight_sound %>% distinct(imdbID) %>% summarise(n= n())
n
1 586
After running the code some titles may be missing. In our examples, two title. We will repeat the process until obtain all of them.
# Checking missing titles
m <- subset(sight_sound, !(const %in% df_sight_sound$imdbID))$const
m
[1] "tt0115751" "tt0032551"
Finally, we keep distinct titles removing duplicates.
df_sight_sound <- df_sight_sound %>%
filter(grepl("Internet",Ratings)) %>%
group_by(imdbID) %>%
distinct()
# A tibble: 588 x 26
# Groups: imdbID [588]
Title Year Rated Released Runtime Genre Director Writer Actors Plot
1 Vert~ 1958 PG 1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
2 Citi~ 1941 PG 1941-09-05 119 min Dram~ Orson W~ Herma~ Josep~ "A g~
3 Toky~ 1953 NOT ~ 1972-03-13 136 min Drama Yasujir~ Kôgo ~ Chish~ An e~
4 The ~ 1939 NOT ~ 1950-04-08 110 min Come~ Jean Re~ Jean ~ Nora ~ Avia~
5 Sunr~ 1927 NOT ~ 1927-11-04 94 min Dram~ F.W. Mu~ Carl ~ Georg~ "In ~
6 2001~ 1968 G 1968-05-12 149 min Adve~ Stanley~ Stanl~ Keir ~ "\"2~
7 The ~ 1956 PASS~ 1956-05-26 119 min Adve~ John Fo~ Frank~ John ~ Etha~
8 Man ~ 1929 NOT ~ 1929-05-12 68 min Docu~ Dziga V~ Dziga~ Mikha~ This~
9 The ~ 1928 NOT ~ 1928-10-25 114 min Biog~ Carl Th~ Josep~ Maria~ The ~
10 8½ 1963 NOT ~ 1963-06-25 138 min Drama Federic~ Feder~ Marce~ Guid~
# ... with 578 more rows, and 16 more variables: Language ,
# Country , Awards , Poster , Ratings ,
# Metascore , imdbRating , imdbVotes , imdbID ,
# Type , DVD , BoxOffice , Production , Website ,
# Response , totalSeasons
To export the final results as a csv:
write.csv(df_sight_sound, "df_sight_sound.csv", row.names = FALSE)
Related posts
References