Nube de datos: noviembre 2018

2018-11-29

Crear una pequeña base de datos de IMDb con R

Introducción

En una entrada anterior usamos la OMDb API para extraer con R información sobre películas o series de televisión. En esta ocasión queremos crear una pequeña base datos con el mismo paquete imdbapi.

Solución

Empleamos el paquete imdbapi que nos permite extraer dicha información. Si utilizamos la versión gratuita, tendremos una limitación de 1.000 peticiones al día.

Lo primero que necesitamos es un vector con títulos de películas o de IMDbIDs (por ejemplo: para Vértigo la parte final de la dirección https://www.imdb.com/title/tt0052357/, la cadena tt0052357. En nuestro ejemplo usamos la encuesta de los críticos Sight & Sound de 2012, que contiene la columna const con dichos IMDbIDs .

library(imdbapi)
library(data.table)
library(tidyverse)
sight_sound <- read.csv("https://sites.google.com/site/nubededatosblogspotcom/Sight&Sound2012-CriticsPoll.txt", stringsAsFactors = FALSE)
glimpse(sight_sound)

Observations: 588
Variables: 17
$ const                          "tt0052357", "tt0033467", "tt004643...
$ position                       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ created                        "Thu Aug 16 07:42:05 2012", "Thu Au...
$ description                    NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ modified                       "Thu Aug 16 07:42:05 2012", "Thu Au...
$ Title                          "Vertigo", "Citizen Kane", "Tôkyô m...
$ Directors                      "Alfred Hitchcock", "Orson Welles",...
$ Title.type                     "Feature Film", "Feature Film", "Fe...
$ IMDb.Rating                    8.5, 8.5, 8.2, 8.0, 8.3, 8.3, 8.0, ...
$ PeacefulAnarchy.rated          10, 9, 10, 9, 9, 6, 6, 10, 8, 9, 6,...
$ Runtime..mins.                 128, 119, 136, 110, 94, 160, 119, 6...
$ Genres                         "mystery, romance, thriller", "dram...
$ Year                           1958, 1941, 1953, 1939, 1927, 1968,...
$ Num.Votes                      153502, 205699, 16219, 14872, 19188...
$ Release.Date..month.day.year.  "1958-05-09", "1941-05-01", "1953-1...
$ Id                             1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ URL                            "http://www.imdb.com/title/tt005235...

Empleamos la función lapply para extraer la información de todos los IMDbIDs.

tt <-
  lapply(sight_sound$const, function(x) {
    return(tryCatch(
      find_by_id(
        x,
        type = NULL,
        year_of_release = NULL,
        plot = "full",
        include_tomatoes = TRUE,
        api_key = "12345678"
      ),
      error = function(e)
        NULL
    ))
  })
df_sight_sound <- rbindlist(tt, fill = TRUE)
df_sight_sound$Ratings <- as.character(df_sight_sound$Ratings)
df_sight_sound <- as.data.frame(df_sight_sound)
df_sight_sound %>% distinct(imdbID) %>% summarise(n= n())

    n
1 586

En una sola pasada suelen faltar algunos. En este caso nos faltan dos títulos. Repetiríamos el proceso hasta obtener todos los títulos.

# Comprobamos los titulos no encontrados
m <- subset(sight_sound, !(const %in% df_sight_sound$imdbID))$const 
m

[1] "tt0115751" "tt0032551"

Finalmente procesamos el data frame para eliminar duplicados.

df_sight_sound <- df_sight_sound %>% 
  filter(grepl("Internet",Ratings)) %>% 
  group_by(imdbID) %>% 
  distinct()

# A tibble: 588 x 26
# Groups:   imdbID [588]
   Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                       
 1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
 2 Citi~ 1941  PG    1941-09-05 119 min Dram~ Orson W~ Herma~ Josep~ "A g~
 3 Toky~ 1953  NOT ~ 1972-03-13 136 min Drama Yasujir~ Kôgo ~ Chish~ An e~
 4 The ~ 1939  NOT ~ 1950-04-08 110 min Come~ Jean Re~ Jean ~ Nora ~ Avia~
 5 Sunr~ 1927  NOT ~ 1927-11-04 94 min  Dram~ F.W. Mu~ Carl ~ Georg~ "In ~
 6 2001~ 1968  G     1968-05-12 149 min Adve~ Stanley~ Stanl~ Keir ~ "\"2~
 7 The ~ 1956  PASS~ 1956-05-26 119 min Adve~ John Fo~ Frank~ John ~ Etha~
 8 Man ~ 1929  NOT ~ 1929-05-12 68 min  Docu~ Dziga V~ Dziga~ Mikha~ This~
 9 The ~ 1928  NOT ~ 1928-10-25 114 min Biog~ Carl Th~ Josep~ Maria~ The ~
10 8½    1963  NOT ~ 1963-06-25 138 min Drama Federic~ Feder~ Marce~ Guid~
# ... with 578 more rows, and 16 more variables: Language ,
#   Country , Awards , Poster , Ratings ,
#   Metascore , imdbRating , imdbVotes , imdbID ,
#   Type , DVD , BoxOffice , Production , Website ,
#   Response , totalSeasons

Y tendremos lista nuestra pequeña base de datos de IMDb. Si queremos exportar los resultados como csv:

write.csv(df_sight_sound, "df_sight_sound.csv", row.names = FALSE)

Entradas relacionadas

Referencias

2018-11-25

Ocultar ceros en Excel

Problema

Deseamos ocultar los ceros de una hoja de Excel.

Solución

Opción 1 - Rango

Seleccionamos el rango deseado y presionamos Ctrl + 1
En el cuadro de diálogo Formato de celda, seleccionamos Número y en Tipo tecleamos: 0;;

Opción 2 - Hoja

Clic en Archivo > Opciones > Avanzadas.
En Mostrar opciones para esta hoja, seleccionamos la hoja deseada, y quitamos la marca de selección de Mostrar un cero en celdas que tienen un valor cero

Opción 3 - VBA: hojas o libros

Abrimos el Editor de Microsoft Visual Basic: Alt+F11
Copiamos las siguiente subrutinas en un módulo: una para ocultar y otra para mostrar los ceros

Sub Ocultar_ceros()
    ActiveWindow.DisplayZeros = False
End Sub

Sub Mostrar_ceros()
    ActiveWindow.DisplayZeros = True
End Sub

Sub Ocultar_ceros()
    Worksheets.Select
    ActiveWindow.DisplayZeros = False
End Sub

Sub Mostrar_ceros()
    Worksheets.Select
    ActiveWindow.DisplayZeros = True
End Sub

Resultado

Entradas relacionadas

2018-11-10

Extraer información de la OMDb API con R

Problema

Queremos extraer con R información sobre películas o series de televisión usando la OMDb API.

Solución

Empleamos el paquete imdbapi que nos permite extraer dicha información. Si utilizamos la versión gratuita, tendremos una limitación de 1.000 peticiones al día. El paquete imdbapi nos permite:

Búsqueda por imdb id

library(imdbapi)
find_by_id("tt0107692", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678")

# A tibble: 2 x 25
  Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                      
1 Ninj~ 1993  NOT ~ 1993-06-05 94 min  Anim~ Yoshiak~ Yoshi~ Kôich~ A Jo~
2 Ninj~ 1993  NOT ~ 1993-06-05 94 min  Anim~ Yoshiak~ Yoshi~ Kôich~ A Jo~
# ... with 15 more variables: Language , Country , Awards ,
#   Poster , Ratings , Metascore , imdbRating ,
#   imdbVotes , imdbID , Type , DVD , BoxOffice ,
#   Production , Website , Response

Búsqueda por título

find_by_title("vertigo", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678")

# A tibble: 3 x 25
  Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                      
1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
2 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
3 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
# ... with 15 more variables: Language , Country , Awards ,
#   Poster , Ratings , Metascore , imdbRating ,
#   imdbVotes , imdbID , Type , DVD , BoxOffice ,
#   Production , Website , Response

Además cuenta con diferentes funciones para extraer una información específica: actores, países, directores, géneros o escritores.

get_actors(find_by_title("vertigo", api_key = "12345678"))

[1] "James Stewart"      "Kim Novak"          "Barbara Bel Geddes"
[4] "Tom Helmore"

get_countries(find_by_title("vertigo", api_key = "12345678"))

[1] "USA"

get_directors(find_by_title("vertigo", api_key = "12345678"))

[1] "Alfred Hitchcock"

get_genres(find_by_title("vertigo", api_key = "12345678"))

[1] "Mystery"  "Romance"  "Thriller"

get_writers(find_by_title("vertigo", api_key = "12345678"))

[1] "Alec Coppel (screenplay by)"                                  
[2] "Samuel A. Taylor (screenplay by)"                             
[3] "Pierre Boileau (based on the novel \"D'Entre Les Morts\" by)" 
[4] "Thomas Narcejac (based on the novel \"D'Entre Les Morts\" by)"

Cargar la imagen del póster

library(RCurl)
df <- find_by_title("Batman Ninja", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678")
plot(0:1,
     0:1,
     type = "n",
     ann = FALSE,
     axes = FALSE)
my_image <-  readJPEG(getURLContent(df$Poster[1]))
rasterImage(my_image, 0, 0, 1, 1)

Notas

Al inspeccionar las funciones anteriores que comienzan por get, podemos ver que simplemente extraen un subconjunto de datos del objeto omdb generado por la búsqueda por imdb id o título.

function (omdb) 
{
  if (!inherits(omdb, "omdb")) {
    message("get_actors() expects an omdb object")
    return(NULL)
  }
  if ("Actors" %in% names(omdb)) {
    str_split(omdb$Actors, ",[ ]*")[[1]]
  }
}

Cada petición con find_by_id o find_by_title, generará una fila por cada página de las valoraciones (Ratings) disponibles encontradas. Por ejemplo, Vértigo devolverá 3 filas con las valoraciones de IMDb, Rotten Tomatoes y Metacritic. En cambio Ninja Scroll o Batman Ninja solamente devolverá dos filas: IMDb y Rotten Tomatoes. Las variables extraídas son:

Classes ‘omdb’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  25 variables:
 $ Title     : chr  "Batman Ninja" "Batman Ninja"
 $ Year      : chr  "2018" "2018"
 $ Rated     : chr  "PG-13" "PG-13"
 $ Released  : Date, format: "2018-04-24" "2018-04-24"
 $ Runtime   : chr  "85 min" "85 min"
 $ Genre     : chr  "Animation, Action" "Animation, Action"
 $ Director  : chr  "Junpei Mizusaki" "Junpei Mizusaki"
 $ Writer    : chr  "Kazuki Nakashima (screenplay), Leo Chu (English screenplay), Eric Garcia (English screenplay), Bob Kane (charac"| __truncated__ "Kazuki Nakashima (screenplay), Leo Chu (English screenplay), Eric Garcia (English screenplay), Bob Kane (charac"| __truncated__
 $ Actors    : chr  "Kôichi Yamadera, Wataru Takagi, Ai Kakuma, Rie Kugimiya" "Kôichi Yamadera, Wataru Takagi, Ai Kakuma, Rie Kugimiya"
 $ Plot      : chr  "Batman, along with a number of his allies and adversaries, finds himself transplanted from modern Gotham City to feudal Japan." "Batman, along with a number of his allies and adversaries, finds himself transplanted from modern Gotham City to feudal Japan."
 $ Language  : chr  "Japanese, English" "Japanese, English"
 $ Country   : chr  "Japan, USA" "Japan, USA"
 $ Awards    : chr  "N/A" "N/A"
 $ Poster    : chr  "https://m.media-amazon.com/images/M/MV5BYmFhYzZhYzgtZjZiYS00NWEwLWFhYTUtN2UxM2FmYzdhNDUyXkEyXkFqcGdeQXVyNDk2Nzc"| __truncated__ "https://m.media-amazon.com/images/M/MV5BYmFhYzZhYzgtZjZiYS00NWEwLWFhYTUtN2UxM2FmYzdhNDUyXkEyXkFqcGdeQXVyNDk2Nzc"| __truncated__
 $ Ratings   :List of 2
  ..$ :List of 2
  .. ..$ Source: chr "Internet Movie Database"
  .. ..$ Value : chr "5.7/10"
  ..$ :List of 2
  .. ..$ Source: chr "Rotten Tomatoes"
  .. ..$ Value : chr "79%"
 $ Metascore : chr  "N/A" "N/A"
 $ imdbRating: num  5.7 5.7
 $ imdbVotes : num  9759 9759
 $ imdbID    : chr  "tt7451284" "tt7451284"
 $ Type      : chr  "movie" "movie"
 $ DVD       : Date, format: "2018-05-08" "2018-05-08"
 $ BoxOffice : chr  "N/A" "N/A"
 $ Production: chr  "DC Comics" "DC Comics"
 $ Website   : chr  "N/A" "N/A"
 $ Response  : chr  "True" "True"