2020-04-22

Extract movie info using the OMDb API in R

Problem

We want to extract movie information using the OMDb API in R.

Solution

We use the imdbapi package. If we use the free version, the maximum number of requests per day is 1,000. We need to request an API key here. The package allows:

  • Retrieve info by imdb id
  • library(imdbapi)
    find_by_id("tt0107692", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678") # Ninja Scroll imdb id: tt0107692
    
    # A tibble: 2 x 25
      Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                          
    1 Ninj~ 1993  NOT ~ 1993-06-05 94 min  Anim~ Yoshiak~ Yoshi~ Kôich~ A Jo~
    2 Ninj~ 1993  NOT ~ 1993-06-05 94 min  Anim~ Yoshiak~ Yoshi~ Kôich~ A Jo~
    # ... with 15 more variables: Language , Country , Awards ,
    #   Poster , Ratings , Metascore , imdbRating ,
    #   imdbVotes , imdbID , Type , DVD , BoxOffice ,
    #   Production , Website , Response 
    
  • Search by title
  • find_by_title("vertigo", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678")
    
    # A tibble: 3 x 25
      Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                          
    1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
    2 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
    3 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
    # ... with 15 more variables: Language , Country , Awards ,
    #   Poster , Ratings , Metascore , imdbRating ,
    #   imdbVotes , imdbID , Type , DVD , BoxOffice ,
    #   Production , Website , Response 
    
  • Additionally the package includes specific functions to extract info about actors, countries, directors, genres and writers.
  • get_actors(find_by_title("vertigo", api_key = "12345678"))
    [1] "James Stewart"      "Kim Novak"          "Barbara Bel Geddes"
    [4] "Tom Helmore"  
    get_countries(find_by_title("vertigo", api_key = "12345678"))
    [1] "USA"
    get_directors(find_by_title("vertigo", api_key = "12345678"))
    [1] "Alfred Hitchcock"
    get_genres(find_by_title("vertigo", api_key = "12345678"))
    [1] "Mystery"  "Romance"  "Thriller"
    get_writers(find_by_title("vertigo", api_key = "12345678"))
    [1] "Alec Coppel (screenplay by)"                                  
    [2] "Samuel A. Taylor (screenplay by)"                             
    [3] "Pierre Boileau (based on the novel \"D'Entre Les Morts\" by)" 
    [4] "Thomas Narcejac (based on the novel \"D'Entre Les Morts\" by)"
    
  • Load poster image
  • library(RCurl)
    df <- find_by_title("Batman Ninja", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678")
    plot(0:1,
         0:1,
         type = "n",
         ann = FALSE,
         axes = FALSE)
    my_image <-  readJPEG(getURLContent(df$Poster[1]))
    rasterImage(my_image, 0, 0, 1, 1)
    

    Notes

    By inspecting the previous functions starting with get, we can see that these are wrappers subetting the info returned by the functions find_by_title or find_by_id.

    function (omdb) 
    {
      if (!inherits(omdb, "omdb")) {
        message("get_actors() expects an omdb object")
        return(NULL)
      }
      if ("Actors" %in% names(omdb)) {
        str_split(omdb$Actors, ",[ ]*")[[1]]
      }
    }
    
    Every request returned by find_by_id o find_by_title, will generate a row for each rating availables. For instance, Vertigo will return 3 rating rows: IMDb, Rotten Tomatoes and Metacritic. Whereas Ninja Scroll or Batman Ninja will return only 2 available ratings: IMDb and Rotten Tomatoes. The variables are:

    Classes ‘omdb’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  25 variables:
     $ Title     : chr  "Batman Ninja" "Batman Ninja"
     $ Year      : chr  "2018" "2018"
     $ Rated     : chr  "PG-13" "PG-13"
     $ Released  : Date, format: "2018-04-24" "2018-04-24"
     $ Runtime   : chr  "85 min" "85 min"
     $ Genre     : chr  "Animation, Action" "Animation, Action"
     $ Director  : chr  "Junpei Mizusaki" "Junpei Mizusaki"
     $ Writer    : chr  "Kazuki Nakashima (screenplay), Leo Chu (English screenplay), Eric Garcia (English screenplay), Bob Kane (charac"| __truncated__ "Kazuki Nakashima (screenplay), Leo Chu (English screenplay), Eric Garcia (English screenplay), Bob Kane (charac"| __truncated__
     $ Actors    : chr  "Kôichi Yamadera, Wataru Takagi, Ai Kakuma, Rie Kugimiya" "Kôichi Yamadera, Wataru Takagi, Ai Kakuma, Rie Kugimiya"
     $ Plot      : chr  "Batman, along with a number of his allies and adversaries, finds himself transplanted from modern Gotham City to feudal Japan." "Batman, along with a number of his allies and adversaries, finds himself transplanted from modern Gotham City to feudal Japan."
     $ Language  : chr  "Japanese, English" "Japanese, English"
     $ Country   : chr  "Japan, USA" "Japan, USA"
     $ Awards    : chr  "N/A" "N/A"
     $ Poster    : chr  "https://m.media-amazon.com/images/M/MV5BYmFhYzZhYzgtZjZiYS00NWEwLWFhYTUtN2UxM2FmYzdhNDUyXkEyXkFqcGdeQXVyNDk2Nzc"| __truncated__ "https://m.media-amazon.com/images/M/MV5BYmFhYzZhYzgtZjZiYS00NWEwLWFhYTUtN2UxM2FmYzdhNDUyXkEyXkFqcGdeQXVyNDk2Nzc"| __truncated__
     $ Ratings   :List of 2
      ..$ :List of 2
      .. ..$ Source: chr "Internet Movie Database"
      .. ..$ Value : chr "5.7/10"
      ..$ :List of 2
      .. ..$ Source: chr "Rotten Tomatoes"
      .. ..$ Value : chr "79%"
     $ Metascore : chr  "N/A" "N/A"
     $ imdbRating: num  5.7 5.7
     $ imdbVotes : num  9759 9759
     $ imdbID    : chr  "tt7451284" "tt7451284"
     $ Type      : chr  "movie" "movie"
     $ DVD       : Date, format: "2018-05-08" "2018-05-08"
     $ BoxOffice : chr  "N/A" "N/A"
     $ Production: chr  "DC Comics" "DC Comics"
     $ Website   : chr  "N/A" "N/A"
     $ Response  : chr  "True" "True"
    

    References

    2020-04-18

    How to draw a stratified sample in R

    Title

    Problem

    We want to draw a stratified sample in R. Previously, we took a random sample from a data frame. We did not control over the distribution of the subgroups. This time we will control over the distribution of each stratum keeping the same overall distribution of the original data.

    Solution

    Using the function createDataPartition from the caret package.

    library(tidyverse)
    library(caret)
    set.seed(1)
    planes <- as.data.frame(nycflights13::planes)
    trainIndex <- createDataPartition(planes$engine,
                                      p = .5,
                                      list = FALSE,
                                      times = 1)
    
    planesTrain <- planes[trainIndex, ]
    planesTest  <- planes[-trainIndex, ]
    
    Using the function stratified from the splitstackshape package.

    library(splitstackshape)
    set.seed(1)
    planesTrain1 <- stratified(planes, "engine", 0.5)
    planesTest1 <- planes[!(planes$tailnum %in% planesTrain$tailnum),]
    
    Checking that we have created balanced splits of the data.

    • Original data frame
    • planes %>% 
        group_by(engine) %>% 
          summarise(n = n()) %>%
        mutate(cum = n / sum(n))
      
      # A tibble: 6 x 3
        engine            n      cum
                     
      1 4 Cycle           2 0.000602
      2 Reciprocating    28 0.00843 
      3 Turbo-fan      2750 0.828   
      4 Turbo-jet       535 0.161   
      5 Turbo-prop        2 0.000602
      6 Turbo-shaft       5 0.00151  
      
    • Training set
    • 
      planesTrain %>%
        group_by(engine) %>%
        summarise(n = n()) %>%
        mutate(cum = n / sum(n))
      
      # A tibble: 6 x 3
        engine            n      cum
                     
      1 4 Cycle           1 0.000602
      2 Reciprocating    14 0.00842 
      3 Turbo-fan      1375 0.827   
      4 Turbo-jet       268 0.161   
      5 Turbo-prop        1 0.000602
      6 Turbo-shaft       3 0.00181 
      
    • Test set
    • planesTest %>%
        group_by(engine) %>%
        summarise(n = n()) %>%
        mutate(cum = n / sum(n))
      
      # A tibble: 6 x 3
        engine            n      cum
                     
      1 4 Cycle           1 0.000602
      2 Reciprocating    14 0.00843 
      3 Turbo-fan      1375 0.828   
      4 Turbo-jet       267 0.161   
      5 Turbo-prop        1 0.000602
      6 Turbo-shaft       2 0.00120 
      

    Related posts

    References

    2020-04-12

    Almacenar fecha y hora al crear registros en Ms Access

    Problema

    Queremos almacenar la fecha y hora al crear un nuevo registro en Ms Access.

    Solución

    1. En la tabla de destino creamos los campos deseados para almacenar la fecha. En nuestro ejemplo de manera demostrativa tenemos tres: Fecha y hora, Fecha y Hora.
    2. En la propiedad Formato especificamos el formato deseado: Fecha general (para mostrar fecha y hora), Fecha corta (para mostrar solamente la fecha), y Hora larga (para mostrar solamente la hora).
    3. En la propiedad Valor predeterminado especificamos =Ahora(), de esta manera cuando añadamos un nuevo registrará la fecha y hora actual automáticamente.

    Notas

    Podemos añadir la fecha y hora por registro de manera más sofisticada, dentro de un formulario de introducción de datos, por ejemplo cada vez que hagamos clic en un campo:

    1. Creamos la macro correspondiente para registrar la fecha y hora.
    2. En la vista de diseño del formulario, vamos a la hoja de propiedades del campo deseado, clic en la pestaña Eventos, y clic en la flecha del cuadro donde seleccionamos el nombre de la macro anteriormente creada.
    3. Al hacer clic en campo en el que hemos añadido dicho evento insertará el valor establecido en la macro.

    2020-04-04

    How to select a random sample in R

    Title

    Problem

    We want to extract a random sample from a data frame in R.

    Solution

    • Base package
    set.seed(1)
    starwars[sample(nrow(starwars), 10), ] # 10 filas
    # Showing the first 5 columns
    set.seed(1)
    starwars[sample(nrow(starwars), 10), 1:5]
    
    # A tibble: 10 x 5
       name            height  mass hair_color skin_color
                                
     1 Dexter Jettster    198 102   none       brown     
     2 Sebulba            112  40   none       grey, red 
     3 Luke Skywalker     172  77   blond      fair      
     4 Jar Jar Binks      196  66   none       orange    
     5 Bib Fortuna        180  NA   none       pale      
     6 Han Solo           180  80   brown      fair      
     7 Cliegg Lars        183  NA   brown      fair      
     8 Eeth Koth          171  NA   black      brown     
     9 Boba Fett          183  78.2 black      fair      
    10 Yarael Poof        264  NA   none       white
    
    • dplyr
    library(tidyverse)
    set.seed(1)
    starwars %>%
      sample_n(10) %>%
      select(1:5)
    
    • data.table
    library(data.table)
    set.seed(1)
    data.table(starwars)[sample(.N, 10), 1:5]
    

    2020-04-03

    Creación de gráficos del coronavirus en R

    Introducción

    Queremos mostrar la evolución de casos de coronavirus en R con gráficos estáticos e interactivos.

    Gráficos

  • Interactivo (escala lineal)
  • Interactivo (escala logaritmica)
  • Solución

    Usamos los datos del repositorio creado por Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Hay tres series de datos temporales: confirmed, deaths y recovered cases. Primero preparamos los datos y creamos el gráfico usando ggplot2 para la versión estática, y plotly para añadir interactividad. Las series de datos incluyen casos de todo el mundo pero en nuestro ejemplo usamos un subconjunto para Alemania, Francia, Italia, España y el Reino Unido.

    # Librerias
    library(magrittr)
    library(lubridate) 
    library(tidyverse)
    library(plotly)
    library(scales)
    
    # Importación de datos
    confirmed <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
    deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv")
    recovered <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv")
    
    # Data preparation
    AppendMe <- function(dfNames) {
      do.call(rbind, lapply(dfNames, function(x) {
        cbind(get(x), source = x)
      }))
    }
    df <- AppendMe(c("confirmed", "deaths", "recovered"))
    data <- df %>%
      rename(province = `Province/State`, country = `Country/Region`) %>% 
      pivot_longer(
        -c(province, country, Lat, Long, source),
        names_to = "date",
        values_to = "count"
      ) %>% 
    mutate(date = mdy(date)) 
    
    # Gráfico escala lineal
    p <- data %>%
      filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
      group_by(country, date, source) %>%
      summarise(n = sum(count)) %>%
      ggplot(aes(date, n, colour = country)) +
      geom_line(linetype = 2) +
      geom_point(size = 1) +
      facet_wrap( ~  source  , scales = "free", nrow = 3) +
      theme_bw()+
      labs(title = "Cumulative Covid-19 cases (linear scale)")+
      ylab("")+
      scale_x_date(date_labels = "%b %d")+
      scale_y_continuous(labels = comma)
    p # Estático
    ggplotly(p) # Interactivo
    
    # Gráfico escala logaritmica
    p <- data %>%
      filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
      group_by(country, date, source) %>%
      summarise(n = sum(count)) %>%
      ggplot(aes(date, n, colour = country)) +
      geom_line(linetype = 2) +
      geom_point(size = 1) +
      facet_wrap( ~  source, scales = "free",  nrow = 3) +
      theme_bw()+
      labs(title = "Cumulative Covid-19 cases (log scale)")+
      ylab("")+
      scale_x_date(date_labels = "%b %d")+
      scale_y_log10(breaks = c(1, 10, 100, 10000))
      p 
    ggplotly(p) 
    
    Para subrayar una serie al pasar sobre ella usamos la función highlight del paquete plotly.

    p <- data %>%
      filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
      group_by(country, date, source) %>%
      summarise(cases = sum(count)) %>%
      highlight_key(~ country ) %>% 
      ggplot(aes(date, cases, colour = country)) +
      geom_line(linetype = 2)+
      geom_point(size = 1) +
      facet_wrap(~  source  , scales = "free", nrow = 3)+
      theme_bw()+
      labs(title = "Cumulative Covid-19 cases (linear scale)")+
      ylab("")+
      scale_x_date(date_labels = "%b %d")+
      scale_y_continuous(labels = comma)
    ggplotly(p, tooltip = c("country", "date", "cases")) %>% 
    highlight(on = "plotly_hover")
    
    Gráficco here. Pantallazo abajo.

    Referencias

    Nube de datos