Nube de datos: abril 2020

2020-04-22

Extract movie info using the OMDb API in R

Problem

We want to extract movie information using the OMDb API in R.

Solution

We use the imdbapi package. If we use the free version, the maximum number of requests per day is 1,000. We need to request an API key here. The package allows:

Retrieve info by imdb id

library(imdbapi)
find_by_id("tt0107692", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678") # Ninja Scroll imdb id: tt0107692

# A tibble: 2 x 25
  Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                      
1 Ninj~ 1993  NOT ~ 1993-06-05 94 min  Anim~ Yoshiak~ Yoshi~ Kôich~ A Jo~
2 Ninj~ 1993  NOT ~ 1993-06-05 94 min  Anim~ Yoshiak~ Yoshi~ Kôich~ A Jo~
# ... with 15 more variables: Language , Country , Awards ,
#   Poster , Ratings , Metascore , imdbRating ,
#   imdbVotes , imdbID , Type , DVD , BoxOffice ,
#   Production , Website , Response

Search by title

find_by_title("vertigo", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678")

# A tibble: 3 x 25
  Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                      
1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
2 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
3 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
# ... with 15 more variables: Language , Country , Awards ,
#   Poster , Ratings , Metascore , imdbRating ,
#   imdbVotes , imdbID , Type , DVD , BoxOffice ,
#   Production , Website , Response

Additionally the package includes specific functions to extract info about actors, countries, directors, genres and writers.

get_actors(find_by_title("vertigo", api_key = "12345678"))

[1] "James Stewart"      "Kim Novak"          "Barbara Bel Geddes"
[4] "Tom Helmore"

get_countries(find_by_title("vertigo", api_key = "12345678"))

[1] "USA"

get_directors(find_by_title("vertigo", api_key = "12345678"))

[1] "Alfred Hitchcock"

get_genres(find_by_title("vertigo", api_key = "12345678"))

[1] "Mystery"  "Romance"  "Thriller"

get_writers(find_by_title("vertigo", api_key = "12345678"))

[1] "Alec Coppel (screenplay by)"                                  
[2] "Samuel A. Taylor (screenplay by)"                             
[3] "Pierre Boileau (based on the novel \"D'Entre Les Morts\" by)" 
[4] "Thomas Narcejac (based on the novel \"D'Entre Les Morts\" by)"

Load poster image

library(RCurl)
df <- find_by_title("Batman Ninja", type = NULL, year_of_release = NULL, plot = "full", include_tomatoes = FALSE, api_key = "12345678")
plot(0:1,
     0:1,
     type = "n",
     ann = FALSE,
     axes = FALSE)
my_image <-  readJPEG(getURLContent(df$Poster[1]))
rasterImage(my_image, 0, 0, 1, 1)

Notes

By inspecting the previous functions starting with get, we can see that these are wrappers subetting the info returned by the functions find_by_title or find_by_id.

function (omdb) 
{
  if (!inherits(omdb, "omdb")) {
    message("get_actors() expects an omdb object")
    return(NULL)
  }
  if ("Actors" %in% names(omdb)) {
    str_split(omdb$Actors, ",[ ]*")[[1]]
  }
}

Every request returned by find_by_id o find_by_title, will generate a row for each rating availables. For instance, Vertigo will return 3 rating rows: IMDb, Rotten Tomatoes and Metacritic. Whereas Ninja Scroll or Batman Ninja will return only 2 available ratings: IMDb and Rotten Tomatoes. The variables are:

Classes ‘omdb’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  25 variables:
 $ Title     : chr  "Batman Ninja" "Batman Ninja"
 $ Year      : chr  "2018" "2018"
 $ Rated     : chr  "PG-13" "PG-13"
 $ Released  : Date, format: "2018-04-24" "2018-04-24"
 $ Runtime   : chr  "85 min" "85 min"
 $ Genre     : chr  "Animation, Action" "Animation, Action"
 $ Director  : chr  "Junpei Mizusaki" "Junpei Mizusaki"
 $ Writer    : chr  "Kazuki Nakashima (screenplay), Leo Chu (English screenplay), Eric Garcia (English screenplay), Bob Kane (charac"| __truncated__ "Kazuki Nakashima (screenplay), Leo Chu (English screenplay), Eric Garcia (English screenplay), Bob Kane (charac"| __truncated__
 $ Actors    : chr  "Kôichi Yamadera, Wataru Takagi, Ai Kakuma, Rie Kugimiya" "Kôichi Yamadera, Wataru Takagi, Ai Kakuma, Rie Kugimiya"
 $ Plot      : chr  "Batman, along with a number of his allies and adversaries, finds himself transplanted from modern Gotham City to feudal Japan." "Batman, along with a number of his allies and adversaries, finds himself transplanted from modern Gotham City to feudal Japan."
 $ Language  : chr  "Japanese, English" "Japanese, English"
 $ Country   : chr  "Japan, USA" "Japan, USA"
 $ Awards    : chr  "N/A" "N/A"
 $ Poster    : chr  "https://m.media-amazon.com/images/M/MV5BYmFhYzZhYzgtZjZiYS00NWEwLWFhYTUtN2UxM2FmYzdhNDUyXkEyXkFqcGdeQXVyNDk2Nzc"| __truncated__ "https://m.media-amazon.com/images/M/MV5BYmFhYzZhYzgtZjZiYS00NWEwLWFhYTUtN2UxM2FmYzdhNDUyXkEyXkFqcGdeQXVyNDk2Nzc"| __truncated__
 $ Ratings   :List of 2
  ..$ :List of 2
  .. ..$ Source: chr "Internet Movie Database"
  .. ..$ Value : chr "5.7/10"
  ..$ :List of 2
  .. ..$ Source: chr "Rotten Tomatoes"
  .. ..$ Value : chr "79%"
 $ Metascore : chr  "N/A" "N/A"
 $ imdbRating: num  5.7 5.7
 $ imdbVotes : num  9759 9759
 $ imdbID    : chr  "tt7451284" "tt7451284"
 $ Type      : chr  "movie" "movie"
 $ DVD       : Date, format: "2018-05-08" "2018-05-08"
 $ BoxOffice : chr  "N/A" "N/A"
 $ Production: chr  "DC Comics" "DC Comics"
 $ Website   : chr  "N/A" "N/A"
 $ Response  : chr  "True" "True"

References

2020-04-18

How to draw a stratified sample in R

Title

Problem

We want to draw a stratified sample in R. Previously, we took a random sample from a data frame. We did not control over the distribution of the subgroups. This time we will control over the distribution of each stratum keeping the same overall distribution of the original data.

Solution

Using the function createDataPartition from the caret package.

library(tidyverse)
library(caret)
set.seed(1)
planes <- as.data.frame(nycflights13::planes)
trainIndex <- createDataPartition(planes$engine,
                                  p = .5,
                                  list = FALSE,
                                  times = 1)

planesTrain <- planes[trainIndex, ]
planesTest  <- planes[-trainIndex, ]

Using the function stratified from the splitstackshape package.

library(splitstackshape)
set.seed(1)
planesTrain1 <- stratified(planes, "engine", 0.5)
planesTest1 <- planes[!(planes$tailnum %in% planesTrain$tailnum),]

Checking that we have created balanced splits of the data.

Original data frame

planes %>% 
  group_by(engine) %>% 
    summarise(n = n()) %>%
  mutate(cum = n / sum(n))

# A tibble: 6 x 3
  engine            n      cum
               
1 4 Cycle           2 0.000602
2 Reciprocating    28 0.00843 
3 Turbo-fan      2750 0.828   
4 Turbo-jet       535 0.161   
5 Turbo-prop        2 0.000602
6 Turbo-shaft       5 0.00151

Training set


planesTrain %>%
  group_by(engine) %>%
  summarise(n = n()) %>%
  mutate(cum = n / sum(n))

# A tibble: 6 x 3
  engine            n      cum
               
1 4 Cycle           1 0.000602
2 Reciprocating    14 0.00842 
3 Turbo-fan      1375 0.827   
4 Turbo-jet       268 0.161   
5 Turbo-prop        1 0.000602
6 Turbo-shaft       3 0.00181

Test set

planesTest %>%
  group_by(engine) %>%
  summarise(n = n()) %>%
  mutate(cum = n / sum(n))

# A tibble: 6 x 3
  engine            n      cum
               
1 4 Cycle           1 0.000602
2 Reciprocating    14 0.00843 
3 Turbo-fan      1375 0.828   
4 Turbo-jet       267 0.161   
5 Turbo-prop        1 0.000602
6 Turbo-shaft       2 0.00120

References

The caret package

2020-04-12

Almacenar fecha y hora al crear registros en Ms Access

Problema

Queremos almacenar la fecha y hora al crear un nuevo registro en Ms Access.

Solución

En la tabla de destino creamos los campos deseados para almacenar la fecha. En nuestro ejemplo de manera demostrativa tenemos tres: Fecha y hora, Fecha y Hora.

En la propiedad Formato especificamos el formato deseado: Fecha general (para mostrar fecha y hora), Fecha corta (para mostrar solamente la fecha), y Hora larga (para mostrar solamente la hora).

En la propiedad Valor predeterminado especificamos =Ahora(), de esta manera cuando añadamos un nuevo registrará la fecha y hora actual automáticamente.

Notas

Podemos añadir la fecha y hora por registro de manera más sofisticada, dentro de un formulario de introducción de datos, por ejemplo cada vez que hagamos clic en un campo:

Creamos la macro correspondiente para registrar la fecha y hora.

En la vista de diseño del formulario, vamos a la hoja de propiedades del campo deseado, clic en la pestaña Eventos, y clic en la flecha del cuadro donde seleccionamos el nombre de la macro anteriormente creada.

Al hacer clic en campo en el que hemos añadido dicho evento insertará el valor establecido en la macro.

2020-04-04

How to select a random sample in R

Title

Problem

We want to extract a random sample from a data frame in R.

Solution

Base package

set.seed(1)
starwars[sample(nrow(starwars), 10), ] # 10 filas
# Showing the first 5 columns
set.seed(1)
starwars[sample(nrow(starwars), 10), 1:5]

# A tibble: 10 x 5
   name            height  mass hair_color skin_color
                            
 1 Dexter Jettster    198 102   none       brown     
 2 Sebulba            112  40   none       grey, red 
 3 Luke Skywalker     172  77   blond      fair      
 4 Jar Jar Binks      196  66   none       orange    
 5 Bib Fortuna        180  NA   none       pale      
 6 Han Solo           180  80   brown      fair      
 7 Cliegg Lars        183  NA   brown      fair      
 8 Eeth Koth          171  NA   black      brown     
 9 Boba Fett          183  78.2 black      fair      
10 Yarael Poof        264  NA   none       white

dplyr

library(tidyverse)
set.seed(1)
starwars %>%
  sample_n(10) %>%
  select(1:5)

data.table

library(data.table)
set.seed(1)
data.table(starwars)[sample(.N, 10), 1:5]

2020-04-03

Creación de gráficos del coronavirus en R

Introducción

Queremos mostrar la evolución de casos de coronavirus en R con gráficos estáticos e interactivos.

Gráficos

Interactivo (escala lineal)

Interactivo (escala logaritmica)

Solución

Usamos los datos del repositorio creado por Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Hay tres series de datos temporales: confirmed, deaths y recovered cases. Primero preparamos los datos y creamos el gráfico usando ggplot2 para la versión estática, y plotly para añadir interactividad. Las series de datos incluyen casos de todo el mundo pero en nuestro ejemplo usamos un subconjunto para Alemania, Francia, Italia, España y el Reino Unido.

# Librerias
library(magrittr)
library(lubridate) 
library(tidyverse)
library(plotly)
library(scales)

# Importación de datos
confirmed <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv")
recovered <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv")

# Data preparation
AppendMe <- function(dfNames) {
  do.call(rbind, lapply(dfNames, function(x) {
    cbind(get(x), source = x)
  }))
}
df <- AppendMe(c("confirmed", "deaths", "recovered"))
data <- df %>%
  rename(province = `Province/State`, country = `Country/Region`) %>% 
  pivot_longer(
    -c(province, country, Lat, Long, source),
    names_to = "date",
    values_to = "count"
  ) %>% 
mutate(date = mdy(date)) 

# Gráfico escala lineal
p <- data %>%
  filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
  group_by(country, date, source) %>%
  summarise(n = sum(count)) %>%
  ggplot(aes(date, n, colour = country)) +
  geom_line(linetype = 2) +
  geom_point(size = 1) +
  facet_wrap( ~  source  , scales = "free", nrow = 3) +
  theme_bw()+
  labs(title = "Cumulative Covid-19 cases (linear scale)")+
  ylab("")+
  scale_x_date(date_labels = "%b %d")+
  scale_y_continuous(labels = comma)
p # Estático
ggplotly(p) # Interactivo

# Gráfico escala logaritmica
p <- data %>%
  filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
  group_by(country, date, source) %>%
  summarise(n = sum(count)) %>%
  ggplot(aes(date, n, colour = country)) +
  geom_line(linetype = 2) +
  geom_point(size = 1) +
  facet_wrap( ~  source, scales = "free",  nrow = 3) +
  theme_bw()+
  labs(title = "Cumulative Covid-19 cases (log scale)")+
  ylab("")+
  scale_x_date(date_labels = "%b %d")+
  scale_y_log10(breaks = c(1, 10, 100, 10000))
  p 
ggplotly(p)

Para subrayar una serie al pasar sobre ella usamos la función highlight del paquete plotly.

p <- data %>%
  filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
  group_by(country, date, source) %>%
  summarise(cases = sum(count)) %>%
  highlight_key(~ country ) %>% 
  ggplot(aes(date, cases, colour = country)) +
  geom_line(linetype = 2)+
  geom_point(size = 1) +
  facet_wrap(~  source  , scales = "free", nrow = 3)+
  theme_bw()+
  labs(title = "Cumulative Covid-19 cases (linear scale)")+
  ylab("")+
  scale_x_date(date_labels = "%b %d")+
  scale_y_continuous(labels = comma)
ggplotly(p, tooltip = c("country", "date", "cases")) %>% 
highlight(on = "plotly_hover")

Gráficco here. Pantallazo abajo.

Nube de datos

2020-04-22

Extract movie info using the OMDb API in R

Problem

Solution

Notes

References

2020-04-18

How to draw a stratified sample in R

Problem

Solution

Related posts

References

2020-04-12

Almacenar fecha y hora al crear registros en Ms Access

Problema

Solución

Notas

2020-04-04

How to select a random sample in R

Problem

Solution

2020-04-03

Creación de gráficos del coronavirus en R

Introducción

Gráficos

Solución

Referencias