Nube de datos

2020-10-07

Filtrar filas que contengan una cadena de texto en R

Title

Problema

Queremos filtrar las filas de un data frame que contengan una determinada cadena de texto. En nuestro ejemplo las filas que contengan la cadena 'foo'.

foo <- data.frame(Company = c("company1", "foo", "test", "food"), Metric = rnorm(4, 10))

   Company    Metric
1 company1  7.590178
2      foo  9.711493
3     test 10.862799
4     food  9.337434

Solución

dplyr

Usamos la función grepl dentro de filter.

library(dplyr)
filter(foo, grepl("foo", Company))

  Company   Metric
1     foo 9.711493
2    food 9.337434

data.table

Otra opción sería usar la función like de data.table, con una sintaxis similar a SQL.

library(data.table)
DT <- data.table(foo)
DT[Company %like% 'foo']

   Company   Metric
1:     foo 9.711493
2:    food 9.337434

Referencias

stackoverflow

2020-09-15

Equivalente a SUMADIVISION en Excel

Title

Problema

Queremos crear el equivalente a la función SUMADIVISION en Excel, de manera que devuelva la suma de la división de un rango o matriz, el inverso de SUMAPRODUCTO.

Solución

Con SUMAPRODUCTO usamos el inverso de la matriz por la que dividiremos: 1/matriz como denominador. En nuestro ejemplo dividimos la columna 1 por la 2.

=SUMAPRODUCTO(A2:A6,1/B2:B6)

Otra opción sería usar SUMA como una fórmula matricial.

{=SUMA(A2:A6/B2:B6)}

2020-08-01

Equivalent of SUMDIVIDE in Excel

Title

Problem

We want to create the equivalent of the function SUMDIVIDE in Excel, returning the sum of the division of ranges or arrays, the inverse of SUMPRODUCT.

Solution

With SUMPRODUCT we just use the inverse of the multiplication for the array acting as denominator: 1/array as denominator. In our example we return the sum of the division of column 1 by column 2.

=SUMPRODUCT(A2:A6,1/B2:B6)

Another option would be using SUM as an array formula.

{=SUM(A2:A6/B2:B6)}

2020-06-30

Borrar rápidamente el formato condicional de todas las hojas de Excel

Title

Problema

Queremos borrar el formato condicional de todas las hojas de Excel. En lugar de ir hoja a hoja: Inicio, clic en Formato condicional > Borrar reglas > Borrar reglas de toda la hoja.

Solución

Una vez ejecutado el código no podremos deshacer los cambios. Por lo que es recomendable guardar una copia antes y salir sin guardar si no queremos preservar los cambios,

Sub delete_cond_format()
Application.ScreenUpdating = False
    For Each sht In Worksheets
     sht.Cells.FormatConditions.Delete
    Next
Application.ScreenUpdating = True
End Sub

Notas

Usamos la grabadora de macros mientras borramos el formato condicional de un rango de celdas.

Sub Macro1()
'
' Macro1 Macro
'
'
    Range("C5:E8").Select
    Cells.FormatConditions.Delete
End Sub

Después usamos ese código dentro de un bucle. Activo y desactivo la actualización de la pantalla, aunque es opcional.

Clear conditional formatting from all sheets at once in Excel

Title

Problem

We want to delete all conditional formatting in all sheets at once. Instead of clearing the rules sheet by sheet: Home tab, click on Conditional Formatting > Clear Rules > Clear Rules from Entire Sheet.

Solution

You cannot undo the changes after running the code. Save a copy of the workbook and close without saving the changes if you don't want to preserve them.

Sub delete_cond_format()
Application.ScreenUpdating = False
    For Each sht In Worksheets
     sht.Cells.FormatConditions.Delete
    Next
Application.ScreenUpdating = True
End Sub

Notes

We just use the macro recorder while deleting the conditional formatting of a range of cells:

Sub Macro1()
'
' Macro1 Macro
'
'
    Range("C5:E8").Select
    Cells.FormatConditions.Delete
End Sub

Then we just include that code inside a loop to delete the format of all the sheets. I turn screen updating off and on but this is optional.

2020-06-24

Aplying a function by row in R

Title

Problem

We want to apply a function to each row of a data frame. In our example, we'd like to calculate the min and median for each row of a data frame.

df <- structure(list(V1 = c(5L, 4L, 7L), V2 = c(8L, 9L, 3L), V3 = c(12L, 
5L, 9L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, 
-3L))

Solution

dplyr

library(dplyr)
# Using the pipe operator %>%
df %>% 
  rowwise() %>% 
  mutate(min = min(V1, V2, V3), median = median(c(V1, V2, V3)))
# Without %>%
mutate(rowwise(df), min = min(V1, V2, V3), median = median(c(V1, V2, V3)))

Source: local data frame [3 x 5]
Groups: 

     V1    V2    V3   min median
  (int) (int) (int) (int)  (int)
1     5     8    12     5      8
2     4     9     5     4      5
3     7     3     9     3      7

Base package

df$min <- apply(df, 1, min)
df$median <- apply(df[, 1:3], 1, median)

  V1 V2 V3 min median
1  5  8 12   5      8
2  4  9  5   4      5
3  7  3  9   3      7

References

stackoverflow

2020-05-01

Create a mini IMDb database in R

Introduction

In a previous post we showed how to extract movie info R info using the imdbapi package. In this post we will create a mini imdb database using the same package.

Solution

If we use the free version, the maximum number of requests per day is 1,000. We need to request an API key here.

First we need a vector containing the movie titles or the IMDbIDs (e.g.: for Vertigo the last section of the url https://www.imdb.com/title/tt0052357/, the string tt0052357. In our example we will use the list containing the results from the Sight and Sound 2012 poll of 846 critics, these are the films receiving at least 3 votes.

library(imdbapi)
library(data.table)
library(tidyverse)
sight_sound <- read.csv("https://sites.google.com/site/nubededatosblogspotcom/Sight&Sound2012-CriticsPoll.txt", stringsAsFactors = FALSE)
glimpse(sight_sound)

Observations: 588
Variables: 17
$ const                          "tt0052357", "tt0033467", "tt004643...
$ position                       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ created                        "Thu Aug 16 07:42:05 2012", "Thu Au...
$ description                    NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ modified                       "Thu Aug 16 07:42:05 2012", "Thu Au...
$ Title                          "Vertigo", "Citizen Kane", "Tôkyô m...
$ Directors                      "Alfred Hitchcock", "Orson Welles",...
$ Title.type                     "Feature Film", "Feature Film", "Fe...
$ IMDb.Rating                    8.5, 8.5, 8.2, 8.0, 8.3, 8.3, 8.0, ...
$ PeacefulAnarchy.rated          10, 9, 10, 9, 9, 6, 6, 10, 8, 9, 6,...
$ Runtime..mins.                 128, 119, 136, 110, 94, 160, 119, 6...
$ Genres                         "mystery, romance, thriller", "dram...
$ Year                           1958, 1941, 1953, 1939, 1927, 1968,...
$ Num.Votes                      153502, 205699, 16219, 14872, 19188...
$ Release.Date..month.day.year.  "1958-05-09", "1941-05-01", "1953-1...
$ Id                             1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ URL                            "http://www.imdb.com/title/tt005235...

We use the function lapply to extract the info for all IMDbIDs.

tt <-
  lapply(sight_sound$const, function(x) {
    return(tryCatch(
      find_by_id(
        x,
        type = NULL,
        year_of_release = NULL,
        plot = "full",
        include_tomatoes = TRUE,
        api_key = "12345678"
      ),
      error = function(e)
        NULL
    ))
  })
df_sight_sound <- rbindlist(tt, fill = TRUE)
df_sight_sound$Ratings <- as.character(df_sight_sound$Ratings)
df_sight_sound <- as.data.frame(df_sight_sound)
df_sight_sound %>% distinct(imdbID) %>% summarise(n= n())

    n
1 586

After running the code some titles may be missing. In our examples, two title. We will repeat the process until obtain all of them.

# Checking missing titles
m <- subset(sight_sound, !(const %in% df_sight_sound$imdbID))$const 
m

[1] "tt0115751" "tt0032551"

Finally, we keep distinct titles removing duplicates.

df_sight_sound <- df_sight_sound %>% 
  filter(grepl("Internet",Ratings)) %>% 
  group_by(imdbID) %>% 
  distinct()

# A tibble: 588 x 26
# Groups:   imdbID [588]
   Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                       
 1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
 2 Citi~ 1941  PG    1941-09-05 119 min Dram~ Orson W~ Herma~ Josep~ "A g~
 3 Toky~ 1953  NOT ~ 1972-03-13 136 min Drama Yasujir~ Kôgo ~ Chish~ An e~
 4 The ~ 1939  NOT ~ 1950-04-08 110 min Come~ Jean Re~ Jean ~ Nora ~ Avia~
 5 Sunr~ 1927  NOT ~ 1927-11-04 94 min  Dram~ F.W. Mu~ Carl ~ Georg~ "In ~
 6 2001~ 1968  G     1968-05-12 149 min Adve~ Stanley~ Stanl~ Keir ~ "\"2~
 7 The ~ 1956  PASS~ 1956-05-26 119 min Adve~ John Fo~ Frank~ John ~ Etha~
 8 Man ~ 1929  NOT ~ 1929-05-12 68 min  Docu~ Dziga V~ Dziga~ Mikha~ This~
 9 The ~ 1928  NOT ~ 1928-10-25 114 min Biog~ Carl Th~ Josep~ Maria~ The ~
10 8½    1963  NOT ~ 1963-06-25 138 min Drama Federic~ Feder~ Marce~ Guid~
# ... with 578 more rows, and 16 more variables: Language ,
#   Country , Awards , Poster , Ratings ,
#   Metascore , imdbRating , imdbVotes , imdbID ,
#   Type , DVD , BoxOffice , Production , Website ,
#   Response , totalSeasons

To export the final results as a csv:

write.csv(df_sight_sound, "df_sight_sound.csv", row.names = FALSE)

Extract movie info using the OMDb API in R

2020-10-07

Problema

Solución

Referencias

2020-09-15

Problema

Solución

2020-08-01

Problem

Solution

2020-06-30

Problema

Solución

Notas

Problem

Solution

Notes

2020-06-24

Problem

Solution

References

2020-05-01

Introduction

Solution

Related posts

References