2020-10-07

Filtrar filas que contengan una cadena de texto en R

Title

Problema

Queremos filtrar las filas de un data frame que contengan una determinada cadena de texto. En nuestro ejemplo las filas que contengan la cadena 'foo'.

foo <- data.frame(Company = c("company1", "foo", "test", "food"), Metric = rnorm(4, 10))

   Company    Metric
1 company1  7.590178
2      foo  9.711493
3     test 10.862799
4     food  9.337434

Solución

  • dplyr
  • Usamos la función grepl dentro de filter.

    library(dplyr)
    filter(foo, grepl("foo", Company))
    

      Company   Metric
    1     foo 9.711493
    2    food 9.337434
    
  • data.table
  • Otra opción sería usar la función like de data.table, con una sintaxis similar a SQL.

    library(data.table)
    DT <- data.table(foo)
    DT[Company %like% 'foo']
    
       Company   Metric
    1:     foo 9.711493
    2:    food 9.337434
    

    Referencias

    2020-09-15

    Equivalente a SUMADIVISION en Excel

    Title

    Problema

    Queremos crear el equivalente a la función SUMADIVISION en Excel, de manera que devuelva la suma de la división de un rango o matriz, el inverso de SUMAPRODUCTO.

    Solución

    Con SUMAPRODUCTO usamos el inverso de la matriz por la que dividiremos: 1/matriz como denominador. En nuestro ejemplo dividimos la columna 1 por la 2.

    =SUMAPRODUCTO(A2:A6,1/B2:B6)
    

    Otra opción sería usar SUMA como una fórmula matricial.

    {=SUMA(A2:A6/B2:B6)}
    

    2020-08-01

    Equivalent of SUMDIVIDE in Excel

    Title

    Problem

    We want to create the equivalent of the function SUMDIVIDE in Excel, returning the sum of the division of ranges or arrays, the inverse of SUMPRODUCT.

    Solution

    With SUMPRODUCT we just use the inverse of the multiplication for the array acting as denominator: 1/array as denominator. In our example we return the sum of the division of column 1 by column 2.

    =SUMPRODUCT(A2:A6,1/B2:B6)
    


    Another option would be using SUM as an array formula.

    {=SUM(A2:A6/B2:B6)}
    

    2020-06-30

    Borrar rápidamente el formato condicional de todas las hojas de Excel

    Title

    Problema

    Queremos borrar el formato condicional de todas las hojas de Excel. En lugar de ir hoja a hoja: Inicio, clic en Formato condicional > Borrar reglas > Borrar reglas de toda la hoja.

    Solución

    Una vez ejecutado el código no podremos deshacer los cambios. Por lo que es recomendable guardar una copia antes y salir sin guardar si no queremos preservar los cambios,

    Sub delete_cond_format()
    Application.ScreenUpdating = False
        For Each sht In Worksheets
         sht.Cells.FormatConditions.Delete
        Next
    Application.ScreenUpdating = True
    End Sub
    

    Notas

    Usamos la grabadora de macros mientras borramos el formato condicional de un rango de celdas.

    Sub Macro1()
    '
    ' Macro1 Macro
    '
    '
        Range("C5:E8").Select
        Cells.FormatConditions.Delete
    End Sub
    
    
    Después usamos ese código dentro de un bucle. Activo y desactivo la actualización de la pantalla, aunque es opcional.

    Clear conditional formatting from all sheets at once in Excel

    Title

    Problem

    We want to delete all conditional formatting in all sheets at once. Instead of clearing the rules sheet by sheet: Home tab, click on Conditional Formatting > Clear Rules > Clear Rules from Entire Sheet.

    Solution

    You cannot undo the changes after running the code. Save a copy of the workbook and close without saving the changes if you don't want to preserve them.

    Sub delete_cond_format()
    Application.ScreenUpdating = False
        For Each sht In Worksheets
         sht.Cells.FormatConditions.Delete
        Next
    Application.ScreenUpdating = True
    End Sub
    

    Notes

    We just use the macro recorder while deleting the conditional formatting of a range of cells:

    Sub Macro1()
    '
    ' Macro1 Macro
    '
    '
        Range("C5:E8").Select
        Cells.FormatConditions.Delete
    End Sub
    
    
    Then we just include that code inside a loop to delete the format of all the sheets. I turn screen updating off and on but this is optional.

    2020-06-24

    Aplying a function by row in R

    Title

    Problem

    We want to apply a function to each row of a data frame. In our example, we'd like to calculate the min and median for each row of a data frame.

    df <- structure(list(V1 = c(5L, 4L, 7L), V2 = c(8L, 9L, 3L), V3 = c(12L, 
    5L, 9L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, 
    -3L))
    
     V1 V2 V3
    1  5  8 12
    2  4  9  5
    3  7  3  9
    

    Solution

  • dplyr
  • library(dplyr)
    # Using the pipe operator %>%
    df %>% 
      rowwise() %>% 
      mutate(min = min(V1, V2, V3), median = median(c(V1, V2, V3)))
    # Without %>%
    mutate(rowwise(df), min = min(V1, V2, V3), median = median(c(V1, V2, V3)))
    

    Source: local data frame [3 x 5]
    Groups: 
    
         V1    V2    V3   min median
      (int) (int) (int) (int)  (int)
    1     5     8    12     5      8
    2     4     9     5     4      5
    3     7     3     9     3      7
    
  • Base package
  • df$min <- apply(df, 1, min) df$median <- apply(df[, 1:3], 1, median)

      V1 V2 V3 min median
    1  5  8 12   5      8
    2  4  9  5   4      5
    3  7  3  9   3      7
    

    References

    2020-05-01

    Create a mini IMDb database in R

    Introduction

    In a previous post we showed how to extract movie info R info using the imdbapi package. In this post we will create a mini imdb database using the same package.

    Solution

    If we use the free version, the maximum number of requests per day is 1,000. We need to request an API key here.

    First we need a vector containing the movie titles or the IMDbIDs (e.g.: for Vertigo the last section of the url https://www.imdb.com/title/tt0052357/, the string tt0052357. In our example we will use the list containing the results from the Sight and Sound 2012 poll of 846 critics, these are the films receiving at least 3 votes.

    library(imdbapi)
    library(data.table)
    library(tidyverse)
    sight_sound <- read.csv("https://sites.google.com/site/nubededatosblogspotcom/Sight&Sound2012-CriticsPoll.txt", stringsAsFactors = FALSE)
    glimpse(sight_sound)
    
    Observations: 588
    Variables: 17
    $ const                          "tt0052357", "tt0033467", "tt004643...
    $ position                       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
    $ created                        "Thu Aug 16 07:42:05 2012", "Thu Au...
    $ description                    NA, NA, NA, NA, NA, NA, NA, NA, NA,...
    $ modified                       "Thu Aug 16 07:42:05 2012", "Thu Au...
    $ Title                          "Vertigo", "Citizen Kane", "Tôkyô m...
    $ Directors                      "Alfred Hitchcock", "Orson Welles",...
    $ Title.type                     "Feature Film", "Feature Film", "Fe...
    $ IMDb.Rating                    8.5, 8.5, 8.2, 8.0, 8.3, 8.3, 8.0, ...
    $ PeacefulAnarchy.rated          10, 9, 10, 9, 9, 6, 6, 10, 8, 9, 6,...
    $ Runtime..mins.                 128, 119, 136, 110, 94, 160, 119, 6...
    $ Genres                         "mystery, romance, thriller", "dram...
    $ Year                           1958, 1941, 1953, 1939, 1927, 1968,...
    $ Num.Votes                      153502, 205699, 16219, 14872, 19188...
    $ Release.Date..month.day.year.  "1958-05-09", "1941-05-01", "1953-1...
    $ Id                             1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
    $ URL                            "http://www.imdb.com/title/tt005235...
    
    We use the function lapply to extract the info for all IMDbIDs.

    tt <-
      lapply(sight_sound$const, function(x) {
        return(tryCatch(
          find_by_id(
            x,
            type = NULL,
            year_of_release = NULL,
            plot = "full",
            include_tomatoes = TRUE,
            api_key = "12345678"
          ),
          error = function(e)
            NULL
        ))
      })
    df_sight_sound <- rbindlist(tt, fill = TRUE)
    df_sight_sound$Ratings <- as.character(df_sight_sound$Ratings)
    df_sight_sound <- as.data.frame(df_sight_sound)
    df_sight_sound %>% distinct(imdbID) %>% summarise(n= n())
        n
    1 586
    After running the code some titles may be missing. In our examples, two title. We will repeat the process until obtain all of them.

    # Checking missing titles
    m <- subset(sight_sound, !(const %in% df_sight_sound$imdbID))$const 
    m
    [1] "tt0115751" "tt0032551"
    Finally, we keep distinct titles removing duplicates.

    df_sight_sound <- df_sight_sound %>% 
      filter(grepl("Internet",Ratings)) %>% 
      group_by(imdbID) %>% 
      distinct()
    
    # A tibble: 588 x 26
    # Groups:   imdbID [588]
       Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                           
     1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
     2 Citi~ 1941  PG    1941-09-05 119 min Dram~ Orson W~ Herma~ Josep~ "A g~
     3 Toky~ 1953  NOT ~ 1972-03-13 136 min Drama Yasujir~ Kôgo ~ Chish~ An e~
     4 The ~ 1939  NOT ~ 1950-04-08 110 min Come~ Jean Re~ Jean ~ Nora ~ Avia~
     5 Sunr~ 1927  NOT ~ 1927-11-04 94 min  Dram~ F.W. Mu~ Carl ~ Georg~ "In ~
     6 2001~ 1968  G     1968-05-12 149 min Adve~ Stanley~ Stanl~ Keir ~ "\"2~
     7 The ~ 1956  PASS~ 1956-05-26 119 min Adve~ John Fo~ Frank~ John ~ Etha~
     8 Man ~ 1929  NOT ~ 1929-05-12 68 min  Docu~ Dziga V~ Dziga~ Mikha~ This~
     9 The ~ 1928  NOT ~ 1928-10-25 114 min Biog~ Carl Th~ Josep~ Maria~ The ~
    10 8½    1963  NOT ~ 1963-06-25 138 min Drama Federic~ Feder~ Marce~ Guid~
    # ... with 578 more rows, and 16 more variables: Language ,
    #   Country , Awards , Poster , Ratings ,
    #   Metascore , imdbRating , imdbVotes , imdbID ,
    #   Type , DVD , BoxOffice , Production , Website ,
    #   Response , totalSeasons 
    
    To export the final results as a csv:

    write.csv(df_sight_sound, "df_sight_sound.csv", row.names = FALSE)
    

    Related posts

    References

    Nube de datos