Nube de datos: group

Mostrando entradas con la etiqueta group_by. Mostrar todas las entradas

2020-11-24

How to aggregate multiple variables with different functions in R

Title

Problem

We want to group the following data frame by Branch, using different functions to summarize the results.

  Branch Loan_Amount TAT
1      A         100 2.0
2      A         120 4.0
3      A         300 9.0
4      B         150 1.5
5      B         200 2.0

This is the expected outcome. We want to group by Branch, count the Number_of_loans, and sum Loan_Amount and TAT.

  Branch Number_of_loans Loan_Amount  TAT
1      A               3         520 15.0
2      B               2         350  3.5

Data

df <- structure(list(Branch = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), Loan_Amount = c(100L, 120L, 300L, 150L, 
200L), TAT = c(2, 4, 9, 1.5, 2)), .Names = c("Branch", "Loan_Amount", 
"TAT"), class = "data.frame", row.names = c(NA, -5L))

Options

Base package

df1 <- aggregate(.~ Branch, df, FUN = "sum")
df2 <- setNames(aggregate(Loan_Amount~Branch, df, length)[2], c("Number_of_loans"))
cbind(df1, df2)

  Branch Loan_Amount  TAT Number_of_loans
1      A         520 15.0               3
2      B         350  3.5               2

dplyr

library(dplyr)
df %>% 
  group_by(Branch) %>% 
  summarise(Number_of_loans = n(),
            Loan_Amount = sum(Loan_Amount),
            TAT = sum(TAT))

Source: local data frame [2 x 4]

  Branch Number_of_loans Loan_Amount   TAT
  (fctr)           (int)       (int) (dbl)
1      A               3         520  15.0
2      B               2         350   3.5

sqldf

library(sqldf)
sqldf("SELECT Branch, 
              COUNT(Loan_Amount) Number_of_loans, 
              SUM(Loan_Amount) Loan_Amount, 
              SUM(TAT) TAT 
      FROM df 
      GROUP BY Branch")

  Branch Number_of_loans Loan_Amount  TAT
1      A               3         520 15.0
2      B               2         350  3.5

References

stackoverflow

2020-05-01

Create a mini IMDb database in R

Introduction

In a previous post we showed how to extract movie info R info using the imdbapi package. In this post we will create a mini imdb database using the same package.

Solution

If we use the free version, the maximum number of requests per day is 1,000. We need to request an API key here.

First we need a vector containing the movie titles or the IMDbIDs (e.g.: for Vertigo the last section of the url https://www.imdb.com/title/tt0052357/, the string tt0052357. In our example we will use the list containing the results from the Sight and Sound 2012 poll of 846 critics, these are the films receiving at least 3 votes.

library(imdbapi)
library(data.table)
library(tidyverse)
sight_sound <- read.csv("https://sites.google.com/site/nubededatosblogspotcom/Sight&Sound2012-CriticsPoll.txt", stringsAsFactors = FALSE)
glimpse(sight_sound)

Observations: 588
Variables: 17
$ const                          "tt0052357", "tt0033467", "tt004643...
$ position                       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ created                        "Thu Aug 16 07:42:05 2012", "Thu Au...
$ description                    NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ modified                       "Thu Aug 16 07:42:05 2012", "Thu Au...
$ Title                          "Vertigo", "Citizen Kane", "Tôkyô m...
$ Directors                      "Alfred Hitchcock", "Orson Welles",...
$ Title.type                     "Feature Film", "Feature Film", "Fe...
$ IMDb.Rating                    8.5, 8.5, 8.2, 8.0, 8.3, 8.3, 8.0, ...
$ PeacefulAnarchy.rated          10, 9, 10, 9, 9, 6, 6, 10, 8, 9, 6,...
$ Runtime..mins.                 128, 119, 136, 110, 94, 160, 119, 6...
$ Genres                         "mystery, romance, thriller", "dram...
$ Year                           1958, 1941, 1953, 1939, 1927, 1968,...
$ Num.Votes                      153502, 205699, 16219, 14872, 19188...
$ Release.Date..month.day.year.  "1958-05-09", "1941-05-01", "1953-1...
$ Id                             1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ URL                            "http://www.imdb.com/title/tt005235...

We use the function lapply to extract the info for all IMDbIDs.

tt <-
  lapply(sight_sound$const, function(x) {
    return(tryCatch(
      find_by_id(
        x,
        type = NULL,
        year_of_release = NULL,
        plot = "full",
        include_tomatoes = TRUE,
        api_key = "12345678"
      ),
      error = function(e)
        NULL
    ))
  })
df_sight_sound <- rbindlist(tt, fill = TRUE)
df_sight_sound$Ratings <- as.character(df_sight_sound$Ratings)
df_sight_sound <- as.data.frame(df_sight_sound)
df_sight_sound %>% distinct(imdbID) %>% summarise(n= n())

    n
1 586

After running the code some titles may be missing. In our examples, two title. We will repeat the process until obtain all of them.

# Checking missing titles
m <- subset(sight_sound, !(const %in% df_sight_sound$imdbID))$const 
m

[1] "tt0115751" "tt0032551"

Finally, we keep distinct titles removing duplicates.

df_sight_sound <- df_sight_sound %>% 
  filter(grepl("Internet",Ratings)) %>% 
  group_by(imdbID) %>% 
  distinct()

# A tibble: 588 x 26
# Groups:   imdbID [588]
   Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                       
 1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
 2 Citi~ 1941  PG    1941-09-05 119 min Dram~ Orson W~ Herma~ Josep~ "A g~
 3 Toky~ 1953  NOT ~ 1972-03-13 136 min Drama Yasujir~ Kôgo ~ Chish~ An e~
 4 The ~ 1939  NOT ~ 1950-04-08 110 min Come~ Jean Re~ Jean ~ Nora ~ Avia~
 5 Sunr~ 1927  NOT ~ 1927-11-04 94 min  Dram~ F.W. Mu~ Carl ~ Georg~ "In ~
 6 2001~ 1968  G     1968-05-12 149 min Adve~ Stanley~ Stanl~ Keir ~ "\"2~
 7 The ~ 1956  PASS~ 1956-05-26 119 min Adve~ John Fo~ Frank~ John ~ Etha~
 8 Man ~ 1929  NOT ~ 1929-05-12 68 min  Docu~ Dziga V~ Dziga~ Mikha~ This~
 9 The ~ 1928  NOT ~ 1928-10-25 114 min Biog~ Carl Th~ Josep~ Maria~ The ~
10 8½    1963  NOT ~ 1963-06-25 138 min Drama Federic~ Feder~ Marce~ Guid~
# ... with 578 more rows, and 16 more variables: Language ,
#   Country , Awards , Poster , Ratings ,
#   Metascore , imdbRating , imdbVotes , imdbID ,
#   Type , DVD , BoxOffice , Production , Website ,
#   Response , totalSeasons

To export the final results as a csv:

write.csv(df_sight_sound, "df_sight_sound.csv", row.names = FALSE)

Extract movie info using the OMDb API in R

References

2018-11-29

Crear una pequeña base de datos de IMDb con R

Introducción

En una entrada anterior usamos la OMDb API para extraer con R información sobre películas o series de televisión. En esta ocasión queremos crear una pequeña base datos con el mismo paquete imdbapi.

Solución

Empleamos el paquete imdbapi que nos permite extraer dicha información. Si utilizamos la versión gratuita, tendremos una limitación de 1.000 peticiones al día.

Lo primero que necesitamos es un vector con títulos de películas o de IMDbIDs (por ejemplo: para Vértigo la parte final de la dirección https://www.imdb.com/title/tt0052357/, la cadena tt0052357. En nuestro ejemplo usamos la encuesta de los críticos Sight & Sound de 2012, que contiene la columna const con dichos IMDbIDs .

library(imdbapi)
library(data.table)
library(tidyverse)
sight_sound <- read.csv("https://sites.google.com/site/nubededatosblogspotcom/Sight&Sound2012-CriticsPoll.txt", stringsAsFactors = FALSE)
glimpse(sight_sound)

Observations: 588
Variables: 17
$ const                          "tt0052357", "tt0033467", "tt004643...
$ position                       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ created                        "Thu Aug 16 07:42:05 2012", "Thu Au...
$ description                    NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ modified                       "Thu Aug 16 07:42:05 2012", "Thu Au...
$ Title                          "Vertigo", "Citizen Kane", "Tôkyô m...
$ Directors                      "Alfred Hitchcock", "Orson Welles",...
$ Title.type                     "Feature Film", "Feature Film", "Fe...
$ IMDb.Rating                    8.5, 8.5, 8.2, 8.0, 8.3, 8.3, 8.0, ...
$ PeacefulAnarchy.rated          10, 9, 10, 9, 9, 6, 6, 10, 8, 9, 6,...
$ Runtime..mins.                 128, 119, 136, 110, 94, 160, 119, 6...
$ Genres                         "mystery, romance, thriller", "dram...
$ Year                           1958, 1941, 1953, 1939, 1927, 1968,...
$ Num.Votes                      153502, 205699, 16219, 14872, 19188...
$ Release.Date..month.day.year.  "1958-05-09", "1941-05-01", "1953-1...
$ Id                             1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
$ URL                            "http://www.imdb.com/title/tt005235...

Empleamos la función lapply para extraer la información de todos los IMDbIDs.

tt <-
  lapply(sight_sound$const, function(x) {
    return(tryCatch(
      find_by_id(
        x,
        type = NULL,
        year_of_release = NULL,
        plot = "full",
        include_tomatoes = TRUE,
        api_key = "12345678"
      ),
      error = function(e)
        NULL
    ))
  })
df_sight_sound <- rbindlist(tt, fill = TRUE)
df_sight_sound$Ratings <- as.character(df_sight_sound$Ratings)
df_sight_sound <- as.data.frame(df_sight_sound)
df_sight_sound %>% distinct(imdbID) %>% summarise(n= n())

    n
1 586

En una sola pasada suelen faltar algunos. En este caso nos faltan dos títulos. Repetiríamos el proceso hasta obtener todos los títulos.

# Comprobamos los titulos no encontrados
m <- subset(sight_sound, !(const %in% df_sight_sound$imdbID))$const 
m

[1] "tt0115751" "tt0032551"

Finalmente procesamos el data frame para eliminar duplicados.

df_sight_sound <- df_sight_sound %>% 
  filter(grepl("Internet",Ratings)) %>% 
  group_by(imdbID) %>% 
  distinct()

# A tibble: 588 x 26
# Groups:   imdbID [588]
   Title Year  Rated Released   Runtime Genre Director Writer Actors Plot 
                       
 1 Vert~ 1958  PG    1958-07-21 128 min Myst~ Alfred ~ "Alec~ James~ "Joh~
 2 Citi~ 1941  PG    1941-09-05 119 min Dram~ Orson W~ Herma~ Josep~ "A g~
 3 Toky~ 1953  NOT ~ 1972-03-13 136 min Drama Yasujir~ Kôgo ~ Chish~ An e~
 4 The ~ 1939  NOT ~ 1950-04-08 110 min Come~ Jean Re~ Jean ~ Nora ~ Avia~
 5 Sunr~ 1927  NOT ~ 1927-11-04 94 min  Dram~ F.W. Mu~ Carl ~ Georg~ "In ~
 6 2001~ 1968  G     1968-05-12 149 min Adve~ Stanley~ Stanl~ Keir ~ "\"2~
 7 The ~ 1956  PASS~ 1956-05-26 119 min Adve~ John Fo~ Frank~ John ~ Etha~
 8 Man ~ 1929  NOT ~ 1929-05-12 68 min  Docu~ Dziga V~ Dziga~ Mikha~ This~
 9 The ~ 1928  NOT ~ 1928-10-25 114 min Biog~ Carl Th~ Josep~ Maria~ The ~
10 8½    1963  NOT ~ 1963-06-25 138 min Drama Federic~ Feder~ Marce~ Guid~
# ... with 578 more rows, and 16 more variables: Language ,
#   Country , Awards , Poster , Ratings ,
#   Metascore , imdbRating , imdbVotes , imdbID ,
#   Type , DVD , BoxOffice , Production , Website ,
#   Response , totalSeasons

Y tendremos lista nuestra pequeña base de datos de IMDb. Si queremos exportar los resultados como csv:

write.csv(df_sight_sound, "df_sight_sound.csv", row.names = FALSE)

Entradas relacionadas

Referencias

2018-03-06

Filtrar un data frame basado en intervalos de tiempo en R

Problema

Para cada una de las Species queremos quedarnos con la primera imagen (columna ID) de cada intervalo de una hora empezando por la fecha inicial: 2015-03-16 18:42:00. Es decir, para la Specie A, nos queremos quedar con P1, P3 y P4. P2 no la consideraríamos pues está dentro del intervalo de una hora a partir de P1, entre las 18:42 y las 19:41.

  ID Species            DateTime
1 P1       A 2015-03-16 18:42:00
2 P2       A 2015-03-16 19:34:00
3 P3       A 2015-03-16 19:58:00
4 P4       A 2015-03-16 21:02:00
5 P5       B 2015-03-16 21:18:00
6 P6       A 2015-03-16 21:19:00
7 P7       A 2015-03-16 21:33:00
8 P8       B 2015-03-16 21:35:00
9 P9       B 2015-03-16 23:43:00

Datos

df <- read.table(
  text = 'ID   Species       DateTime
  P1   A            "2015-03-16 18:42:00"
  P2   A             "2015-03-16 19:34:00"
  P3   A             "2015-03-16 19:58:00"
  P4   A             "2015-03-16 21:02:00"
  P5   B             "2015-03-16 21:18:00"
  P6   A             "2015-03-16 21:19:00"
  P7   A             "2015-03-16 21:33:00"
  P8   B             "2015-03-16 21:35:00"
  P9   B             "2015-03-16 23:43:00"',
  stringsAsFactors = FALSE,
  header = TRUE
)

Solución

Creamos una nueva columna con los intervalos cada 60 minutos y nos quedamos con la primera ocurrencia para cada una de las Species. Es importante señalar que dentro de la función cut tenemos que especificar 60 minutos y no una hora ("1 hour"), o de lo contrario el intervalo no tendría en cuenta los minutos sino solamente las horas. Es decir, comenzaría el primer intervalo a las 18:00 y no a las 18:42.

library(dplyr)
df$DateTime <- as.POSIXct(df$DateTime)
df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1) %>%
  ungroup() %>%
  select(-by60)

Resultados

# A tibble: 5 x 3
  ID    Species DateTime           
                   
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

Referencias

stackoverflow

2016-05-30

Añadir una columna con el min y max por grupo en R

Title

Problema

Deseamos añadir dos columnas con el mínimo y máximo de la columna old.var para cada grupo de la columna id: 1, 2 y 3.

  id old.var
1  1       1
2  1       2
3  1       3
4  2       5
5  2       7
6  2       9
7  2      11
8  3       3
9  3       4

Datos originales

df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), old.var = c(1L, 
2L, 3L, 5L, 7L, 9L, 11L, 3L, 4L)), .Names = c("id", "old.var"
), class = "data.frame", row.names = c(NA, -9L))

Solución

Paquete base

df$min.var <- ave(df$old.var, df$id, FUN = min)
df$max.var <- ave(df$old.var, df$id, FUN = max)

Paquete dplyr

library(dplyr)
df %>% 
  group_by(id) %>%
  mutate(min.var = min(old.var), max.var = max(old.var))

  id old.var min.var max.var
1  1       1       1       3
2  1       2       1       3
3  1       3       1       3
4  2       5       5      11
5  2       7       5      11
6  2       9       5      11
7  2      11       5      11
8  3       3       3       4
9  3       4       3       4

Referencias

stackoveflow

2016-02-13

Añadir una columna con el número de ocurrencias sin agrupar en Access

Title

Problema

Queremos añadir una columna con el número de ocurrencias de otra columna sin agrupar el resultados. Tomamos como ejemplo la tabla Detalles de pedidos de la base de datos Neptuno.mdb. Queremos añadir una columna n, que cuente el número de filas —de productos— de cada pedido.

Solución

En Vista SQL, creamos una subconsulta para obtener el número de productos por pedido y la llamamos T2. Unimos la tabla Detalles de pedidos con esa subconsulta mediante el campo Id.Pedido, añadiendo el campo T2.n con el número de ocurrencias.

SELECT [Detalles de pedidos].IdPedido, 
    [Detalles de pedidos].IdProducto, 
    [Detalles de pedidos].PrecioUnidad, 
    [Detalles de pedidos].Cantidad, 
    [Detalles de pedidos].Descuento, 
    T2.n
FROM [Detalles de pedidos] 
INNER JOIN 
(SELECT [Detalles de pedidos].IdPedido, 
  Count([Detalles de pedidos].IdPedido) AS n
FROM [Detalles de pedidos]
GROUP BY [Detalles de pedidos].IdPedido)  AS T2 
ON [Detalles de pedidos].IdPedido = T2.IdPedido;

Resultado

En la columna n tenemos el número de productos de cada pedido. Sería el equivalente en Excel a usar la función CONTAR.SI con la columna IdPedido en el argumento rango y los valores de cada fila de IdPedido en el argumento criterio.

Entradas relacionadas

Ordenar un data frame por la frecuencia de una columna en R

2015-11-14

Contar el número de ocurrencias basado en dos columnas en R

Title

Problema

Deseamos añadir un contador con el número de ocurrencias basado en las columnas timey site. Con la particularidad de que la agrupación de time sea por día, si no, cada fila sería una ocurrencia distinta. Es decir, cada vez que la combinación de día y site diferente, reiniciará el contador.

                  time site val
1  2014-09-01 00:00:00 2001   1
2  2014-09-01 00:15:00 2001   0
3  2014-09-01 00:30:00 2001   2
4  2014-09-01 00:45:00 2001   0
5  2014-09-01 00:00:00 2002   1
6  2014-09-01 00:15:00 2002   0
7  2014-09-01 00:30:00 2002   2
8  2014-09-02 00:45:00 2001   0
9  2014-09-02 00:00:00 2001   1
10 2014-09-02 00:15:00 2001   0
11 2014-09-02 00:30:00 2001   2
12 2014-09-02 00:45:00 2001   0
13 2014-09-02 00:00:00 2002   1
14 2014-09-02 00:15:00 2002   0
15 2014-09-02 00:30:00 2002   2
16 2014-09-02 00:45:00 2001   0

df <- structure(list(time = structure(c(1409522400, 1409523300, 1409524200, 1409525100, 1409522400, 1409523300, 1409524200, 1409611500, 1409608800, 1409609700, 1409610600, 1409611500, 1409608800, 1409609700, 1409610600, 1409611500), class = c("POSIXct", "POSIXt"), tzone = ""), site = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("2001", "2002"), class = "factor"), val = c(1L, 0L, 2L, 0L, 1L, 0L, 2L, 0L, 1L, 0L, 2L, 0L, 1L, 0L, 2L, 0L)), .Names = c("time", "site", "val"), row.names = c(NA, -16L), class = "data.frame")

Solución

Utilizamos dplyr. Primero creamos una columna id, extrayendo el día de la fecha (columna time). Después agrupamos por site e id, y añadimos una nueva variable countercon el número de ocurrencias para esos dos grupos.

library(dplyr)
df$id <- as.factor(format(df$time,'%d'))
library(dplyr)
df %>% group_by(site, id) %>% mutate(counter = row_number())

Ahora tenemos un contador que empieza desde 1 cada vez que el día y el site cambian. Si se vuelven a repetir, como en la última fila, continua con la numeración.

Source: local data frame [16 x 5]
Groups: site, id

                  time site val id counter
1  2014-09-01 00:00:00 2001   1 01       1
2  2014-09-01 00:15:00 2001   0 01       2
3  2014-09-01 00:30:00 2001   2 01       3
4  2014-09-01 00:45:00 2001   0 01       4
5  2014-09-01 00:00:00 2002   1 01       1
6  2014-09-01 00:15:00 2002   0 01       2
7  2014-09-01 00:30:00 2002   2 01       3
8  2014-09-02 00:45:00 2001   0 02       1
9  2014-09-02 00:00:00 2001   1 02       2
10 2014-09-02 00:15:00 2001   0 02       3
11 2014-09-02 00:30:00 2001   2 02       4
12 2014-09-02 00:45:00 2001   0 02       5
13 2014-09-02 00:00:00 2002   1 02       1
14 2014-09-02 00:15:00 2002   0 02       2
15 2014-09-02 00:30:00 2002   2 02       3
16 2014-09-02 00:45:00 2001   0 02       6

Referencias

stackoverflow

2020-11-24

Problem

Options

References

2020-05-01

Introduction

Solution

Related posts

References

2018-11-29

Introducción

Solución

Entradas relacionadas

Referencias

2018-03-06

Problema

Solución

Resultados

Referencias

2016-05-30

Problema

Solución

Referencias

2016-02-13

Problema

Solución

Resultado

Entradas relacionadas

2015-11-14

Problema

Solución

Referencias