Nube de datos: Select

Mostrando entradas con la etiqueta Select. Mostrar todas las entradas

2020-11-24

How to select all columns in dplyr

Title

Problem

We want to select all variables in a data frame using dplyr.

Solution

select(iris, everything())
# Using the pipe operator %>%
iris %>% select(everything())

Notes

The function select( ) subsets columns based on their names and other features. We can use the function everything( ) a selection helper to select all variables.

Results

Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

References

2020-11-21

How to reorder all other columns in a data frame in dplyr

Title

Problem

In dplyr when using select, we want to reorder all other columns at the beginning or end of a data frame without having to type these column names again. In the following example, the data drame flights has 19 columns and we want to order 5 of them at the beginning or end of a data frame.

library(nycflights13)
library(dplyr)
head(flights)

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
                                     
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ... with 11 more variables: arr_delay , carrier , flight ,
#   tailnum , origin , dest , air_time , distance ,
#   hour , minute , time_hour

Solution

5 columns at the beginning of the data frame.

col <- c("carrier", "tailnum", "year", "month", "day")
select(flights, one_of(col), everything())

# A tibble: 336,776 × 19
   carrier tailnum  year month   day dep_time sched_dep_time dep_delay arr_time
                                  
1       UA  N14228  2013     1     1      517            515         2      830
2       UA  N24211  2013     1     1      533            529         4      850
3       AA  N619AA  2013     1     1      542            540         2      923
4       B6  N804JB  2013     1     1      544            545        -1     1004
5       DL  N668DN  2013     1     1      554            600        -6      812
6       UA  N39463  2013     1     1      554            558        -4      740
7       B6  N516JB  2013     1     1      555            600        -5      913
8       EV  N829AS  2013     1     1      557            600        -3      709
9       B6  N593JB  2013     1     1      557            600        -3      838
10      AA  N3ALAA  2013     1     1      558            600        -2      753
# ... with 336,766 more rows, and 10 more variables: sched_arr_time ,
#   arr_delay , flight , origin , dest , air_time ,
#   distance , hour , minute , time_hour

5 columns at the end of the data frame.

select(flights, -one_of(col), one_of(col))

# A tibble: 336,776 × 19
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay flight
                                          
1       517            515         2      830            819        11   1545
2       533            529         4      850            830        20   1714
3       542            540         2      923            850        33   1141
4       544            545        -1     1004           1022       -18    725
5       554            600        -6      812            837       -25    461
6       554            558        -4      740            728        12   1696
7       555            600        -5      913            854        19    507
8       557            600        -3      709            723       -14   5708
9       557            600        -3      838            846        -8     79
10      558            600        -2      753            745         8    301
# ... with 336,766 more rows, and 12 more variables: origin , dest ,
#   air_time , distance , hour , minute , time_hour ,
#   carrier , tailnum , year , month , day

To add all columns

In this case we want to add all the data frame at the beginning or the end again, resulting in duplicates of our 5 columns.

# 5 columns at the beginning
bind_cols(select(flights, one_of(col)), flights)
# 5 columns at the end
bind_cols(flights, select(flights, one_of(col)))

References

stackoveflow

2018-03-06

Filtrar un data frame basado en intervalos de tiempo en R

Problema

Para cada una de las Species queremos quedarnos con la primera imagen (columna ID) de cada intervalo de una hora empezando por la fecha inicial: 2015-03-16 18:42:00. Es decir, para la Specie A, nos queremos quedar con P1, P3 y P4. P2 no la consideraríamos pues está dentro del intervalo de una hora a partir de P1, entre las 18:42 y las 19:41.

  ID Species            DateTime
1 P1       A 2015-03-16 18:42:00
2 P2       A 2015-03-16 19:34:00
3 P3       A 2015-03-16 19:58:00
4 P4       A 2015-03-16 21:02:00
5 P5       B 2015-03-16 21:18:00
6 P6       A 2015-03-16 21:19:00
7 P7       A 2015-03-16 21:33:00
8 P8       B 2015-03-16 21:35:00
9 P9       B 2015-03-16 23:43:00

Datos

df <- read.table(
  text = 'ID   Species       DateTime
  P1   A            "2015-03-16 18:42:00"
  P2   A             "2015-03-16 19:34:00"
  P3   A             "2015-03-16 19:58:00"
  P4   A             "2015-03-16 21:02:00"
  P5   B             "2015-03-16 21:18:00"
  P6   A             "2015-03-16 21:19:00"
  P7   A             "2015-03-16 21:33:00"
  P8   B             "2015-03-16 21:35:00"
  P9   B             "2015-03-16 23:43:00"',
  stringsAsFactors = FALSE,
  header = TRUE
)

Solución

Creamos una nueva columna con los intervalos cada 60 minutos y nos quedamos con la primera ocurrencia para cada una de las Species. Es importante señalar que dentro de la función cut tenemos que especificar 60 minutos y no una hora ("1 hour"), o de lo contrario el intervalo no tendría en cuenta los minutos sino solamente las horas. Es decir, comenzaría el primer intervalo a las 18:00 y no a las 18:42.

library(dplyr)
df$DateTime <- as.POSIXct(df$DateTime)
df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1) %>%
  ungroup() %>%
  select(-by60)

Resultados

# A tibble: 5 x 3
  ID    Species DateTime           
                   
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

Referencias

stackoverflow

2017-05-02

Cómo dividir un data frame en partes iguales y quedarnos con una de ellas

Problema

Queremos dividir un data frame en 5 partes iguales y quedarnos con una de ellas, en nuestro ejemplo la tercera.

df<- data.frame(data=(1:100))

library(tibble)
as_tible(df)

# A tibble: 100 × 1
    data
   
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
# ... with 90 more rows

Solución

Creamos una columna con la función ntile que nos indica cada una de las 5 partes del data frame. Filtramos el data frame quedándonos con la tercera parte.

library(dplyr)
df[ntile(df$data, 5) == 3, ]
df %>% 
  mutate(n = ntile(data, 5)) %>% 
  filter(n == 3) %>% 
  select(data)

Resultado

Entradas relacionadas

Referencias

stackoverflow

2016-12-19

Reordenar el resto columnas de un data frame con dplyr

Title

Problema

Cuando seleccionamos columnas usando el paquete dplyr, queremos organizar el resto de columnas al principio o el final del data frame, sin tener que escribir sus nombres manualmente.

En el siguiente ejemplo, el data frame flights tiene 19 columnas y queremos ordenar 5 de ellas al comienzo o al final del mismo.

library(nycflights13)
library(dplyr)
head(flights)

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
                                     
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ... with 11 more variables: arr_delay , carrier , flight ,
#   tailnum , origin , dest , air_time , distance ,
#   hour , minute , time_hour

Solución

Las 5 columnas al inicio del data frame

col <- c("carrier", "tailnum", "year", "month", "day")
select(flights, one_of(col), everything())

# A tibble: 336,776 × 19
   carrier tailnum  year month   day dep_time sched_dep_time dep_delay arr_time
                                  
1       UA  N14228  2013     1     1      517            515         2      830
2       UA  N24211  2013     1     1      533            529         4      850
3       AA  N619AA  2013     1     1      542            540         2      923
4       B6  N804JB  2013     1     1      544            545        -1     1004
5       DL  N668DN  2013     1     1      554            600        -6      812
6       UA  N39463  2013     1     1      554            558        -4      740
7       B6  N516JB  2013     1     1      555            600        -5      913
8       EV  N829AS  2013     1     1      557            600        -3      709
9       B6  N593JB  2013     1     1      557            600        -3      838
10      AA  N3ALAA  2013     1     1      558            600        -2      753
# ... with 336,766 more rows, and 10 more variables: sched_arr_time ,
#   arr_delay , flight , origin , dest , air_time ,
#   distance , hour , minute , time_hour

Las 5 columnas al final del data frame

select(flights, -one_of(col), one_of(col))

# A tibble: 336,776 × 19
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay flight
                                          
1       517            515         2      830            819        11   1545
2       533            529         4      850            830        20   1714
3       542            540         2      923            850        33   1141
4       544            545        -1     1004           1022       -18    725
5       554            600        -6      812            837       -25    461
6       554            558        -4      740            728        12   1696
7       555            600        -5      913            854        19    507
8       557            600        -3      709            723       -14   5708
9       557            600        -3      838            846        -8     79
10      558            600        -2      753            745         8    301
# ... with 336,766 more rows, and 12 more variables: origin , dest ,
#   air_time , distance , hour , minute , time_hour ,
#   carrier , tailnum , year , month , day

Para añadir las columnas a todo el data frame

Si quisiéramos añadir las 5 columnas, duplicándolas, a la totalidad del data frame embp.

# 5 columnas al principio
bind_cols(select(flights, one_of(col)), flights)

  # Con dplyr::relocate()
  flights %>%  
    relocate(carrier, tailnum, year, month, day) 

# 5 columnas al final
bind_cols(flights, select(flights, one_of(col)))

  # Con dplyr::relocate()
  flights %>%  
    relocate(carrier, tailnum, year, month, day, .after = last_col())

Entradas relacionadas

Referencias

stackoveflow

2015-12-16

Cómo omitir variables de un data frame en R basadas en un vector de caracteres usando dplyr

Title

Problema

De un data frame deseamos eliminar usando dplyr aquellas variables cuyos nombres suministramos en un vector de caracteres. Por ejemplo, del siguiente vector deseamos eliminar la variable B.

set.seed(201512)
df <- data.frame(A = runif(10), B = runif(10))

         A           B
1  0.5235130 0.032229763
2  0.2625372 0.059565071
3  0.6049460 0.792932236
4  0.5136619 0.809743021
5  0.5668388 0.001533767
6  0.4876062 0.155949532
7  0.2354488 0.490415100
8  0.5688439 0.165787477
9  0.5964628 0.807970900
10 0.4615434 0.380846012

Solución

Empleamos la función one_of, que sólo funciona dentro de la función select. Con ella seleccionamos, en este caso omitimos al ir precedida del signo negativo, aquellas variables suministradas en un vector de caracteres (omitir).

library(dplyr)    
omitir <- c("B")# Vector de caracteres
df %>% select(-one_of(omitir))

           A
1  0.5235130
2  0.2625372
3  0.6049460
4  0.5136619
5  0.5668388
6  0.4876062
7  0.2354488
8  0.5688439
9  0.5964628
10 0.4615434

Referencias

stackoverflow

Entradas relacionadas

2015-11-05

Descartar variables de un data frame por su nombre

Title

Problema

Deseamos excluir variables de un data frame de acuerdo a su nombre. En nuestro ejemplo usamos el data frame iris, y queremos descartar las variables Sepal.Length y Petal.Width.

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Soluciones

Hay múltiples opciones.

Paquete base

# Opción 1
iris[, -which(names(iris) %in% c("Sepal.Length", "Petal.Width"))]
# Opción 2
iris[ , !names(iris) %in% c("Sepal.Length","Petal.Width")]
# Opción 3
subset(iris, select = -c(Sepal.Length, Petal.Width))

dplyr

library(dplyr)
iris %>% select(-c(Sepal.Length, Petal.Width))

data.table

library(data.table)
DT = as.data.table(iris)
DT[ , !names(DT) %in% c("Sepal.Length", "Petal.Width"), with = FALSE]
# Otra opción
subset(DT, select=-c(Sepal.Length, Petal.Width))

Resultado

Las tres variables restantes. Solamente mostramos las 6 primeras filas

  Sepal.Width Petal.Length Species
1         3.5          1.4  setosa
2         3.0          1.4  setosa
3         3.2          1.3  setosa
4         3.1          1.5  setosa
5         3.6          1.4  setosa
6         3.9          1.7  setosa

Notas

Hay diferencias sutiles entre la primera opción con el paquete base -which y la segunda con la función !. Si con -which especificamos nombres que no encuentra en el data frame devolverá un data frame vacío con cero columnas. Mientras que si sucede lo mismo con !, devolverá el data frame original sin modificar.

La sintaxis con dplyr es bastante sencilla. Con data.table es muy similar a las utilizadas con el paquete base. No obstante, es necesario especificar el argumento with = FALSE o devolverá un vector lógico. También podemos emplear con data.table la función subset.

Entradas relacionadas

Referencias

stackoveflow