2021-03-01

How to draw a stratified sample in R

Problem

We want to draw a stratified sample from a data frame in R.

Solution

Let's look at two examples, with one or several groups.

One group

We extract 3 records from each of the species: setosa, versicolor y virginica.

  • Base package
  • set.seed(1)
    iris1 <- lapply(split(iris, iris$Species), function(x) x[sample(nrow(x), 3), ])
    do.call("rbind", iris1) 
    
                  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    setosa.14              4.3         3.0          1.1         0.1     setosa
    setosa.19              5.7         3.8          1.7         0.3     setosa
    setosa.28              5.2         3.5          1.5         0.2     setosa
    versicolor.96          5.7         3.0          4.2         1.2 versicolor
    versicolor.60          5.2         2.7          3.9         1.4 versicolor
    versicolor.94          5.0         2.3          3.3         1.0 versicolor
    virginica.148          6.5         3.0          5.2         2.0  virginica
    virginica.133          6.4         2.8          5.6         2.2  virginica
    virginica.131          7.4         2.8          6.1         1.9  virginica
    
  • dplyr
  • library(dplyr)
    set.seed(1)
    iris %>%
      group_by(Species) %>%
      sample_n(., 3)
    
     Source: local data frame [9 x 5]
    Groups: Species
    
      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    1          4.3         3.0          1.1         0.1     setosa
    2          5.7         3.8          1.7         0.3     setosa
    3          5.2         3.5          1.5         0.2     setosa
    4          5.7         3.0          4.2         1.2 versicolor
    5          5.2         2.7          3.9         1.4 versicolor
    6          5.0         2.3          3.3         1.0 versicolor
    7          6.5         3.0          5.2         2.0  virginica
    8          6.4         2.8          5.6         2.2  virginica
    9          7.4         2.8          6.1         1.9  virginica
    
    Two groups

    For each number of cylinders (4, 6 u 8) we will extract 2 records with automatic transmission (am = 0) and 2 with manual transmission (am = 1).

  • Base package
  • set.seed(1)
    mtcars1 <- lapply(split(mtcars, list(mtcars$cyl, mtcars$am)), function(x) x[sample(nrow(x), 2), ])
    do.call("rbind", mtcars1) 
    
                           mpg cyl  disp  hp drat    wt  qsec vs am gear carb
    0.4.Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
    0.4.Toyota Corona     21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
    1.4.Fiat X1-9         27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
    1.4.Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
    0.6.Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
    0.6.Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
    1.6.Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
    1.6.Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
    0.8.Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
    0.8.Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
    1.8.Ford Pantera L    15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
    1.8.Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
    
  • dplyr
  • set.seed(1)
    mtcars %>%
      group_by(cyl, am) %>%
      sample_n(., 2)
    
    Source: local data frame [12 x 11]
    Groups: cyl, am
    
        mpg cyl  disp  hp drat    wt  qsec vs am gear carb
    1  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
    2  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
    3  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
    4  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
    5  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
    6  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
    7  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
    8  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
    9  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
    10 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
    11 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
    12 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
    

    Related posts

    References

    No hay comentarios:

    Publicar un comentario

    Nube de datos