Problem
We want to draw a stratified sample from a data frame in R.
Solution
Let's look at two examples, with one or several groups.
One group
We extract 3 records from each of the species: setosa, versicolor y virginica.
set.seed(1)
iris1 <- lapply(split(iris, iris$Species), function(x) x[sample(nrow(x), 3), ])
do.call("rbind", iris1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
setosa.14 4.3 3.0 1.1 0.1 setosa
setosa.19 5.7 3.8 1.7 0.3 setosa
setosa.28 5.2 3.5 1.5 0.2 setosa
versicolor.96 5.7 3.0 4.2 1.2 versicolor
versicolor.60 5.2 2.7 3.9 1.4 versicolor
versicolor.94 5.0 2.3 3.3 1.0 versicolor
virginica.148 6.5 3.0 5.2 2.0 virginica
virginica.133 6.4 2.8 5.6 2.2 virginica
virginica.131 7.4 2.8 6.1 1.9 virginica
library(dplyr)
set.seed(1)
iris %>%
group_by(Species) %>%
sample_n(., 3)
Source: local data frame [9 x 5]
Groups: Species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.2 3.5 1.5 0.2 setosa
4 5.7 3.0 4.2 1.2 versicolor
5 5.2 2.7 3.9 1.4 versicolor
6 5.0 2.3 3.3 1.0 versicolor
7 6.5 3.0 5.2 2.0 virginica
8 6.4 2.8 5.6 2.2 virginica
9 7.4 2.8 6.1 1.9 virginica
Two groupsFor each number of cylinders (4, 6 u 8) we will extract 2 records with automatic transmission (am = 0) and 2 with manual transmission (am = 1).
set.seed(1)
mtcars1 <- lapply(split(mtcars, list(mtcars$cyl, mtcars$am)), function(x) x[sample(nrow(x), 2), ])
do.call("rbind", mtcars1)
mpg cyl disp hp drat wt qsec vs am gear carb
0.4.Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
0.4.Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
1.4.Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
1.4.Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
0.6.Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
0.6.Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
1.6.Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
1.6.Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
0.8.Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
0.8.Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
1.8.Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
1.8.Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
set.seed(1)
mtcars %>%
group_by(cyl, am) %>%
sample_n(., 2)
Source: local data frame [12 x 11]
Groups: cyl, am
mpg cyl disp hp drat wt qsec vs am gear carb
1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
3 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
4 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
5 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
6 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
7 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
8 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
9 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
10 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
11 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
12 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Related posts
References
No hay comentarios:
Publicar un comentario