Problem
We want to draw a stratified sample in R. Previously, we took a random sample from a data frame. We did not control over the distribution of the subgroups. This time we will control over the distribution of each stratum keeping the same overall distribution of the original data.
Solution
Using the function createDataPartition from the caret package.
library(tidyverse)
library(caret)
set.seed(1)
planes <- as.data.frame(nycflights13::planes)
trainIndex <- createDataPartition(planes$engine,
p = .5,
list = FALSE,
times = 1)
planesTrain <- planes[trainIndex, ]
planesTest <- planes[-trainIndex, ]
Using the function stratified from the splitstackshape package.
library(splitstackshape)
set.seed(1)
planesTrain1 <- stratified(planes, "engine", 0.5)
planesTest1 <- planes[!(planes$tailnum %in% planesTrain$tailnum),]
Checking that we have created balanced splits of the data.
- Original data frame
planes %>%
group_by(engine) %>%
summarise(n = n()) %>%
mutate(cum = n / sum(n))
# A tibble: 6 x 3
engine n cum
1 4 Cycle 2 0.000602
2 Reciprocating 28 0.00843
3 Turbo-fan 2750 0.828
4 Turbo-jet 535 0.161
5 Turbo-prop 2 0.000602
6 Turbo-shaft 5 0.00151
planesTrain %>%
group_by(engine) %>%
summarise(n = n()) %>%
mutate(cum = n / sum(n))
# A tibble: 6 x 3
engine n cum
1 4 Cycle 1 0.000602
2 Reciprocating 14 0.00842
3 Turbo-fan 1375 0.827
4 Turbo-jet 268 0.161
5 Turbo-prop 1 0.000602
6 Turbo-shaft 3 0.00181
planesTest %>%
group_by(engine) %>%
summarise(n = n()) %>%
mutate(cum = n / sum(n))
# A tibble: 6 x 3
engine n cum
1 4 Cycle 1 0.000602
2 Reciprocating 14 0.00843
3 Turbo-fan 1375 0.828
4 Turbo-jet 267 0.161
5 Turbo-prop 1 0.000602
6 Turbo-shaft 2 0.00120
Related posts
References
No hay comentarios:
Publicar un comentario