2020-04-18

How to draw a stratified sample in R

Title

Problem

We want to draw a stratified sample in R. Previously, we took a random sample from a data frame. We did not control over the distribution of the subgroups. This time we will control over the distribution of each stratum keeping the same overall distribution of the original data.

Solution

Using the function createDataPartition from the caret package.

library(tidyverse)
library(caret)
set.seed(1)
planes <- as.data.frame(nycflights13::planes)
trainIndex <- createDataPartition(planes$engine,
                                  p = .5,
                                  list = FALSE,
                                  times = 1)

planesTrain <- planes[trainIndex, ]
planesTest  <- planes[-trainIndex, ]
Using the function stratified from the splitstackshape package.

library(splitstackshape)
set.seed(1)
planesTrain1 <- stratified(planes, "engine", 0.5)
planesTest1 <- planes[!(planes$tailnum %in% planesTrain$tailnum),]
Checking that we have created balanced splits of the data.

  • Original data frame
  • planes %>% 
      group_by(engine) %>% 
        summarise(n = n()) %>%
      mutate(cum = n / sum(n))
    
    # A tibble: 6 x 3
      engine            n      cum
                   
    1 4 Cycle           2 0.000602
    2 Reciprocating    28 0.00843 
    3 Turbo-fan      2750 0.828   
    4 Turbo-jet       535 0.161   
    5 Turbo-prop        2 0.000602
    6 Turbo-shaft       5 0.00151  
    
  • Training set
  • 
    planesTrain %>%
      group_by(engine) %>%
      summarise(n = n()) %>%
      mutate(cum = n / sum(n))
    
    # A tibble: 6 x 3
      engine            n      cum
                   
    1 4 Cycle           1 0.000602
    2 Reciprocating    14 0.00842 
    3 Turbo-fan      1375 0.827   
    4 Turbo-jet       268 0.161   
    5 Turbo-prop        1 0.000602
    6 Turbo-shaft       3 0.00181 
    
  • Test set
  • planesTest %>%
      group_by(engine) %>%
      summarise(n = n()) %>%
      mutate(cum = n / sum(n))
    
    # A tibble: 6 x 3
      engine            n      cum
                   
    1 4 Cycle           1 0.000602
    2 Reciprocating    14 0.00843 
    3 Turbo-fan      1375 0.828   
    4 Turbo-jet       267 0.161   
    5 Turbo-prop        1 0.000602
    6 Turbo-shaft       2 0.00120 
    

Related posts

References

No hay comentarios:

Publicar un comentario

Nube de datos