2019-12-28

How to convert a continuous variable to discrete in R?

Problem

We want to convert continuous variable to discrete in R:

'Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%'.

library(ISLR)
library(tidyverse)
glimpse(College)
Observations: 777
Variables: 18
$ Private      Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye...
$ Apps         1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 17...
$ Accept       1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 142...
$ Enroll       721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484...
$ Top10perc    23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 2...
$ Top25perc    52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 4...
$ F.Undergrad  2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 183...
$ P.Undergrad  537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 63...
$ Outstate     7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 155...
$ Room.Board   3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 338...
$ Books        450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 40...
$ Personal     2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 6...
$ PhD          70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 3...
$ Terminal     78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, ...
$ S.F.Ratio    18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5,...
$ perc.alumni  12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26...
$ Expend       7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644...
$ Grad.Rate    60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 5...

Solution

  1. Option 1: form ISLR's book.
  2. Elite = rep("No", nrow(College))
    Elite[College$Top10perc > 50] = "Yes"
    Elite <- as.factor(Elite)
    college <- data.frame(College,  Elite)
    summary(college[, c("Top10perc", "Elite")])
    
    There are 78 elite universities.

      Top10perc     Elite    
     Min.   : 1.00   No :699  
     1st Qu.:15.00   Yes: 78  
     Median :23.00            
     Mean   :27.56            
     3rd Qu.:35.00            
     Max.   :96.00    
    
  3. Option 2: ifelse from base package and dplyr
  4. # base 
    College$Elite <- factor(ifelse(College$Top10perc > 50, "Yes", "No"))
    # dplyr
    library(dplyr)
    College <-
      college %>%
      mutate(Elite = factor(ifelse(College$Top10perc > 50, "Yes", "No")))
    
  5. Option 3: creating a logical vector.
  6. There are multiple options. I show two examples.

    college$Elite <- transform(College, Elite = Top10perc > 50)
    College$Elite <- College$Top10perc > 50
    

References

From 'An Introduction to Statistical Learning' (ISLR), page 54.

Related posts

No hay comentarios:

Publicar un comentario

Nube de datos