Problem
We want to convert continuous variable to discrete in R:
'Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%'.
library(ISLR)
library(tidyverse)
glimpse(College)
Observations: 777
Variables: 18
$ Private Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye...
$ Apps 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 17...
$ Accept 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 142...
$ Enroll 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484...
$ Top10perc 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 2...
$ Top25perc 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 4...
$ F.Undergrad 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 183...
$ P.Undergrad 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 63...
$ Outstate 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 155...
$ Room.Board 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 338...
$ Books 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 40...
$ Personal 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 6...
$ PhD 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 3...
$ Terminal 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, ...
$ S.F.Ratio 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5,...
$ perc.alumni 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26...
$ Expend 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644...
$ Grad.Rate 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 5...
Solution
- Option 1: form ISLR's book.
- Option 2: ifelse from base package and dplyr
- Option 3: creating a logical vector.
Elite = rep("No", nrow(College))
Elite[College$Top10perc > 50] = "Yes"
Elite <- as.factor(Elite)
college <- data.frame(College, Elite)
summary(college[, c("Top10perc", "Elite")])
There are 78 elite universities.
Top10perc Elite
Min. : 1.00 No :699
1st Qu.:15.00 Yes: 78
Median :23.00
Mean :27.56
3rd Qu.:35.00
Max. :96.00
# base
College$Elite <- factor(ifelse(College$Top10perc > 50, "Yes", "No"))
# dplyr
library(dplyr)
College <-
college %>%
mutate(Elite = factor(ifelse(College$Top10perc > 50, "Yes", "No")))
There are multiple options. I show two examples.
college$Elite <- transform(College, Elite = Top10perc > 50)
College$Elite <- College$Top10perc > 50
References
From 'An Introduction to Statistical Learning' (ISLR), page 54.
Related posts
No hay comentarios:
Publicar un comentario