We want to create dummy variables based on other variables in R. In our example based on variables Sex and Embarked.
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked Age.NA
1 1 0 3 male 22 1 0 7.2500 S 0
2 2 1 1 female 38 1 0 71.2833 C 0
3 3 1 3 female 26 0 0 7.9250 S 0
4 4 1 1 female 35 1 0 53.1000 S 0
5 5 0 3 male 35 0 0 8.0500 S 0
6 6 0 3 male NA 0 0 8.4583 Q 1
df <- structure(list(PassengerId = 1:6, Survived = c(0L, 1L, 1L, 1L,
0L, 0L), Pclass = c(3L, 1L, 3L, 1L, 3L, 3L), Sex = structure(c(2L,
1L, 1L, 1L, 2L, 2L), .Label = c("female", "male"), class = "factor"),
Age = c(22L, 38L, 26L, 35L, 35L, NA), SibSp = c(1L, 1L, 0L,
1L, 0L, 0L), Parch = c(0L, 0L, 0L, 0L, 0L, 0L), Fare = c(7.25,
71.2833, 7.925, 53.1, 8.05, 8.4583), Embarked = structure(c(3L,
1L, 3L, 3L, 3L, 2L), .Label = c("C", "Q", "S"), class = "factor"),
Age.NA = c(0, 0, 0, 0, 0, 1)), .Names = c("PassengerId",
"Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare",
"Embarked", "Age.NA"), row.names = c("1", "2", "3", "4", "5",
"6"), class = "data.frame")
We use the function dummy.data.frame from the package dummies. By default it will expand dummy variables for character and factor classes.
The original columns Sex and Embarked are replaced by the dummy variable columns Sexfemale, Sexmale, EmbarkedC, EmbarkedQ and EmbarkedS.
PassengerId Survived Pclass Sexfemale Sexmale Age SibSp Parch Fare
1 1 0 3 0 1 22 1 0 7.2500
2 2 1 1 1 0 38 1 0 71.2833
3 3 1 3 1 0 26 0 0 7.9250
4 4 1 1 1 0 35 1 0 53.1000
5 5 0 3 0 1 35 0 0 8.0500
6 6 0 3 0 1 NA 0 0 8.4583
EmbarkedC EmbarkedQ EmbarkedS Age.NA
1 0 0 1 0
2 1 0 0 0
3 0 0 1 0
4 0 0 1 0
5 0 0 1 0
6 0 1 0 1