Three ways to dichotomize a variable

Reading time ~4 minutes

Dichotomizing is also called dummy coding. It means: Take a variable with multiple different values (>2), and transform it so that the output variable has 2 different values.

Note that this “thing” can be understood as consisting of two different aspects: Recoding and cutting. Recoding means that value “a” becomes values “b” etc. Cutting means that a “rope” of numbers is cut into several shorter “ropes” (that’s why it is called cutting).

Several ways of achieving this exist in R. Here we discuss three.

First, let’s load some data.

library(AER)
data(Affairs)

We will define a new variable, called “halodrie”. A “halodrie” is someone who likes having affairs (German parlance). The variable should have 2 values, ie., “yes” and “no”.

Using {car}

library(car)

Affairs$halodrie <- car::recode(Affairs$affairs, "0 = 'no'; 1:12 = 'yes'")

head(Affairs$halodrie)
## [1] "no" "no" "no" "no" "no" "no"
table(Affairs$halodrie)
## 
##  no yes 
## 451 150

A comfortable feature of this function is that it allows using the colon operator. Note that the whole recode-thing (all values to be recoded) is to be put into quotation marks.

Using {dplyr}

library(dplyr)

Affairs$halodrie <- 
  dplyr::recode(Affairs$affairs, `0` = "no", .default = "yes")

Affairs$halodrie %>% head
## [1] "no" "no" "no" "no" "no" "no"
table(Affairs$halodrie)
## 
##  no yes 
## 451 150

This function does not allow the colon operator. I assume the reason that the author (Hadley Wickham) argues that a given function should only be capable of one thing. Here, the function reassigns a value of a variable, not (much) more, not less.

Using base::cut

Affairs$halodrie <- cut(Affairs$affairs, breaks = c(-Inf, 0, +Inf), labels = c("no", "yes"))

table(Affairs$halodrie)
## 
##  no yes 
## 451 150

Of note, when a continuous variable is “cut”, one must specify the minimum and the maximum value (or arbitrarly small or large values) as cutting points. So cutting in two halfs, is not one cutting point for cut, but three (always add two cutting points: one being the smallest value in the sample [or smaller, even -Inf, the other one being the largest value in the sample [or even Inf]).

Summary

For beginners, I would recommend car::recode. It provides both recoding and cutting in one function, and hence may be easier to apply at start. It also offers quite some flexibility. Assume you are interested in marriages ranging from 5 to 15 years:

Affairs$years5_15 <- car::recode(Affairs$yearsmarried, "0:4 =  'no'; 5:15 = 'yes'; else = 'no'")

head(Affairs$years5_15)
## [1] "yes" "no"  "yes" "yes" "no"  "no"
table(Affairs$years5_15)
## 
##  no yes 
## 245 356

Crashkurs Datenanalyse mit R

# Willkommen zum R-CrashkursNicht jeder liebt Datenanalyse und Statistik... in gleichem Maße! Das ist zumindest meine Erfahrung aus dem U...… Continue reading

Different ways to count NAs over multiple columns

Published on September 08, 2017

Different ways to present summaries in ggplot2

Published on September 08, 2017