Cross-tabulate multiple variables

reading time: 15-20 min.

Recently, I analyzed some data of a study where the efficacy of online psychotherapy was investigated. The investigator had assessed whether or not a participant suffered from some comorbidities (such as depression, anxiety, eating disorder…).

I wanted to know whether each of these (10 or so) comorbidities was associated with the outcome (treatment success, yes vs. no).

Of course, an easy solution would be to “half-manually” check the association, eg. using table() in R. But I wanted a more reproducible, more automated solution (ok, I confess, I justed wanted it to be cooler, probably…).

As a starter, consider the old question of the Titanic disaster: Did class correlate with survival? Let’s see:

data(titanic, package = "COUNT")  # package need be installed
attach(titanic)  # save some typing
table(class, survived)# tabulate the two variables

##            survived
## class        no yes
##   1st class 122 203
##   2nd class 167 118
##   3rd class 528 178
##   crew        0   0

prop.table(table(class, survived))  # get the percentages/proportions

##            survived
## class               no        yes
##   1st class 0.09270517 0.15425532
##   2nd class 0.12689970 0.08966565
##   3rd class 0.40121581 0.13525836
##   crew      0.00000000 0.00000000

The tabulation results look like this:

Alas! Money can buy you life, it appears…

Note that we are talking about nominal variables with two levels, ie. dichotomous variables. In other words, we are looking at frequencies.

So my idea was:

take a column of a predictor (say, depression)
take the column of the outcome variable (treatment success)
Cross-tabulate the association.
Repeat those steps for each of the comorbidities
Make sure it looks (sufficiently) cool…

Have I already told that I like the R-package dplyr? Check it out. Let’s see how to do the job with dplyr. Let’s use the Wages dataset from package ISLR.

This dataset from ISLR shows the wage of some male professionals in the Atlantic US area; along with the wage there are also some “risk factors” for the wage such as health status, education, race. Let’s see if there is an association.

First, load the package and the data:

# don't forget to install the packages upfront.
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
data(Wage, package = "ISLR")

Now, here comes dplyr:

Wage %>%
  mutate(wage_f = ntile(wage, 2)) %>%  # bin it
  group_by(wage_f, health, race) %>%
  summarise(count = n()) %>%  # summarise by each group
  ggplot(aes(x = factor(wage_f), y = count, fill = race)) +
  geom_bar(stat = "identity") +
  facet_wrap(~health)

Let’s discuss the code a bit (not every bit), looking briefly the lines:

build a column where wage is binned in two groups (low vs. high wage)
group by wage group, health status and race (image 222 sub-spreadsheets)
count the row number in each group (ie., 8 mini-spreadsheets)
plot it, with some parameters

Cross-tabulate multiple variables

July 03, 2016

New bar stacking with ggplot 2.2.0

Crashkurs zur Erstellung von Barplots für Umfrage-Daten

Some thoughts (and simulation) on overfitting