# Multiple t-Tests with dplyr

## August 18, 2016

Reading time ~9 minutes

## t-Test on multiple columns

Suppose you have a data set where you want to perform a t-Test on multiple columns with some grouping variable. As an example, say you a data frame where each column depicts the score on some test (1st, 2nd, 3rd assignment…). In each row is a different student. So you glance at the grading list (OMG!) of a teacher!

How to do do that in R? Probably, the most “natural” solution would be some lapply() call.

But I like dplyr (see intro here); so, is there some nice way to perform that with dplyr? The short answer is: no. dplyr basically wants to deliver back a data frame, and the t-test does not output a single value, so you cannot use the t-test (right away) for dplyr’s summarise. One way out is using list-columns… Let’s see.

Load some dplyr, tidyr and some data:

library(dplyr)
library(tidyr)
data(tips, package = "reshape2")
glimpse(tips)

## Observations: 244
## Variables: 7
## $total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26.... ##$ tip        <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
## $sex <fctr> Female, Male, Male, Male, Female, Male, Male, Male... ##$ smoker     <fctr> No, No, No, No, No, No, No, No, No, No, No, No, No...
## $day <fctr> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, ... ##$ time       <fctr> Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Di...
## $size <int> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...  Well, it looks crude, but that’s a way: tips %>% select(tip, total_bill, sex) %>% gather(key = variable, value = value, -sex) %>% group_by(sex, variable) %>% summarise(value = list(value)) %>% spread(sex, value) %>% group_by(variable) %>% mutate(p_value = t.test(unlist(Female), unlist(Male))$p.value,
t_value = t.test(unlist(Female), unlist(Male))$statistic)  ## Source: local data frame [2 x 5] ## Groups: variable [2] ## ## variable Female Male p_value t_value ## <chr> <list> <list> <dbl> <dbl> ## 1 tip <dbl [87]> <dbl [157]> 0.13780684 -1.489536 ## 2 total_bill <dbl [87]> <dbl [157]> 0.01857339 -2.373398  Let’s go through this code step by step. First, this bit tips %>% select(tip, total_bill, sex) %>% head  ## tip total_bill sex ## 1 1.01 16.99 Female ## 2 1.66 10.34 Male ## 3 3.50 21.01 Male ## 4 3.31 23.68 Male ## 5 3.61 24.59 Female ## 6 4.71 25.29 Male  is used to select the columns we want to perform the t-Test on (here: tip and total_bill) plus the grouping variable (sex). Then by the next bit tips %>% select(tip, total_bill, sex) %>% gather(key = variable, value = value, -sex) %>% head  ## sex variable value ## 1 Female tip 1.01 ## 2 Male tip 1.66 ## 3 Male tip 3.50 ## 4 Male tip 3.31 ## 5 Female tip 3.61 ## 6 Male tip 4.71  we “melt” the data frame down, so that all numeric variables are put in one column (underneath each other). tips %>% select(tip, total_bill, sex) %>% gather(key = variable, value = value, -sex) %>% group_by(sex, variable) %>% summarise(value = list(value))  ## Source: local data frame [4 x 3] ## Groups: sex [?] ## ## sex variable value ## <fctr> <chr> <list> ## 1 Female tip <dbl [87]> ## 2 Female total_bill <dbl [87]> ## 3 Male tip <dbl [157]> ## 4 Male total_bill <dbl [157]>  Now it get’s interesting. We put all the values per group (e.g., male-tip or female-total_bill…) in one cell. Yes, that’s right. In each cell of column value there is now a list (a bunch) of values. That’s what is called a “list-column”. We will now use this list column for the t-test. tips %>% select(tip, total_bill, sex) %>% gather(key = variable, value = value, -sex) %>% group_by(sex, variable) %>% summarise(value = list(value)) %>% spread(sex, value) %>% group_by(variable)  ## Source: local data frame [2 x 3] ## Groups: variable [2] ## ## variable Female Male ## * <chr> <list> <list> ## 1 tip <dbl [87]> <dbl [157]> ## 2 total_bill <dbl [87]> <dbl [157]>  But before we do the t-Test, we “spread” the data frame. That is, we convert from “long” to “wide” format. Next, we group for variable. That means in practice, that the following t-test will be applied to each member of this group (ie., each variable, here tip and total_bill). And now the t-Test: tips %>% select(tip, total_bill, sex) %>% gather(key = variable, value = value, -sex) %>% group_by(sex, variable) %>% summarise(value = list(value)) %>% spread(sex, value) %>% group_by(variable) %>% mutate(p_value = t.test(unlist(Female), unlist(Male))$p.value,
t_value = t.test(unlist(Female), unlist(Male))$statistic)  ## Source: local data frame [2 x 5] ## Groups: variable [2] ## ## variable Female Male p_value t_value ## <chr> <list> <list> <dbl> <dbl> ## 1 tip <dbl [87]> <dbl [157]> 0.13780684 -1.489536 ## 2 total_bill <dbl [87]> <dbl [157]> 0.01857339 -2.373398  That’s it. ## You can have it very simple That was quite cumbersome for something which could have been achieved very simple with t.test(tip ~ sex, data = tips)$p.value

## [1] 0.1378068

t.test(total_bill ~ sex, data = tips)$p.value  ## [1] 0.01857339  So our way (OK, my way) does not seem advisable. However, it may has been instructive for the “thinking” of dplyr. By the way, another simpler approach with dplyr could have been: tips %>% summarise_each(funs(t.test(.[sex == "Female"], .[sex == "Male"])$p.value), vars = total_bill:tip)

##        vars1     vars2
## 1 0.01857339 0.1378068


## Acknowledgment

This post was inspired by this post on Stack Overflow, and also by this one.

### This blog has moved

This blog has moved to Adios, Jekyll. Hello, Blogdown!… Continue reading

#### Wie gut schätzt eine Stichprobe die Grundgesamtheit?

Published on November 17, 2017

#### Some thoughts on tidyveal and environments in R

Published on November 16, 2017