A second look to grouping with dplyr

Reading time ~5 minutes

The one basic idea of dplyr is that each function should focus on one job. That’s why there are no functions such as compute_sumamries_by_group_with_robust_variants(df). Rather, summarising and grouping are seen as different jobs which should be accomplished by different functions. And, in turn, that’s why group_by, the grouping function of dplyr, is of considerable importance: this function should do the grouping for each operation whatsoever.

Let’s load all tidyverse libraries in one go:

library(tidyverse)

On the first glimpse, simply enough, this code

my_data %>% 
  group_by(grouping_var) %>% 
  summarise(n_per_group = (n(var_to_be_counted_per_group)))

more or less can be read in plain English. So, simple. As a popular example, consider the dataset mtcars:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(n_per_group = n())
## # A tibble: 3 x 2
##     cyl n_per_group
##   <dbl>       <int>
## 1     4          11
## 2     6           7
## 3     8          14

n simply counts the rows of a (grouped) dataframe. As all rows must, per definition, have the same number of rows, it does not matter which particular column we are looking at. Hence, no column is specified for n().

Somewhat slightly more complicated, say we want to count the number of cars grouped by (number of) cylinder and by (type of) transmission. In dplyr, we just add an additional grouping variable:

mtcars %>% 
  group_by(cyl, am) %>%  # 'am' added as second grouping variable
  summarise(n_per_group = n())
## # A tibble: 6 x 3
## # Groups:   cyl [?]
##     cyl    am n_per_group
##   <dbl> <dbl>       <int>
## 1     4     0           3
## 2     4     1           8
## 3     6     0           4
## 4     6     1           3
## 5     8     0          12
## 6     8     1           2

Note that dplyr now tells us about groups: # Groups: cyl [?]. That output tells us that cyl is the grouping variable. The question mark just means that the number of groups is unkown right now; it will only be computed when the next line is executed.

Note also that one grouping variable is indicated. But wait! We had indicated two grouping variables, right? What happened is that by each run of summarise one grouping variable is removed. So if we had two grouping variables before we ran summarise we are left with one grouping variabl after the call of summarise. I assume this behavior was implemented because you can ‘roll up’ up a data frame: Get the counts for each cell for the two grouping variables, then sum up the levels of one variables, then sum up again to double check the total sum. See:

mtcars %>% 
  group_by(cyl, am) %>%  
  summarise(n_per_group = n()) %>% 
  summarise(n_per_group = n())
## # A tibble: 3 x 2
##     cyl n_per_group
##   <dbl>       <int>
## 1     4           2
## 2     6           2
## 3     8           2

This output tells us that there are 2 lines for each group of cyl (am == 1 and am == 0). Maybe more useful:

mtcars %>% 
  group_by(cyl, am) %>%  
  summarise(n_per_group = n()) %>% 
  summarise(n_per_group = sum(n_per_group))
## # A tibble: 3 x 2
##     cyl n_per_group
##   <dbl>       <int>
## 1     4          11
## 2     6           7
## 3     8          14

What happens if the ‘peel off’ the last last layer and sum up the remaining rows?

mtcars %>% 
  group_by(cyl, am) %>%  
  summarise(n_per_group = n()) %>% 
  summarise(n_per_group = sum(n_per_group)) %>% 
  summarise(n_per_group = sum(n_per_group))
## # A tibble: 1 x 1
##   n_per_group
##         <int>
## 1          32

We get the overall number of rows of the whole dataset.

Each summarise peels off one layer of grouping.

This blog has moved

This blog has moved to Adios, Jekyll. Hello, Blogdown!… Continue reading