Programming with dplyr: Part 02, writing a function

Reading time ~8 minutes

Recently, since dplyr <= 0.6.0 a new way of dealing with NSE was introduced, called tidyeval. As with every topic that begs our attention, the question “why bother” is in place. Theone answer is “you’ll need this stuff if you want to lock dplyr verbs inside a function”. Once you like dplyr and friends, a natural second step is to use the ideas not only for interactive use, but for more “programming” type, ie., writing functions.

The topic of this post is to explain some of the ‘tidyeval’ thinking. That is, let’s write a function using tidyeval thinking.

Build a function using tidyeval ideas

As an example, assume we want to have a function that takes two numeric vectors (ie., two groups), performs all pairwise comparisons (>), and returns the proportion ‘favorable’ for the first group. Say, out of 20 comparisons, 16 showed that group1 > group2. Hence, the ‘proportion of favorable comparisons’ would be 16/20 or .8. This measure can be seen as an effect size for the Mann Whitney U Test, Wilcoxon Test or even some type of rank biserial correlation. See the nice paper on the “Simple Difference Formula” of Kerby 2014 for details and background.

First, load tidyverse.

library(tidyverse)

Let’s take the mtcars dataset, I know it’s boring, but instructive, I hope.

So, here comes the function prop_fav:

prop_fav <- function(df, value, group, g1, g2){
  value <- enquo(value)
  group <- enquo(group)


  df %>%
   filter((!!group) == g1) %>%
   select(!!value) %>%
   pull -> values_g1

  df %>%
    filter((!!group) == g2) %>%
    select(!!value) %>%
    pull -> values_g2

  values_g1 <- na.omit(values_g1)
  values_g2 <- na.omit(values_g2)

  comp_grid <- expand.grid(values_g1, values_g2)

  fav_vec <- comp_grid[["Var1"]] > comp_grid[["Var2"]]

  fav_sum <- sum(fav_vec)

  fav_prop <- fav_sum / nrow(comp_grid)

  return(fav_prop)

}

tidyeval steps explained

What did we do? Let’s look at the steps in turn:

value <- enquo(value)
group <- enquo(group)

With enquo we take the parameter of a function as expression. That means, we do not evaluate it, we just take the expression, leave it on hold, and use it later on.

Then we extract the values of group 1 and group2 out of the data frame (assumed tidy). In tidy data frames, each variable has its own column. So we need to do some work to get the values extracted.

df %>%
   filter((!!group) == g1) %>%
   select(!!value) %>%
   pull -> values_g1

  df %>%
    filter((!!group) == g2) %>%
    select(!!value) %>%
    pull -> values_g2

See the !!; they mean “hey R, remember the expression I stored recently? Now take it, and ‘unquote’ it, that is, just run it!”. The double exclamation mark is just syntactic sugar for that phrase.

pull, also recently added to dplyr, pulls out a vector (column) out of a data frame.

And in the last bit, we take the two vectors, span a Cartesian product and do a comparison of each element. More bluntly, take the first value of group 1 and compare it with value 1 of group 2. Decide which is greater. Go on comparing value 1 with the rest of values in group 2. And then go on with the rest of the values of group 1 in the same manner.

Then we simple compute the proportion where a member of group 1 had a greater value than a member of group 2. The proportion resulting is a measure of the effect size for the difference in distribution of the two groups.

 comp_grid <- expand.grid(values_g1, values_g2)

  fav_vec <- comp_grid[["Var1"]] > comp_grid[["Var2"]]

  fav_sum <- sum(fav_vec)

  fav_prop <- fav_sum / nrow(comp_grid)

Convenience sourcing

For convenience, this function can be sourced like this:

source("https://sebastiansauer.github.io/Rcode/prop_fav.R")

Example time

Now, some examples. Comparing cars with 4 and 6 cylinders, how many of the 4 cylinder group has more horse power? Or, more precisely, “Out of 100 comparisons of 4 and 6 cyl cars, what’s the proportion of 4 cyl cars having more hp (than the 6 cyl group)?”.

prop_fav(df = mtcars, value = hp, group = cyl, g1 = 4, g2 = 6)
## [1] 0.06493506

About 6%. Accordingly, in ~94% of the comparisons, a 6 cyl car had more horse power.

prop_fav(df = mtcars, value = hp, group = cyl, g1 = 6, g2 = 4)
## [1] 0.9350649

Let’s look at extraversion, coming from this dataset:

# devtools::install_github("sebastiansauer/prada")
data(extra, package = "prada")

Let’s compare women and men to see how is more extrovert.

prop_fav(df = extra, value = extra_mean, group = sex, g1 = "Frau", g2 = "Mann")
## [1] 0.4867873

Women are about as extroverted as men.

What about the number of Facebook friends?

prop_fav(extra, n_facebook_friends, sex, g1 = "Frau", g2 = "Mann")
## [1] NA

Men have slightly more Facebook friends (in this sample).

Concluding thoughts

Note that this measure can be applied for non-metric, ie., ordinal variables. It is quite straight forward, that makes it easy to comprehend.

Using tidyeval can give a headache at start, but it is worthwhile if you want to dig deeper, and write your own functions. However, there may be some similar alternatives such as wrapr::let, see here.

This was just a tiny little start. Hadley Wickham’s book ‘Advances R’ gives a much more thorough introduction.

Crashkurs Datenanalyse mit R

# Willkommen zum R-CrashkursNicht jeder liebt Datenanalyse und Statistik... in gleichem Maße! Das ist zumindest meine Erfahrung aus dem U...… Continue reading

Different ways to count NAs over multiple columns

Published on September 08, 2017

Different ways to present summaries in ggplot2

Published on September 08, 2017