# Programming with dplyr: Part 01, introduction

Like for [others], Hadley Wickham’s dplyr, and more generally, the tidyverse approach has considerably changed the I do data analyses. Most notably, the pipe (coming from magrittr by Stefan Milton Bache, see here) has creeped into nearly every analyses I, do.

That is, is every analyses except for functions, and other non-interactive stuff. In those programming contexts, the dplyr way does not work, due to its non standard evaluation or NSE for short.

So, this post is about programming with dplyr, or more precisely, using the recently introduced tidyeval approach (approached into widely used R libraries, that is).

To understand the usecase, consider the following example. Say, we count the frequencies of some groups, and want to add the proportiong of these counts.

library(tidyverse)

mtcars %>%
group_by(cyl) %>%
summarise(n = n())

## # A tibble: 3 x 2
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14


Or, shorter mtcars %>% count(cyl).

Now, let’s add a column with the proporting of the count column (n).

mtcars %>%
count(cyl) %>%
mutate(prop = n / sum(n))

## # A tibble: 3 x 3
##     cyl     n    prop
##   <dbl> <int>   <dbl>
## 1     4    11 0.34375
## 2     6     7 0.21875
## 3     8    14 0.43750


For extending this game to more than one variable, see this post.

Now, let’s assume we would like to put the ‘add a proportion column to my dataframe’ in a function. How to proceed?

We might think this approach should work:

add_prop <- function(df, count_var, group_var){
df %>%
ungroup() %>%
group_by(group_var) %>%
mutate(prop =  count_var / sum(n()))
}

mtcars %>%
count(cyl) %>%
add_prop(count_var = n, group_var = cyl)


However, this does not work.

The dplyr ‘verbs’ expect quoted input variables. That means we must quote the parameters first, before we can hand over them to dplyr. quo basically means “hey R, don’t yet evaulate this expression. Just read it, understand the name of the expression, and wait until I tell you. And R, understand, that you got an expression, not a simple text string”. The last sentence tells us (or R) that we do not want a text string, but the quotation (or ‘citation’ of the expression if you like) of an expression to be evaluated in the futre.

mtcars %>%
count(cyl) %>%
add_prop(count_var = quo(n), group_var = quo(cyl))


The important point is quo.

quo(mtcars$cyl)  ## <quosure: global> ## ~mtcars$cyl


But there’s a second step we need to take. Now that quo has defined a quoted expression, the dplyr verbs do not need to quote their input variables, because they are already quoted. That is, we need to tell R now: “Hey, do not quote, evaluate! We have taken care of the quoting before”.

add_prop <- function(df, count_var, group_var){
df %>%
mutate(prop =  (!!count_var) / sum(!!count_var))
}

mtcars %>%
count(cyl) %>%
add_prop(count_var = quo(n), group_var = quo(cyl))

## # A tibble: 3 x 3
##     cyl     n    prop
##   <dbl> <int>   <dbl>
## 1     4    11 0.34375
## 2     6     7 0.21875
## 3     8    14 0.43750


Hizzah!

# Preparation of extraversion survey data

For teaching purposes and out of curiosity towards some psychometric questions, I have run a survey on extraversion here. The dataset has been published at OSF (DOI 10.17605/OSF.IO/4KGZH). The survey is base on a google form, which in turn saves the data in Google spreadsheet. Before the data can be analyzed, some preparation and makeup is in place. This posts shows some general makeup, typical for survey data.

Download the data from source (Google spreadsheets); the package gsheet provides an easy interface for that purpose.

library(gsheet)

## Warning: Missing column names filled in: 'X23' [23]

library(tidyverse)  # data judo
library(purrr)  # map
library(modeest)  # mlv


# Prepare variable names

First, save item names in a separate object for later retrieval and for documentation.

extra_names <- names(extra_raw)

## [1] "Zeitstempel"
## [2] "Bitte geben Sie Ihren dreistellen anonymen Code ein (1.: Anfangsbuchstabe des Vornamens Ihres Vaters; 2.: Anfangsbuchstabe des Mädchennamens Ihrer Mutter; 3: Anfangsbuchstabe Ihres Geburstsorts)"
## [3] "Ich bin gerne mit anderen Menschen zusammen."
## [4] "Ich bin ein Einzelgänger. (-)"
## [5] "Ich bin in vielen Vereinen aktiv."
## [6] "Ich bin ein gesprächiger und kommunikativer Mensch."


Next, replace the lengthy col names by ‘i’ followed by a number:

extra <- extra_raw
names(extra) <- c("timestamp", "code", paste0("i",1:26))


Then we rename some of the variables with new names.

extra <-
extra %>%
n_hangover = i12,
age = i13,
sex = i14,
extra_single_item = i15,
time_conversation = i16,
presentation = i17,
n_party = i18,
clients = i19,
extra_vignette = i20,
extra_vignette2 = i22,
major = i23,
smoker = i24,
sleep_week = i25,
sleep_wend = i26)


It is more helpful for sorting purposes for variable names with numbers to have the same format, ie., the same number of leadings zero. So not this “i1, i2”, but this “i01, i02” (if the number of items is not greater than 99).

To get the same amount of leading zeros we can use:

i <- 1:10
item_names <- paste0("i", formatC(i, width = 2, flag = "0"))

colnames(extra)[3:12] <- item_names


# Parse numbers from chr columns

Some columns actually assess a number but the field in the survey form was liberally open to characters. So we have to convert the character to numbers, or, more precisely, suck out the numbers from the character variables.

extra$n_hangover <- parse_number(extra$n_hangover)

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col     row   col expected actual expected   <int> <int>    <chr>  <chr> actual 1   132    NA a number Keinen row 2   425    NA a number      .

extra$n_facebook_friends <- parse_number(extra$n_facebook_friends)
extra$time_conversation <- parse_number(extra$time_conversation)

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col     row   col expected      actual expected   <int> <int>    <chr>       <chr> actual 1   153    NA a number       Opfer row 2   633    NA a number Eine Minute

extra$n_party <- parse_number(extra$n_party)

## Warning: 1 parsing failure.
## row # A tibble: 1 x 4 col     row   col expected actual expected   <int> <int>    <chr>  <chr> actual 1   270    NA a number      u


The parsing left the dataframe with some rather ugly attributes, albeit with interesting informations. After checking them, however, I feel inclined to delete them.

attributes(extra$n_hangover) <- NULL attributes(extra$time_conversation) <- NULL
attributes(extra$n_party) <- NULL attributes(extra$sleep_wend) <- NULL
attr(extra, "spec") <- NULL


# Recode items 1: Reverse order

Some extraversion items (variables i2, i6) need to be recoded, ie., reversed.

extra %>%
mutate(i02 = 5-i02,
i06 = 5-i06) %>%
rename(i02r = i02,
i06r = i06) -> extra


# Recode items 2: Convert labels to numbers

Typically, items answers are anchored with labels such as “do not agree” till “fully agree” or similar. However, sometimes it is convenient to have such labels in a number format. Let’s convert such items labels to numbers.

extra %>%
mutate(clients_freq = case_when(
clients ==  "im Schnitt 1 Mal pro Quartal (oder weniger)" ~ "1",
clients == "im Schnitt 1 Mal pro Monat" ~ "2",
clients == "im Schnitt 1 Mal pro Woche" ~ "3",
clients == "im Schnitt 1 Mal pro Tag" ~ "4",
clients == "im Schnitt mehrfach pro Tag" ~"5",
TRUE ~ "NA")) %>%
mutate(clients_freq = parse_number(clients_freq)) -> extra


# Compute summaries (extraversion score)

Let’s compute the mean but also the median and mode for each person (ie., row) with regard to the 10 extraversion items.

For that, we’ll use a helper function to compute the mode (most frequent value).

most <- function(x){
if (!(is.numeric(x))) {
out <- NA
return(out)
}
x <- stats::na.omit(x)
t <- base::table(x)
m <- base::max(t)
out <- base::as.numeric(base::names(t)[t==m])
if (base::length(out) > 1) out <- out[1]
if (base::length(out) == 0) out <- NA
base::return(out)
}

extra %>%
rowwise %>%
summarise(extra_mean = mean(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10), na.rm = TRUE),
extra_md = median(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10), na.rm = TRUE),
extra_aad = aad(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10), na.rm = TRUE),
extra_mode = most(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10)),
extra_iqr = IQR(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10))) -> extra_scores

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf


This approach works but it’s a lot of duplicated typing. Rather give summarise an unquoted expression:

First, we define an expression; that’s to say we want R to quote, rather than to evaluation the expression. This can be achieved using quo:

extra_items <- quo(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10))


Then we hand over the unquoted expression (defined by quote) to mean Unquoting now (dplyr >= .7) works by usig the !! operator.

extra %>%
rowwise %>%
summarise(extra_mean = mean(!!extra_items),
extra_md = median(!!extra_items),
extra_mode = most(!!extra_items),
extra_iqr = IQR(!!extra_items)) -> extra_scores

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

extra_scores %>% head

## # A tibble: 6 x 5
##   extra_mean extra_md extra_aad extra_mode extra_iqr
##        <dbl>    <dbl>     <dbl>      <dbl>     <dbl>
## 1        2.9      3.0      0.56          3      0.00
## 2        2.1      2.0      0.54          2      0.75
## 3        2.6      3.0      1.08          1      2.50
## 4        2.9      3.0      0.36          3      0.00
## 5        3.2      3.5      0.80          4      1.00
## 6        2.8      3.0      0.68          3      0.75


Neat! Now let’s bind that to the main df.

extra %>%
bind_cols(extra_scores) -> extra


Done! Enyoj the freshly juiced data frame

# tl;dr

Use this convenience function to print a dataframe as a png-plot: tab2grob().

Source the function here: https://sebastiansauer.github.io/Rcode/tab2grob.R

Easiest way in R:

source("https://sebastiansauer.github.io/Rcode/tab2grob.R")


# Printing csv-dataframes as ggplot plots

Recently, I wanted to print dataframes not as normal tables, but as a png-plot. See:

Why? Well, basically as a convenience function for colleagues who are not into using Markdown & friends. As I am preparing some stats stuff (see my new open access course material here) using RMarkdown, I wanted to prepare the materials ready for using in Powerpoint.

I found out that tables are difficult to copy-paste into Powerpoint. That’s why I thought it may help to print to tables as plots.

So I came up with some function who does this job:

1. Scan a folder for all csv files
2. parse each csv and for each csv file
3. print it as a plot
4. save it as a png

# Rcode

tab2plot <- function(path = "") {

# Print csv-dataframes as plots using ggplot2 - for easier handling in Powerpoint & friends
# Arguments
# path: are the csv-files in a subdirectory? Then specify here. Defaults to "" (working directory)
# Value:
# None. Saves each csv-file as png file of the table.

library(tidyverse)
library(stringr)
library(gridExtra)
library(grid)

df <- data_frame(
file_name = list.files(path = path, pattern = "\\w+.csv"),
title = str_extract(file_name, pattern = "\\w+")
)

tt <- ttheme_default(core=list(fg_params=list(hjust=0, x=0.1)))

for (i in seq_along(df$file_name)) { cat(paste0(df$file_name[i], "\n"))
csv_i <- read.csv(paste0(path, df$file_name[i])) #csv_i <- read.csv("includes/Befehle_Cluster.csv") csv_i %>% rownames_to_column %>% dplyr::select(-rowname) -> csv_i p <- arrangeGrob(tableGrob(csv_i, theme = tt)) ggsave(file = paste0("images/Tabellen/","Tabelle_",df$title[i],".png"),  p)
}

}


# Review of "The 7 Deadly Sins of Psychology" by Chris Chambers

tl;dr: great book. Read.

The “Seven Sins” is concerned about the validity of psychological research. Can we at all, or to what degree, be certain about the conclusions reached in psychological research? More recently, replications efforts have cast doubt on our confidence in psychological research (1). In a similar vein, a recent papers states that in many research areas, researchers mostly report “successes” in the sense of that they report that their studies confirm their hypotheses - with Psychology leading in the proportion of supported hypotheses (2). To good to be true? In the light of all this unbehagen, Chambers’ book addresses some of the (possible) roots of the problem of (un)reliability of psychological science. Precisely, Chambers mentions seven “sins” that the psychological research community appears to be guilty of: confirmation bias, data tuning (“hidden flexibility”), disregard of direct replications (and related problems), failure to share data (“data hoarding”), fraud, lack of open access publishing, and fixation on impact factors.

Chambers is not alone in out-speaking some dirty little (or not so little) secrets or tricks of the trade. The discomfort with the status quo is gaining momentum (3,4,5, 6); see also the work of psychologists such as J. Wicherts, F. Schönbrodt, D. Bishop, J. Simmons, S. Schwarzkopf, R. Morey, or B. Nosek, to name just a few. For example, recently, the German psychological association (DGPs) opened up (more) towards open data (7). However, a substantial number of prominent psychologist oppose the more open approach towards higher validity and legitimateness (8). Thus, Chambers’ book hit the nerve of many psychologists. True, a lot is at stake (9, 10, 11), and a train wreck may have appeared. Chambers book knits together the most important aspects of the replicability (or reproducibility); the first “umbrella book” on that topic, as far as I know. Personally, I feel that one point only would merit some more scrutiny: the unchallenged assumption that psychological constructs are metric (12,13,14). Measurement builds the very rock of any empirical science. Without precise measurement, it appears unlikely that any theory will advance. Still, psychologists turn a dead ear to this issue, sadly. Just assuming that my sum-score does possess metric niveau is not enough (15).

The book is well written, pleasurable to read, suitable for a number of couch evenings (as in my case). Although methodologically sound, as far as I can say, no special statistical knowledge is needed to follow and benefit from the whole exposition.

The last chapter is devoted to solutions (“remedies”); arguably, this is the most important chapter in the book. Again, Chambers arrives at pulling together most important trends, concrete ideas and more general, far reaching avenues. The most important measures are to him a) preregistration of studies, b) judging journals by their replication quota and strengthening the whole replication effort as such, c) open science in general (see Openness Initiative, and TOP guidelines) and d) novel ways of conceiving the job of journals. Well, maybe he is not so much focusing on the last part, but I find that last point quite sensible. One could argue that publishers such as Elsevier managed to suck way to much money out of the system, money that ultimately is paid by the tax payers, and by the research community. Basically, scientific journals do two things: hosting manuscripts and steering peer-review. Remember that journals do not do the peer review, it is provided for free by researchers. As hosting is very cheap nowadays, and peer review is brought by without much input by the publishers, why not come up with new, more cost-efficient, and more reliable ways of publishing? One may think that money is not of primary concern for science, truth is. However, science, as most societal endeavors, is based entirely on the trust and confidence of the wider public. Wasting that trust, destroying the funding base. Hence, science cannot afford to waste money, not at all. Among the ideas for updating publishing and journal infrastructure is the idea to use open archives such as ArXive or osf.io as repositories for manuscripts. Peer review can be conducted on this non-paywalled manuscripts (some type of post publication peer review), for instance organized by universities (5). “Overlay journals” may pick and choose papers from these repositories, organize peer review, and make sure their peer review, and the resulting paper is properly indexed (Google Scholar etc.).

To sum up, the book taps into what is perhaps the most pressing concern in psychological research right now. It succeeds in pulling together the wires that together provide the fabric of the unbehagen in the zeitgeist of contemporary academic psychology. I feel that a lot is at stake. If we as a community fail in securing the legitimateness of academic psychology, the discipline may end up in a way similar to phrenology: once hyped, but then seen by some as pseudo science, a view that gained popularity and is now commonplace. Let’s work together for a reliable science. Chambers’ book helps to contribute in that regard.

1 Open Science Collaboration, & Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716-aac4716. http://doi.org/10.1126/science.aac4716

2 Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904. http://doi.org/10.1007/s11192-011-0494-7

3 Fanelli, D. (2009). How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLoS ONE. http://doi.org/10.1371/journal.pone.0005738

4 Nuzzo, R. (2015). How scientists fool themselves – and how they can stop. Nature, 526(7572), 182–185. http://doi.org/10.1038/526182a

5 Brembs, B., Button, K., & Munafò, M. (2013). Deep impact: unintended consequences of journal rank. Frontiers in Human Neuroscience, 7. http://doi.org/10.3389/fnhum.2013.00291

6 Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, R., Lakens, D., … Zwaan, R. A. (2016). The Peer Reviewers’ Openness Initiative: incentivizing open research practices through peer review. Royal Society Open Science, 3(1), 150547. http://doi.org/10.1098/rsos.150547

7 Schönbrodt, F., Gollwitzer, M., & Abele-Brehm, A. (2017). Der Umgang mit Forschungsdaten im Fach Psychologie: Konkretisierung der DFG-Leitlinien. Psychologische Rundschau, 68(1), 20–25. http://doi.org/10.1026/0033-3042/a000341

8 Longo, D. L., & Drazen, J. M. (2016). Data Sharing. New England Journal of Medicine, 374(3), 276–277. http://doi.org/10.1056/NEJMe1516564

9 LeBel, E. P. (2017). Even With Nuance, Social Psychology Faces its Most Major Crisis in History. Retrieved from https://proveyourselfwrong.wordpress.com/2017/05/26/even-with-nuance-social-psychology-faces-its-most-major-crisis-in-history/.

10 Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., … Wong, K. M. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34.

11 Ledgerwood, A. (n.d.). Everything is F*cking Nuanced: The Syllabus (Blog Post). Retrieved from http://incurablynuanced.blogspot.de/2017/04/everything-is-fcking-nuanced-syllabus.html

12 Michell, J. (2005). The logic of measurement: A realist overview. Measurement, 38(4), 285–294. http://doi.org/10.1016/j.measurement.2005.09.004

13 Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88(3), 355–383. http://doi.org/Article

14 Heene, M. (2013). Additive conjoint measurement and the resistance toward falsifiability in psychology. Frontiers in Psychology, 4.

15 Sauer, S. (2016). Why metric scale level cannot be taken for granted (Blog Post). http://doi.org/http://doi.org/10.5281/zenodo.571356

# tl;dr

Suppose you want to know which package(s) a given R function belongs to, say filter. Here come find_funsto help you:

find_funs("filter")

## # A tibble: 4 x 3
##   package_name builtin_pckage loaded
##          <chr>          <lgl>  <lgl>
## 1         base           TRUE   TRUE
## 2        dplyr          FALSE   TRUE
## 3       plotly          FALSE  FALSE
## 4        stats           TRUE   TRUE


This function will search all installed packages for this function name. It will return all the package names that match the function name (ie., packages which include a function by the respective name). In addition, the function raises a flag as to whether the packages is a standard (built-in) packge and whether the package is currently loaded/attached.

For convenience this function can be sourced like this:

source("https://sebastiansauer.github.io/Rcode/find_funs.R")


# Usecase

Sometimes it is helpful to know in which R package a function ‘resides’. For example, ggplot comes from the package ggplot2, and select is a function that can be located in the package dplyr (among other packages). Especially if a function has a common name name clases are bound to be experienced. For a example, I was bitten by filter a couple of times - not reckognizing that the function filter that was applied did not come from dplyr as intended but from some other package.

Additionally, sometimes we have in mind ‘oh I should make use of this function filter here’, but cannot remember which package should be loaded for that function.

A number of ways exist to address this question. Our convenience function here takes the name of the function for which we search its residential package as its input (that’s the only parameter). The function will then return the one more packgages in which the function was found. In addition, it returns for each package found whether that package comes with standard R (is ‘built-in’). That information can be useful to know whether someone needs to install a package in order to use some function. The function also returns whether the function is currently loaded.

# Code

find_funs <- function(f) {
# Returns dataframe with two columns:
# package_name: packages(s) which the function is part of (chr)
# builtin_package:  whether the package comes with standard R (a 'builtin'  package)

# Arguments:
# f: name of function for which the package(s) are to be identified.

if ("tidyverse" %in% rownames(installed.packages()) == FALSE) {
cat("tidyverse is needed for this fuction. Please install. Stopping")
stop()}

suppressMessages(library(tidyverse))

# search for help in list of installed packages
help_installed <- help.search(paste0("^",f,"$"), agrep = FALSE) # extract package name from help file pckg_hits <- help_installed$matches[,"Package"]

if (length(pckg_hits) == 0) pckg_hits <- "No_results_found"

# get list of built-in packages

pckgs <- installed.packages()  %>% as_tibble
pckgs %>%
dplyr::filter(Priority %in% c("base","recommended")) %>%
dplyr::select(Package) %>%
distinct -> builtin_pckgs_df

# check for each element of 'pckg hit' whether its built-in and loaded (via match). Then print results.

results <- data_frame(
package_name = pckg_hits,
builtin_pckage = match(pckg_hits, builtin_pckgs_df\$Package, nomatch = 0) > 0,
loaded = match(paste("package:",pckg_hits, sep = ""), search(), nomatch = 0) > 0
)

return(results)

}


# Example

find_funs("filter")

## # A tibble: 4 x 3
##   package_name builtin_pckage loaded
##          <chr>          <lgl>  <lgl>
## 1         base           TRUE   TRUE
## 2        dplyr          FALSE   TRUE
## 3       plotly          FALSE  FALSE
## 4        stats           TRUE   TRUE


# Convenience access

For convenience this function can be sourced like this:

source("https://sebastiansauer.github.io/Rcode/find_funs.R")


# Notes

tidyverse needs to installed to run this code. tidyverse is loaded quietly. The function will return an empty dataframe if no target package is found.

# Acknowledgements

This function was inspired by code from Ben Bolker’s post on SO.