June 28, 2017Sebastian Sauer Reading time ~5 minutes

A second look to grouping with dplyr

~~The~~ one basic idea of dplyr is that each function should focus on one job. That’s why there are no functions such as compute_sumamries_by_group_with_robust_variants(df). Rather, summarising and grouping are seen as different jobs which should be accomplished by different functions. And, in turn, that’s why group_by, the grouping function of dplyr, is of considerable importance: this function should do the grouping for each operation whatsoever.

Let’s load all tidyverse libraries in one go:

library(tidyverse)

On the first glimpse, simply enough, this code

my_data %>% 
  group_by(grouping_var) %>% 
  summarise(n_per_group = (n(var_to_be_counted_per_group)))

more or less can be read in plain English. So, simple. As a popular example, consider the dataset mtcars:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(n_per_group = n())

## # A tibble: 3 x 2
##     cyl n_per_group
##   <dbl>       <int>
## 1     4          11
## 2     6           7
## 3     8          14

n simply counts the rows of a (grouped) dataframe. As all rows must, per definition, have the same number of rows, it does not matter which particular column we are looking at. Hence, no column is specified for n().

Somewhat slightly more complicated, say we want to count the number of cars grouped by (number of) cylinder and by (type of) transmission. In dplyr, we just add an additional grouping variable:

mtcars %>% 
  group_by(cyl, am) %>%  # 'am' added as second grouping variable
  summarise(n_per_group = n())

## # A tibble: 6 x 3
## # Groups:   cyl [?]
##     cyl    am n_per_group
##   <dbl> <dbl>       <int>
## 1     4     0           3
## 2     4     1           8
## 3     6     0           4
## 4     6     1           3
## 5     8     0          12
## 6     8     1           2

Note that dplyr now tells us about groups: # Groups: cyl [?]. That output tells us that cyl is the grouping variable. The question mark just means that the number of groups is unkown right now; it will only be computed when the next line is executed.

Note also that one grouping variable is indicated. But wait! We had indicated two grouping variables, right? What happened is that by each run of summarise one grouping variable is removed. So if we had two grouping variables before we ran summarise we are left with one grouping variabl after the call of summarise. I assume this behavior was implemented because you can ‘roll up’ up a data frame: Get the counts for each cell for the two grouping variables, then sum up the levels of one variables, then sum up again to double check the total sum. See:

mtcars %>% 
  group_by(cyl, am) %>%  
  summarise(n_per_group = n()) %>% 
  summarise(n_per_group = n())

## # A tibble: 3 x 2
##     cyl n_per_group
##   <dbl>       <int>
## 1     4           2
## 2     6           2
## 3     8           2

This output tells us that there are 2 lines for each group of cyl (am == 1 and am == 0). Maybe more useful:

mtcars %>% 
  group_by(cyl, am) %>%  
  summarise(n_per_group = n()) %>% 
  summarise(n_per_group = sum(n_per_group))

## # A tibble: 3 x 2
##     cyl n_per_group
##   <dbl>       <int>
## 1     4          11
## 2     6           7
## 3     8          14

What happens if the ‘peel off’ the last last layer and sum up the remaining rows?

mtcars %>% 
  group_by(cyl, am) %>%  
  summarise(n_per_group = n()) %>% 
  summarise(n_per_group = sum(n_per_group)) %>% 
  summarise(n_per_group = sum(n_per_group))

## # A tibble: 1 x 1
##   n_per_group
##         <int>
## 1          32

We get the overall number of rows of the whole dataset.

Each summarise peels off one layer of grouping.

June 28, 2017Sebastian Sauer Reading time ~5 minutes

Programming with dplyr: Part 01, introduction

Like for [others], Hadley Wickham’s dplyr, and more generally, the tidyverse approach has considerably changed the I do data analyses. Most notably, the pipe (coming from magrittr by Stefan Milton Bache, see here) has creeped into nearly every analyses I, do.

That is, is every analyses except for functions, and other non-interactive stuff. In those programming contexts, the dplyr way does not work, due to its non standard evaluation or NSE for short.

So, this post is about programming with dplyr, or more precisely, using the recently introduced tidyeval approach (approached into widely used R libraries, that is).

To understand the usecase, consider the following example. Say, we count the frequencies of some groups, and want to add the proportiong of these counts.

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(n = n())

## # A tibble: 3 x 2
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

Or, shorter mtcars %>% count(cyl).

Now, let’s add a column with the proporting of the count column (n).

mtcars %>% 
  count(cyl) %>% 
  mutate(prop = n / sum(n))

## # A tibble: 3 x 3
##     cyl     n    prop
##   <dbl> <int>   <dbl>
## 1     4    11 0.34375
## 2     6     7 0.21875
## 3     8    14 0.43750

For extending this game to more than one variable, see this post.

Now, let’s assume we would like to put the ‘add a proportion column to my dataframe’ in a function. How to proceed?

We might think this approach should work:

add_prop <- function(df, count_var, group_var){
  df %>% 
    ungroup() %>% 
    group_by(group_var) %>% 
    mutate(prop =  count_var / sum(n()))
}

mtcars %>% 
  count(cyl) %>% 
  add_prop(count_var = n, group_var = cyl)

However, this does not work.

The dplyr ‘verbs’ expect quoted input variables. That means we must quote the parameters first, before we can hand over them to dplyr. quo basically means “hey R, don’t yet evaulate this expression. Just read it, understand the name of the expression, and wait until I tell you. And R, understand, that you got an expression, not a simple text string”. The last sentence tells us (or R) that we do not want a text string, but the quotation (or ‘citation’ of the expression if you like) of an expression to be evaluated in the futre.

mtcars %>% 
  count(cyl) %>% 
  add_prop(count_var = quo(n), group_var = quo(cyl))

The important point is quo.

quo(mtcars$cyl)

## <quosure: global>
## ~mtcars$cyl

But there’s a second step we need to take. Now that quo has defined a quoted expression, the dplyr verbs do not need to quote their input variables, because they are already quoted. That is, we need to tell R now: “Hey, do not quote, evaluate! We have taken care of the quoting before”.

add_prop <- function(df, count_var, group_var){
  df %>% 
    mutate(prop =  (!!count_var) / sum(!!count_var))
}


mtcars %>% 
  count(cyl) %>% 
  add_prop(count_var = quo(n), group_var = quo(cyl)) 

## # A tibble: 3 x 3
##     cyl     n    prop
##   <dbl> <int>   <dbl>
## 1     4    11 0.34375
## 2     6     7 0.21875
## 3     8    14 0.43750

Hizzah!

June 24, 2017Sebastian Sauer Reading time ~11 minutes

Preparation of extraversion survey data

For teaching purposes and out of curiosity towards some psychometric questions, I have run a survey on extraversion here. The dataset has been published at OSF (DOI 10.17605/OSF.IO/4KGZH). The survey is base on a google form, which in turn saves the data in Google spreadsheet. Before the data can be analyzed, some preparation and makeup is in place. This posts shows some general makeup, typical for survey data.

Download the data and load packages

Download the data from source (Google spreadsheets); the package gsheet provides an easy interface for that purpose.

library(gsheet)
extra_raw <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1Ln8_0xSJ5teHY2QkwGaYxDLxpcdjOsQ0gIAEZ5az5BY/edit#gid=305064170")

## Warning: Missing column names filled in: 'X23' [23]

library(tidyverse)  # data judo
library(purrr)  # map
library(lsr)  # aad
library(modeest)  # mlv
#devtools::install_github("sebastiansauer/prada")
library(prada)

Prepare variable names

First, save item names in a separate object for later retrieval and for documentation.

extra_names <- names(extra_raw) 
head(extra_names)

## [1] "Zeitstempel"                                                                                                                                                                                       
## [2] "Bitte geben Sie Ihren dreistellen anonymen Code ein (1.: Anfangsbuchstabe des Vornamens Ihres Vaters; 2.: Anfangsbuchstabe des Mädchennamens Ihrer Mutter; 3: Anfangsbuchstabe Ihres Geburstsorts)"
## [3] "Ich bin gerne mit anderen Menschen zusammen."                                                                                                                                                      
## [4] "Ich bin ein Einzelgänger. (-)"                                                                                                                                                                     
## [5] "Ich bin in vielen Vereinen aktiv."                                                                                                                                                                 
## [6] "Ich bin ein gesprächiger und kommunikativer Mensch."

Next, replace the lengthy col names by ‘i’ followed by a number:

extra <- extra_raw
names(extra) <- c("timestamp", "code", paste0("i",1:26))

Then we rename some of the variables with new names.

extra <-
  extra %>%
  rename(n_facebook_friends = i11,
         n_hangover = i12,
         age = i13,
         sex = i14,
         extra_single_item = i15,
         time_conversation = i16,
         presentation = i17,
         n_party = i18,
         clients = i19,
         extra_vignette = i20,
         extra_vignette2 = i22,
         major = i23,
         smoker = i24,
         sleep_week = i25,
         sleep_wend = i26)

Add leading zero to items

It is more helpful for sorting purposes for variable names with numbers to have the same format, ie., the same number of leadings zero. So not this “i1, i2”, but this “i01, i02” (if the number of items is not greater than 99).

To get the same amount of leading zeros we can use:

i <- 1:10
item_names <- paste0("i", formatC(i, width = 2, flag = "0"))

colnames(extra)[3:12] <- item_names

Parse numbers from chr columns

Some columns actually assess a number but the field in the survey form was liberally open to characters. So we have to convert the character to numbers, or, more precisely, suck out the numbers from the character variables.

extra$n_hangover <- parse_number(extra$n_hangover)

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col     row   col expected actual expected   <int> <int>    <chr>  <chr> actual 1   132    NA a number Keinen row 2   425    NA a number      .

extra$n_facebook_friends <- parse_number(extra$n_facebook_friends)
extra$time_conversation <- parse_number(extra$time_conversation)

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col     row   col expected      actual expected   <int> <int>    <chr>       <chr> actual 1   153    NA a number       Opfer row 2   633    NA a number Eine Minute

extra$n_party <- parse_number(extra$n_party)

## Warning: 1 parsing failure.
## row # A tibble: 1 x 4 col     row   col expected actual expected   <int> <int>    <chr>  <chr> actual 1   270    NA a number      u

The parsing left the dataframe with some rather ugly attributes, albeit with interesting informations. After checking them, however, I feel inclined to delete them.

attributes(extra$n_hangover) <- NULL
attributes(extra$time_conversation) <- NULL
attributes(extra$n_party) <- NULL
attributes(extra$sleep_wend) <- NULL
attr(extra, "spec") <- NULL

Recode items 1: Reverse order

Some extraversion items (variables i2, i6) need to be recoded, ie., reversed.

extra %>% 
  mutate(i02 = 5-i02,
            i06 = 5-i06) %>% 
  rename(i02r = i02,
         i06r = i06) -> extra

Recode items 2: Convert labels to numbers

Typically, items answers are anchored with labels such as “do not agree” till “fully agree” or similar. However, sometimes it is convenient to have such labels in a number format. Let’s convert such items labels to numbers.

extra %>% 
  mutate(clients_freq = case_when(
    clients ==  "im Schnitt 1 Mal pro Quartal (oder weniger)" ~ "1",
    clients == "im Schnitt 1 Mal pro Monat" ~ "2",
    clients == "im Schnitt 1 Mal pro Woche" ~ "3",
    clients == "im Schnitt 1 Mal pro Tag" ~ "4",
    clients == "im Schnitt mehrfach pro Tag" ~"5",
    TRUE ~ "NA")) %>% 
  mutate(clients_freq = parse_number(clients_freq)) -> extra

Compute summaries (extraversion score)

Let’s compute the mean but also the median and mode for each person (ie., row) with regard to the 10 extraversion items.

For that, we’ll use a helper function to compute the mode (most frequent value).

most <- function(x){
  if (!(is.numeric(x))) {
    out <- NA
    return(out)
  }
  x <- stats::na.omit(x)
  t <- base::table(x)
  m <- base::max(t)
  out <- base::as.numeric(base::names(t)[t==m])
  if (base::length(out) > 1) out <- out[1]
  if (base::length(out) == 0) out <- NA
  base::return(out)
}

extra %>% 
  rowwise %>% 
  summarise(extra_mean = mean(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10), na.rm = TRUE),
            extra_md = median(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10), na.rm = TRUE),
            extra_aad = aad(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10), na.rm = TRUE),
            extra_mode = most(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10)),
            extra_iqr = IQR(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10))) -> extra_scores

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

This approach works but it’s a lot of duplicated typing. Rather give summarise an unquoted expression:

First, we define an expression; that’s to say we want R to quote, rather than to evaluation the expression. This can be achieved using quo:

extra_items <- quo(c(i01, i02r, i03, i04, i05, i06r, i07, i08, i09, i10))

Then we hand over the unquoted expression (defined by quote) to mean Unquoting now (dplyr >= .7) works by usig the !! operator.

extra %>% 
  rowwise %>% 
  summarise(extra_mean = mean(!!extra_items),
            extra_md = median(!!extra_items),
            extra_aad = lsr::aad(!!extra_items),
            extra_mode = most(!!extra_items),
            extra_iqr = IQR(!!extra_items)) -> extra_scores

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

## Warning in base::max(t): no non-missing arguments to max; returning -Inf

extra_scores %>% head

## # A tibble: 6 x 5
##   extra_mean extra_md extra_aad extra_mode extra_iqr
##        <dbl>    <dbl>     <dbl>      <dbl>     <dbl>
## 1        2.9      3.0      0.56          3      0.00
## 2        2.1      2.0      0.54          2      0.75
## 3        2.6      3.0      1.08          1      2.50
## 4        2.9      3.0      0.36          3      0.00
## 5        3.2      3.5      0.80          4      1.00
## 6        2.8      3.0      0.68          3      0.75

Neat! Now let’s bind that to the main df.

extra %>% 
  bind_cols(extra_scores) -> extra

Done! Enyoj the freshly juiced data frame

June 22, 2017Sebastian Sauer Reading time ~2 minutes

Print csv-file tables as plots

tl;dr

Use this convenience function to print a dataframe as a png-plot: tab2grob().

Source the function here: https://sebastiansauer.github.io/Rcode/tab2grob.R

Easiest way in R:

source("https://sebastiansauer.github.io/Rcode/tab2grob.R")

Printing csv-dataframes as ggplot plots

Recently, I wanted to print dataframes not as normal tables, but as a png-plot. See:

Why? Well, basically as a convenience function for colleagues who are not into using Markdown & friends. As I am preparing some stats stuff (see my new open access course material here) using RMarkdown, I wanted to prepare the materials ready for using in Powerpoint.

I found out that tables are difficult to copy-paste into Powerpoint. That’s why I thought it may help to print to tables as plots.

So I came up with some function who does this job:

Scan a folder for all csv files
parse each csv and for each csv file
print it as a plot
save it as a png

Rcode

tab2plot <- function(path = "") {

  # Print csv-dataframes as plots using ggplot2 - for easier handling in Powerpoint & friends
  # Arguments
  # path: are the csv-files in a subdirectory? Then specify here. Defaults to "" (working directory)
  # Value:
  # None. Saves each csv-file as png file of the table.

  library(tidyverse)
  library(stringr)
  library(gridExtra)
  library(grid)

    df <- data_frame(
    file_name = list.files(path = path, pattern = "\\w+.csv"),
    title = str_extract(file_name, pattern = "\\w+")
  )

  tt <- ttheme_default(core=list(fg_params=list(hjust=0, x=0.1)))


  for (i in seq_along(df$file_name)) {
    cat(paste0(df$file_name[i], "\n"))
    csv_i <- read.csv(paste0(path, df$file_name[i]))
    #csv_i <- read.csv("includes/Befehle_Cluster.csv")
    csv_i %>%
      rownames_to_column %>%
      dplyr::select(-rowname) -> csv_i
    p <- arrangeGrob(tableGrob(csv_i, theme = tt))
    ggsave(file = paste0("images/Tabellen/","Tabelle_",df$title[i],".png"),  p)
  }

}

June 22, 2017Sebastian Sauer Reading time ~11 minutes

Review of "The 7 Deadly Sins of Psychology" by Chris Chambers

tl;dr: great book. Read.

The “Seven Sins” is concerned about the validity of psychological research. Can we at all, or to what degree, be certain about the conclusions reached in psychological research? More recently, replications efforts have cast doubt on our confidence in psychological research (1). In a similar vein, a recent papers states that in many research areas, researchers mostly report “successes” in the sense of that they report that their studies confirm their hypotheses - with Psychology leading in the proportion of supported hypotheses (2). To good to be true? In the light of all this unbehagen, Chambers’ book addresses some of the (possible) roots of the problem of (un)reliability of psychological science. Precisely, Chambers mentions seven “sins” that the psychological research community appears to be guilty of: confirmation bias, data tuning (“hidden flexibility”), disregard of direct replications (and related problems), failure to share data (“data hoarding”), fraud, lack of open access publishing, and fixation on impact factors.

Chambers is not alone in out-speaking some dirty little (or not so little) secrets or tricks of the trade. The discomfort with the status quo is gaining momentum (3,4,5, 6); see also the work of psychologists such as J. Wicherts, F. Schönbrodt, D. Bishop, J. Simmons, S. Schwarzkopf, R. Morey, or B. Nosek, to name just a few. For example, recently, the German psychological association (DGPs) opened up (more) towards open data (7). However, a substantial number of prominent psychologist oppose the more open approach towards higher validity and legitimateness (8). Thus, Chambers’ book hit the nerve of many psychologists. True, a lot is at stake (9, 10, 11), and a train wreck may have appeared. Chambers book knits together the most important aspects of the replicability (or reproducibility); the first “umbrella book” on that topic, as far as I know. Personally, I feel that one point only would merit some more scrutiny: the unchallenged assumption that psychological constructs are metric (12,13,14). Measurement builds the very rock of any empirical science. Without precise measurement, it appears unlikely that any theory will advance. Still, psychologists turn a dead ear to this issue, sadly. Just assuming that my sum-score does possess metric niveau is not enough (15).

The book is well written, pleasurable to read, suitable for a number of couch evenings (as in my case). Although methodologically sound, as far as I can say, no special statistical knowledge is needed to follow and benefit from the whole exposition.

The last chapter is devoted to solutions (“remedies”); arguably, this is the most important chapter in the book. Again, Chambers arrives at pulling together most important trends, concrete ideas and more general, far reaching avenues. The most important measures are to him a) preregistration of studies, b) judging journals by their replication quota and strengthening the whole replication effort as such, c) open science in general (see Openness Initiative, and TOP guidelines) and d) novel ways of conceiving the job of journals. Well, maybe he is not so much focusing on the last part, but I find that last point quite sensible. One could argue that publishers such as Elsevier managed to suck way to much money out of the system, money that ultimately is paid by the tax payers, and by the research community. Basically, scientific journals do two things: hosting manuscripts and steering peer-review. Remember that journals do not do the peer review, it is provided for free by researchers. As hosting is very cheap nowadays, and peer review is brought by without much input by the publishers, why not come up with new, more cost-efficient, and more reliable ways of publishing? One may think that money is not of primary concern for science, truth is. However, science, as most societal endeavors, is based entirely on the trust and confidence of the wider public. Wasting that trust, destroying the funding base. Hence, science cannot afford to waste money, not at all. Among the ideas for updating publishing and journal infrastructure is the idea to use open archives such as ArXive or osf.io as repositories for manuscripts. Peer review can be conducted on this non-paywalled manuscripts (some type of post publication peer review), for instance organized by universities (5). “Overlay journals” may pick and choose papers from these repositories, organize peer review, and make sure their peer review, and the resulting paper is properly indexed (Google Scholar etc.).

To sum up, the book taps into what is perhaps the most pressing concern in psychological research right now. It succeeds in pulling together the wires that together provide the fabric of the unbehagen in the zeitgeist of contemporary academic psychology. I feel that a lot is at stake. If we as a community fail in securing the legitimateness of academic psychology, the discipline may end up in a way similar to phrenology: once hyped, but then seen by some as pseudo science, a view that gained popularity and is now commonplace. Let’s work together for a reliable science. Chambers’ book helps to contribute in that regard.

1 Open Science Collaboration, & Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716-aac4716. http://doi.org/10.1126/science.aac4716

2 Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904. http://doi.org/10.1007/s11192-011-0494-7

3 Fanelli, D. (2009). How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLoS ONE. http://doi.org/10.1371/journal.pone.0005738

4 Nuzzo, R. (2015). How scientists fool themselves – and how they can stop. Nature, 526(7572), 182–185. http://doi.org/10.1038/526182a

5 Brembs, B., Button, K., & Munafò, M. (2013). Deep impact: unintended consequences of journal rank. Frontiers in Human Neuroscience, 7. http://doi.org/10.3389/fnhum.2013.00291

6 Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, R., Lakens, D., … Zwaan, R. A. (2016). The Peer Reviewers’ Openness Initiative: incentivizing open research practices through peer review. Royal Society Open Science, 3(1), 150547. http://doi.org/10.1098/rsos.150547

7 Schönbrodt, F., Gollwitzer, M., & Abele-Brehm, A. (2017). Der Umgang mit Forschungsdaten im Fach Psychologie: Konkretisierung der DFG-Leitlinien. Psychologische Rundschau, 68(1), 20–25. http://doi.org/10.1026/0033-3042/a000341

8 Longo, D. L., & Drazen, J. M. (2016). Data Sharing. New England Journal of Medicine, 374(3), 276–277. http://doi.org/10.1056/NEJMe1516564

9 LeBel, E. P. (2017). Even With Nuance, Social Psychology Faces its Most Major Crisis in History. Retrieved from https://proveyourselfwrong.wordpress.com/2017/05/26/even-with-nuance-social-psychology-faces-its-most-major-crisis-in-history/.

10 Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., … Wong, K. M. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34.

11 Ledgerwood, A. (n.d.). Everything is F*cking Nuanced: The Syllabus (Blog Post). Retrieved from http://incurablynuanced.blogspot.de/2017/04/everything-is-fcking-nuanced-syllabus.html

12 Michell, J. (2005). The logic of measurement: A realist overview. Measurement, 38(4), 285–294. http://doi.org/10.1016/j.measurement.2005.09.004

13 Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88(3), 355–383. http://doi.org/Article

14 Heene, M. (2013). Additive conjoint measurement and the resistance toward falsifiability in psychology. Frontiers in Psychology, 4.

15 Sauer, S. (2016). Why metric scale level cannot be taken for granted (Blog Post). http://doi.org/http://doi.org/10.5281/zenodo.571356

Sebastian Sauer Stats Blog

Latest Posts