tl;dr

Use this convenience function to print a dataframe as a png-plot: tab2grob().

Source the function here: https://sebastiansauer.github.io/Rcode/tab2grob.R

Easiest way in R:

source("https://sebastiansauer.github.io/Rcode/tab2grob.R")

Printing csv-dataframes as ggplot plots

Recently, I wanted to print dataframes not as normal tables, but as a png-plot. See:

Why? Well, basically as a convenience function for colleagues who are not into using Markdown & friends. As I am preparing some stats stuff (see my new open access course material here) using RMarkdown, I wanted to prepare the materials ready for using in Powerpoint.

I found out that tables are difficult to copy-paste into Powerpoint. That’s why I thought it may help to print to tables as plots.

So I came up with some function who does this job:

  1. Scan a folder for all csv files
  2. parse each csv and for each csv file
  3. print it as a plot
  4. save it as a png

Rcode

tab2plot <- function(path = "") {

  # Print csv-dataframes as plots using ggplot2 - for easier handling in Powerpoint & friends
  # Arguments
  # path: are the csv-files in a subdirectory? Then specify here. Defaults to "" (working directory)
  # Value:
  # None. Saves each csv-file as png file of the table.

  library(tidyverse)
  library(stringr)
  library(gridExtra)
  library(grid)

    df <- data_frame(
    file_name = list.files(path = path, pattern = "\\w+.csv"),
    title = str_extract(file_name, pattern = "\\w+")
  )

  tt <- ttheme_default(core=list(fg_params=list(hjust=0, x=0.1)))


  for (i in seq_along(df$file_name)) {
    cat(paste0(df$file_name[i], "\n"))
    csv_i <- read.csv(paste0(path, df$file_name[i]))
    #csv_i <- read.csv("includes/Befehle_Cluster.csv")
    csv_i %>%
      rownames_to_column %>%
      dplyr::select(-rowname) -> csv_i
    p <- arrangeGrob(tableGrob(csv_i, theme = tt))
    ggsave(file = paste0("images/Tabellen/","Tabelle_",df$title[i],".png"),  p)
  }

}

tl;dr: great book. Read.

The “Seven Sins” is concerned about the validity of psychological research. Can we at all, or to what degree, be certain about the conclusions reached in psychological research? More recently, replications efforts have cast doubt on our confidence in psychological research (1). In a similar vein, a recent papers states that in many research areas, researchers mostly report “successes” in the sense of that they report that their studies confirm their hypotheses - with Psychology leading in the proportion of supported hypotheses (2). To good to be true? In the light of all this unbehagen, Chambers’ book addresses some of the (possible) roots of the problem of (un)reliability of psychological science. Precisely, Chambers mentions seven “sins” that the psychological research community appears to be guilty of: confirmation bias, data tuning (“hidden flexibility”), disregard of direct replications (and related problems), failure to share data (“data hoarding”), fraud, lack of open access publishing, and fixation on impact factors.

Chambers is not alone in out-speaking some dirty little (or not so little) secrets or tricks of the trade. The discomfort with the status quo is gaining momentum (3,4,5, 6); see also the work of psychologists such as J. Wicherts, F. Schönbrodt, D. Bishop, J. Simmons, S. Schwarzkopf, R. Morey, or B. Nosek, to name just a few. For example, recently, the German psychological association (DGPs) opened up (more) towards open data (7). However, a substantial number of prominent psychologist oppose the more open approach towards higher validity and legitimateness (8). Thus, Chambers’ book hit the nerve of many psychologists. True, a lot is at stake (9, 10, 11), and a train wreck may have appeared. Chambers book knits together the most important aspects of the replicability (or reproducibility); the first “umbrella book” on that topic, as far as I know. Personally, I feel that one point only would merit some more scrutiny: the unchallenged assumption that psychological constructs are metric (12,13,14). Measurement builds the very rock of any empirical science. Without precise measurement, it appears unlikely that any theory will advance. Still, psychologists turn a dead ear to this issue, sadly. Just assuming that my sum-score does possess metric niveau is not enough (15).

The book is well written, pleasurable to read, suitable for a number of couch evenings (as in my case). Although methodologically sound, as far as I can say, no special statistical knowledge is needed to follow and benefit from the whole exposition.

The last chapter is devoted to solutions (“remedies”); arguably, this is the most important chapter in the book. Again, Chambers arrives at pulling together most important trends, concrete ideas and more general, far reaching avenues. The most important measures are to him a) preregistration of studies, b) judging journals by their replication quota and strengthening the whole replication effort as such, c) open science in general (see Openness Initiative, and TOP guidelines) and d) novel ways of conceiving the job of journals. Well, maybe he is not so much focusing on the last part, but I find that last point quite sensible. One could argue that publishers such as Elsevier managed to suck way to much money out of the system, money that ultimately is paid by the tax payers, and by the research community. Basically, scientific journals do two things: hosting manuscripts and steering peer-review. Remember that journals do not do the peer review, it is provided for free by researchers. As hosting is very cheap nowadays, and peer review is brought by without much input by the publishers, why not come up with new, more cost-efficient, and more reliable ways of publishing? One may think that money is not of primary concern for science, truth is. However, science, as most societal endeavors, is based entirely on the trust and confidence of the wider public. Wasting that trust, destroying the funding base. Hence, science cannot afford to waste money, not at all. Among the ideas for updating publishing and journal infrastructure is the idea to use open archives such as ArXive or osf.io as repositories for manuscripts. Peer review can be conducted on this non-paywalled manuscripts (some type of post publication peer review), for instance organized by universities (5). “Overlay journals” may pick and choose papers from these repositories, organize peer review, and make sure their peer review, and the resulting paper is properly indexed (Google Scholar etc.).

To sum up, the book taps into what is perhaps the most pressing concern in psychological research right now. It succeeds in pulling together the wires that together provide the fabric of the unbehagen in the zeitgeist of contemporary academic psychology. I feel that a lot is at stake. If we as a community fail in securing the legitimateness of academic psychology, the discipline may end up in a way similar to phrenology: once hyped, but then seen by some as pseudo science, a view that gained popularity and is now commonplace. Let’s work together for a reliable science. Chambers’ book helps to contribute in that regard.

1 Open Science Collaboration, & Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716-aac4716. http://doi.org/10.1126/science.aac4716

2 Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904. http://doi.org/10.1007/s11192-011-0494-7

3 Fanelli, D. (2009). How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLoS ONE. http://doi.org/10.1371/journal.pone.0005738

4 Nuzzo, R. (2015). How scientists fool themselves – and how they can stop. Nature, 526(7572), 182–185. http://doi.org/10.1038/526182a

5 Brembs, B., Button, K., & Munafò, M. (2013). Deep impact: unintended consequences of journal rank. Frontiers in Human Neuroscience, 7. http://doi.org/10.3389/fnhum.2013.00291

6 Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, R., Lakens, D., … Zwaan, R. A. (2016). The Peer Reviewers’ Openness Initiative: incentivizing open research practices through peer review. Royal Society Open Science, 3(1), 150547. http://doi.org/10.1098/rsos.150547

7 Schönbrodt, F., Gollwitzer, M., & Abele-Brehm, A. (2017). Der Umgang mit Forschungsdaten im Fach Psychologie: Konkretisierung der DFG-Leitlinien. Psychologische Rundschau, 68(1), 20–25. http://doi.org/10.1026/0033-3042/a000341

8 Longo, D. L., & Drazen, J. M. (2016). Data Sharing. New England Journal of Medicine, 374(3), 276–277. http://doi.org/10.1056/NEJMe1516564

9 LeBel, E. P. (2017). Even With Nuance, Social Psychology Faces its Most Major Crisis in History. Retrieved from https://proveyourselfwrong.wordpress.com/2017/05/26/even-with-nuance-social-psychology-faces-its-most-major-crisis-in-history/.

10 Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., … Wong, K. M. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34.

11 Ledgerwood, A. (n.d.). Everything is F*cking Nuanced: The Syllabus (Blog Post). Retrieved from http://incurablynuanced.blogspot.de/2017/04/everything-is-fcking-nuanced-syllabus.html

12 Michell, J. (2005). The logic of measurement: A realist overview. Measurement, 38(4), 285–294. http://doi.org/10.1016/j.measurement.2005.09.004

13 Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88(3), 355–383. http://doi.org/Article

14 Heene, M. (2013). Additive conjoint measurement and the resistance toward falsifiability in psychology. Frontiers in Psychology, 4.

15 Sauer, S. (2016). Why metric scale level cannot be taken for granted (Blog Post). http://doi.org/http://doi.org/10.5281/zenodo.571356

tl;dr

Suppose you want to know which package(s) a given R function belongs to, say filter. Here come find_funsto help you:

find_funs("filter")
## # A tibble: 4 x 3
##   package_name builtin_pckage loaded
##          <chr>          <lgl>  <lgl>
## 1         base           TRUE   TRUE
## 2        dplyr          FALSE   TRUE
## 3       plotly          FALSE  FALSE
## 4        stats           TRUE   TRUE

This function will search all installed packages for this function name. It will return all the package names that match the function name (ie., packages which include a function by the respective name). In addition, the function raises a flag as to whether the packages is a standard (built-in) packge and whether the package is currently loaded/attached.

For convenience this function can be sourced like this:

source("https://sebastiansauer.github.io/Rcode/find_funs.R")

Usecase

Sometimes it is helpful to know in which R package a function ‘resides’. For example, ggplot comes from the package ggplot2, and select is a function that can be located in the package dplyr (among other packages). Especially if a function has a common name name clases are bound to be experienced. For a example, I was bitten by filter a couple of times - not reckognizing that the function filter that was applied did not come from dplyr as intended but from some other package.

Additionally, sometimes we have in mind ‘oh I should make use of this function filter here’, but cannot remember which package should be loaded for that function.

A number of ways exist to address this question. Our convenience function here takes the name of the function for which we search its residential package as its input (that’s the only parameter). The function will then return the one more packgages in which the function was found. In addition, it returns for each package found whether that package comes with standard R (is ‘built-in’). That information can be useful to know whether someone needs to install a package in order to use some function. The function also returns whether the function is currently loaded.

Code

find_funs <- function(f) {
  # Returns dataframe with two columns:
    # `package_name`: packages(s) which the function is part of (chr)
    # `builtin_package`:  whether the package comes with standard R (a 'builtin'  package)

  # Arguments:
    # f: name of function for which the package(s) are to be identified.


  if ("tidyverse" %in% rownames(installed.packages()) == FALSE) {
    cat("tidyverse is needed for this fuction. Please install. Stopping")
    stop()}

  suppressMessages(library(tidyverse))


  # search for help in list of installed packages
  help_installed <- help.search(paste0("^",f,"$"), agrep = FALSE)

  # extract package name from help file
  pckg_hits <- help_installed$matches[,"Package"]

  if (length(pckg_hits) == 0) pckg_hits <- "No_results_found"


  # get list of built-in packages

  pckgs <- installed.packages()  %>% as_tibble
  pckgs %>%
    dplyr::filter(Priority %in% c("base","recommended")) %>%
    dplyr::select(Package) %>%
    distinct -> builtin_pckgs_df

  # check for each element of 'pckg hit' whether its built-in and loaded (via match). Then print results.

  results <- data_frame(
    package_name = pckg_hits,
    builtin_pckage = match(pckg_hits, builtin_pckgs_df$Package, nomatch = 0) > 0,
    loaded = match(paste("package:",pckg_hits, sep = ""), search(), nomatch = 0) > 0
  )

  return(results)

}

Example

find_funs("filter")
## # A tibble: 4 x 3
##   package_name builtin_pckage loaded
##          <chr>          <lgl>  <lgl>
## 1         base           TRUE   TRUE
## 2        dplyr          FALSE   TRUE
## 3       plotly          FALSE  FALSE
## 4        stats           TRUE   TRUE

Convenience access

For convenience this function can be sourced like this:

source("https://sebastiansauer.github.io/Rcode/find_funs.R")

Notes

tidyverse needs to installed to run this code. tidyverse is loaded quietly. The function will return an empty dataframe if no target package is found.

Acknowledgements

This function was inspired by code from Ben Bolker’s post on SO.

Some time ago, I posted about how to plot frequencies using ggplot2. One point that remained untouched was how to sort the order of the bars. Let’s look at that issue here.

First, let’s load some data.

data(tips, package = "reshape2")

And the usual culprits.

library(tidyverse)
library(scales)  # for percentage scales

First, let’s plot a standard plot, with bars unsorted.

tips %>% 
  count(day) %>% 
  mutate(perc = n / nrow(tips)) -> tips2

ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-3

Hang on, what could ‘unsorted’ possibly mean? There must be some rule, by which ggplot2 determines order.

And the rule is:

  • if factor, the order of factor levels is used
  • if character, an alphabetical order ist used

Sorting bars by factor ordering

Albeit it appears common not to like factors, now that’s a situation when they are useful. Factors provide an easy for sorting, see:

tips2$day <- factor(tips2$day,levels = c("Fri", "Sat", "Sun", "Thur"))

Now let’s plot again:

ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-5

Sorting bars by some numeric variable

Often, we do not want just some ordering, we want to order by frequency, the most frequent bar coming first. This can be achieved in this way.

ggplot(tips2, aes(x = reorder(day, -perc), y = perc)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-6

Note the minus sign -, leaving it out will change the ordering (from low to high).

Happy plotting!

Edit: This post was updated, including two errors fixed - thanks to (private) comments from Norman Markgraf.

z-values, aka values coming from an z-transformation are a frequent creature in statistics land. Among their properties are the following:

  1. mean is zero
  2. variance is one (and hence sd is one)

But why is that? How come that this two properties are true? The goal of this post is to shed light on these two properties of z-values.

Mean value of z-value is zero

There are a number of ways to explain this fact.

One is that it is one feature of mean values that the sum of the differences of the mean is zero. z-values tell nothing but the difference to the mean (in relation to the SD of the distribution). Hence, once you realize that the mean z-value is nothing but some mean value, you will see that the mean of a distribution/sample in z-values equals zero.

Intuition

Look at the following codea and diagram as an intuition:

library(tidyverse)

mtcars %>% 
  select(cyl) %>% 
  slice(1:10) %>% 
  mutate(cyl_mean = mean(cyl),
         diff = cyl - cyl_mean) %>% 
  summarise(sum(diff))
##      sum(diff)
## 1 1.776357e-15

So our sum of the diff values is (practically) zero.

mtcars %>% 
  rownames_to_column %>% 
  select(cyl, rowname) %>% 
  slice(1:10) %>% 
  mutate(cyl_mean = mean(cyl),
         diff = cyl - cyl_mean,
         sign_diff = sign(diff)) %>% 
  ggplot() +
  aes(x = rowname, y = diff) +
  geom_col(aes(fill = factor(sign_diff))) +
  coord_flip() +
  guides(fill = FALSE)

plot of chunk unnamed-chunk-2

The diagram above is meant to give an impression that the negative and positive differences to the mean “cancel out”, ie., sum up to zero. Put pluntly: First, concatenate the red bars. Then, concatenate the magenta bars. You will find the total red bar and the total magenta bar are of the same length.

One further visualization:

cyl_mean <- mean(mtcars[1:10, "cyl"])

mtcars %>% 
  rownames_to_column %>% 
  select(cyl, rowname) %>% 
  slice(1:10) %>% 
  mutate(cyl_mean = mean(cyl),
         diff = cyl - cyl_mean,
         sign_diff = sign(diff),
         ID = 1:10) %>% 
  ggplot() +
  aes(x = ID, y = cyl) +
  geom_col(aes(fill = factor(sign_diff))) +
  coord_flip() +
  guides(fill = FALSE) +
  geom_hline(yintercept = cyl_mean) +
  geom_rect(aes(xmin = ID-.45, xmax = ID+.45,
                ymin = cyl_mean,
                ymax = cyl_mean + diff
               ),
            alpha = .7) +
  scale_x_continuous(breaks = 1:10)

plot of chunk unnamed-chunk-3

Saying that the differences of the values to the mean sum up amounts to saying that the “negative bars” (starting at the vertical mean line at about 5.8) are of equal length if concatenated as the “positive bars” (again starting at the vertical line), concatenated.

More formally

More formally, let be some numeric variable, and the respective arithmetic mean, then .

But why does this equation hold? And how does this apply to z-values?

Let’s look at some algebra:

But the sum of the differences to the mean () is zero. Hence, the whole term is zero. That’s why (that’s one reason hot to explain why) the mean of z-values is zero.

But… why is the sum of differences from the mean equal to zero?

SD of z-values is one

Ok, maybe the mean value of z-values is zero. But why is the SD or the variance equal to one?

If the variance is one, we will agree that the sd is one, too, because the root of 1 is 1.

Intuition

Well, suppose you take all the differences from the mean and divide them by . Let’s call the new differences . Not very surprisingly, the of will also change accordingly - multiplied by , ie., will be divided by the factor . And that yields 1.

More formally

Let be the variance (V) of some variable X, the remaining details are as above.

But, , as discussed above. Thus, the equation is shortened to:

Now we replace by its definition.

Rearranging gives:

which can be simplified to

Thus, we see that the variance of z-values equals 1.


Similarly, picking up the idea from the intuition above, note that

In other words, if we multiply our values (the s) by some factor , the resulting variance will increase by . In case that the mean value is zero (as for z-values), then we may say that “if we multiply our differences by some factor , the variance will increase by ”. Taking the root of the variance, we are left with the sd changed by factor .