In data analysis, we often ask “Do these two groups differ in the outcome variable”? Asking this question, a tacit assumption may be that the grouping variable is the cause of the difference in the outcome variable. For example, assume the two groups are “treatment group” and “control group”, and the outcome variable is “pain reduction”.

A typical approach would be to report the strenght of the difference by help of Cohen’s d. Even better (probably, but this atttitude is not undebated) is to report confidence intervals for d.

Although this approach is widely used, I think it is not ideal. One reason is that we (may) have tacitly changed our hypothesis. We are not asking anymore “Do individuals between the two groups differ?” but we ask now “Do the mean values differ?”, which is different. Although we are used to this approach, there are problems: Often, our theories do not state that e.g., some pill will change the mean of a group. Rather, the theory will (quite rightly, in principle) suggest that some chemical compound in an organism will change some molecules, thereby some cell metabolism products, and finally bringing about some biological and/or psychological change (to the better). This is complete different a theory. And the last theory makes sense (potentially) from a biological and logical perspective. The former does not make any sense. How can some chemical affect some group averages, some “average organism” which does not exist in reality? If we agre with that notion we would be forced, I believe, not to compare means.

But what can we do instead? In fact, some new methods have been proposed; for example James Grice developed what he has dubbed Observation Oriented Moeling. But maybe an easy, first step alternative could be using the “common language effect size” (CLES). CLES simply states “given the observed effect, how likely is it that if we draw two individuals, one from each of the two groups, that the individual from the experimental group will have a higher value in the variable of interest?” (see here).

The value of CLEs is observational, not a hypothetical parameter. It is quite straight-forward man-on-the-street logic. For example: “Draw 100 pairs from the two groups; in 83 cases the experimental group person will a higher value”. That makes sense from an observation or individual organism point of view.

Practically, there are R packages out there which help us computing CLES (or at least one package). The details of the computation are beyond the scope of this article; see here for mathematical details.

Rather, we will build a nice plot for displaying a number of CLES differences between two groups.

First, let’s get some data, and load the needed packages.

library(tidyverse)
# library(readr)
extra_file <- "https://raw.github.com/sebastiansauer/Daten_Unterricht/master/extra.csv"

extra_df <- read_csv(extra_file)

This dataset extra comprises answers to a survey tapping into extraversion as a psychometric variable plus some related behavior (e.g., binge drinking). The real-word details are not so much of interest here.

So, let’s pick all numeric variables plus one grouping variable (say, sex):

# library(dplyr)
extra_df %>% 
  na.omit() %>% 
  select_if(is.numeric) %>% 
  names -> extra_num_vars

First, we compute a t-test for each numeric variable (using our grouping variable “sex”):

# library(purrr)
extra_df %>% 
  dplyr::select(one_of(extra_num_vars)) %>% 
  # na.omit() %>% 
  map(~t.test(. ~ extra_df$Geschlecht)) %>% 
  map(~compute.es::tes(.$statistic,
                       n.1 = nrow(dplyr::filter(extra_df, Geschlecht == "Frau")),
                       n.2 = nrow(dplyr::filter(extra_df, Geschlecht == "Mann")))) %>% 
  map(~do.call(rbind, .)) %>% 
  as.data.frame %>% 
  t %>% 
  data.frame %>% 
  rownames_to_column %>% 
  rename(outcomes = rowname) -> 
  extra_effsize

Note: output hidden, tl;dr.


Puh, that’s a bit confusing. So, what did we do?

  1. We selected the numeric variables of the data frame; here one_of comes handy. This function gets the column names of a data frame as input, and allows to select all of them in the select function.
  2. We mapped a t-test to each of those columns, where extra_df$Geschlecht was the grouping variable. Note that the dot . is used as a shortcut for the lefthand side of the equation, ie., whatever is handed over by the pipe ` %>% ` in the previous line/step.
  3. Next, we computed the effect size using tes from package compute.es.
  4. For each variable submitted to a t-test, bind the results rowwise. Each t-test gives back a list of results. We want to bind those results in a list. As we have several list elements (for each t-test), we can use do.call to bind that all together in one go.
  5. Convert to data frame.
  6. Better turn matrix by 90° using t.
  7. As t gives back a matrix, we need to convert to data frame again.
  8. Rownames should be their own column with an appropriate name (outcomes).
  9. Save that in an own data frame.

And compute the differences in CLES, and plot it:

# library(ggplot2)
extra_effsize %>% 
  dplyr::select(outcomes, cl.d) %>% 
  mutate(sign = ifelse(cl.d > 50, "+", "-")) %>% 
  ggplot(aes(x = reorder(outcomes, cl.d), y = cl.d, color = sign)) + 
  geom_hline(yintercept = 50, alpha = .4) +
  geom_point(aes(shape = sign)) + 
  coord_flip() +
  ylab("effect size (Common Language Effect Size)") +
  xlab("outcome variables") + 
  ggtitle("CLES plot") -> CLES_plot


CLES_plot

The code above in more detail:

  1. select columns outcomes and cl.d; note that cl.d stands for CLES (here: Common Language Cohen’s d, as CLES is derived from Cohen’s d).
  2. Compute variable to tell in which direction the difference points, ie., whether it is greater or smaller than 50%, where 50% refers to ignorance towards the direction of difference.
  3. Well, now plot, but sort the outcomes by their CLES value; for the purpose of sorting we use reorder.
  4. Flip the axes because it is more beautiful here.

Often, we want to check for missing values (NAs). There are of course many ways to do so. dplyr provides a quite nice one.

First, let’s load some data:

library(readr)
extra_file <- "https://raw.github.com/sebastiansauer/Daten_Unterricht/master/extra.csv"

extra_df <- read_csv(extra_file)

Note that extra is a data frame consisting of survey items regarding extraversion and related behavior.

In case the dataframe is quite largish (many columns) it is helpful to have some quick way. Here, we have 25 columns. That is not enormous, but ok, let’s stick with that for now.

library(dplyr)

extra_df %>% 
  select_if(function(x) any(is.na(x))) %>% 
  summarise_each(funs(sum(is.na(.)))) -> extra_NA

So, what have we done? The select_if part choses any column where is.na is true (TRUE). Then we take those columns and for each of them, we sum up (summarise_each) the number of NAs. Note that each column is summarized to a single value, that’s why we use summarise. And finally, the resulting data frame (dplyr always aims at giving back a data frame) is stored in a new variable for further processing.

Now, let’s see:

# library(pander)  # for printing tables in markdown
library(knitr)

kable(extra_NA)
code i6 i9 i12 Facebook Kater Alter Geschlecht extro_one_item Minuten Messe Party Kunden Beschreibung Aussagen i26 extra_mw
82 1 1 1 73 12 3 3 4 37 4 16 49 117 121 3 3

Subsetting a data frame is an essential and frequently performed task. Here, some basic ideas are presented.

Get some data first.

str(mtcars)
## 'data.frame':	32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars <- head(mtcars)  # for shorter output
mtcars
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

One: Addressing dataframe as list (vector, 1-dim structure)

Data frames can be understood/addressed as lists, ie., as some type of vectors. Vectors have one dimension. Thus, we can access/subset with one index only (one dimension). For example mtcars[1] selects the first element (ie., column) of mtcars.

mtcars[1]
##                    mpg
## Mazda RX4         21.0
## Mazda RX4 Wag     21.0
## Datsun 710        22.8
## Hornet 4 Drive    21.4
## Hornet Sportabout 18.7
## Valiant           18.1
mtcars["mpg"]
##                    mpg
## Mazda RX4         21.0
## Mazda RX4 Wag     21.0
## Datsun 710        22.8
## Hornet 4 Drive    21.4
## Hornet Sportabout 18.7
## Valiant           18.1
mtcars[c("mpg", "cyl")]  
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6
mtcars[1:3]
##                    mpg cyl disp
## Mazda RX4         21.0   6  160
## Mazda RX4 Wag     21.0   6  160
## Datsun 710        22.8   4  108
## Hornet 4 Drive    21.4   6  258
## Hornet Sportabout 18.7   8  360
## Valiant           18.1   6  225

Note that data frames are vectors (lists) technically, where the vectors are the columns and the columns possess names. Thus we can address the columns by their names. Of course, we can select (address/index/subset) more than one element (column) using the c() function.

Besides names, we can address the elements by their number: type a positive integer to subset the respective element. c can be used here too.

Note that the : (colon) operator is a short hand for c(from_this_column_to_that column).

Similarly to addressing the names of the data frames using brackets [], we can use the Dollar $ operator:

mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1

Whatever comes after the $ is understood by R as the name (without quotation marks) of the column within that data frame. $ is a shorthand for [[]] (but not exactly the same; see here for an excellent overview).

Two: Addressing dataframe as a matrix (2-dim-structure, 2-dim-matrix)

As data frames can also be addressed as rectangular, two-dimensional matrices, we may subset specific elements using a x-y-coordinate scheme where in R matrices the row is addressed first, and the column second, eg., mtcars(1,2) would be first line of second column.

mtcars[1,1]
## [1] 21
mtcars[1, c(1,2)]
##           mpg cyl
## Mazda RX4  21   6
mtcars[1, 1:3]
##           mpg cyl disp
## Mazda RX4  21   6  160
mtcars[1, c(1:3)]
##           mpg cyl disp
## Mazda RX4  21   6  160
mtcars[, c(1:3)]
##                    mpg cyl disp
## Mazda RX4         21.0   6  160
## Mazda RX4 Wag     21.0   6  160
## Datsun 710        22.8   4  108
## Hornet 4 Drive    21.4   6  258
## Hornet Sportabout 18.7   8  360
## Valiant           18.1   6  225
mtcars[1, "mpg"]
## [1] 21
mtcars[1, c("mpg", "cyl")]  
##           mpg cyl
## Mazda RX4  21   6

Again, the c() operator may be used to group several rows or columns. Columns may again addressed by their names (row names are unusual). The : colon operator is allowed, too.

Three: Logical subsetting in dataframes

mtcars[c(T, T, F, F, F, F, F, F, F, F, T)]
##                    mpg cyl carb
## Mazda RX4         21.0   6    4
## Mazda RX4 Wag     21.0   6    4
## Datsun 710        22.8   4    1
## Hornet 4 Drive    21.4   6    1
## Hornet Sportabout 18.7   8    2
## Valiant           18.1   6    1
mtcars[c(T, T, F)]
##                    mpg cyl  hp drat  qsec vs gear carb
## Mazda RX4         21.0   6 110 3.90 16.46  0    4    4
## Mazda RX4 Wag     21.0   6 110 3.90 17.02  0    4    4
## Datsun 710        22.8   4  93 3.85 18.61  1    4    1
## Hornet 4 Drive    21.4   6 110 3.08 19.44  1    3    1
## Hornet Sportabout 18.7   8 175 3.15 17.02  0    3    2
## Valiant           18.1   6 105 2.76 20.22  1    3    1

In the first example above, the columns #1, #2, #and #11 are selected, because their position is indexed as TRUE (or T).

Note that if you supply less elements than the length of the objects (eg., here 11 columns/elements), R will recycle your elements until the full length of the element is met (here: TTF-TTF-TTF-TT).

Again, the data frame can be addressed either as a list (1-dim), or as a 2-dim matrix. See here for an example using logical indexing and addressing the data frame as a 2-dim matrix:

mtcars[c(T, T, F), c(T, T, F)]
##                    mpg cyl  hp drat  qsec vs gear carb
## Mazda RX4         21.0   6 110 3.90 16.46  0    4    4
## Mazda RX4 Wag     21.0   6 110 3.90 17.02  0    4    4
## Hornet 4 Drive    21.4   6 110 3.08 19.44  1    3    1
## Hornet Sportabout 18.7   8 175 3.15 17.02  0    3    2

Actually, the logical subsetting is quite powerful. We can use a predicate function, ie., a function delivering a logial state (TRUE or FALSE) within the subsetting:

mtcars[mtcars$cyl == 6, c(1,2)]
##                 mpg cyl
## Mazda RX4      21.0   6
## Mazda RX4 Wag  21.0   6
## Hornet 4 Drive 21.4   6
## Valiant        18.1   6

Here, we declared that we only want rows for which the following condition is TRUE: mtcars$cyl == 6. (And only cols 1 and 2.)

Final words

Subsetting in R is an essential task. It is also not so easy, as many slightly different variants exist. Here, only some ideas were presented. A much broader are excellently presented by Hadley Wickham here.

Besides that subsettting using base R should be well understood, it may be more comfortable to use functions such as select from dplyr.

Downloading a folder (repository) from Github as a whole

The most direct way to get data from Github to your computer/ into R, is to download the repository. That is, click the big green button:


The big, green button saying “Clone or download”, click it and choose “download zip”.

Of course, for those using Git and Github, it would be appropriate to clone the repository. And, although appearing more advanced, cloning has the definitive advantage that you’ll enjoy the whole of the Github features. In fact, the whole purpose of Github is to provide a history of the file(s), so the purpose is not really served if one just downloads the most recent snapshot. But anyhow, that depends on you own will.

Note that “repository” can be thought of as “folder” or “project”.

Once downloaded, you need to unzip the folder. Unzipping means to “extract” or “unpack” the file/folder. On many machines, this can be accomplished by right clicking the icon and choosing something like “extract here”.

Once extracted, just navigate to the folder and open whatever file you are inclined to.

Downloading individual files from Github

In case you do not want to download the whole repository, individual files can be downloaded and parsed to R quite easily:

library(readr)  # for read_csv
library(knitr)  # for kable
myfile <- "https://raw.github.com/sebastiansauer/Daten_Unterricht/master/Affairs.csv"

Affairs <- read_csv(myfile)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   affairs = col_integer(),
##   gender = col_character(),
##   age = col_double(),
##   yearsmarried = col_double(),
##   children = col_character(),
##   religiousness = col_integer(),
##   education = col_integer(),
##   occupation = col_integer(),
##   rating = col_integer()
## )
kable(head(Affairs))
X1 affairs gender age yearsmarried children religiousness education occupation rating
1 0 male 37 10.00 no 3 18 7 4
2 0 female 27 4.00 no 4 14 6 4
3 0 female 32 15.00 yes 1 12 1 4
4 0 male 57 15.00 yes 5 18 6 5
5 0 male 22 0.75 no 2 17 6 3
6 0 female 32 1.50 no 2 17 5 5

Let’s quickly deconstruct the url above from Github. In general, we need to write:

https://raw.github.com/user/repository/branch/file.name.

In many cases, the branch will be “master”. You can easily find out about that one the page of the repo you wanna download:

I’ve noticed that unzipping a repository from Github (and downloading a zip file) can cause confusion, so it might be easier to provide a code bit as shown above.

BTW: read.csv should work equally.

In my role as a teacher, I (have to) write a lot of marking feedback reports. My university provides a website to facilitate the process, that’s great. I have also been writing my reports with Pages, Word, or friends. But somewhat cooler, more attractive, and more reproducible would be using (a markup language such as) Markdown. Basically, that’s easy, but it would be of help to have a template that makes up a nice and nicely formatted report, like this:

Download this pdf file here. Here is the source file. Credit goes to the Pandoc team; I based my template on their’s.

So how to do it?

First and foremorst, write your report using Markdown, and convert it to HTML oder Latex-PDF using Pandoc. Rstudio provides nice introduction, eg., here or here.

Next, tell your Markdown document to use your individual stylesheet, i.e, template. Note that I focus here on PDF output.

---
subtitle: "A general theory ..."
title: "Feedback report to the assignment"
output:
  pdf_document:
    template: template_feedback.latex   
---

You have to put that bit above in the YAML header of your markdown document (right at the top of your document), see the source file for details. And then, you just write your Markdown report in plain English (or whatever language…).

However, where the music actually plays is the latex template, which is being used in the Markdown document (via the YAML header). The idea is that in the Latex file, we define some variables (such as “author” or “title”) which then can be used in the markdown file. Markdown, that is YAML, is able to address those variables defined in the Latex template. In this example, the variables defined include:

  • author
  • title
  • subtitle
  • “thanks to” (I use this field as some “freeride” variable)
  • date

The body (main part) of the onepage example above basically looks like this:


# Obedience to the teacher
- Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
- sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, 
- sed diam voluptua. 
...


# Statistical abuses
- Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
...


# Contribution to meaning of live
- Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
(...)

En plus, the style sheet - being based on Pandoc’s stylesheet - allows for quite a number of more format-based adjustments such as language, geometry of the paper, section-numbering etc. See the excellent Pandoc help for details.

Enjoy!