Some tricks on dplyr::filter

The R package dplyr has some attractive features; some say, this packkage revolutionized their workflow. At any rate, I like it a lot, and I think it is very helpful.

In this post, I would like to share some useful (I hope) ideas (“tricks”) on filter, one function of dplyr. This function does what the name suggests: it filters rows (ie., observations such as persons). The addressed rows will be kept; the rest of the rows will be dropped. Note that always a data frame tibble is returned.

Starters example

library(tidyverse)  # get da whole shabeng!

## Loading tidyverse: ggplot2

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

# don't forget to have the package(s) installed upfront.


An easy usecase would be:

mtcars %>%
filter(cyl >= 8)

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 2  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 3  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 4  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 5  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 6  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 7  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 8  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 9  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 10 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 11 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 12 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 13 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 14 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8


We see there are 15 cars with 8 cylinders.

Typical comparison operators to filter rows include:

• == equality
• != inequality
• < or > greater than/ smaller than
• <= less or equal

Multiple logical comparisons can be combined. Just add ‘em up using commas; that amounts to logical OR “addition”:

mtcars %>%
filter(cyl == 8, hp > 250)

##    mpg cyl disp  hp drat   wt qsec vs am gear carb
## 1 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
## 2 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8


AND addition can be achived the standard way

mtcars %>%
filter(cyl == 6 & hp < 260)

##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 4 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 5 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 6 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 7 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6


Before we continue, let’s transform the rowname (where the cars’ names are stored) into a proper column, so that we can address the cars names the usual way:

mtcars <- mtcars %>%
rownames_to_column


Pick value from list

One particular helpful way is to say “I want to keep any of the following items”. In R, the %in% operator comes for help. See:

mtcars %>%
filter(cyl %in% c(4, 6))

##           rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1       Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2   Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3      Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4  Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5         Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 6       Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 7        Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 8        Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 9       Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 10       Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 11    Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 12 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 13  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 14      Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 15  Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 16   Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 17   Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 18     Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


Partial matching

Suppose you would like to filter all Mercs; the Mercs include “Merc 240D”, “Merc 280C” and other. So we cannot filter for “Merc” as an exact search string. We need to tell R, “hey if ‘Merc’ is a part of this string, then filter it, otherwise leave it”.

For more flexible string-operations, we can make use of the package stringr (again, by Hadley Wickham).

library(stringr)

mtcars %>%
filter(str_detect(rowname, "Merc"))

##       rowname  mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 1   Merc 240D 24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
## 2    Merc 230 22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
## 3    Merc 280 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
## 4   Merc 280C 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
## 5  Merc 450SE 16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
## 6  Merc 450SL 17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
## 7 Merc 450SLC 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3


Of course, we now can go wild, making use of the whole string manipulation magic, called Regex. This tool is powerful indeed, but it needs some time to get used to it. For example, let’s filter all cars where the cars names begins with “L”:

mtcars %>%
filter(str_detect(rowname, "^L"))

##               rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 2        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2


The circonflex means “the string starts with”; in this example “the string starts with ‘L’”. To get values ending with, say, “L”, we use $ in Regex: mtcars %>% filter(str_detect(rowname, "L$"))

##          rowname  mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 1     Merc 450SL 17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
## 2 Ford Pantera L 15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4


Another usecase could be that we want to pick rows where the names contain digits.

mtcars %>%
filter(str_detect(rowname, "\\d"))

##           rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1       Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2   Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3      Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4  Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5      Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 6       Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 7        Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 8        Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 9       Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 10     Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 11     Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 12    Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 13       Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 14     Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 15      Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 16  Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 17     Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


In Regex, \d means “digit”. As the backslash needs to be escaped (by typing an extra backslash), we type two backslashes, and get what we want.

Similarly, if we want all values except those with digits, we could say:

mtcars %>%
filter(!str_detect(rowname, "\\d"))

##                rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 2              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 3   Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 4  Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 5    Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 6          Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 7       Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 8        Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 9     Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 10         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 11    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 12        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 13      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 14        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 15       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8


As ! is used for a logical not, we can invert our expression above (str_detect) this way.

Finally, say we want to filter all “Mercs” and all “Toyotas”. As there are different Mercs and different Toyotas in the data set, we need to tell R something like “take all Mercs you can find all Toyotas, and leave the rest”.

What does not work is this:

mtcars %>%
filter(rowname %in% c("Merc", "Toyota"))

##  [1] rowname mpg     cyl     disp    hp      drat    wt      qsec
##  [9] vs      am      gear    carb
## <0 rows> (or 0-length row.names)


The code above does not work, as the %in% operators does not partial matching but expects complete matching.

Again, Regex can help:

mtcars %>%
filter(str_detect(rowname, "Merc|Toy"))

##          rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1      Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 2       Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 3       Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 4      Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 5     Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 6     Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 7    Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 8 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 9  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1


Note that the | (vertical dash) means “or” (in R and in Regex).

The same result could be achived in a more usual R way:

mtcars %>%
filter(str_detect(rowname, "Merc") | str_detect(rowname, "Toy"))

##          rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1      Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 2       Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 3       Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 4      Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 5     Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 6     Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 7    Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 8 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 9  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1


Filtering NAs

For the sake of illustration, let’s introduce some NAs to mtcars.

mtcars$mpg[sample(32, 3)] <- NA mtcars$cyl[sample(32, 3)] <- NA
mtcars$hp[sample(32, 3)] <- NA mtcars$wt[sample(32, 3)] <- NA


First, we filter all lines where there are no NAs in mpg:

mtcars %>%
filter(!is.na(mpg))

##                rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3           Datsun 710 22.8  NA 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7           Duster 360 14.3   8 360.0 245 3.21    NA 15.84  0  0    3    4
## 8            Merc 240D 24.4  NA 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9             Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 10           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 11          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 12          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 13  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 14 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 15   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 16            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 17      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 18       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 19    Dodge Challenger 15.5   8 318.0 150 2.76    NA 16.87  0  0    3    2
## 20         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 21          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 22    Pontiac Firebird 19.2  NA 400.0 175 3.08 3.845 17.05  0  0    3    2
## 23           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 24       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 25        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 26      Ford Pantera L 15.8   8 351.0  NA 4.22 3.170 14.50  0  1    5    4
## 27        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 28       Maserati Bora 15.0   8 301.0  NA 3.54 3.570 14.60  0  1    5    8
## 29          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


Easy. Next, we filter complete rows. Wait, there’s a shortcut for that:

mtcars %>%
na.omit %>%
nrow

## [1] 22


But, what if we only care about NAs at mpg and hp? Say, we want any row with NA in these two columns. Here’s a way:

mtcars %>%
filter(!is.na(mpg) & !is.na(hp))

##                rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3           Datsun 710 22.8  NA 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7           Duster 360 14.3   8 360.0 245 3.21    NA 15.84  0  0    3    4
## 8            Merc 240D 24.4  NA 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9             Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 10           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 11          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 12          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 13  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 14 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 15   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 16            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 17      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 18       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 19    Dodge Challenger 15.5   8 318.0 150 2.76    NA 16.87  0  0    3    2
## 20         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 21          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 22    Pontiac Firebird 19.2  NA 400.0 175 3.08 3.845 17.05  0  0    3    2
## 23           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 24       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 25        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 26        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 27          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


Finally, assume we want to inspect each row where there is at least one NA at mpg or at hp.

mtcars %>%
filter((is.na(mpg) | is.na(hp)))

##          rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1       Merc 230   NA   4 140.8  NA 3.92 3.150 22.90  1  0    4    2
## 2    Merc 450SLC   NA   8 275.8 180 3.07    NA 18.00  0  0    3    3
## 3    Honda Civic   NA   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 4 Ford Pantera L 15.8   8 351.0  NA 4.22 3.170 14.50  0  1    5    4
## 5  Maserati Bora 15.0   8 301.0  NA 3.54 3.570 14.60  0  1    5    8


Happy filtering!

Some thoughts on 'Dear stats curriculum developers'

Recently, Andrew Gelman (@StatModeling at Twitter) published a post with this title - ““Dear Major Textbook Publisher”: A Rant”.

In essence, he discussed how a good stats intro text book should be like. And complained about the low quality of some many textbooks out there.

As I am also in the business guilty of coming up with stats curriculum for my students (applied courses for business type students mostly), I discuss some thoughts for “stats curriculum developers” (like myself).

Of course, the GAISE report provides an authoritative overview on what is considered helpful (by the ASA and by many others). My compilation builds on that but adds/drops some points.

Principles in stats curriculum

• statistical thinking above anything else. What is statistics all about? Get a bunch of data, see the essence (some summary, often). But wait, there is variability, stuff comes in different shades of grey. How certain (or rather uncertain) are we about some hypothesis? Ah, that’s why probability comes into play. My grand grandpa did smoke, loved his beer, and see, turned 94 recently, and you keep telling me that would not be enough to “demonstrate”…. Ah, OK… What does statistics say about causality? (not much, to my mind.) This commitment implies that procedural know-how need be downgraded (we cannot have the cake and eat it).

• Problem solving. Give students some dataset, some question, and let them investigate. Discuss, refine, correct their actions. Explain afterwards better ways (I heard this was called inductive learning).

• Real data, real problems. Don’t stay with urns and coin tosses, fine as a starter, but then go on to real data, and real problems. For example, I used extraversion and number of Facebook friends as a running example; based on the data students gave to this survey. So they analyzed their own data. If you want to make use of the data: you are welcome, here it is (free as in free).

• A lot of ‘guide at your side’ (activation). Rather then ‘sage on stage’. However, I found that surprisingly many students appear overwhelmed if the “activation dose” gets too strong. So I try to balance the time when I speak, students work on their own or when they work in groups. Honestly, what’s the advantage of listening to me in person compared to watching a Youtube video? OK, at times, I may respond (Youtube not so much), but the real benefit would be joint problem solving. So let’s do that.

• Technology (R). Computers are already ubiquitous, and penetrating even more in everyday life. So we as teachers should not dream of the “good old days” where we solved triple integrals with formulas as wide as my arm span (I never did). Future is likely to dominated by computers when it comes to our working style or working tools. So let’s use them and show them how to code. A bit, at least (many are frightened though initially). R is a great environment for that purpose.

• Ongoing selfassessment. Cognitive and educational psychology points out that frequent assessment helps a lot to learn some stuff [citation missing]. So let’s give the students ongoing feedback. For example, what I do is I use a Google form with quiz feature for that purpose. Minimal preparation from my side for tomorrow’s class, because that’s all pre-prepared.

• Focus on/start with intuition. Don’t start by throwing some strange (\LaTeX typeset, of course) equation in the students’ faces (unless you are teaching in some math heavy fields, but not in business). Start by explaining the essence in simple words, and giving an intuition. For example, variance can be depicted as (something like) “average deviation sticks”; correlation as “average deviation rectangles” (see this post). When the essence is understood, refine. Now come your long equations. This post speaks of a refinement process from “sorta-true” to very true. Quite true.

• Be nice. I think there is no point in being overly austere; surely you must be joking, Mr. Feynman. Even stats classes can be fun… Personally, I find Andy Field’s tone in his textbook quite refreshing.

Aims of a stats curriculum

If your class belongs to a business type of field, don’t expect they all will turn to science as a career now. More realistically, they will have to face some stats questions in their working live. Two things may happen then.

First, they hire someone who really knows what to do (at least, someone who thinks so). Then your students of today need be prepared to speak to that expert. They need enough understanding to get the gist of the experts strategy. They need be able to ask some good questions. But the clearly do not need to understand many details, if any.

Second scenario. The future ego of your today-student will by her or himself do some stats problem solving. I think even our applied business students should have some working knowledge, maybe on regression plus some Random Forests, based on some general ideas plus visualization and data wrangling.

In sum, critical understanding is the first and some partial, locally constrained actionable know-how is second.

Content

Some thought blobs.

• Descriptive stats can be paired with some cool visualization. I like qplot from ggplot2 because it is easy start, but allows to combine different steps, which fosters creative problem solving (I hope).

• Data wrangling. dplyr is great because with ~5 functions you can rule them all see here for an intro. This is, a couple of easy functions, such as filter, summarise, or count do the bulk of typical data preparation tasks (yes, these functions pretty much do what their name suggests).

• wee p, I tend(ed) to focus on this point in class, not because I love it, but because it is still the way papers are built. So students need to understand it. And it is not easy to comprehend. If they do not understand, how can the new generation overcome the problems of the past. I try to lay bare the problems of the p-value (mainly that it is not the answer we need).

• Bayes. It is (probably) too strong to state that without understanding Bayes, one cannot understand the p-value. But understanding posterior probabilities is a relief, as the p-value can now be accepted as what it is: some conditional probability (of some data given some hypothesis). Bayes puts p into context. But Bayes, I fear – not being deeply based in Bayes – can necessitate a lot of ground work, time consuming fundamentals, if you really want some working knowledge. That’s why I, for the moment at least, take some shortcut, and present only the basic concept, alongside, maybe, some off-the-shelf software such as JASP.

• Basic statistical learning ideas. Overfitting is an essential concept; it should be grasped. There are a number of similarly important points (particularly resampling, cross-validation, and bias-variance trade-off, for example); the book Introduction to Statistical Learning is of great help here.

• Algorithmic methods, in the sense of Breiman’s two cultures. One particular nice candidate here are tree-based methods, including notorious Random Forests. Conceptually easy yet high predictive performance in some situations.

• Text mining. 90% of all is something, I mean, 90% of all information is unstructured, mostly image and text, some say. So let’s look at that. That’s fun, not yet so much looked at, and for business students of some relevance. Julia Silge’s and David Robinsons’s book on Textmining appears great (and vast, for that matter).

• Repro crisis. Let’s not be silent about the problems we face in Science. True, there are great challenges, but hey, what we need is not less, but more science. A good dose of open science can and should be plugged in here.

Byebye Powerpoint

In my experience, it can be difficult by bypass PowerPoint, even when you are preparing a curriculum. In my university, several teachers may/are encouraged to make use of your stuff, and they may/will utterly complain if you do not use PowerPoint (Keynote as a surrogate is not a solution).

But for the next curriculum, my plan is to use RMarkdown to write a real scriptum, not only slides. That will be lots of work. But then it will be easy to come up with lean slides, without much text, because there will be the script for the details. Then, slides, can be used what they are intended for: convey one idea per slide, no or little text, primarily consisting of graphics or schema. The slides can be prepared using RMarkdown, too. RMarkdown is not particularly fit for that purpose, though. I mean, you will not have the full features of PPT, and not the ease. But you will not need it. Simple slides can very well be made using RMarkdown. Once there is a real “mini-book”, aka script, no detailed slides are needed any more.

Git and friends can then be used for version control, collaboration, and so on…

However, a lot of colleagues will complain, I fear. Let’s see. I hope to convince them that this is a far better way. Bye bye PowerPoint.

Conclusions

All that here are some quick thoughts in progress. They are more to make up my mind, and to stimulate your thinking. I believe all that written here, but I am sure, I missed a lot and made a number of mistakes. Let me know your thoughts!

Acknowledgements

I learnt a lot, both statistically as well as didactically, from my colleague Karsten Luebke, whose thoughts also went into this post.

Simulation of p-values

Teaching or learning stats can be a challenging endeavor. In my experience, starting with concrete (as opposed to abstract) examples helps many a learner. What also helps (for me) is visualizing.

As p-values are still part and parcel of probably any given stats curriculum, here is a convenient function to simulate p-values and to plot them.

“Simulating p-values” amounts to drawing many samples from a given, specified population (eg., µ=100, s=15, normally distributed). We could ourselves go out and draw samples (eg., testing IQ of strangers). As the computer can do that too and does not mourn about repetitive tasks, let’s leave that task to the machine.

Simulation can be easier to comprehend than working with theoretical distributions. It is more tangible to say “we are drawing many samples” then to speak about some theoretical distribution. That’s why simulation can be of didactic value.

So, here you can source an R function which will do the simulation plus plotting job for you.

source("https://sebastiansauer.github.io/Rcode/clt_1.R")


Note that dplyr and ggplot need to installed.

Use the function simu_p(). It can be run without parameters:

s1 <- simu_p()


The plot shows the sampling distribution, the 5% sample means with highest values, the p-value, and the mean of the sample actual drawn (dashed line)

However, one is free to adapt the parameters, eg., experimenting with smaller or larger samples sizes, or a different sample mean:

s2 <- simu_p(n_samples = 1000, mean = 100, sd = 15, n = 5, distribution = "normal", sample_mean = 104)


Note that two types of distribution to draw from are implemented: normal and uniform:

s3 <- simu_p(distribution = "uniform", sample_mean = .6)


Initially, I wrote the function to demonstrate the central limit theorem. That’s why two different distributions (normal and uniform) are implemented.

Note that the p-value presented here is of type greater than only, for didactic reasons.

The object returned, eg. s2 contains the values, a cumulative percentage, and an indication whether a certain sample mean was smaller or larger than the significance level, or the sample mean, respectively:

head(s1)

##     samples   perc_num max_5perc greater_than_sample
## 1 101.73581 0.74374374         0                   0
## 2 102.74671 0.84484484         0                   0
## 3  93.98155 0.01501502         0                   0
## 4  98.10223 0.25825826         0                   0
## 5 100.50964 0.56456456         0                   0
## 6 100.17396 0.52152152         0                   0


Here is the source code of the function simu_p():

simu_p <- function(n_samples = 1000, mean = 100, sd = 15, n = 30, distribution = "normal", sample_mean = 107){

library(ggplot2)
library(dplyr)

if (distribution == "normal") {
samples <- replicate(n_samples, mean(rnorm(n = n, mean = mean, sd = sd)))

df <- data.frame(samples = samples)

df %>%
mutate(perc_num = percent_rank(samples),
max_5perc = ifelse(perc_num > 950, 1, 0),
greater_than_sample = ifelse(samples > sample_mean, 1, 0)) -> df

p_value <- round(mean(df$greater_than_sample), 3) p <- ggplot(df) + aes(x = samples) + geom_histogram() + labs(title = paste("Histogram of ", n_samples, " samples", "\n from a normal distribution", sep = ""), caption = paste("mean-pop=", mean, ", sd-pop=",sd, sep = "", ", mean in sample=", sample_mean), x = "sample means") + geom_histogram(data = dplyr::filter(df, perc_num > .95), fill = "red") + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(xintercept = sample_mean, linetype = "dashed", color = "grey40") + ggplot2::annotate("text", x = Inf, y = Inf, label = paste("p =",p_value), hjust = 1, vjust = 1) print(p) } if (distribution == "uniform") { samples <- replicate(n_samples, mean(runif(n = n, min=0, max=1))) df <- data.frame(samples = samples) if (sample_mean > 1) sample_mean <- .99 df %>% mutate(perc_num = percent_rank(samples), max_5perc = ifelse(perc_num > 950, 1, 0), greater_than_sample = ifelse(samples > sample_mean, 1, 0)) -> df p_value <- round(mean(df$greater_than_sample), 3)

p <- ggplot(df) +
aes(x = samples) +
geom_histogram() +
labs(title = paste("Histogram of ", n_samples, " samples", "\n from a uniform distribution", sep = ""),
caption = paste("sample mean =", sample_mean, ", min-pop = 0, max-pop = 1"),
x = "sample means") +
geom_histogram(data = dplyr::filter(df, perc_num > .95), fill = "red") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(xintercept = sample_mean, linetype = "dashed", color = "grey40") +
ggplot2::annotate("text", x = Inf, y = Inf, label = paste("p =",p_value), hjust = 1, vjust = 1)
print(p)

}
return(df)

}


Pipe the Variance

One idea of problem solving is, or should be, I think, that one should tackle problems of high complexity, but not too high. That sounds trivial, cooler tone would be “as hard as possible, as easy as necessary” which is basically the same thing.

In software development including Rstats, a similar principle applies. Sounds theoretical, I admit. So see here some lines of code that has bitten me recently:

obs <-  c(1,2,3)
pred <- c(1,2,4)

monster <- 1 - (sum((obs - pred)^2))/(sum((obs - mean(obs))^2))
monster

## [1] 0.5


The important line is of course

 1 - (sum((obs - pred)^2))/(sum((obs - mean(obs))^2))


The bug I incorporated was something like this:

 1 - ((obs - pred)^2)/((obs - mean(obs))^2)


Friendly bugs yield a functions that dies trying its job. Nasty bugs silently infuse problems. In this case, not a single number, but a vector of numbers resulted. Once you know it, surely, it is easy. But starring at some 1000 lines of codes can make it more difficult to see…

So, one could argue that the complexity was (too) high… for me, at least. Sigh. To put it differently: Can the code rendered more bug-robust?

I think the pipe —- %>% —- helps to reduce the complexity. The pipe hacks a big package in tiny pieces, to be dealt with step-by-step, one after each other. And not all in one gobble-dee-gopp.

So, compare the pipe code:

library(dplyr)  # make sure it is installed

obs %>%
-(pred) %>%
^(2) %>%
sum -> SS_b

obs %>%
-(mean(obs)) %>%
^(2) %>%
sum -> SS_t

R2 <- 1 - (SS_b/SS_t)
R2

## [1] 0.5


In words, what does this code do:

1. Take obs, then
2. from that (obs) subtract pred, then
3. square each number, then
4. sum up the resulting numbers, then
5. save that number as SS_b

That’s really simple, isn’t it!

Similarly procedure for the second bit. tl;dr.

Note that the pipe comes from the package magrittr, but is included by dplyr.

Admittedly, I have allowed one or two intermediate steps which make it easier to follow, too. But that’s also a way to reduce unnecessary (I would say) complexity.

Some musings on the validation of Satow's Extraversion questionnaire

Measuring personality traits is one of (the?) bread-and-butter business of psychologists, at least for quantitatively oriented ones. Literally, thousand of psychometric questionnaires exits. Measures abound. Extroversion, part of the Big Five personality theory approach, is one of the most widely used, and extensively scrutinized questionnaire tapping into human personality.

One rather new, but quite often used questionnaire, is Satow’s (2012) B5T. The reason for the popularity of this instrument is that it runs under a CC-licence - in contrast to the old ducks, which coute chere. The B5T has undergone some psychometric scrutiny, and a number of results support the notion that it is a valid instrument.

Let’s look here into some aspects of validity of the instrument. More bluntly: Does the B5t (here: only extraversion) really measures extraversion? My point is rather not a judgement on this particular instrument, but more which checks can guide us to answer this question.

First, we do not know whether extraversion per se is of metric niveau; can the mind distinguish between infinitely many distinct yet equidistant steps on a imagined continuum? Surely this is a strong assumption. But let’s put that aside for now (as nearly everybody does not only for a moment but forever).

External validity first

A primary check should be the external validity of the scale. When I say external validity, I mean e.g., whether the instrument is correlated with some variable it should be correlated with according to the theory we hold. That is to say, if some “obvious” fact is not identified by a measurement device, we would and should be unwilling to accept whether it really measures what it ought measure.

Note that reliability is not more than a premise or prerequisite for (external) validity. Thus, if we knew some measure is reliable (by standard procedures), we do not know whether it measures what it should measure. To the contrary, if there is some sound association with the right external measure, we are more confident (though maybe not satisfied).

Data & packages

Let’s have a look at our extraversion (e) data. Here we go:

e <- read.csv("https://osf.io/meyhp/?action=download")


The DOI of this data is 10.17605/OSF.IO/4KGZH

We will need these packages:

library(tidyverse)

## Loading tidyverse: ggplot2

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(htmlTable)
library(corrr)
library(psych)

##
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
##
##     %+%, alpha

library(plotluck)


Note that recoding has already taken place, and a mean value has been computed; see here for details. The extraversion scale consists of 10 items, which 4 answer categories each. A codebook can be found here: https://osf.io/4kgzh/.

Plotting means and frequencies

Let’s have a brief look at some stats as first step. Here are the histograms:

e %>%
select(i01:i10) %>%
gather %>%
ggplot +
aes(x = value) +
geom_bar()+
facet_wrap(~key, ncol = 4)

## Warning: Removed 24 rows containing non-finite values (stat_count).


Let’s save this dataframe because we will use it frequently:

e %>%
select(i01:i10) %>%
gather(key = item, value = value) -> e2


And means and SD:

e %>%
select(i01:i10) %>%
gather(key = item, value = value) %>%
group_by(item) %>%
summarise(item_mean = round(mean(value, na.rm = T), 2),
item_sd = round(sd(value, na.rm = T), 2)) -> e_summaries

htmlTable(e_summaries)

item item_mean item_sd
1 i01 3.34 0.68
2 i02r 3.12 0.8
3 i03 1.91 0.92
4 i04 3.24 0.73
5 i05 3.05 0.77
6 i06r 2.94 0.77
7 i07 2.97 0.73
8 i08 2.93 0.86
9 i09 3.35 0.71
10 i10 2.24 0.9

Maybe let’s plot that.

e_summaries %>%
mutate(mean = item_mean,
lower = mean - item_sd,
upper = mean + item_sd) %>%
select(-item_mean) -> e_summaries

e_summaries %>%
ggplot(aes(x = item, y = mean)) +
geom_errorbar(aes(ymin = lower,
ymax = upper),
color = "grey40") +
geom_point(color = "blue") +
labs(caption = "Error bars denote SDs")


Note that scale_y_continuous(limits = c(1,4)) will not work, because this method kicks out all values beyond those limits, which will screw up your error bars (I ran into that problem and it took me a while to see it).

Hm, there appear to types of items, one with a rather large mean, and one with a small mean. The SD appears to bear roughly equal.

Let’ sort the plot by the item means.

e_summaries %>%
ggplot(aes(x = reorder(item, mean), y = mean)) +
geom_errorbar(aes(ymin = lower,
ymax = upper),
color = "grey40") +
geom_point(color = "blue") +
labs(caption = "Error bars denote SDs")


Maybe plot stacked bar plots.

e2 %>%
na.omit %>%
count(item, value) %>%
ggplot +
aes(x = item, y = n, fill = value) +
geom_col() +
coord_flip()


We see that at least the difficulty of the items ranged quite a bit. This last plot surely gives us the best impression on what’s going on. Note that real values have been depicted, not functions of them such as mean or sd.

For example, for nearly all items, mainly categories 3 and 4 were picked. So items were comparatively easy, which is not seldom. Maybe a finer grunalirity at the “bottom” would be of help to get more information.

Correlation of means score with external criteria

OK, the correlation of the mean score with some external criteria surely is the central (first) idea to see how well the scale may have measured what it should measure.

We could assume that there is an association of extraversion and the number of Facebook friends.

cor(x = e$extra_mean, y = e$n_facebook_friends, use = "complete.obs") -> cor_mean_FB
cor_mean_FB

## [1] 0.2558908


Hm, not really that much; however, in real life you have to be happy with non-perfect things…

There are a number of similar behavioral indicators in the data set besides number of FB friends. Let’s look at each in turn.

e %>%
select(extra_mean, time_conversation, n_party, n_hangover) %>%
correlate %>%
focus(extra_mean) %>%
mutate_if(is.numeric, funs(round), digits = 2) -> corr_emean_criteria

htmlTable(corr_emean_criteria)

rowname extra_mean
1 time_conversation -0.11
2 n_party 0.25
3 n_hangover 0.11

Basically not much, maybe n_party can be of interest.

Let’s plot a scatterplot.

e %>%
qplot(x = extra_mean, y = n_facebook_friends, data = .) +
geom_smooth(method = "lm", se = FALSE) +
geom_label(x = 4, y = 2500, label = paste("r = ", round(cor_mean_FB, 2), sep = ""), hjust = 1)

## Warning: Removed 90 rows containing non-finite values (stat_smooth).

## Warning: Removed 90 rows containing missing values (geom_point).


Let’s skip the extreme high values of FB friends. The correlation appeared to got stronger:

e %>%
correlate %>%
focus(extra_mean) -> cor_mean_FB_filtered
cor_mean_FB_filtered

## # A tibble: 1 × 2
##              rowname extra_mean
##                <chr>      <dbl>

e %>%
qplot(x = extra_mean, y = n_facebook_friends, data = .) +
geom_smooth(method = "lm", se = FALSE) +
geom_label(x = 4, y = 1000, label = round(cor_mean_FB_filtered$extra_mean[1], 2), hjust = 1)  And now a scatterplot of extra_mean and n_party. e %>% select(extra_mean, n_party) %>% qplot(x = extra_mean, y = n_party, data = .) + geom_smooth(method = "lm", se = FALSE) + geom_label(x = 4, y = 120, label = paste("r = ", corr_emean_criteria$extra_mean[2], sep = ""), hjust = 1)

## Warning: Removed 18 rows containing non-finite values (stat_smooth).

## Warning: Removed 18 rows containing missing values (geom_point).


Correlation of extra_description with external criteria

Conventional psychometric knowledge holds that a psychometric ought to consist of several items. However, some renegades argue that “one-item wonder” can do an equally good job of predicting some external criteria. That’s no small claim, because if they do, there’s little to argue in favor of the longer, and thought more sophisticated methods in psychometry.

Of course, here we only play around a little, but let’s see what will happen:

• extra_description: One item self-description of extroversion
• extra_vignette: One item vignette (self-description) of extraversion

We start with extra_description:

e %>%
select(extra_description, time_conversation, n_party, n_hangover, n_facebook_friends, extra_mean) %>%
correlate %>%
focus(extra_description) %>%
ggplot(aes(x = rowname, extra_description)) +
geom_point() +
ggtitle("Correlation of extra_description with external criteria") +
ylab("correlation with extra_description")


Let’s compare the correlation of extr_description with the external criteria to the correlation of extra_mean wit the external criteria.

e %>%
select(extra_description, time_conversation, n_party, n_hangover, n_facebook_friends) %>%
correlate %>%
focus(extra_description) %>%
mutate(predictor = "extra_description") %>%
rename(correlation = extra_description) -> corr_extra_description

e %>%
select(extra_mean, time_conversation, n_party, n_hangover, n_facebook_friends) %>%
correlate %>%
focus(extra_mean) %>%
mutate(predictor = "extra_mean") %>%
rename(correlation = extra_mean) -> corr_extra_mean

rbind(corr_extra_description, corr_extra_mean) -> corr_compare

corr_compare %>%
rename(criterion = rowname) -> corr_compare


Now let’s plot the difference in correlation strength for the two predictors of the external behavioral criteria.

corr_compare %>%
ggplot +
aes(x = criterion, y = correlation, color = predictor, fill = predictor) +
geom_rect(ymin = -0.1, ymax = 0.1, xmin = -10, xmax = 10, fill = "grey80", alpha = .2, color = NA) +
geom_point(aes(shape = predictor), size = 3) +
theme(legend.position = "bottom") +
coord_flip()


Of interest, the correlation of number of Facebook friends is quite different for extra_description compared with the correlation of extra_mean. For the rest, the difference is not “too” strong.

Assocation of extra_vignette with external criteria

extra_vignette needs some upfront preparation:

count(e, extra_vignette)

## # A tibble: 4 × 2
##          extra_vignette     n
##                  <fctr> <int>
## 1                         119
## 2         keine Antwort    65
## 3       passt insgesamt   238
## 4 passt insgesamt nicht    79

recode(e$extra_vignette, "keine Antwort" = "no_answer", "passt insgesamt" = "fit", "passt insgesamt nicht" = "no_fit", .default = NA_character_) -> e$extra_vignette

count(e, extra_vignette)

## # A tibble: 4 × 2
##   extra_vignette     n
##           <fctr> <int>
## 2            fit   238
## 3         no_fit    79
## 4             NA   119


Let’s compute the mean value for each group of extra_vignette:

e %>%
select(extra_vignette, extra_description, time_conversation, n_party, n_hangover, n_facebook_friends, extra_mean) %>%
filter(extra_vignette %in% c("fit", "no_fit")) %>%
group_by(extra_vignette) %>%
summarise_all(funs(mean), na.rm = TRUE) %>%
mutate_if(is.numeric, funs(round), digits = 1) -> comp_extra_vignette

comp_extra_vignette %>%
htmlTable

extra_vignette extra_description time_conversation n_party n_hangover n_facebook_friends extra_mean
1 fit 2.7 12.1 20.7 11.2 417 3
2 no_fit 4 23.7 13.5 6.7 274.7 2.6

Let’s plot that too.

comp_extra_vignette %>%
gather(key = variable, value = value, -extra_vignette) %>%
ggplot +
aes(x = variable, y = value, color = extra_vignette, fill = extra_vignette) +
geom_point() +
scale_y_log10() +
coord_flip() +
labs(y = "value [log10]")


We see that the fit dot is sometimes at a higher value (to the right) compared to the no_fit dot, which matches expectations. But this is, as can be seen, not always the case. Of interest, extra_mean shows small differences only.

OK, but what’s a big difference? Let’s compute the CLES measure of effect size for each criterion (here column of the table above) with extra_vignette as grouping variable.

First we prepare the data:

library(compute.es)

dplyr::count(e, extra_vignette)

e %>%
select(time_conversation, n_party, n_hangover, n_facebook_friends, extra_mean, extra_description, extra_vignette) %>%
filter(extra_vignette %in% c("fit", "no_fit")) %>%
na.omit() -> e3

e3 %>%
select(-extra_vignette) %>%
map(~t.test(. ~ e3$extra_vignette)) %>% map(~compute.es::tes(.$statistic,
n.1 = nrow(dplyr::filter(e, extra_vignette == "fit")),
n.2 = nrow(dplyr::filter(e, extra_vignette == "no_fit")))) %>%
map(~do.call(rbind, .)) %>%
as.data.frame %>%
t %>%
data.frame %>%
rownames_to_column %>%
rename(outcomes = rowname) ->
extra_effsize


Then we plot ‘em.

extra_effsize %>%
dplyr::select(outcomes, cl.d) %>%
mutate(sign = ifelse(cl.d > 50, "+", "-")) %>%
ggplot(aes(x = reorder(outcomes, cl.d), y = cl.d, color = sign)) +
geom_hline(yintercept = 50, alpha = .4) +
geom_point(aes(shape = sign)) +
coord_flip() +
ylab("effect size (Common Language Effect Size)") +
xlab("outcome variables") +
labs(title = "CLES plot for differences between groups of 'extra-vignette",
caption = "positive value are in favor of group 'fit'") -> CLES_plot

CLES_plot


Pooh, quite some fuzz. See this post for some explanation on what we have just done.

Now, what do we see? extra_description went plain wrong; extra_mean captures some information. We should note that some overcertainty may creep in here (and elsewhere). Our estimates are based on sample, not population values. So it is doubtful whether we can infer that futures samples (all we are interested in) will behave as predicted (here).

Reliability

The only type of reliability we can compute is internal consistency, for we have no second measurement. It should be stressed that even if high reliability is given we do not know whether the measurement measured what we hope for. If reliability is not high, then we do know that our measurement failed. So reliability can be seen as necessary, but not sufficient, for gaining confidence in an instrument. That’s why validity (often read as correlation) is more telling: If association is there, we probably know what we wanted to know. Our measurement device as expected.

e %>%
select(i01:i10) %>%
psych::alpha()

##
## Reliability analysis
## Call: psych::alpha(x = .)
##
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd
##       0.78      0.79    0.81      0.28 3.9 0.015  2.9 0.46
##
##  lower alpha upper     95% confidence boundaries
## 0.75 0.78 0.81
##
##  Reliability if an item is dropped:
##      raw_alpha std.alpha G6(smc) average_r S/N alpha se
## i01       0.75      0.76    0.77      0.26 3.2    0.017
## i02r      0.76      0.77    0.78      0.27 3.4    0.016
## i03       0.81      0.81    0.82      0.33 4.4    0.013
## i04       0.75      0.76    0.77      0.26 3.2    0.017
## i05       0.75      0.76    0.77      0.26 3.2    0.017
## i06r      0.76      0.77    0.78      0.27 3.4    0.016
## i07       0.76      0.77    0.79      0.27 3.4    0.016
## i08       0.76      0.78    0.79      0.28 3.5    0.016
## i09       0.77      0.78    0.79      0.28 3.6    0.016
## i10       0.78      0.79    0.80      0.30 3.8    0.015
##
##  Item statistics
##        n raw.r std.r r.cor r.drop mean   sd
## i01  499  0.69  0.71  0.68   0.60  3.3 0.68
## i02r 498  0.63  0.63  0.59   0.50  3.1 0.80
## i03  500  0.33  0.30  0.16   0.14  1.9 0.92
## i04  498  0.68  0.69  0.67   0.57  3.2 0.73
## i05  498  0.69  0.70  0.68   0.59  3.1 0.77
## i06r 499  0.62  0.62  0.57   0.50  2.9 0.77
## i07  498  0.62  0.62  0.57   0.50  3.0 0.73
## i08  499  0.61  0.60  0.53   0.47  2.9 0.86
## i09  499  0.55  0.57  0.49   0.43  3.4 0.71
## i10  498  0.51  0.49  0.39   0.34  2.2 0.90
##
## Non missing response frequency for each item
##         1    2    3    4 miss
## i01  0.01 0.08 0.46 0.45 0.00
## i02r 0.04 0.16 0.45 0.35 0.01
## i03  0.41 0.33 0.19 0.06 0.00
## i04  0.01 0.16 0.43 0.41 0.01
## i05  0.02 0.20 0.47 0.30 0.01
## i06r 0.05 0.18 0.54 0.22 0.00
## i07  0.02 0.21 0.54 0.23 0.01
## i08  0.06 0.23 0.44 0.27 0.00
## i09  0.01 0.10 0.41 0.47 0.00
## i10  0.21 0.43 0.26 0.10 0.01


Internal consistency, as an intuition, can be understood as something like the mean item inter correlation (the numbers are different, but the idea is similar). Note that Alpha is also a function of the number of items (the higher the number of items, the larger Alpha will be, not a nice property). And, even if Alpha is high that is no sufficient evidence that the data is unidimensional.

With the help of the manual of psych let’s dissect the output:

• raw_alpha: That’s the typical alpha you might expect (based on the covariances)
• average_r: Mean inter-item correlation
• S/N: signal-noise-ration (see manual for details)
• se: standard error (precision of estimation in Frequentist theory)
• rar.r: “The correlation of each item with the total score, not corrected for item overlap.”
• std.r: “The correlation of each item with the total score (not corrected for item overlap) if the items were all standardized”
• r.cor: “Item whole correlation corrected for item overlap and scale reliability”
• r.drop: “Item whole correlation for this item against the scale without this item”

Here’s some input on how to work with this function.

In sum, we may conclude that the internal consistency is good; item i03 appearing somewhat problematic as internal consistency would be higher without that guy.

Remember that i03 had the lowest mean? Fits together.

Validity, next step

Let’s go back to some “validity” concerns, as this step is more important than reliability.

Single-item external validity

Let’s compute the correlation of each item with a given external criterion. If the correlation of a single item is not worse than the total score, ~we~ the item score, that is the instrument is in trouble.

e %>%
correlate %>%
mutate_if(is.numeric, round, digits = 2) %>%
rename(item = rowname, cor_with_n_FB_friends = n_facebook_friends) %>%
ggplot +
aes(x = reorder(item, cor_with_n_FB_friends), y = cor_with_n_FB_friends) +
geom_point() +
xlab("item") +
geom_hline(yintercept = cor_mean_FB, linetype = "dashed") +
geom_label(x = 5, y = cor_mean_FB, label = "correlation of mean score")


OK, the mean score is “somewhat” better than the most predictive item, i07, but really really not much.

Dichotomization

I know, some will say “don’t to that, my Goodness! Power!”, yes that’s true, but hey, let’s loose some power! After all, we are being more conservative, in a way. What do we get for this price? A quite tangible explanation. More precisely, let’s median split extra_mean and n_facebook_friends.

e %>%
mutate(e_bin = ntile(extra_mean, 2),
na.omit %>%
plotluck(e_bin ~ n_fb_bin)


Hm, at least some association seem to be present.

Non-linear association of extraversion with FB friends

We could apply different model, not only the linear model (ie., correlation) to discern possible patterns between the variables. This is an interesting field, but not my goal for today. Let’s dive here some other day; some time ago we have used some “machine learning” methods for this purpose in this paper.

Finer categorial analysis

The idea of the dichotomization can be put to level of finer granularity. Why 2 buckets (bins)? Why not some more? This would give us a more precise picture. Additionally, it may argued that no measurement device exist with infinite granularity – which is posited by continuous variables (see Brigg’s book on that).

Of course, many ways exist to bin a metric variable into some discrete buckets including Sturges’ rule, or Freedman’s rule (see here for some explanation). But let’s go easy, and pick - arbitrarily - ten buckets for each variable of equal width (not of equal n).

e %>%
na.omit %>%
mutate_all(cut_interval, n = 10) %>%
ggplot(aes(x = extra_mean, y = n_facebook_friends, fill = n)) +
geom_bin2d()


One problem with this method here is that empty buckets are silently dropped which renders the procedure fruitless for our purpose.

Let’s try differently, and let’s kick out extreme values.

e %>%
na.omit %>%
mutate_all(ntile, n = 10) %>%
ggplot(aes(x = factor(extra_mean), y = factor(n_facebook_friends), fill = n)) +
geom_bin2d(aes(fill = n))


Ah, that’s way better. Well, we do not see much, but the (absolute) numbers are not equal among the buckets.

 But even better would be if we could devise the probability of $y_n$ given $x_n$. More verbose, if the probability to have a 1 in X given Y is, say, 5 equals the probability $$p(X=2 Y=5)$$, then we will say that these two events are not related, or, more precisely, independent.

Let’s try it this way:

e %>%
na.omit %>%
mutate_all(ntile, n = 10) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = factor(extra_mean), y = factor(n_facebook_friends), fill = prop)) +
geom_bin2d(aes(fill = prop))


Here we summed up rowwise. Let’s compare to what happens if we sum up colwise.

e %>%
na.omit %>%
mutate_all(ntile, n = 10) %>%
group_by(extra_mean) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = factor(extra_mean), y = factor(n_facebook_friends), fill = prop)) +
geom_bin2d(aes(fill = prop))


Hm, maybe the association is too weak to see in that many buckets. Let’s reduce the bucket number by factor 2 (5 buckets), and check the picture again.

e %>%
na.omit %>%
mutate_all(ntile, n = 5) %>%
group_by(extra_mean) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = factor(extra_mean), y = factor(n_facebook_friends), fill = prop)) +
geom_bin2d(aes(fill = prop))


 This level of granularity seems more in place. Some association can be eyeballed. The conditional probabilities (p(Y=y X=x)) are not equal, thus, we can say, some dependencies exist. Note that this analyses is in a way more concrete than the correlation. Why? For two reasons. First, we speak on real observations, not on abstract parameters such as $r$ (see James Grice’ idea on Observation Oriented Modeling). Second, we do not shrunk down the information to one number only, but too much more (here 25=5x5), this amount of information can easily be recognized by the eye.

There is a wealth on categorial data analysis tools available, notable Agresti’s work. Thus, much more could be said and done.

CLES

The “Common language effect size” (CLES) is an interesting approach to put effect sizes, such as $r$, to a more tangible language. Basically, CLES tells us, if we were to draw two observations, one of each of two group (here: high vs. low extraversion), what’s the probability that the observation from the “experimental” group (or reference group, more generally) will show higher value in the outcome variable? Note that it is note the same as a Bayesian predictive distribution.

An R function for that purpose exists, but see this great post for more in depth info.

Final thoughts

The central idea of this post was to see whether the mean score is more predictive to some criterion than single items. I argue that validity is more important than reliability, albeit reliability (Cronbach’s Alpha) is typically reported in (applied) papers, but validity to a lesser degree.

Second, we explored (a bit) the notion whether the association is better captured in several numbers, and not only in one, as is done by the correlation value. This idea is based on the definition of independence in probability theory.

We did not dive here in factor analytic waters, nor did we discussed Item Response Theory, amongst other ideas left out. These ideas are worthwhile certainly, but beyond the primary aims of this post.

Related work

There’s a lot on scale validation on the internet; see here for an example. For skepticism on psychometric scale, and for advocates of single item scales, see here, or here, and here for a broader discussion.

SessionInfo

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##  [1] compute.es_0.2-4 plotluck_1.1.0   psych_1.6.9      corrr_0.2.1
##  [5] htmlTable_1.7    dplyr_0.5.0      purrr_0.2.2.9000 readr_1.0.0
##  [9] tidyr_0.6.0      tibble_1.2       ggplot2_2.2.0    tidyverse_1.0.0
## [13] knitr_1.15
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7         magrittr_1.5        mnormt_1.5-5
##  [4] munsell_0.4.3       colorspace_1.3-0    R6_2.2.0
##  [7] highr_0.6           stringr_1.1.0       plyr_1.8.4
## [10] tools_3.3.2         parallel_3.3.2      grid_3.3.2
## [13] gtable_0.2.0        DBI_0.5-1           htmltools_0.3.5
## [16] yaml_2.1.14         lazyeval_0.2.0.9000 rprojroot_1.1
## [19] digest_0.6.10       assertthat_0.1      RColorBrewer_1.1-2
## [22] evaluate_0.10       rmarkdown_1.1.9016  labeling_0.3
## [25] stringi_1.1.2       scales_0.4.1        backports_1.0.4
## [28] foreign_0.8-67
`