# Dataset 'performance in stats test'

This posts shows data cleaning and preparation for a data set on a statistics test (NHST inference). Data is published under a CC-licence, see here.

Data was collected 2015 to 2017 in statistics courses at the FOM university in different places in Germany. Several colleagues helped to collect the data. Thanks a lot! Now let’s enjoy the outcome (and make it freely available to all).

Raw N is 743. The test consists of 40 items which are framed as propositions; students are asked to respond with either “true” or “false” to each item. In addition, self-rating of proportion correct, study time and interest in the subject are asked. Last column notes the number (proportion) of correct responses.

Raw data can be accessed here

df_raw <- read.csv("https://sebastiansauer.github.io/data/test_inf_raw.csv")



DOI of this data is: DOI 10.17605/OSF.IO/SJHUY, URL: https://osf.io/sjhuy

Some packages:

library(tidyverse)


To make the data set less clumsy, let’s replace variable names.

test_inf_names <- names(df_raw)
df <- df_raw
names(df) <- paste("V", 1:ncol(df_raw), sep = "_")

df <- rename(df, study_time = V_43, self_eval = V_44, interest = V_45)



The vector test_inf_names can now serve as a codebook; the variable names are stored there.

The correct answers for the 40 questions are:

correct <- c(
T, T, T, T, F,
F, F, F, F, T,
T, T, F, T, T,
T, T, T, T, F,
F, T, T, F, T,
F, F, F, T, F,
T, T, F, T, T,
F, T, T, T, F
)


We can now compare the actual answers to the correct ones for each respondent. Let’s leave that for later :-) What’s more, the items (questions) have changed over time. Malheuresement, the software used for delivering the survey (Google forms) does not save the history (and I did not really either, I admit). So I am not 100% sure whether the “solution vector” above is correct for each and every (older) respondent.

Instead, in the first step, let’s focus on the data where the solution is already provided for brevity. This information is stored in V_46 (Punkte). Let’s convert that string to a number.

library(readr)
library(stringr)
df$score <- str_sub(df$V_46, start = 1, end = 2)
df$score <- as.numeric(df$score)


Out of curiosity, let’s look at the distribution of the score.

ggplot(df, aes(x = score)) +
geom_histogram()
#> stat_bin() using bins = 30. Pick better value with binwidth.
#> Warning: Removed 443 rows containing non-finite values (stat_bin).


Note that the NAs are not shown. If a given student didn’t know anything, but would flip a coin for each answer, what’s the probability of getting x correct answers? In other words, in a sequence of 40 coin flips, what’s the probability of getting at least x right?

x <- 0:40
cdf <- pbinom(x, 40, prob = .5)

temp <- tibble(x, cdf)

ggplot() +
geom_rect(aes(xmin = 15, xmax = 25, ymin = 0, ymax = 1), fill = "red", alpha = .3) +
geom_line(data = temp, aes(x = x, y = cdf) )



The diagram shows the probability of getting not more than x right. From a crude glance, say, 15 to 25 corrects answers are quite compatible with chance (coin tossing).

Let’s extract these and see how many cases are left.

df %>% filter(!is.na(score)) %>% nrow
#> [1] 321
df %>% filter(score > 16) %>% nrow
#> [1] 306


As the initial score variable was Punkte, let’s see how many NAs we had there.

df %>% filter(!is.na(V_46)) %>% nrow
#> [1] 764


OK, let’s assume all responses > 15 are “real” trials, not just guessing and clicking.

df <- filter(df, score > 15)


# Association of study time and score

Now, what’s bothering me since years is whether (and how strong) there is an association between score and study time. Now finally, let’s jete a coup d’oeil.


r1 <- round(cor(df$study_time, df$score, use = "complete.obs"), 2)

p1 <- ggplot(df) +
aes(x = study_time, y = score) +
geom_jitter() + geom_smooth() +
annotate(label = paste("r = ", r1), geom = "label", x = 4, y = 20, hjust = 0)

p1
#> geom_smooth() using method = 'loess' and formula 'y ~ x'
#> Warning: Removed 68 rows containing non-finite values (stat_smooth).
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 3
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 1
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 0
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata)) #> as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used #> at 3 #> Warning in predLoess(object$y, object$x, newx = if #> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> 1
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata)) #> as.matrix(model.frame(delete.response(terms(object)), : reciprocal #> condition number 0 #> Warning: Removed 68 rows containing missing values (geom_point). library(ggExtra) #ggMarginal(p1)  And the correlation is 0.441; hey, that’s quite strong! Let’s also check ordinal correlation: cor(df$study_time, df$score, method = "spearman", use = "complete.obs") #> [1] 0.452 cor(df$study_time, df$score, method = "kendall", use = "complete.obs") #> [1] 0.349  Given some measurement error, it can be speculated that the real, unattenuated correlation is quite substantial indeed. Maybe have a lookt at boxplots, as study time is not really metric: ggplot(df) + aes(x = factor(study_time), y = score) + geom_boxplot()  # Assocation of interest and score Similarly, if one is interested in the subject, does she scores higher? r2 <- round(cor(df$interest, df$score, use = "complete.obs"), 2) p2 <- ggplot(df) + aes(x = interest, y = score) + geom_jitter() + geom_smooth() + annotate(label = paste("r = ", r2), geom = "label", x = 4, y = 20, hjust = 0) p2 #> geom_smooth() using method = 'loess' and formula 'y ~ x' #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #ggMarginal(p2)  Weaker, but detectable. # Association of self-evaluation and score Well, if I think I will score poorly (superb), will I do so? Does my self-image match the actual outcome? r3 <- round(cor(df$self_eval, dfscore, use = "complete.obs"), 2) p3 <- ggplot(df) + aes(x = self_eval, y = score) + geom_jitter() + geom_smooth() + annotate(label = paste("r = ", r3), geom = "label", x = 8, y = 20, hjust = 0) p3 #> geom_smooth() using method = 'loess' and formula 'y ~ x' #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #ggMarginal(p3)  Oh, that’s strong; folks know when they’ll nail it. Good. # Correlation matrix Finally, let’s look at the correlation matrix of the variables mentioned above. library(corrr) df %>% select(score, study_time, interest, self_eval) %>% correlate %>% shave -> r_mat r_mat #> # A tibble: 4 × 5 #> rowname score study_time interest self_eval #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 score NA NA NA NA #> 2 study_time 0.441 NA NA NA #> 3 interest 0.223 0.461 NA NA #> 4 self_eval 0.628 0.559 0.36 NA corrr::rplot(r_mat)  A scatter plot matrix can be of interest, too. library(GGally) df %>% select(score, study_time, interest, self_eval) %>% ggpairs( lower = list( continuous = "smooth", combo = "facetdensity" )) #> Warning in (function (data, mapping, alignPercent = 0.6, method = #> "pearson", : Removed 68 rows containing missing values #> Warning in (function (data, mapping, alignPercent = 0.6, method = #> "pearson", : Removed 68 rows containing missing values #> Warning in (function (data, mapping, alignPercent = 0.6, method = #> "pearson", : Removed 68 rows containing missing values #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #> Warning: Removed 68 rows containing non-finite values (stat_density). #> Warning in (function (data, mapping, alignPercent = 0.6, method = #> "pearson", : Removed 68 rows containing missing values #> Warning in (function (data, mapping, alignPercent = 0.6, method = #> "pearson", : Removed 68 rows containing missing values #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #> Warning: Removed 68 rows containing non-finite values (stat_density). #> Warning in (function (data, mapping, alignPercent = 0.6, method = #> "pearson", : Removed 68 rows containing missing values #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #> Warning: Removed 68 rows containing non-finite values (stat_smooth). #> Warning: Removed 68 rows containing missing values (geom_point). #> Warning: Removed 68 rows containing non-finite values (stat_density).  # Debrief As a teacher, I feel reassured that study time is associated with test performance. Processed data can also be downloaded, here. # Convert logit to probability Logistic regression may give a headache initially. While the structure and idea is the same as “normal” regression, the interpretation of the b’s (ie., the regression coefficients) can be more challenging. This post provides a convenience function for converting the output of the glm function to a probability. Or more generally, to convert logits (that’s what spit out by glm) to a probabilty. Note1: The objective of this post is to explain the mechanics of logits. There are more convenient tools out there. (Thanks to Jack’s comment who made me adding this note.) Note2: I have corrected an error pointed out by Jana’s comment, below (you can always check older versions on the Github repo). # Example time So, let’s look at an example. First load some data (package need be installed!): data(titanic_train, package = "titanic") d <- titanic_train # less typing  Compute a simple glm: glm1 <- glm(Survived ~ Pclass, data = d, family = "binomial")  The coeffients are the interesting thing: coef(glm1)  ## (Intercept) Pclass ## 1.4467895 -0.8501067  These coefficients are in a form called ‘logits’. # Takeaway If coefficient (logit) is positive, the effect of this predictor (on survival rate) is positive and vice versa. Here Pclass coefficient is negative indicating that the higher Pclass the lower is the probability of survival. # Conversion rule To convert a logit (glm output) to probability, follow these 3 steps: • Take glm output coefficient (logit) • compute e-function on the logit using exp() “de-logarithimize” (you’ll get odds then) • convert odds to probability using this formula prob = odds / (1 + odds). For example, say odds = 2/1, then probability is 2 / (1+2)= 2 / 3 (~.67) # R function to rule ‘em all (ahem, to convert logits to probability) This function converts logits to probability. logit2prob <- function(logit){ odds <- exp(logit) prob <- odds / (1 + odds) return(prob) }  For convenience, you can source the function like this: source("https://sebastiansauer.github.io/Rcode/logit2prob.R")  For our glm: logit2prob(coef(glm1))  ## (Intercept) Pclass ## 0.8095038 0.2994105  # How to convert logits to probability How to interpret: • The survival probability is 0.8095038 if Pclass were zero (intercept). • However, you cannot just add the probability of, say Pclass == 1 to survival probability of PClass == 0 to get the survival chance of 1st class passengers. Instead, consider that the logistic regression can be interpreted as a normal regression as long as you use logits. So the general regression formula applies as always: y = intercept + b*x  That is, in our example logits_survival = intercept + b_survival*Pclass  where b_survival is given in logits (it’s just the b-coefficient of Pclass). So, it’ simple to calculate by hand, eg., the survival logits for a 2nd class passenger: (intercept <- coef(glm1)[1])  ## (Intercept) ## 1.44679  (b_survival <- coef(glm1)[2])  ## Pclass ## -0.8501067  (logits_survival <- intercept + 2 * b_survival)  ## (Intercept) ## -0.253424  Thus, the logits of survival are -0.25 Now we can convert to probability: logit2prob(logits_survival)  ## (Intercept) ## 0.4369809  # Lumping logits to probability Remember that $e^1 \approx 2.71$. That is, if your logit is 1, your odds will be approx. 2.7 to 1, so the the probability is 2.7 / 3.7, or about 3/4, 75%. Similarly important, $e^0 = 1$. Hence, your odds will be 1:1, ie., 50%. Hence, whenever your logit is negative, the associated probability is below 50% and v.v. (positive logit <–> probability above 50%). # Predict as convenience function However, more convenient would be to use the predict function instance of glm; this post is aimed at explaining the idea. In practice, rather use: predict(glm1, data.frame(Pclass = 1), type = "response")  ## 1 ## 0.644897  In the 1st class, survival chance is ~65%, and for 2nd class about 44%. predict(glm1, data.frame(Pclass = 2), type = "response")  ## 1 ## 0.4369809  # Conversion table Here’s a look up table for the conversion: logit_seq <- seq(-10, 10, by = 2)  prob_seq <- round(logit2prob(logit_seq), 3) df <- data.frame(Logit = logit_seq, Probability = prob_seq)  library(htmlTable) htmlTable(df)  Logit Probability 1 -10 0 2 -8 0 3 -6 0.002 4 -4 0.018 5 -2 0.119 6 0 0.5 7 2 0.881 8 4 0.982 9 6 0.998 10 8 1 11 10 1 A handy function is datatable, does not work in this environment however it appears. library(DT) datatable(df)  The package mfx provides a convenient functions to get odds out of a logistic regression (Thanks for Henry Cann’s comment for pointing that out!). # Conversion plot More convenient for an overview is a plot like this. library(ggplot2) logit_seq <- seq(-10, 10, by = .1) prob_seq <- logit2prob(logit_seq) rm(df) df <- data.frame(Logit = logit_seq, Probability = prob_seq) ggplot(df) + aes(x = logit_seq, y = prob_seq) + geom_point(size = 2, alpha = .3) + labs(x = "logit", y = "probability of success")  # Takeway The relationship between logit and probability is not linear, but of s-curve type. The coefficients in logit form can be be treated as in normal regression in terms of computing the y-value. Transform the logit of your y-value to probability to get a sense of the probability of the modeled event. Happy glming! # sessionInfo sessionInfo()  ## R version 3.3.2 (2016-10-31) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: macOS Sierra 10.12.4 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggplot2_2.2.1 DT_0.2 htmlTable_1.9 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.10 knitr_1.16 magrittr_1.5 ## [4] munsell_0.4.3 colorspace_1.3-2 R6_2.2.1 ## [7] rlang_0.1 plyr_1.8.4 stringr_1.2.0 ## [10] highr_0.6 dplyr_0.5.0 tools_3.3.2 ## [13] grid_3.3.2 webshot_0.4.0 gtable_0.2.0 ## [16] checkmate_1.8.2 DBI_0.6-1 htmltools_0.3.6 ## [19] yaml_2.1.14 lazyeval_0.2.0.9000 assertthat_0.2.0 ## [22] rprojroot_1.2 digest_0.6.12 tibble_1.3.0.9002 ## [25] htmlwidgets_0.8 rsconnect_0.8 evaluate_0.10 ## [28] rmarkdown_1.5 labeling_0.3 stringi_1.1.5 ## [31] scales_0.4.1 backports_1.0.5 jsonlite_1.4  # The two ggplot2-ways of plottings bars Bar plots, whereas not appropriate for means, are helpful for conveying impressions of frequencies, particularly relative frequencies, ie., proportions. Intuition: Bar plots and histograms alike can be thought of as piles of Lego pieces, put onto each each other, where each Lego piece represents (is) one observation. Presenting tables of frequencies are often not insightful to the eye. Bar plots are often much more accessible and present the story more clearly. In R, using ggplot2, there are basically two ways of plotting bar plots (as far as I know). First, we can just take a data frame in its raw form and let ggplot2 count the rows so to compute frequencies and map that to the height of the bar. Second, we can do the computation of frequencies ourselves and just give the condensed numbers to ggplot2. Let’s look at each of the two ways in turn. # Way one: Give raw, unprocessed data to ggplot First, let’s give the raw (unprocessed) data frame to ggplot2, and let ggplot2 compute frequencies. For example: data(Affairs, package = "AER") # package must be installed upfron library(ggplot2) library(dplyr) ggplot(Affairs) + aes(x = rating, fill = factor(affairs)) + geom_bar(position = "fill")  # Recode: 0 -> "Iron Heart"; else -> "Player" Affairshalodrie <- recode(Affairs$affairs, 0 = "Iron Heart", .default = "Heart Breaker") ggplot(Affairs) + aes(x = rating, fill = factor(halodrie)) + geom_bar(position = "fill")  Data comes from Fair (1978): Fair, R.C. (1978). A Theory of Extramarital Affairs. Journal of Political Economy, 86, 45–61. Some notes. We use fill = ... to say “put the number of ‘Heart Breakers’ and ‘Iron Hearts’ onto each other”. So the columns for ‘Heart breakers’ and for ‘Iron Hearts’ pile up on each other. Of importance, the position = "fill" (note the quotation marks) will make “proportion bars”, so the height of the bar goes up to 100%. This is obviously only of use if different colors indicate the relative height (frequencies) of the categories in the bar. Don’t forget to make the filling variable of type factor, otherwise the filling can go awry. Also note that you cannot make use of qplot here, as qplot does not support the position parameter (see documentation; qplot is designed for easy, quick plotting): qplot(data = Affairs, x = rating, fill = halodrie, geom = "bar", position = "fill")  ## Warning: position is deprecated  # Way two: Summarise the frequencies upfront, and give the summaries to ggplot Way two basically consists of doing the computation of the frequencies by oneself and of then giving the frequencies per category/group for plotting. See: A2 <- count(Affairs, rating, halodrie) A2 <- group_by(A2, rating) A2 <- mutate(A2, prop = n / sum(n))  Now we handover A2, the dataframe consisting of summarized data (frequencies) to qplot: qplot(x = rating, y = prop, fill = halodrie, data = A2, geom = "col")  Easy, right? Note that geom col (as in column) expects single numbers; it does not count rows (as does geom_bar). In effect, geom_col(...) is geom_bar(stat = identity, ...). As our proportions sum up to 100% we do need need to care to tell about relative height plotting. Works out automatically. To the contrary, geom_bar won’t work: qplot(x = rating, fill = halodrie, data = A2, geom = "bar")  Translating qplot to ggplot is easy enough: ggplot(A2) + aes(x = rating, y = prop, fill = halodrie) + geom_col()  Some polishment: qplot(x = rating, y = prop, fill = halodrie, geom = "col", data = A2) + theme(legend.position = "bottom") + labs(title = "Halodrie-status \nacross different levels of marital satisfaction") + theme(plot.title = element_text(hjust = 0.5))  Now suppose, we want not showing the proportions, ie., the bars should not have height 100%, but the real count-height. position = "stack" will accomplish that. ggplot(A2) + geom_col(aes(x = rating, y = n, fill = halodrie), position = "stack")  Note that this view makes it more difficult to tell in which column the proportion of “Heart Players” is high(er). In qplot: qplot(data = A2, x = rating, y = n, fill = halodrie, geom = "col")  Happy plotting! # Gentle intro to 'R-squared equals squared r' It comes as no surprise that $R^2$ (“coefficient of determination”) equals $r^2$ in simple regression (predictor X, criterion Y), where $r(X,Y)$ is Pearson’s correlation coefficient. $R^2$ equals the fraction of explained variance in a simple regression. However, the statistical (mathematical) background is often less clear or buried in less-intuitive formula. The goal of this post is to offer a gentle explanantion why $R^2 = r^2$, where $r$ is $r(Y,\hat{Y})$ and $\hat{Y}$ are the predicted values. Some more notation: $Y_i$ are the individual values of the criterion, and $\bar{Y}$ is the mean of the criterion. As an example, suppose we wanted to predict the grades ($Y$) of some students based on their studying time ($X$). Of course, our prediction won’t be perfect most of the time, so there will be gaps (errors or residuals) between what we predict and what the actual grade in some particular $Y_i$ case will be. So $Y_i - \hat{Y_i} \ne 0$, or, similarly $(Y_i - \hat{Y_i})^2 > 0$. Let’s start with the idea that the variance is additive, see here or here for details. The additivity of variance implies that the whole variation in the students’ grades is the sum of the variance of our prediction errors and the improvement of our prediction/ regression model: $\sum_i{(Y_i - \hat{Y})}^2 = \sum_i{(Y_i - \hat{Y_i})}^2 + \sum_i{(\hat{Y_i} - \bar{Y})}^2$. In words: Total = Error^2 + Improvement^2. Or: Total variance in simple regression equals error variance plus variance explained by model. This expression can be rearranged to 1 = Error^2 / Total + Improvement^2 / Total. As formula: $1 = \frac{\sum_i{(Y_i - \hat{Y_i})}^2}{\sum_i{(Y_i - \bar{Y})}^2} + \frac{\sum_i{(\hat{Y_i} - \bar{Y})}^2}{\sum_i{(Y_i - \bar{Y})}^2}$. Puh, quite some stuff. Now we apply a property of least square regression stating that there is no correlation between the residuals (e) and the predicted value (we don’t discuss that property here further). More formally: $r \{ (Y_i - \hat{Y_i}, \hat{Y}) \} = 0$. A bit more readable: $r(e, \hat{Y}) = 0$. Now we write the (sample) correlation between actual (observed) and predicted (fitted) values as follows: In other words, Note that $n^{-1}$ cancel out, leaving The variance can be seen as the average of the squared sums (SS) of some delta (d): $Var(X) = n^{-1}\sum{d}$. If we on our way “lose” the $n^{-1}$ part (as it happened in the line above), we are left with the sum of squares (SS), as they are called. Now, as we are concerned with the variance of the error, variance of the model improvement and the total variance, we can in the same way speak of the according SS-terms. We will come back to the SS-terms later on. Now a little trick. We add the term $-\hat{Y_i} + \hat{Y_i}$, which is zero, so no harm caused. We get: Now stay close. Observe that we have three terms in the numerator, let’s call them $a$, $b$, and $c$. We can multiplicate that out resulting in $(a+b)c = ac + bc$. So: As the denominator did not change, we just abbreviated it in the obvious way. Note that the numerator now is of the form $ac + cc$ (as b = c). Thus we can further simplify to Referring back to our notion of the two parts of the total variance in regression we can think of the numerator above as $\sum{(e \cdot m + m^2)}$, where e indicates the error or residual term, and m the model improvement. However, as the correlation between m and e (which is zero, see above) is nothing but average product of m and e, their sum also is zero. Thus we are left with Think of the numerator as a and then $a = \sqrt{a^2}$, thus After canceling out the obvious we are left with This term is nothing but Model variance / Total variance. And we are there! To summarise OK, quite some ride. At least we arrived where we headed to. Some algebra kungfu turned out to be handy. It took me some time to decypher the algebra without any explanation. So I hope me comments make the voyage a bit shorter for you. Without some expertise in algebra (multiplicating out, adding zero-terms, …) and some knowledge of regression properties we would not have arrived at the end. The good news are that each of the steps is not really difficult to understand (I am not saying it is easy to discover these steps). # Fallstudie (YACSDA) zur praktischen Datenanalyse mit dplyr Case study in data analysis using R package dplyr in German language. # Praktische Datenanalyse mit dplyr Das R-Paket dplyr von Hadley Wickham ist ein Stargast auf der R-Showbühne; häufig diskutiert in einschlägigen Foren. Mit dyplr kann man Daten “verhackstücken” - umformen und aufbereiten (“to wrangle” auf Englisch); “praktische Datenanalyse” ist vielleicht eine gute Bezeichnung. Es finden sich online viele Einführungen, z.B. hier oder hier. Dieser Text ist nicht als Einführung oder Erläuterung gedacht, sondern als Übung, um (neu erworbenen Fähigkeiten) in der praktischen Datenanalyse im Rahmen einer Fallstudie auszuprobieren. Stellen Sie sich folgendes Szenario zur Fallstudie vor: Sie sind Unternehmensberäter bei einer (nach eigenen Angaben) namhaften Gesellschaft. Ihr erster Auftrag führt Sie direkt nach New York City (normal). Ihr Chef hat irgendwie den Eindruck, dass Sie Zahlen- und Computer-affin sind… “Sag mal, schon mal von R gehört?”, fragt er mal eines Abends (22h, noch voll bei der Arbeit). “Eine Programmiersprache zur Datenanalyse und -visualisierung”, antworten Sie wie aus der Pistole geschossen. Das gefällt Ihrem Chef. “Pass mal auf. Bis morgen brauche ich eine Analyse aller Flüge von NYC, Anzahl, nach Origin, nach Destination.. Du weißt schon…”. Natürlich wissen Sie.. “Reicht bis morgen früh um acht”, sagt er noch, bevor er das Büro verlässt. Ok, also dann sollten wir keine Zeit verlieren… # Aufgaben (und Lösungen) Laden wir zuerst die nötigen Pakete und Daten; denken Sie daran, dass R-Pakete zuerst installiert werden müssen (einmalig), bevor Sie sie laden können. # install.packages("nycflights13") library(dplyr) library(ggplot2) # Diagramme data(flights, package = "nycflights13")  ## Wie viele Flüge starteten in den NYC-Flughäfen in 2013? flights %>% summarise(n_flights = n())  ## # A tibble: 1 × 1 ## n_flights ## <int> ## 1 336776  Ah, eine Menge :-). ## Welche Flughäfen gibt es in NYC? Wie viele Flüge starteten von dort jeweils? flights %>% group_by(origin) %>% summarise(Anzahl = n())  ## # A tibble: 3 × 2 ## origin Anzahl ## <chr> <int> ## 1 EWR 120835 ## 2 JFK 111279 ## 3 LGA 104662  Das könnten wir auch plotten. Allerdings… 3 Zahlen, das kann man auch ohne Diagramm gut erkennen… Die internationalen Codes von Flughäfen können z.B. hier nachgelesen werden. ## Wie viele Flughäfen starteten pro Monat aus NYC? Das ist praktische die gleiche Frage in grün… flights %>% group_by(month) %>% summarise(Anzahl = n())  ## # A tibble: 12 × 2 ## month Anzahl ## <int> <int> ## 1 1 27004 ## 2 2 24951 ## 3 3 28834 ## 4 4 28330 ## 5 5 28796 ## 6 6 28243 ## 7 7 29425 ## 8 8 29327 ## 9 9 27574 ## 10 10 28889 ## 11 11 27268 ## 12 12 28135  Das lohnt sich schon eher als Diagramm: flights %>% group_by(month) %>% summarise(Anzahl = n()) %>% ggplot(aes(x = month, y = Anzahl)) + geom_point(color = "firebrick") + geom_line()  ## Welche Ziele wurden angeflogen? Wurden MUC und FRA angeflogen? flights %>% group_by(dest) %>% summarise(Anzahl = n()) %>% select(dest) %>% print(n = 200)  ## # A tibble: 105 × 1 ## dest ## <chr> ## 1 ABQ ## 2 ACK ## 3 ALB ## 4 ANC ## 5 ATL ## 6 AUS ## 7 AVL ## 8 BDL ## 9 BGR ## 10 BHM ## 11 BNA ## 12 BOS ## 13 BQN ## 14 BTV ## 15 BUF ## 16 BUR ## 17 BWI ## 18 BZN ## 19 CAE ## 20 CAK ## 21 CHO ## 22 CHS ## 23 CLE ## 24 CLT ## 25 CMH ## 26 CRW ## 27 CVG ## 28 DAY ## 29 DCA ## 30 DEN ## 31 DFW ## 32 DSM ## 33 DTW ## 34 EGE ## 35 EYW ## 36 FLL ## 37 GRR ## 38 GSO ## 39 GSP ## 40 HDN ## 41 HNL ## 42 HOU ## 43 IAD ## 44 IAH ## 45 ILM ## 46 IND ## 47 JAC ## 48 JAX ## 49 LAS ## 50 LAX ## 51 LEX ## 52 LGA ## 53 LGB ## 54 MCI ## 55 MCO ## 56 MDW ## 57 MEM ## 58 MHT ## 59 MIA ## 60 MKE ## 61 MSN ## 62 MSP ## 63 MSY ## 64 MTJ ## 65 MVY ## 66 MYR ## 67 OAK ## 68 OKC ## 69 OMA ## 70 ORD ## 71 ORF ## 72 PBI ## 73 PDX ## 74 PHL ## 75 PHX ## 76 PIT ## 77 PSE ## 78 PSP ## 79 PVD ## 80 PWM ## 81 RDU ## 82 RIC ## 83 ROC ## 84 RSW ## 85 SAN ## 86 SAT ## 87 SAV ## 88 SBN ## 89 SDF ## 90 SEA ## 91 SFO ## 92 SJC ## 93 SJU ## 94 SLC ## 95 SMF ## 96 SNA ## 97 SRQ ## 98 STL ## 99 STT ## 100 SYR ## 101 TPA ## 102 TUL ## 103 TVC ## 104 TYS ## 105 XNA  Eine lange Liste… wäre vielleicht übersichtlicher, die nicht abzubilden ;) Schauen wir mal, ob MUC (München) oder FRA (Frankfurt) dabei waren. flights %>% filter(dest == "MUC" | dest == "FRA")  ## # A tibble: 0 × 19 ## # ... with 19 variables: year <int>, month <int>, day <int>, ## # dep_time <int>, sched_dep_time <int>, dep_delay <dbl>, arr_time <int>, ## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>  Die resultierende Tabelle (“tibble”) hat 0 Zeilen. Diese Ziele wurden also nicht angeflogen. Das Zeichen “|” bedeutet “oder” (im logischen Sinne). Demnach kann man die die ganze “Pfeife” so lesen: Nimm flights Filter Zeilen mit Ziel gleich MUC oder Zeilen mit Ziel gleich FRA. ## Welche Ziele am häufigsten angeflogen? flights %>% group_by(dest) %>% summarise(n_per_dest = n()) %>% arrange(desc(n_per_dest))  ## # A tibble: 105 × 2 ## dest n_per_dest ## <chr> <int> ## 1 ORD 17283 ## 2 ATL 17215 ## 3 LAX 16174 ## 4 BOS 15508 ## 5 MCO 14082 ## 6 CLT 14064 ## 7 SFO 13331 ## 8 FLL 12055 ## 9 MIA 11728 ## 10 DCA 9705 ## # ... with 95 more rows  Das könnte man auch wieder “plotten”, aber lieber nur die Top-10. flights %>% group_by(dest) %>% summarise(n_per_dest = n()) %>% arrange(desc(n_per_dest)) %>% filter(min_rank(n_per_dest) < 11) %>% ggplot(aes(x = dest, y = n_per_dest)) + geom_bar(stat = "identity")  Der Befehl min_rank(n_per_dest) < 11 liefert die 10 kleinsten Rangplätze der Variablen n_per_dest zurück. Beim Plotten brauchen wir beim Geom bar (Balken) den Zusatz stat = "identity", weil das Geom bar standardgemäß zählen möchte, wie viele Zeilen z.B. “LGA” enthalten. Wir haben aber das Zählen der Zeilen schon vorher mit n() gemacht, so dass der Befehl einfach den Wert, so wie er in unserem Dataframe steht (daher identity) nehmen soll. ## Welche Ziele werden mehr als 10000 Mal pro Jahr angeflogen? flights %>% group_by(dest) %>% summarise(n_dest = n()) %>% filter(n_dest > 10000)  ## # A tibble: 9 × 2 ## dest n_dest ## <chr> <int> ## 1 ATL 17215 ## 2 BOS 15508 ## 3 CLT 14064 ## 4 FLL 12055 ## 5 LAX 16174 ## 6 MCO 14082 ## 7 MIA 11728 ## 8 ORD 17283 ## 9 SFO 13331  ## Welche Flüge gingen von JFK nach PWM (Portland) im Januar zwischen Mitternach und 5 Uhr? library(knitr) filter(flights, origin == "JFK" & month == 1, dest == "PWM", dep_time < 500) %>% kable  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour 2013 1 4 106 2245 141 201 2356 125 B6 608 N192JB JFK PWM 44 273 22 45 2013-01-04 22:00:00 2013 1 31 54 2250 124 152 2359 113 B6 608 N281JB JFK PWM 41 273 22 50 2013-01-31 22:00:00 Der Befehl knitr::kable erstellt eine (einigermaßen) schöne Tabelle (man muss aber das Paket knitr vorher geladen haben.) Warum Ihr Chef das wissen will, weiß er nur allein… ## Welche Flüge starteten von JFK, dieeine Ankunftsverspätung hatten doppelt so groß wie die Abflugverspätung, und die nach Atlanta geflogen sind? Selten eine Aufgabe gelesen, die aus so einem langen Satz bestand … filter(flights, origin == "JFK", arr_delay > 2 * dep_delay, month == 1, dest == "ATL") %>% kable  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour 2013 1 1 807 810 -3 1043 1043 0 DL 269 N308DE JFK ATL 126 760 8 10 2013-01-01 08:00:00 2013 1 1 1325 1330 -5 1606 1605 1 DL 2043 N318US JFK ATL 131 760 13 30 2013-01-01 13:00:00 2013 1 2 606 610 -4 846 845 1 DL 1743 N387DA JFK ATL 129 760 6 10 2013-01-02 06:00:00 2013 1 2 808 810 -2 1049 1045 4 DL 269 N971DL JFK ATL 124 760 8 10 2013-01-02 08:00:00 2013 1 2 1551 1548 3 1838 1830 8 DL 95 N702TW JFK ATL 119 760 15 48 2013-01-02 15:00:00 2013 1 2 2027 2030 -3 2309 2314 -5 DL 1447 N947DL JFK ATL 127 760 20 30 2013-01-02 20:00:00 2013 1 3 612 615 -3 912 850 22 DL 2057 N707TW JFK ATL 135 760 6 15 2013-01-03 06:00:00 2013 1 3 811 810 1 1053 1042 11 DL 269 N319NB JFK ATL 134 760 8 10 2013-01-03 08:00:00 2013 1 3 1847 1855 -8 2130 2142 -12 DL 951 N181DN JFK ATL 129 760 18 55 2013-01-03 18:00:00 2013 1 5 805 810 -5 1039 1041 -2 DL 269 N339NB JFK ATL 116 760 8 10 2013-01-05 08:00:00 2013 1 5 1540 1548 -8 1820 1829 -9 DL 95 N710TW JFK ATL 119 760 15 48 2013-01-05 15:00:00 2013 1 6 612 615 -3 846 848 -2 DL 2057 N397DA JFK ATL 126 760 6 15 2013-01-06 06:00:00 2013 1 6 809 810 -1 1044 1042 2 DL 269 N316NB JFK ATL 125 760 8 10 2013-01-06 08:00:00 2013 1 6 1326 1330 -4 1605 1605 0 DL 2043 N3734B JFK ATL 126 760 13 30 2013-01-06 13:00:00 2013 1 6 1545 1548 -3 1842 1830 12 DL 95 N387DA JFK ATL 139 760 15 48 2013-01-06 15:00:00 2013 1 7 803 810 -7 1029 1042 -13 DL 269 N344NB JFK ATL 105 760 8 10 2013-01-07 08:00:00 2013 1 8 612 615 -3 901 855 6 9E 3856 N153PQ JFK ATL 124 760 6 15 2013-01-08 06:00:00 2013 1 8 823 810 13 1112 1043 29 DL 269 N377NW JFK ATL 126 760 8 10 2013-01-08 08:00:00 2013 1 8 1549 1548 1 1846 1830 16 DL 95 N376DA JFK ATL 117 760 15 48 2013-01-08 15:00:00 2013 1 8 1843 1855 -12 2142 2142 0 DL 951 N1611B JFK ATL 126 760 18 55 2013-01-08 18:00:00 2013 1 9 807 810 -3 1045 1042 3 DL 269 N316NB JFK ATL 124 760 8 10 2013-01-09 08:00:00 2013 1 9 1857 1855 2 2151 2142 9 DL 951 N173DZ JFK ATL 120 760 18 55 2013-01-09 18:00:00 2013 1 10 809 810 -1 1041 1042 -1 DL 269 N361NB JFK ATL 111 760 8 10 2013-01-10 08:00:00 2013 1 11 807 810 -3 1056 1042 14 DL 269 N345NB JFK ATL 117 760 8 10 2013-01-11 08:00:00 2013 1 13 612 615 -3 853 855 -2 9E 3856 N146PQ JFK ATL 118 760 6 15 2013-01-13 06:00:00 2013 1 13 1848 1855 -7 2204 2142 22 DL 951 N1609 JFK ATL 116 760 18 55 2013-01-13 18:00:00 2013 1 15 612 615 -3 927 855 32 9E 3856 N181PQ JFK ATL 134 760 6 15 2013-01-15 06:00:00 2013 1 15 1324 1330 -6 1605 1605 0 DL 2043 N704X JFK ATL 133 760 13 30 2013-01-15 13:00:00 2013 1 16 615 615 0 905 855 10 9E 3856 N232PQ JFK ATL 131 760 6 15 2013-01-16 06:00:00 2013 1 16 1324 1330 -6 1603 1605 -2 DL 2043 N3763D JFK ATL 129 760 13 30 2013-01-16 13:00:00 2013 1 17 612 615 -3 906 855 11 9E 3856 N176PQ JFK ATL 129 760 6 15 2013-01-17 06:00:00 2013 1 17 810 810 0 1048 1042 6 DL 269 N340NB JFK ATL 127 760 8 10 2013-01-17 08:00:00 2013 1 17 1325 1330 -5 1600 1605 -5 DL 2043 N3756 JFK ATL 127 760 13 30 2013-01-17 13:00:00 2013 1 19 614 615 -1 857 855 2 9E 3856 N187PQ JFK ATL 127 760 6 15 2013-01-19 06:00:00 2013 1 19 803 810 -7 1031 1041 -10 DL 269 N320NB JFK ATL 115 760 8 10 2013-01-19 08:00:00 2013 1 20 1851 1855 -4 2135 2142 -7 DL 951 N1605 JFK ATL 114 760 18 55 2013-01-20 18:00:00 2013 1 21 1723 1730 -7 2010 2017 -7 DL 951 N175DN JFK ATL 131 760 17 30 2013-01-21 17:00:00 2013 1 22 614 615 -1 857 855 2 9E 3856 N153PQ JFK ATL 132 760 6 15 2013-01-22 06:00:00 2013 1 22 820 810 10 1107 1042 25 DL 269 N359NB JFK ATL 130 760 8 10 2013-01-22 08:00:00 2013 1 24 612 615 -3 855 855 0 9E 3856 N197PQ JFK ATL 117 760 6 15 2013-01-24 06:00:00 2013 1 25 810 810 0 1056 1042 14 DL 269 N366NB JFK ATL 123 760 8 10 2013-01-25 08:00:00 2013 1 25 1850 1855 -5 2151 2142 9 DL 951 N646DL JFK ATL 150 760 18 55 2013-01-25 18:00:00 2013 1 26 1543 1548 -5 1821 1829 -8 DL 95 N723TW JFK ATL 107 760 15 48 2013-01-26 15:00:00 2013 1 28 610 615 -5 856 855 1 9E 3856 N187PQ JFK ATL 126 760 6 15 2013-01-28 06:00:00 2013 1 28 808 810 -2 1041 1042 -1 DL 269 N327NB JFK ATL 120 760 8 10 2013-01-28 08:00:00 2013 1 28 1320 1330 -10 1547 1605 -18 DL 2043 N3748Y JFK ATL 117 760 13 30 2013-01-28 13:00:00 2013 1 28 1546 1548 -2 1835 1830 5 DL 95 N3745B JFK ATL 160 760 15 48 2013-01-28 15:00:00 2013 1 29 1325 1330 -5 1559 1605 -6 DL 2043 N712TW JFK ATL 119 760 13 30 2013-01-29 13:00:00 2013 1 30 807 810 -3 1105 1042 23 DL 269 N361NB JFK ATL 128 760 8 10 2013-01-30 08:00:00 2013 1 30 1853 1855 -2 2152 2142 10 DL 951 N686DA JFK ATL 131 760 18 55 2013-01-30 18:00:00 2013 1 31 616 615 1 905 855 10 9E 3856 N181PQ JFK ATL 126 760 6 15 2013-01-31 06:00:00 2013 1 31 810 810 0 1054 1042 12 DL 269 N355NB JFK ATL 127 760 8 10 2013-01-31 08:00:00 2013 1 31 1853 1855 -2 2149 2142 7 DL 951 N175DZ JFK ATL 129 760 18 55 2013-01-31 18:00:00 Auch diese Tabelle ist recht lang. Aber sei’s drum :) ## Welche Airlines hatten die meiste “Netto-Verspätung”? f_2 <- group_by(flights, carrier) f_3 <- mutate(f_2, delay = dep_delay - arr_delay) f_4 <- filter(f_3, !is.na(delay)) f_5 <- summarise(f_4, delay_mean = mean(delay)) arrange(f_5, delay_mean)  ## # A tibble: 16 × 2 ## carrier delay_mean ## <chr> <dbl> ## 1 F9 -1.7195301 ## 2 FL -1.5099213 ## 3 MQ -0.3293526 ## 4 OO 0.6551724 ## 5 US 1.6150976 ## 6 YV 3.3419118 ## 7 B6 3.5095746 ## 8 EV 4.0424982 ## 9 DL 7.5796089 ## 10 WN 8.0125374 ## 11 AA 8.2048393 ## 12 UA 8.4588972 ## 13 9E 9.0599052 ## 14 VX 10.9921814 ## 15 HA 11.8157895 ## 16 AS 15.7616361  Etwas umständlich mit den ganzen Zwischenspeichern… Vielleicht besser so: flights %>% group_by(carrier) %>% mutate(delay = dep_delay - arr_delay) %>% filter(!is.na(delay)) %>% summarise(delay_mean = mean(delay)) %>% arrange(-delay_mean)  ## # A tibble: 16 × 2 ## carrier delay_mean ## <chr> <dbl> ## 1 AS 15.7616361 ## 2 HA 11.8157895 ## 3 VX 10.9921814 ## 4 9E 9.0599052 ## 5 UA 8.4588972 ## 6 AA 8.2048393 ## 7 WN 8.0125374 ## 8 DL 7.5796089 ## 9 EV 4.0424982 ## 10 B6 3.5095746 ## 11 YV 3.3419118 ## 12 US 1.6150976 ## 13 OO 0.6551724 ## 14 MQ -0.3293526 ## 15 FL -1.5099213 ## 16 F9 -1.7195301  Das könnten wir mal wieder visualisieren: flights %>% group_by(carrier) %>% mutate(delay = dep_delay - arr_delay) %>% filter(!is.na(delay)) %>% summarise(delay_mean = mean(delay)) %>% arrange(-delay_mean) -> f_summarised ggplot(f_summarised, aes(x = carrier, y = delay_mean)) + geom_point(color = "firebrick")  ggplot2 ordnet die X-Achse hier automatisch alphanumerisch. Wenn wir wollen, dass die Achse nach den Werten der Y-Achse (delay_mean) geordnet wird (was sinnvoll ist), können wir das so erreichen:  ggplot(f_summarised, aes(x = reorder(carrier, delay_mean), y = delay_mean)) + geom_point(color = "firebrick")  Der Befehl reorder(carrier, delay_mean) ordnet die Werte der Varialbne carrier anhand der Werte der Variablen delay_mean. ## Berechnen Sie die mittlere Verspätung aller Flüge mit deutlicher Verspätung (> 1 Stunde)! flights %>% mutate(delay = dep_delay - arr_delay) %>% filter(delay > 60) %>% summarise(delay_mean = mean(delay), n = n()) %>% # Anzahl arrange(delay_mean)  ## # A tibble: 1 × 2 ## delay_mean n ## <dbl> <int> ## 1 65.18182 154  ## Wie sind die Verspätungen verteilt? ggplot(f_summarised, aes(x = delay_mean)) + geom_histogram()  ## stat_bin() using bins = 30. Pick better value with binwidth.  ## Hängen Flugzeit und Verspätung zusammen? flights %>% mutate(delay = dep_delay - arr_delay) %>% na.omit() %>% ggplot(aes(x = distance, y = delay)) + geom_point(alpha = .1) + geom_smooth()  ## geom_smooth() using method = 'gam'  Sag mal, plotten wir gerade wirklich 300.000 Punkte??? Das kann dauern… Das alpha = .1 macht die Punkte blässlich, fast durchsichtig. Ganz praktisch, wenn viele Punkte aufeinander liegen. ## Hängen Verspätung und Jahreszeit zusammen? Auch eine ganz interessante Frage. Schauen wir mal: cor(flights$month, flights\$dep_delay, use = "complete")

## [1] -0.02005702


Das use = complete sagt, dass wir Zeilen mit fehlenden Werten ignorieren.

Sieht also nicht nach einem Zusammenhang aus. Das sollte uns ein Diagramm auch bestätigen:

flights %>%
group_by(month) %>%
na.omit() %>%  # alle Zeilen mit fehlenden Werten löschen
mutate(delay = dep_delay - arr_delay) %>%
ggplot(aes(x = month, y = delay)) + geom_boxplot()

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?


Upps, das sieht ja komisch aus… Hm..ggplot schlägt vor, wir sollen irgendwie group mit reinwursten… Naja, unsere Gruppen könnten die Monate sein. Also probieren wir’s mal…

flights %>%
group_by(month) %>%
na.omit() %>%  # alle Zeilen mit fehlenden Werten löschen
mutate(delay = dep_delay - arr_delay) %>%
ggplot(aes(x = month, y = delay, group = month)) + geom_boxplot()


Die X-Achse sieht noch nicht so toll aus (mit den Nachkommastellen), aber das heben wir uns für eine andere Gelegenheit auf :-)

Noch ein kleiner Bonus zum Abschluss: Interaktive Diagramme!

Dazu müssen wir erstmal ein neues Paket laden: plotly (und ggf. vorher installieren).

# install.packages("plotly")
library(plotly)


plotly kann man ein ggplot-Objekt übergeben, welches dann automatisch in ein interaktives Diagramm übersetzt wird. Macht natürlich nur Sinn, wenn man das am Computer anschaut; ausgedruckt ist es dann nicht interaktiv…

flights %>%
group_by(month) %>%
na.omit() %>%  # alle Zeilen mit fehlenden Werten löschen
mutate(delay = dep_delay - arr_delay) %>%
ggplot(aes(x = month, y = delay, group = month, color = month)) + geom_boxplot() -> flights_plot

ggplotly(flights_plot)


Für heute reicht’s!