Aufgaben

Aufgabe

Laden Sie das Paket gapminder (ggf. vorab installieren) und laden Sie daraus den Datensatz gapminder. Hier finden Sie weitere Informationen zum Projekt Gapminder.

Tipp: Mit help(gapminder) bekommen Sie mehr Informationen zum Datensatz.

Hinweis: Beziehen Sie sich auf den Stoff dieses Buches: Ismay, C., & Kim, A. (2019). ModernDive—An Introduction to Statistical and Data Sciences via R. http://moderndive.com/.
1. Visualisieren Sie die Anzahl der Länder pro Kontinent mit einem geeigneten Diagramm!
2. Weisen Sie der Füllfarbe die Werte der Variablen zum Kontinent zu.
3. Ergänzen Sie + scale_fill_viridis_d(). Was ändert sich?
Aufgabe

Laden Sie das Paket gapminder (ggf. vorab installieren) und laden Sie daraus den Datensatz gapminder. Hier finden Sie weitere Informationen zum Projekt Gapminder.

Tipp: Mit help(gapminder) bekommen Sie mehr Informationen zum Datensatz.
1. Erstellen Sie einen Boxplort für jedes Land im Datensatz, um die Verteilung der Lebenserwartung zu visualisieren.
2. Erstellen Sie einen Boxplort für jedes Kontinent im Datensatz, um die Verteilung der Lebenserwartung zu visualisieren.
3. Erstellen Sie einen Boxplort für jedes Kontinent im Datensatz, um die Verteilung der Lebenserwartung zu visualisieren. Dieses Mal sollen Sie aber vorab den Datensatz zusammenfassen, so dass ein (zeilen-)reduzierter Datensatz entsteht, der für jeden Kontinent eine Zeile umfasst. Wie sinnvoll ist dieses Vorgehen?
4. Erstellen Sie einen Boxplot pro Kontinent und weisen Sie der Füllfarbe die Variable zum Kontinent zu.
Hinweis: Beziehen Sie sich auf den Stoff dieses Buches: Ismay, C., & Kim, A. (2019). ModernDive—An Introduction to Statistical and Data Sciences via R. http://moderndive.com/.
Aufgabe

Laden Sie das Paket gapminder (ggf. vorab installieren) und laden Sie daraus den Datensatz gapminder. Hier finden Sie weitere Informationen zum Projekt Gapminder.
1. Erstellen Sie ein Histogramm, um die Verteilung der Lebenserwartung zu skizzieren.
2. Teilen Sie das Histogramm in mehrere Facetten auf, entsprechend der Kontinente.
3. Erstellen Sie nun wieder ein Histogramm, aber fügen Sie bei aes() noch hinzu, dass die Füllfarbe zum Kontinent zugeordnet werden soll.
4. Ersetzen Sie im letzten Diagramm das Geom “Histogram” durch das Geom “density” (sog. “Dichtediagramm”, also geom_density). Reduzieren Sie das Alpha der Füllung auf 50%. Welche Variante (Histogramm oder Dichtediagramm) ist sinnvoller, bzw. wann sinnvoller? Warum?
Tipp: Mit help(gapminder) bekommen Sie mehr Informationen zum Datensatz.

Hinweis: Beziehen Sie sich auf den Stoff dieses Buches: Ismay, C., & Kim, A. (2019). ModernDive—An Introduction to Statistical and Data Sciences via R. http://moderndive.com/.
Aufgabe

Laden Sie das Paket gapminder (ggf. vorab installieren) und laden Sie daraus den Datensatz gapminder. Hier finden Sie weitere Informationen zum Projekt Gapminder.
1. Erstellen Sie ein Streudiagramm, das den Zusammenhang von Bruttosozialprodukt und Lebenserwartung widerspiegelt.
Tipp: Mit help(gapminder) bekommen Sie mehr Informationen zum Datensatz.
1. Logarithmieren Sie die Variable zum Bruttosozialprodukt und erstellen Sie auf dieser Basis das Diagramm erneut.
2. Interpretieren Sie dieses (zweite) Diagramm.
3. Die Punkte im Streudiagramm sind stark überlagert. Wie kann man diese “Overplotting” verringern?
Hinweis: Beziehen Sie sich auf den Stoff dieses Buches: Ismay, C., & Kim, A. (2019). ModernDive—An Introduction to Statistical and Data Sciences via R. http://moderndive.com/.
Aufgabe

Laden Sie das Paket gapminder (ggf. vorab installieren) und laden Sie daraus den Datensatz gapminder. Hier finden Sie weitere Informationen zum Projekt Gapminder.

Bauen Sie das folgende Streudiagramm nach!

Tipp: Mit help(gapminder) bekommen Sie mehr Informationen zum Datensatz.

Hinweis: Beziehen Sie sich auf den Stoff dieses Buches: Ismay, C., & Kim, A. (2019). ModernDive—An Introduction to Statistical and Data Sciences via R. http://moderndive.com/.

Aufgabe

In dieser Aufgaben analysieren wir den Datensatz diamonds, welcher Merkmale von Diamanten (wie Preis, Schliffart etc.) auflistet. Hier ist ein Blick in den Datensatz:

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, …
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, …
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, …
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, V…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, …
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, …
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, …
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, …
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, …

diamonds ist Teil von Tidyverse (genauer gesagt dem Paket {{ggplot2}}).

Hier ist ein Überblick über die deskriptiven, univariaten Statistiken:

Data summary
Name	diamonds
Number of rows	53940
Number of columns	10
_______________________
Column type frequency:
factor	3
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
cut	1	TRUE	5	Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color	1	TRUE	7	G: 11292, E: 9797, F: 9542, H: 8304
clarity	1	TRUE	8	SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carat	1	0.80	0.47	0.2	0.40	0.70	1.04	5.01	▇▂▁▁▁
depth	1	61.75	1.43	43.0	61.00	61.80	62.50	79.00	▁▁▇▁▁
table	1	57.46	2.23	43.0	56.00	57.00	59.00	95.00	▁▇▁▁▁
price	1	3932.80	3989.44	326.0	950.00	2401.00	5324.25	18823.00	▇▂▁▁▁
x	1	5.73	1.12	0.0	4.71	5.70	6.54	10.74	▁▁▇▃▁
y	1	5.73	1.14	0.0	4.72	5.71	6.54	58.90	▇▁▁▁▁
z	1	3.54	0.71	0.0	2.91	3.53	4.04	31.80	▇▁▁▁▁

Betrachten Sie das Histogramm; welcher R-Code hat es erzeugt?

Für alle Diagramme gilt:

diamonds %>% 
  mutate(price = price/1000 %>% round) %>%
  ggplot() +
  aes(x = price) +
  geom_histogram()  +
  geom_vline(aes(xintercept = mean(price))) +
  theme_light() +
  facet_wrap( ~ cut, scales = "free")

diamonds %>% 
  drop_na(cut, price) %>% 
  filter(cut %in% c("Fair", "Premium", "Ideal")) %>% 
  mutate(price = price/1000 %>% round) %>% 
  ggplot() +
  aes(x = price) +
  geom_histogram()  +
  geom_vline(aes(xintercept = median(price))) +
  theme_light() +
  facet_wrap( ~ cut, scales = "free")

diamonds %>% 
  drop_na(cut, price) %>% 
  filter(cut %in% c("Fair", "Premium", "Ideal")) %>%      
  ggplot() +
  aes(x = price) +
  geom_histogram()  +
  geom_vline(aes(xintercept = median(price))) 
  facet_wrap(cut)

diamonds %>% 
  drop_na(cut, price) %>% 
  filter(cut %in% c("Fair", "Premium", "Ideal")) %>%      
  ggplot() +
  aes(x = price) +
  geom_histogram()  +
  geom_vline(aes(xintercept = median(price))) +
  theme_light() +
  facet_wrap( ~ cut, scales = "free")

Code-Beispiel C
Code-Beispiel D
Code-Beispiel A
Code-Beispiel B
Keine Antwort möglich

Aufgabe

Welche der folgenden Aussagen zum Diagramm ist korrekt?
1. Der vertikale Strich in jedem Bild passt gut zur Position des insgesamten Medians.
2. Auf der x-Achse werden Häufigkeiten abgetragen.
3. Auf der X-Achse ist eine nominalskalierte Variable abgetragen.
4. Die Gruppierungsvariable cut wird hier als ordinale Variable, also mit Ordnungsstruktur, verwendet.
5. Die Variable cut ist eine intervallskalierte Variable.
Aufgabe

Betrachten Sie die beiden Diagramm, A und B; sie zeigen die Verteilung des Preises von Diamanten in Abhängigkeit der Schliffart. Der vertikale Strich zeigt ein Maß der zentralen Tendenz. Welche der Aussagen ist korrekt?
1. Der vertikale Strich passt nicht auf den Median.
2. Es ist nicht sinnvoll, die Gesamtverteilung zusätzlich zur Verteilung pro Gruppe in jeder Facette darzustellen.
3. Den globalen Median (für den gesamten Datensatz, also über alle Gruppen hinweg) in jeder Facette darzustellen, ist redundant. Daher ist es besser, in jeder Facetten den Median pro Gruppe darzustellen.
4. Die Verwendung einer Füllfarbe (Diagramm B) ist hier nicht sinnvoll.
Aufgabe

Betrachten Sie die beiden Diagramm, A und B; sie zeigen die Verteilung des Preises von Diamanten in Abhängigkeit der Schliffart. Der vertikale Strich zeigt ein Maß der zentralen Tendenz. Welche der Aussagen ist korrekt?
1. Im Diagramm A wird ein Maß der zentralen Tendenz gruppenbezogen gezeigt, also den jeweiligen Kennwert der Gruppe (Schliffart) wiedergegeben.
2. Insgesamt sind die Verteilung linksschief.
3. Im Diagramm B wird die Gesamtverteilung über die drei Gruppen hinweg (in hellgrau) dargestellt; in den kräftigeren Farbtönen wird die Verteilung pro Gruppe (Schliffart) dargestellt.
4. Insgesamt sind die Verteilung rechtssteil.

Aufgabe

Was ist der richtige R-Code, um das Diagramm zu erzeugen?

**R-Code A** 

flights %>%
  ggplot(aes(x = hour, y = dep_delay)) + 
  geom_boxplot(aes(group = hour)) + 
  geom_smooth() + 
  coord_cartesian(ylim = c(-10, 60)) + 
  scale_x_continuous(breaks = 1:24)


**R-Code B** 

flights %>%
  select(arr_delay, hour) %>% 
  ggplot(aes(x = hour, y = arr_delay)) + 
  geom_boxplot(aes(group = hour)) + 
  geom_smooth(method = 'lm') + 
  coord_cartesian(ylim = c(-10, 60))


**R-Code C** 

flights %>%
  select(dep_delay, hour) %>% 
  ggplot(aes(x = hour, y = dep_delay)) + 
  geom_boxplot(aes(group = hour)) + 
  geom_smooth(method = 'lm') + 
  coord_cartesian(ylim = c(-10, 60)) + 
  scale_x_continuous(breaks = 1:24)


**R-Code D** 

flights %>%
  select(dep_delay, hour) %>% 
  ggplot(aes(y = hour, x = dep_delay)) + 
  geom_boxplot(aes(group = dep_delay)) + 
  geom_smooth(method = 'lm') + 
  coord_cartesian(ylim = c(-10, 60)) + 
  scale_x_continuous(breaks = 1:24)


**R-Code E** 

flights %>%
  select(dep_delay, hour) %>% 
  ggplot(aes(x = hour, y = dep_delay)) + 
  geom_boxplot(aes(group = hour)) + 
  coord_cartesian(ylim = c(-10, 60)) + 
  scale_x_continuous(breaks = 1:24)

Aufgabe

Was ist der richtige R-Code, um das Diagramm zu erzeugen?

**R-Code A** 

tips %>%
  ggplot(aes(x = total_bill, y = tip, 
             color = sex, shape = sex)) + 
  geom_point() + 
  scale_color_viridis_d() + 
  theme_bw()


**R-Code B** 

tips %>%
  ggplot(aes(x = total_bill, y = tip, 
             color = sex, shape = sex)) + 
  geom_point(size = 2) + 
  geom_smooth() + 
  scale_color_viridis_d() + 
  theme_bw()


**R-Code C** 

tips %>%
  ggplot(aes(y = total_bill, x = tip, 
             color = sex, shape = sex)) + 
  geom_point(size = 2) + 
  geom_smooth() + 
  scale_color_viridis_d() + 
  theme_bw()


**R-Code D** 

tips %>%
  ggplot(aes(x = total_bill, y = tip, 
             color = sex, shape = sex)) + 
  geom_point(size = 2) + 
  geom_smooth() 


**R-Code E** 

tips %>%
  ggplot(aes(x = total_bill, y = tip, 
             color = sex)) + 
  geom_point(size = 1) + 
  geom_smooth() + 
  scale_color_viridis_d()

Aufgabe

Je nach Datenanalyse sind verschiedene Arten von Diagramm sinnvoll. Ein Diagrammtyp namens Heatmap lässt sich in R z.B. so erstellen:
```
library(tidyverse)
data("diamonds")
```
```
p1 <- 
  ggplot(diamonds) +
  aes(x = cut, y = clarity) +
  geom_bin2d()
p1
```
Ändern wir noch das Farbschema, damit die Farbunterschiede deutlicher zutage treten, wir setzen sozusagen die Skibrille mit den gelben Gläsern auf.
```
p1 +
  scale_fill_viridis_c()
```
plot of chunk unnamed-chunk-3

Welche Aussage passt am besten zu diesem Diagramm?
1. Es handelt sich um eine univariate Häufigkeitsanalyse.
2. Je dunkler die Farbe, desto häufiger die Kategorie.
3. Es handelt sich um eine Visualisierung einer Kontingenztabelle.
4. Es wurden zwei metrische Variablen untersucht.
5. Heatmaps sind insgesamt eine wenig geeignete Form der Datenanalyse
Aufgabe

Betrachten Sie das Histogramm. Welcher Boxplot spiegelt das Histogramm am genauesten wider?

Histogramm:

Boxplots:
```
## [[1]]
```
```
## 
## [[2]]
```
```
## 
## [[3]]
```
```
## 
## [[4]]
```
```
## 
## [[5]]
```
1. Boxplot A
2. Boxplot B
3. Boxplot C
4. Boxplot D
5. Boxplot E
Aufgabe

Wir analysieren die Verteilung des Preises (price) von Diamanten, gruppiert nach Schliffart (cut). Betrachten Sie die Histogramme. Welche Aussage ist korrekt?
1. Die Färbung (Füllfarbe) kodiert die Schliffart (cut).
2. Die Mittelwerte der Histogramme sind identisch.
3. Die Mediane der Histogramme sind identisch.
4. Einige Histogramme sind normalverteilt.
5. Die Histogramme sind (alle) rechtsschief.
Aufgabe

Wir analysieren die Verteilung des Preises (price) von Diamanten, gruppiert nach Schliffart (cut). Betrachten Sie die Histogramme. Welche Aussage ist korrekt?
1. Die Schliffart Premium ist bimodal verteilt.
2. Die Histogramme sind (alle) symmetrisch.
3. Die Schliffart Fair ist bimodal verteilt.
4. Bei allen Schliffarten ist der Median kleiner als der Mittelwert.
Aufgabe

Wählen Sie den am besten treffenden Wert des Korrelationskoeffizientens im Streudiagramm.
1. +.90
2. 1
3. -.90
4. 0
5. -1
Aufgabe

Welche Aussage zu dieser R-Syntax ist falsch:
```
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
```
1. Der Datensatz heißt mpg.
2. Als Geom werden Punkte angegeben.
3. Auf der X-Achse soll die Variable displ stehen.
4. Es sollen zwei Plots gezeichnet werden (daher +).
5. Auf der Y-Achse soll die Variable hwy stehen.

Aufgabe

Gesucht ist die Syntax, die folgendes Diagramm erstellt (Datensatz mpg):

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(y = hwy, x = cyl))

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

data(mpg)
ggplot(data = data) + 
  geom_point(aes(x = hwy, y = cyl))

data(mpg)
  geom_point(mapping = aes(x = hwy, y = cyl)

data(mpg)
ggplot(data = mpg) + 
  geom_point((x = hwy, y = cyl)

Antwort A
Antwort B
Antwort C
Antwort D
Antwort E

Aufgabe

Gesucht ist die Syntax, die folgendes Diagramm erstellt (Datensatz mpg):

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = class)

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

data(mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(color = class), x = displ, y = hwy)

Antwort A
Antwort B
Antwort C
Antwort D
Antwort E

Aufgabe

Gesucht ist die Syntax, die folgendes Diagramm erstellt (Datensatz mpg):

data(mpg)
ggplot(data = mpg) 
  + geom_point(mapping = aes(x = displ, y = hwy)) 
  + facet_wrap(~ class, nrow = 2)

data(mpg)
ggplot(data = mpg) + 
  geom_point(x = displ, y = hwy) + 
  facet_wrap(~ class, nrow = 2)

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping(x = displ, y = hwy) + 
  facet_wrap(~ class, nrow = 2)

data(mpg) +
 ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ; y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Antwort A
Antwort B
Antwort C
Antwort D
Antwort E

Aufgabe

Welche der folgenden Erweiterungen für ggplot2 gibt es nicht?
1. Anatomiediagramme
2. Animationen
3. Diagnostiken zur Überprüfung der Modellannahmen von Regressionsmodellen
4. Halbierung von Geomen (wer braucht sowas?)
5. Fortgeschrittene Tortendiagramme
6. 3D

Aufgabe

Gesucht ist die Syntax, die folgendes Diagramm erstellt (Datensatz diamonds):

data(diamonds)
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = prop, group = 1))

data(diamonds)
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = prop))

data(diamonds)
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))

data(mpg) +
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop)), group = 1)

data(mpg)
ggplot(data = diamonds) 
  + geom_bar(mapping == aes(x = cut, y = stat(prop), group = 1))

Antwort A
Antwort B
Antwort C
Antwort D
Antwort E

Aufgabe

Welche Aussage zu stat_summary() (bei ggplot2) ist korrekt?
1. summarises the x values for each y value, to draw attention to the summary that you’re computing
2. summarises the unique x values for each unique y value, to draw attention to the summary that you’re computing
3. summarises the x values
4. summarises the x values for each unique y value, to draw attention to the summary that you’re computing
5. summarises the y values for each unique x value, to draw attention to the summary that you’re computing

Aufgabe

Gesucht ist die Syntax, die folgendes Diagramm erstellt (Datensatz diamonds):

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity),)

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity, position = "fill"))

Antwort A
Antwort B
Antwort C
Antwort D
Antwort E

Aufgabe

Gesucht ist die Syntax, die folgendes Diagramm erstellt:

de <- map_data("world", region = "Germany")

ggplot(de, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

de <- map_data("world")

ggplot(de, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

de <- map_data("world")

ggplot(de, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

de <- map_data("world", region = "Germany")

ggplot(de, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

de <- map_data("world", region = "France")

ggplot(de, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

Antwort A
Antwort B
Antwort C
Antwort D
Antwort E

Aufgabe

ggplot2 besteht aus mehreren “Bausteinen” oder “Schichten”, die zusammen kombiniert werden können, und so ein Diagramm erstellen.

Welches der folgenden “Schichten” ist nicht Teil von ggplot2?
1. DATA
2. GEOM_FUNCTION
3. FACET_FUNCTION
4. VARIABLES
5. POSITION