Analyse der HaNS-Matomo-Daten

Author

Sebastian Sauer

Published

September 29, 2025

1 Hintergrund

Dieser Arbeitsbericht schildert das technische Vorgehen im Rahmen der Analyse der Matomo-Daten des BMBF-Projekt “HaNS”.

1.1 Vorgehen

Die Matomo-Klickdaten aller Semester der Projektlaufzeit wurden für diese Analyse verarbeitet. Mit Hilfe einer R-Pipeline wurden eine Reihe von Forschungsfragen analysiert.

Der komplette Code ist online dokumentiert unter https://github.com/sebastiansauer/hans. Aus Datenschutzgründen sind online keine Daten eingestellt.

Die zentrale Analyse-Pipeline-Datei ist https://github.com/sebastiansauer/hans/blob/main/_targets.R.

1.2 Forschungsfragen

  1. Wie viele Nutzer gibt es und in welchem Zeitraum?
  2. In welcher Frequenz wird HaNS aufgesucht? Wie groß sind die zeitlichen Zwischenräume zwischen der Benutzung der Plattform?
  3. Wie oft wird HaNS pro Zeitraum (z.B. Monat) besucht?
  4. Wie verändert sich die Nutzung im Zeitverlauf?
  5. Wie viele Aktionen bringt ein Visit mit sich? Wie ist die statistische Verteilung der Aktionen pro Visit?
  6. Wie lang verweilen die Nutzer pro Visit?
  7. Wie verändert sich die Nutzungsdauer pro Visit im Zeitverlauf?
  8. Welche Aktionen führen die Nutzer auf Hans aus?
  9. Wie verändern sich die Verteilungen der Aktionshäufigkeiten im Zeitverlauf?
  10. An welchen Tagen und zu welcher Zeit kommen die User zu HaNS?
  11. Wie häufig und in welcher Art inteagieren die Nutzer mit dem LLM in HaNS?
  12. Wie groß ist der Anteil der Nutzer, die mit dem LLM interagieren?
  13. Wie verändert sich der Anteil der Nutzer, die mit dem LLM interagieren, im Zeitverlauf?
  14. Wie oft wird auf ein Wort im Transkript des LLM geklickt?
  15. Wie oft wird ein Transkript-Dienst in HaNS in Anspruch genommen?
  16. Wie verändert sich die Nutzung der Transkript-Dienste in HaNS im Zeitverlauf?
  17. Wie lange werden Videos angeschaut?
  18. Wie verändert sich die Betrachtungsdauer im Zeitverlauf?

2 Setup

2.1 R-Pakete starten

Show the code
library(targets)
library(tidyverse)
library(ggokabeito)
library(easystats)
library(gt)
library(ggfittext)
library(scales)
library(visdat)
library(collapse)
library(ggpubr)
library(knitr)
library(tinytable)
library(data.table)
Show the code
library(ggplot2)
theme_set(theme_minimal())

2.2 Optionen setzen

Show the code
options(lubridate.week.start = 1) # Monday as first day
#options(collapse_mask = "all") # use collapse for all dplyr operations
options(chromote.headless = "new") # Chrome headleass needed for gtsave
Show the code
scale_colour_discrete <- function(...) scale_colour_brewer(palette = "Set2")
scale_fill_discrete <- function(...) scale_fill_brewer(palette = "Set2")

2.3 Daten laden

Show the code
tar_load(ai_transcript_clicks_per_month)
tar_load(config)
tar_load(course_and_uni_per_visit)
tar_load(data_all_fct)
tar_load(data_long)
tar_load(data_prepped)
tar_load(data_separated_distinct_slice)
tar_load(data_separated_filtered)
#tar_load(data_users_only)
tar_load(idvisit_has_llm)
tar_load(llm_response_text)
tar_load(n_action)
tar_load(n_action_type)
tar_load(n_action_w_date)
tar_load(time_duration)
tar_load(time_since_last_visit)
tar_load(time_spent)
tar_load(time_spent_w_course_university)
tar_load(time_visit_wday)
tar_load(n_mc_answers_selected)
tar_load(mc_answers_with_timestamps)
tar_load(n_action_fingerprint)
tar_load(time_visit_wday_fingerprint)
tar_load(n_action_w_date_fingerprint)
tar_load(time_spent_fingerprint)

3 Datenaufbereitung und Analysepipeline

3.1 Targets-Pipeline stellt Überblick aller Analyseschritte dar

Die Analyse wird im Rahmen einer Targets-Pipeline beschrieben und ist offen auf Github einsehbar.

3.2 Langformat

Aufgrund des “rechts flatternden” Datenformat (d.h. unterschiedliche Zeilenlängen) wurden die Daten in ein Langformat überführt, zwecks besserer/einfacherer Analyse.

Dazu wurden (neben den ID-Variablen, v.a. idvisit) die actionDetails_-Variablen verwendet. Der Code des Pivotierens in das Langformat ist in der Funktion longify-data.R einsehbar.

Die Daten im Langformat wurden dann noch etwas aufbereitet mt der Funktion slimify-data.R.

Show the code
data_separated_filtered |>
  head(30)

4 Überblick über die Daten

4.1 Roh-Daten laden und inspizieren (data_all_fact)

4.1.1 Dimension

Der Roh-Datensatz verfügt über

  • 14207 Zeilen
  • 3181 Spalten (Dubletten und Spalten mit Bildern bereits entfernt)

Jede Zeile entspricht einem “Visit”.

4.1.2 Erster Blick

Show the code
data_all_fct_head100 <-
  data_all_fct %>%
  select(1:100) %>%
  slice_head(n = 100)
Show the code
data_all_fct_head100 %>%
  visdat::vis_dat()

4.1.3 (Fast) leere Spalten

Show the code
d_na_cols <- data.frame(
  id = 1:ncol(data_all_fct),
  names = names(data_all_fct),
  na_prop = colMeans(is.na(data_all_fct))
)

4.1.3.1 Leere Spalten

Show the code
d_na_cols |>
  filter(na_prop == 1)

4.1.3.2 Fast leere Spalten

Show the code
no_na_cols <-
  d_na_cols |>
  filter(na_prop > .9) |>
  nrow()

no_na_cols
[1] 1951
Important

Sehr viele Spalten, 1951 sind fast leer.

4.1.4 Namen (1-100)

Show the code
d_100_names <- data.frame(
  id = 1:100,
  col_name = data_all_fct_head100 %>% names()
)

d_100_names

4.1.5 Werte der erst 100 Spalten

Show the code
data_all_fct_head100

4.1.6 Datensatz data_separated_filtered, Zeilen 1-100

Show the code
data_separated_filtered %>%
  slice(1:100)

4.2 Fallzahl im Nur-User-Datensatz

Entfernt man Developer, Admins und Lecturers aus dem Roh-Datensatz so bleiben weniger Zeilen übrig:

  • 14207 Zeilen
  • 3181 Spalten

4.3 Datensatz mit Anzahl der Aktionen pro User

4.3.1 idvisit

Show the code
n_action |> dim()
[1] 14207     2
Show the code
n_action |>
  head(30)
Show the code
n_action |>
  ggplot(aes(x = nr_max)) +
  geom_histogram()

4.3.2 fingerprint

Show the code
n_action_fingerprint |> head(30)
Show the code
n_action_fingerprint |>
  ggplot(aes(x = nr_max)) +
  geom_histogram()

5 Zeitraum

5.1 Beginn/Ende der Daten

Show the code
n_action_w_date |>
  head(30)
Show the code
min_max_time <-
  n_action_w_date |>
  summarise(
    time_min = min(date_time_start, na.rm = T),
    time_max = max(date_time_start, na.rm = T)
  )

min_max_time |>
  gt()
time_min time_max
2022-12-05 15:33:45 2025-07-14 23:40:45
Important

Erster Visit im Datensatz: 2022-12-05 15:33:45.

Letzter Visit im Datensatz: 2025-07-14 23:40:45.

Diese Statistik wurde auf Basis des Datenobjekts data_separated_filtered berechnet, vgl. das Target dieses Objekts in der Pipeline.

5.2 Days since last visit

5.2.1 Insgesamt

5.2.1.1 idvisit

Show the code
time_visit_wday |>
  head(30)
Show the code
time_since_last_visit <-
  time_since_last_visit |>
  mutate(dayssincelastvisit = as.numeric(dayssincelastvisit)) |>
  distinct(idvisit, .keep_all = TRUE)

time_since_last_visit |>
  datawizard::describe_distribution(dayssincelastvisit) |>
  knitr::kable(digits = 2)
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
dayssincelastvisit 6.89 15.75 0 1 87 2.98 8.26 14207 0
Show the code
time_since_last_visit |>
  ggplot(aes(x = dayssincelastvisit)) +
  geom_density() +
  labs(
    title = "If visitor return, they return mostly not later than a few days."
  )

Important

Die Nutzer nutzen die Seite in Abständen von wenigen Tagen?

5.2.1.2 fingerprint

Show the code
time_visit_wday_fingerprint |> head()
Show the code
time_since_last_visit_fingerprint <-
  time_since_last_visit |>
  mutate(dayssincelastvisit = as.numeric(dayssincelastvisit)) |>
  distinct(fingerprint, .keep_all = TRUE)

time_since_last_visit |>
  datawizard::describe_distribution(dayssincelastvisit) |>
  knitr::kable(digits = 2)
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
dayssincelastvisit 6.89 15.75 0 1 87 2.98 8.26 14207 0
Show the code
time_since_last_visit |>
  ggplot(aes(x = dayssincelastvisit)) +
  geom_density() +
  labs(
    title = "If visitor return, they return mostly not later than a few days."
  )

5.2.2 Nach Lehrveranstaltungen

Show the code
time_since_last_visit_per_course <-
  time_since_last_visit |>
  left_join(course_and_uni_per_visit) |>
  drop_na()
Show the code
time_since_last_visit_per_course_summary <-
  time_since_last_visit_per_course |>
  group_by(course) |>
  summarise(
    dayssincelastvisit_mean = mean(dayssincelastvisit),
    dayssincelastvisit_sd = sd(dayssincelastvisit),
    dayssincelastvisit_n = n()
  ) |>
  mutate(
    dayssincelastvisit_n_log = log(dayssincelastvisit_n, base = 10) + 0.001
  )
Show the code
time_since_last_visit_per_course_summary
Show the code
time_since_last_visit_per_course_summary |>
  ggplot(aes(
    y = reorder(course, dayssincelastvisit_mean),
    x = dayssincelastvisit_mean
  )) +
  geom_errorbar(aes(
    xmin = dayssincelastvisit_mean - dayssincelastvisit_sd,
    xmax = dayssincelastvisit_mean + dayssincelastvisit_sd
  )) +
  geom_point(aes(alpha = log(dayssincelastvisit_n)), show.legend = FALSE) +
  labs(
    x = "Days since last visit (mean±sd)",
    y = "course",
    title = "In some courses, users use HaNS frequently.",
    caption = "Grey saturation of the mean dots refers to the log10 of the sample size (N)"
  ) +
  geom_text(
    aes(label = round(dayssincelastvisit_n)),
    x = Inf,
    hjust = 1.2,
    size = 2
  ) +
  annotate(
    x = Inf,
    y = Inf,
    label = "N",
    geom = "label",
    hjust = 1,
    vjust = 1
  ) +
  scale_y_discrete(expand = expansion(mult = 0.1)) +
  theme_minimal()

5.3 Visits im Zeitverlauf

Wie viele Visits (von Hans) gab es?

5.3.1 Pro Monat

5.3.1.1 idivisit

Show the code
time_visit_wday_summary <-
  time_visit_wday |>
  ungroup() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  mutate(
    month_name = month(date_time, label = TRUE, abbr = FALSE),
    month_num = month(date_time, label = FALSE),
    year_num = year(date_time)
  )
Show the code
time_visit_wday_summary |>
  group_by(year_num, month_num) |>
  summarise(n = n()) |>
  gt()
month_num n
2022
12 329
2023
1 455
2 561
3 149
4 253
5 391
6 292
7 441
8 26
9 39
10 614
11 660
12 519
2024
1 783
2 85
3 138
4 329
5 413
6 593
7 743
8 16
9 23
10 731
11 918
12 765
2025
1 959
2 155
3 507
4 1011
5 557
6 321
7 430
NA
NA 1
Show the code
time_visit_wday_summary |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ggplot(aes(x = month_start, y = n)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  labs(
    title = "The number of visits reflect the teaching periods of the semesters.",
    x = "month/year"
  )

5.3.1.2 fingerprint

Show the code
time_visit_wday_summary_fingerprint <-
  time_visit_wday_fingerprint |>
  ungroup() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  mutate(
    month_name = month(date_time, label = TRUE, abbr = FALSE),
    month_num = month(date_time, label = FALSE),
    year_num = year(date_time)
  )
Show the code
time_visit_wday_summary_fingerprint |>
  group_by(year_num, month_num) |>
  summarise(n = n()) |>
  gt()
month_num n
2022
12 235
2023
1 248
2 303
3 99
4 160
5 226
6 195
7 227
8 17
9 23
10 402
11 412
12 325
2024
1 445
2 50
3 94
4 179
5 204
6 274
7 214
8 10
9 16
10 365
11 417
12 317
2025
1 347
2 74
3 217
4 424
5 273
6 171
7 196
NA
NA 1
Show the code
time_visit_wday_summary_fingerprint |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ggplot(aes(x = month_start, y = n)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  labs(
    title = "The number of visits reflect the teaching periods of the semesters.",
    x = "month/year"
  )

5.3.2 Pro Woche

Show the code
time_visit_wday_summary_week <-
  time_visit_wday |>
  ungroup() |>
  mutate(week_start = floor_date(date_time, "week")) |>
  mutate(week_num = week(date_time), year_num = year(date_time))
Show the code
time_visit_wday_summary_week_summarized <-
  time_visit_wday_summary_week |>
  group_by(year_num, week_num) |>
  summarise(n = n())

time_visit_wday_summary_week_summarized
Show the code
time_visit_wday_summary_week_summarized_dateformat <-
  time_visit_wday_summary_week |>
  group_by(week_start) |>
  summarise(n = n())
Show the code
time_visit_wday_summary_week_summarized_dateformat |>
  ggplot(aes(x = week_start, y = n)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  geom_smooth(method = "gam", se = FALSE, color = "blue") +
  labs(
    title = "The number of visits is increasing and reflects the teaching periods of the semesters.",
    x = "week number/year"
  )

Important

The number of visits has increased over time.

5.3.3 Akkumulierte Seitenaufrufe im Zeitverlauf

5.3.3.1 Monat - idvisit

Show the code
time_visit_wday_summary |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ungroup() |>
  mutate(n_cumsum = cumsum(n)) |>
  ggplot(aes(x = month_start, y = n_cumsum)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm") +
  labs(title = "Visits have increased linearly over time.", x = "month/year")

5.3.3.2 Monat - fingerprint

Show the code
time_visit_wday_summary_fingerprint |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ungroup() |>
  mutate(n_cumsum = cumsum(n)) |>
  ggplot(aes(x = month_start, y = n_cumsum)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm") +
  labs(title = "Visits have increased linearly over time.", x = "month/year")

5.3.3.3 Woche

Show the code
time_visit_wday_summary_week |>
  group_by(year_num, week_start) |>
  summarise(n = n()) |>
  ungroup() |>
  mutate(n_cumsum = cumsum(n)) |>
  ggplot(aes(x = week_start, y = n_cumsum)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm") +
  labs(
    title = "Visits have increased approx. linearly over time.",
    x = "week/year"
  )

5.4 Statistiken

Die folgenden Statistiken beruhen auf dem Datensatz data_separated_filtered:

5.4.1 idivisit

Show the code
glimpse(data_separated_filtered)
Rows: 4,477,584
Columns: 5
$ nr          <int> 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5…
$ type        <fct> subtitle, timestamp, eventcategory, eventaction, timestamp…
$ value       <fct> "https://hans.th-nuernberg.de/", "2023-03-23 18:37:56", "c…
$ idvisit     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ fingerprint <fct> aa8a78771b4f21ff, aa8a78771b4f21ff, aa8a78771b4f21ff, aa8a…

nr fasst die Nummer der Aktion innerhalb eines bestimmten Visits.

5.4.2 fingerprint

Show the code
data_separated_filtered |>
  distinct(fingerprint, .keep_all = TRUE) |>
  glimpse()
Rows: 7,160
Columns: 5
$ nr          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ type        <fct> subtitle, subtitle, subtitle, subtitle, subtitle, subtitle…
$ value       <fct> "https://hans.th-nuernberg.de/", "https://hans.th-nuernber…
$ idvisit     <int> 1, 3, 6, 7, 8, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,…
$ fingerprint <fct> aa8a78771b4f21ff, 1f026ad3cbbdf325, 518965d4e1ae7e2d, aa95…

5.5 Mit allen Daten (den 499er-Daten)

5.5.1 idvisit

Show the code
tbl_n_action <-
  n_action |>
  describe_distribution(nr_max, centrality = c("median", "mean"))

tbl_n_action

nr_max gibt den Maximalwert von nr zurück, sagt also, wie viele Aktionen maximal während eines Vitis ausgeführt wurden.

Betrachtet man die Anzahl der Aktionen pro Visit näher, so fällt auf, dass der Maximalwert (499) sehr häufig vorkommt:

Show the code
n_action |>
  count(nr_max) |>
  ggplot(aes(x = nr_max, y = n)) +
  geom_col() +
  geom_vline(
    xintercept = tbl_n_action$Median,
    color = "blue",
    linetype = "dashed"
  ) +
  labs(
    caption = "Vertical dashed lines shows the median.",
    title = "Most users to only a few actions, but some do many."
  )

Important

Die meisten Nutzer machen nur wenige Aktionen pro Visit, aber einige machen sehr viele.

Hier noch in einer anderen Darstellung:

Show the code
n_action |>
  count(nr_max) |>
  ggplot(aes(x = nr_max, y = n)) +
  geom_point()

Der Maximalwert ist einfach auffällig häufig:

Show the code
n_action |>
  count(nr_max == 499) |>
  gt()
nr_max == 499 n
FALSE 13626
TRUE 581

Es erscheint plausibel, dass der Maximalwert alle “gekappten” (zensierten, abgeschnittenen) Werte fasst, also viele Werte, die eigentlich größer wären (aber dann zensiert wurden).

5.5.2 fingerprint

Show the code
tbl_n_action_fingerprint <-
  n_action_fingerprint |>
  describe_distribution(nr_max, centrality = c("median", "mean"))

tbl_n_action_fingerprint
Show the code
n_action_fingerprint |>
  count(nr_max) |>
  ggplot(aes(x = nr_max, y = n)) +
  geom_col() +
  geom_vline(
    xintercept = tbl_n_action_fingerprint$Median,
    color = "blue",
    linetype = "dashed"
  ) +
  labs(
    caption = "Vertical dashed lines shows the median.",
    title = "Most users to only a few actions, but some do many."
  )

5.6 Nur Visitors, für die weniger als 500 Aktionen protokolliert sind

5.6.1 idvisit

Show the code
n_action_lt_500 <-
  n_action |>
  filter(nr_max != 499)

n_action_lt_500 |>
  describe_distribution(nr_max) |>
  gt() |>
  fmt_number(columns = where(is.numeric), decimals = 2)
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
nr_max 61.88 88.53 77.00 1.00 496.00 2.27 5.47 13,626.00 0.00

5.6.2 fingerprint

Show the code
n_action_lt_500_fingerprint <-
  n_action_fingerprint |>
  filter(nr_max != 499)

n_action_lt_500_fingerprint |>
  describe_distribution(nr_max) |>
  gt() |>
  fmt_number(columns = where(is.numeric), decimals = 2)
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
nr_max 75.78 99.73 100.00 1.00 496.00 1.88 3.31 6,771.00 0.00

6 Lehrveranstaltungen

6.1 Anzahl an Lehrveranstaltungen nach Hochschule

6.1.1 fingerprint

Show the code
course_and_uni_per_visit |>
  count(university)
Show the code
course_and_uni_per_visit |>
  count(university) |>
  drop_na() |>
  ggplot(aes(y = reorder(university, n), x = n)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "TH Nürnberg hosts the most courses on HaNS by far.",
    y = "University"
  )

6.1.2 fingerprint

Show the code
course_and_uni_per_visit |>
  distinct(fingerprint, .keep_all = TRUE) |>
  count(university) |>
  ggplot(aes(y = reorder(university, n), x = n)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "TH Nürnberg hosts the most courses on HaNS by far.",
    y = "University"
  )

6.2 Visits nach Lehrveranstaltung und Jahr

6.2.1 idvisit

Show the code
time_spent_w_course_university |>
  count(year, course)
Show the code
time_spent_w_course_university |>
  count(year, course) |>
  drop_na() |>
  ggplot(aes(x = n, y = course, fill = factor(year), )) +
  geom_col(position = "dodge") +
  labs(title = "The course 'GeSOA' is the most active course on HaNS.")

6.2.2 fingerprint

Show the code
time_spent_w_course_university |>
  distinct(fingerprint, .keep_all = TRUE) |>
  count(year, course) |>
  drop_na() |>
  ggplot(aes(x = n, y = course, fill = factor(year), )) +
  geom_col(position = "dodge") +
  labs(title = "The course 'GeSOA' is the most active course on HaNS.")

7 Aktionen pro Visit/Fingerprint

7.1 Statistiken pro Visit

Show the code
n_actions_searches_interactions <-
  data_prepped |>
  select(
    idvisit,
    fingerprint,
    any_of(c(
      "searches",
      "actions",
      "interactions",
      "referrertype",
      "referrername",
      "language",
      "devicetype",
      "devicemodel",
      "operatingsystem",
      "browsername"
    ))
  )

7.1.1 Unique IDs, Fingerprints, Mean searches, Mean actions

Auswertung - der Anzahlen der uniquen visitids und uniquen Fingerprints - Mittelwerte der Anzahl der Suchen und Aktionen pro Besuch

7.1.1.1 idivisit und fingerprint

Show the code
n_actions_searches_interactions |>
  as.data.frame() |>
  summarise(
    idvisit_n = length(unique(idvisit)),
    fingerprint_n = length(unique(fingerprint)),
    actions_mean = mean(as.integer(actions), na.rm = TRUE),
    searches_mean = mean(as.integer(searches), na.rm = TRUE)
  )
Note

Es gibt etwa doppelt so viele Besucher wie unique Nutzer.

7.1.2 Referrer Type pro Visit

Show the code
n_actions_searches_interactions |>
  count(referrertype, sort = TRUE)

7.1.3 Referrer Type Name pro Visit

Show the code
n_actions_searches_interactions |>
  count(referrername, sort = TRUE)

7.1.4 devicemodel

Show the code
n_actions_searches_interactions |>
  count(devicemodel, sort = TRUE) |>
  slice_head(n = 10)

7.1.5 operatingsystem

Show the code
n_actions_searches_interactions |>
  count(operatingsystem, sort = TRUE) |>
  slice_head(n = 10)

7.1.6 browsername

Show the code
n_actions_searches_interactions |>
  count(browsername, sort = TRUE) |>
  slice_head(n = 10)

Die Mac-User scheinen besonders aktiv zu sein auf HaNS.

7.2 Aktionen pro idvisit/fingerprint - Mit den 499er-Daten

7.2.1 idvisit

Show the code
n_action_avg = mean(n_action$nr_max) |> round(0)
n_action_median = median(n_action$nr_max) |> round(0)
n_action_sd = sd(n_action$nr_max) |> round(0)
n_action_iqr = IQR(n_action$nr_max) |> round(0)

n_action |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale MW±SD"
  ) +
  theme_minimal() +
  geom_vline(xintercept = n_action_avg, color = palette_okabe_ito()[1]) +
  geom_segment(
    x = n_action_avg - n_action_sd,
    y = 0,
    xend = n_action_avg + n_action_sd,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_avg,
    y = 1500,
    label = paste0("MW = ", n_action_avg)
  ) +
  annotate(
    "label",
    x = n_action_avg + n_action_sd,
    y = 0,
    label = paste0("SD = ", n_action_sd)
  )

Show the code
#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")

n_action |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Median; der horizontale Median±IQR"
  ) +
  theme_minimal() +
  geom_vline(xintercept = n_action_median, color = palette_okabe_ito()[1]) +
  geom_segment(
    x = n_action_median - n_action_iqr,
    y = 0,
    xend = n_action_median + n_action_iqr,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_median,
    y = 1500,
    label = paste0("Md = ", n_action_median)
  ) +
  annotate(
    "label",
    x = n_action_median + n_action_iqr,
    y = 0,
    label = paste0("IQR = ", n_action_iqr)
  )

Show the code
#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")
  • Mittelwert der Aktionen pro Visit: 80.
  • SD der Aktionen pro Visit: 123.
  • MD: 27.
  • IQR: : 88.

7.2.2 fingerprint

Show the code
n_action_fingerprint_avg = mean(n_action_fingerprint$nr_max) |> round(0)
n_action_fingerprint_median = median(n_action_fingerprint$nr_max) |> round(0)
n_action_fingerprint_sd = sd(n_action_fingerprint$nr_max) |> round(0)
n_action_fingerprint_iqr = IQR(n_action_fingerprint$nr_max) |> round(0)

n_action_fingerprint |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale MW±SD"
  ) +
  theme_minimal() +
  geom_vline(
    xintercept = n_action_fingerprint_avg,
    color = palette_okabe_ito()[1]
  ) +
  geom_segment(
    x = n_action_fingerprint_avg - n_action_fingerprint_sd,
    y = 0,
    xend = n_action_fingerprint_avg + n_action_fingerprint_sd,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg,
    y = 1500,
    label = paste0("MW = ", n_action_fingerprint_avg)
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg + n_action_fingerprint_sd,
    y = 0,
    label = paste0("SD = ", n_action_fingerprint_sd)
  )

Show the code
#geom_label(aes(x = n_action_fingerprint_avg), y = 1, label = "Mean")

n_action_fingerprint |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Median; der horizontale Median±IQR"
  ) +
  theme_minimal() +
  geom_vline(
    xintercept = n_action_fingerprint_median,
    color = palette_okabe_ito()[1]
  ) +
  geom_segment(
    x = n_action_fingerprint_median - n_action_fingerprint_iqr,
    y = 0,
    xend = n_action_fingerprint_median + n_action_fingerprint_iqr,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_median,
    y = 1500,
    label = paste0("Md = ", n_action_fingerprint_median)
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_median + n_action_fingerprint_iqr,
    y = 0,
    label = paste0("IQR = ", n_action_fingerprint_iqr)
  )

Show the code
#geom_label(aes(x = n_action_fingerprint_avg), y = 1, label = "Mean")

7.3 Ohne 499er-Daten

7.3.1 idvisit

Show the code
n_action_avg2 = mean(n_action_lt_500$nr_max) |> round(0)
n_action_sd2 = sd(n_action_lt_500$nr_max) |> round(2)

n_action_lt_500 |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    title = "Verteilung der User-Aktionen pro Visit",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale die SD"
  ) +
  theme_minimal() +
  geom_vline(xintercept = n_action_avg2, color = palette_okabe_ito()[1]) +
  geom_segment(
    x = n_action_avg - n_action_sd2,
    y = 0,
    xend = n_action_avg2 + n_action_sd2,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_avg2,
    y = 1500,
    label = paste0("MW = ", n_action_avg2)
  ) +
  annotate(
    "label",
    x = n_action_avg2 + n_action_sd2,
    y = 0,
    label = paste0("SD = ", n_action_sd2)
  )

Show the code
#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")
  • Mittelwert der Aktionen pro Visit: 62.
  • SD der Aktionen pro Visit: 88.53.

7.3.2 fingerprint

Show the code
n_action_fingerprint_avg2 = mean(n_action_lt_500_fingerprint$nr_max) |> round(0)
n_action_fingerprint_sd2 = sd(n_action_lt_500_fingerprint$nr_max) |> round(2)

n_action_lt_500_fingerprint |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    title = "Verteilung der User-Aktionen pro Visit",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale die SD"
  ) +
  theme_minimal() +
  geom_vline(
    xintercept = n_action_fingerprint_avg2,
    color = palette_okabe_ito()[1]
  ) +
  geom_segment(
    x = n_action_fingerprint_avg - n_action_fingerprint_sd2,
    y = 0,
    xend = n_action_fingerprint_avg2 + n_action_fingerprint_sd2,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg2,
    y = 1500,
    label = paste0("MW = ", n_action_fingerprint_avg2)
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg2 + n_action_fingerprint_sd2,
    y = 0,
    label = paste0("SD = ", n_action_fingerprint_sd2)
  )

Show the code
#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")

7.4 Anzahl Aktionen im Zeitverlauf

7.4.1 Monat

7.4.1.1 idvisit

Show the code
n_action_w_date |>
  ggplot(aes(x = month_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(
    fun.data = mean_sdl,
    fun.args = list(mult = 1),
    geom = "errorbar",
    width = 0.2
  ) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per visit has incresed over time")

Show the code
n_action_w_date |>
  ggplot(aes(x = month_date, y = nr_max)) +
  geom_jitter(alpha = .1)

7.4.1.2 fingerprint

Show the code
n_action_w_date_fingerprint |>
  ggplot(aes(x = month_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(
    fun.data = mean_sdl,
    fun.args = list(mult = 1),
    geom = "errorbar",
    width = 0.2
  ) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per visit has incresed over time")

Show the code
n_action_w_date_fingerprint |>
  ggplot(aes(x = month_date, y = nr_max)) +
  geom_jitter(alpha = .1)

7.4.2 Regression (Monat)

7.4.2.1 idvisit

Show the code
lm(nr_max ~ month_date, data = n_action_w_date)

Call:
lm(formula = nr_max ~ month_date, data = n_action_w_date)

Coefficients:
(Intercept)   month_date  
 -5.956e+02    3.937e-07  

7.4.2.2 fingerprint

Show the code
lm(nr_max ~ month_date, data = n_action_w_date_fingerprint)

Call:
lm(formula = nr_max ~ month_date, data = n_action_w_date_fingerprint)

Coefficients:
(Intercept)   month_date  
 -1.186e+03    7.503e-07  

7.4.3 Woche

7.4.3.1 idvisit

Show the code
n_action_w_date |>
  mutate(week_date = as.Date(week_date)) |>
  ggplot(aes(x = week_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(fun.data = mean_sdl, geom = "errorbar", width = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per visit has incresed over time")

7.4.3.2 fingerprint

Show the code
n_action_w_date_fingerprint |>
  mutate(week_date = as.Date(week_date)) |>
  ggplot(aes(x = week_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(fun.data = mean_sdl, geom = "errorbar", width = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per fingerprint has incresed over time")

7.4.4 Regression (Woche)

7.4.4.1 idvisit

Show the code
lm(nr_max ~ week_date, data = n_action_w_date)

Call:
lm(formula = nr_max ~ week_date, data = n_action_w_date)

Coefficients:
(Intercept)    week_date  
  -5.93e+02     3.92e-07  

7.4.4.2 fingerprint

Show the code
lm(nr_max ~ week_date, data = n_action_w_date_fingerprint)

Call:
lm(formula = nr_max ~ week_date, data = n_action_w_date_fingerprint)

Coefficients:
(Intercept)    week_date  
 -1.178e+03    7.453e-07  

7.5 Gruppierung der Visits/fingerprints nach Anzahl der Aktionen

7.5.1 idvisit

Show the code
n_action_lt_500 <-
  n_action_lt_500 |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  )
Show the code
n_action_lt_500 |>
  count(n_actions_type) |>
  gt()
n_actions_type n
glimpser 7388
heavy user 465
serious user 5773
Show the code
ggplot(n_action_lt_500) +
  aes(x = n_actions_type) +
  geom_bar()

7.5.1.1 fingerprint

Show the code
n_action_lt_500_fingerprint <-
  n_action_lt_500_fingerprint |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  )
Show the code
n_action_lt_500_fingerprint |>
  count(n_actions_type) |>
  gt()
n_actions_type n
glimpser 3269
heavy user 334
serious user 3168
Show the code
ggplot(n_action_lt_500_fingerprint) +
  aes(x = n_actions_type) +
  geom_bar()

7.6 Gruppierung der Visits im Zeitverlauf

7.6.1 idvisit

Show the code
n_action_w_date |>
  group_by(month_date) |>
  count(nr_max) |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  ) |>
  count(n_actions_type) |>
  ggplot(aes(
    x = month_date,
    y = n,
    color = n_actions_type,
    group = n_actions_type
  )) +
  geom_point() +
  geom_line()

7.6.2 fingerprint

Show the code
n_action_w_date_fingerprint |>
  group_by(month_date) |>
  count(nr_max) |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  ) |>
  count(n_actions_type) |>
  ggplot(aes(
    x = month_date,
    y = n,
    color = n_actions_type,
    group = n_actions_type
  )) +
  geom_point() +
  geom_line()

8 Verweildauer pro Visit

8.1 Berechnungsgrundlage der Verweildauer

Die Verweildauer wurde berechnet als Differenz zwischen kleinstem und größtem Datumszeitwert (POSixct) eines Visits (also pro Wert der Variablen idvisit), vgl. [Funktion diff_time](https://github.com/sebastiansauer/hans/blob/main/funs/diff_time.R). Diese Variable heißttime_diffim Objekttime_spent`.

Dabei wird das Objekt data_separated_filtered herangezogen, vgl. die Definition es Targets “time_spent” in der Targets-Pipeline.

8.2 Vorverarbeitung

Die Visit-Zeit wurde auf 600 Min. trunkiert/begrenzt.

8.2.1 idvisit

Show the code
time_spent |>
  head(30)
Show the code
time_spent <-
  time_spent |>
  # compute time (t) in minutes (min):
  mutate(t_minutes = as.numeric(time_diff, units = "mins")) |>
  filter(t_minutes < 600)

8.2.2 fingerprint

Show the code
time_spent_fingerprint |>
  head(30)
Show the code
time_spent_fingerprint <-
  time_spent_fingerprint |>
  # compute time (t) in minutes (min):
  mutate(t_minutes = as.numeric(time_diff, units = "mins")) |>
  filter(t_minutes < 600)

8.3 Verweildauer-Statistiken in Sekunden

Die Verweildauer ist im Folgenden dargestellt auf Grundlage oben dargestellter Berechnungsgrundlage (in Sekunden).

8.3.1 idvisit

Show the code
time_spent |>
  summarise(
    mean_time_diff = round(mean(time_diff), 2),
    sd_time_diff = sd(time_diff),
    min_time_diff = min(time_diff), # shortest duration
    max_time_diff = max(time_diff) # longest
  )

8.3.2 fingerprint

Show the code
time_spent_fingerprint |>
  summarise(
    mean_time_diff = round(mean(time_diff), 2),
    sd_time_diff = sd(time_diff),
    min_time_diff = min(time_diff), # shortest duration
    max_time_diff = max(time_diff) # longest
  )

8.4 Verweildauer auf Basis der Variable visitduration

8.4.1 Für alle Daten

Alternativ zur Berechnung der Verweildauer steht eine Variable, visitduration zur Verfügung, die (offenbar) die Dauer des Visits misst bzw. messen soll.

Allerdings resultieren substanziell andere Werte, wenn man diese Variable heranzieht zur Berechnung der Verweildauer, vgl. Target time_duration in der Targets-Pipeline.

Show the code
time_duration |>
  head(30)
Show the code
time_duration |>
  summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |>
  mutate(duration_min_avg = duration_sec_avg / 60)

8.4.2 Für unique idvisits

Show the code
time_duration |>
  distinct(idvisit, .keep_all = TRUE) |>
  summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |>
  mutate(duration_min_avg = duration_sec_avg / 60)

8.4.3 Für unique fingerprints

Show the code
time_duration |>
  distinct(fingerprint, .keep_all = TRUE) |>
  summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |>
  mutate(duration_min_avg = duration_sec_avg / 60)

8.5 Verweildauer-Statistiken in Minuten

Show the code
time_spent |>
  mutate(time_diff_minutes = time_length(time_diff, unit = "minute")) |>
  summarise(
    mean_time_diff = round(mean(time_diff_minutes), 2),
    sd_time_diff = sd(time_diff_minutes),
    min_time_diff = min(time_diff_minutes), # shortest duration
    max_time_diff = max(time_diff_minutes) # longest
  )
Show the code
small_padding_theme <- ggpubr::ttheme(
  tbody.style = tbody_style(size = 8), # Smaller font size can help
  colnames.style = colnames_style(size = 9, face = "bold"),
  padding = unit(c(2, 2), "mm") # Reduce horizontal and vertical padding
)
Show the code
ggpubr::ggtexttable(
  time_spent_summary,
  rows = NULL,
  theme = small_padding_theme
)

8.6 Visualisierung der Verweildauer

8.6.1 Binwidth=10 Minutes

Show the code
time_spent |>
  mutate(time_diff_minutes = time_diff / 60) |>
  ggplot(aes(x = time_diff_minutes)) + # minutes
  geom_histogram(binwidth = 10) +
  #scale_x_time() +
  theme_minimal() +
  labs(y = "n", x = "Verweildauer in HaNS pro Visit in d:h:m") +
  scale_x_time(breaks = pretty_breaks())

8.6.2 Bin width= 20 Minutes

Show the code
time_spent |>
  mutate(time_diff_minutes = time_diff / 60) |>
  ggplot(aes(x = time_diff_minutes)) + # minutes
  geom_histogram(binwidth = 20) +
  theme_minimal() +
  labs(
    y = "n",
    x = "Verweildauer",
    title = "Verweildauer in HaNS pro Visit in d:h:m"
  ) +
  scale_x_time(breaks = pretty_breaks())

8.6.3 Zeitdauer begrenzt auf 1-120 Min.

Show the code
time_spent2 <-
  time_spent |>
  filter(time_diff > 1, time_diff < 120)

time_spent2 |>
  ggplot(aes(x = time_diff)) +
  geom_histogram(binwidth = 10) +
  theme_minimal() +
  labs(
    y = "n",
    x = "Verweildauer in HaNS pro Visit in Minuten",
    title = "Verweildauer begrenzt auf 1-120 Minuten",
    caption = "bindwidth = 10 Min."
  )

8.6.4 Veränderung der Verweildauer im Zeitverlauf

8.6.4.1 Monat

Die Einheit von time_spent ist Sekunden.

Show the code
time_spent_by_month <-
  time_spent |>
  mutate(month_start = floor_date(time_min, "month")) |>
  mutate(
    month_name = month(month_start, label = TRUE, abbr = FALSE),
    month_num = month(month_start, label = FALSE),
    year = year(month_start)
  ) |>
  group_by(month_num, year) |>
  summarise(
    time_spent_month_avg = mean(time_diff, na.rm = TRUE),
    time_spent_month_sd = sd(time_diff, na.rm = TRUE)
  ) |>
  arrange(year, month_num)

time_spent_by_month
Show the code
time_spent_by_month |>
  mutate(
    time_spent_month_avg = round(time_spent_month_avg, 2),
    time_spent_month_sd = round(time_spent_month_sd, 2)
  ) |>
  ggtexttable()

Show the code
time_spent_by_month_name <-
  time_spent |>
  mutate(month_start = floor_date(time_min, "month")) |>
  mutate(
    month_name = month(month_start, label = TRUE, abbr = FALSE),
    month_num = month(month_start, label = FALSE),
    year = year(month_start)
  ) |>
  group_by(month_start, year) |>
  summarise(
    time_spent_month_avg = mean(time_diff, na.rm = TRUE),
    time_spent_month_sd = sd(time_diff, na.rm = TRUE)
  )

time_spent_by_month_name |>
  ggplot(aes(x = month_start, y = time_spent_month_avg)) +
  geom_line(group = 1, color = "grey60") +
  geom_point()

8.6.4.2 Jahr

Show the code
time_spent_by_year <-
  time_spent |>
  mutate(month_start = floor_date(time_min, "month")) |>
  mutate(
    month_name = month(month_start, label = TRUE, abbr = FALSE),
    month_num = month(month_start, label = FALSE),
    year = year(month_start)
  ) |>
  group_by(year) |>
  summarise(
    time_spent_avg = mean(time_diff, na.rm = TRUE),
    time_spent_sd = sd(time_diff, na.rm = TRUE)
  )

time_spent_by_year
Show the code
time_spent_by_year |>
  ggplot(aes(x = year, y = time_spent_avg)) +
  geom_line(group = 1, color = "grey60") +
  geom_point()

8.6.4.3 Woche

Show the code
time_spent_by_week_name <-
  time_spent |>
  mutate(week_start = floor_date(time_min, "week")) |>
  mutate(week_num = week(week_start), year = year(week_start)) |>
  group_by(week_start, year) |>
  summarise(
    time_spent_week_avg = mean(time_diff, na.rm = TRUE),
    time_spent_week_sd = sd(time_diff, na.rm = TRUE)
  )

time_spent_by_week_name |>
  ggplot(aes(x = week_start, y = time_spent_week_avg)) +
  geom_line(group = 1, color = "grey60") +
  geom_point()

8.7 Zusammenhang von Lehrveranstaltung und Verweildauer

Show the code
time_spent_w_course_university_summary <-
  time_spent_w_course_university |>
  group_by(floor_date_month) |>
  summarise(
    distinct_courses_n = n_distinct(course),
    diff_time_mean = mean(time_diff, na.rm = TRUE),
    n = n()
  )

time_spent_w_course_university_summary
Show the code
time_spent_w_course_university_summary |>
  ggplot(aes(x = distinct_courses_n, y = diff_time_mean)) +
  geom_point()

8.8 Zusammenhang von Lehrveranstaltung und Anzahl Visits

Show the code
time_spent_w_course_university_summary |>
  ggplot(aes(x = distinct_courses_n, y = n)) +
  geom_point() +
  labs(y = "No. of visits per month", x = "No. of distinct courses per month")

9 Was machen die User?

Was machen die Visitors eigentlich? Und wie oft?

9.1 Häufigkeiten

Für das Objekt n_action_type wurde die Spalte subtitle in den Langformat-Daten ausgewertet, s. Funktionsdefinition von count_user_action_type.

Show the code
n_action_type |>
  head(30)

Achtung: Es kann sinnvoller sein, alternativ zu dieser Analyse die Analyse auf Basis von eventcategory heranzuziehen. Dort werden alle Arten von Events berücksichtigt. Hier, in der vorliegenden, nur ausgewählte Events.

9.1.1 Nach bestimmten Kategorien

Show the code
n_action_type_counted <-
  n_action_type |>
  drop_na() |>
  count(category, sort = TRUE) |>
  mutate(prop = round(n / sum(n), 2))

n_action_type_counted |>
  gt()
category n prop
video 845813 0.84
click_slideChange 61934 0.06
visit_page 55551 0.06
Media item 17485 0.02
login 6550 0.01
in_media_search 3422 0.00
Search Results Count 2856 0.00
click_topic 2799 0.00
Medien 1646 0.00
logout 1495 0.00
Kanäle 1395 0.00
GESOA 1358 0.00
click_channelcard 848 0.00
Evaluation 183 0.00
Data protection 39 0.00

9.1.2 Nach Kategorien im Zeitverlauf

Show the code
n_action_type_per_month <-
  n_action_type |>
  select(nr, idvisit, category) |>
  ungroup() |>
  left_join(time_visit_wday |> ungroup()) |>
  select(-c(dow, hour, nr)) |>
  drop_na() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  count(month_start, category)
Show the code
n_action_type_per_month

9.1.3 Nur die Top3-Kategorien

9.1.3.1 idvisit

Show the code
time_visit_wday |>
  head(30)
Show the code
n_action_type_per_month_top3 <-
  n_action_type |>
  select(nr, idvisit, category) |>
  ungroup() |>
  filter(category %in% c("video", "click_slideChange", "visit_page")) |>
  left_join(time_visit_wday |> ungroup()) |>
  select(-c(dow, hour, nr)) |>
  drop_na() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  count(month_start, category)
Show the code
n_action_type_per_month_top3
Show the code
n_action_type_per_month_top3 |>
  ggplot(aes(x = month_start, y = n, color = category, group = category)) +
  geom_line()

9.1.3.2 fingerprint

Show the code
time_visit_wday_fingerprint |>
  head(30)
Show the code
n_action_type_per_month_top3_fingerprint <-
  n_action_type |>
  select(nr, fingerprint, category) |>
  ungroup() |>
  filter(category %in% c("video", "click_slideChange", "visit_page")) |>
  left_join(time_visit_wday_fingerprint |> ungroup()) |>
  select(-c(dow, hour, nr)) |>
  drop_na() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  count(month_start, category)
Show the code
n_action_type_per_month_top3_fingerprint
Show the code
n_action_type_per_month_top3_fingerprint |>
  ggplot(aes(x = month_start, y = n, color = category, group = category)) +
  geom_line()

9.1.4 Top3 - Pro Kurs

Show the code
n_action_type_course_uni <-
  n_action_type |>
  left_join(course_and_uni_per_visit |> mutate(idvisit = as.integer(idvisit)))
Show the code
n_action_type_per_month_top3_per_course <-
  n_action_type_course_uni |>
  filter(category %in% c("video", "click_slideChange", "visit_page")) |>
  drop_na() |>
  mutate(month_start = floor_date(actiondetails_0_timestamp, "month")) |>
  count(course, month_start, category)
Show the code
n_action_type_per_month_top3_per_course |>
  ggplot(aes(x = month_start, y = n, color = category, group = category)) +
  facet_wrap(~course, ncol = 3, scales = "free_y") +
  geom_line() +
  theme(legend.position = "bottom") +
  scale_x_date(date_labels = "%b %Y")

9.1.5 eventcategory

Für folgende Analyse wurde eine andere Variable als oben herangezogen, nämlich eventcategory. Dadurch resultieren etwas andere Ergebnisse.

Show the code
data_separated_filtered_count <-
  data_separated_filtered |>
  filter(type == "eventcategory") |>
  count(value, sort = TRUE) |>
  mutate(prop = n / sum(n))

data_separated_filtered_count
Show the code
data_separated_filtered_count |>
  ggtexttable()

Als Excel-Datei abspeichern:

Show the code
#data_separated_filtered_count |>
#  writexl::write_xlsx(path = "obj/data_separated_filtered_count.xlsx")

9.1.6 User-Typen nach Aktivitäten

Was ist die Hauptaktivität pro User? - Verteilung

9.1.6.1 idvisit

Show the code
n_action_type_distro <-
  n_action_type |>
  group_by(idvisit) |>
  summarise(category_max = max(category, na.rm = TRUE)) |>
  count(category_max)

n_action_type_distro
Show the code
n_action_type_distro |>
  ggplot(aes(x = n, y = category_max)) +
  geom_col()

9.1.6.2 fingerprint

Show the code
n_action_type_distro_fingerpr <-
  n_action_type |>
  group_by(fingerprint) |>
  summarise(category_max = max(category, na.rm = TRUE)) |>
  count(category_max)

n_action_type_distro
Show the code
n_action_type_distro_fingerpr |>
  ggplot(aes(x = n, y = category_max)) +
  geom_col()

9.2 Verteilung

Show the code
n_action_type_counted <-
  n_action_type |>
  count(category, sort = TRUE)

9.2.1 Insgesamt - Rohwerte

Show the code
n_action_type_counted |>
  ggplot(aes(y = reorder(category, n), x = n)) +
  geom_col() +
  geom_bar_text() +
  labs(
    x = "User-Aktion",
    y = "Aktion",
    title = "Anzahl der User-Aktionen nach Kategorie"
  ) +
  theme_minimal() +
  scale_x_continuous(labels = scales::comma)

9.2.2 Insgesamt - Log-Skalierung

Show the code
n_action_type_counted |>
  ggplot(aes(y = reorder(category, n), x = n)) +
  geom_col() +
  geom_bar_text() +
  labs(
    x = "Anazhl der User-Aktionen",
    y = "Aktion",
    title = "Anzahl der User-Aktionen nach Kategorie",
    caption = "Log10-Skala"
  ) +
  theme_minimal() +
  scale_x_log10()

9.2.3 Pro Kurs - Rohwerte

Show the code
n_action_type_course_uni_counted <-
  n_action_type_course_uni |>
  group_by(course) |>
  count(category, sort = TRUE) |>
  drop_na()
Show the code
n_action_type_course_uni_counted |>
  ggplot() +
  aes(y = category, x = log(n, base = 10)) +
  geom_col() +
  facet_wrap(~course)

10 An welchen Tagen und zu welcher Zeit kommen die User zu HaNS?

10.1 Setup

10.1.1 idvisit

Show the code
# Define a vector with the names of the days of the week
# Note: Adjust the start of the week (Sunday or Monday) as per your requirement
days_of_week <- c(
  "Monday",
  "Tuesday",
  "Wednesday",
  "Thursday",
  "Friday",
  "Saturday",
  "Sunday"
)

# Replace numbers with day names
time_visit_wday$dow2 <- factor(
  days_of_week[time_visit_wday$dow],
  levels = days_of_week
)

10.1.2 fingerprint

Show the code
# Define a vector with the names of the days of the week
# Note: Adjust the start of the week (Sunday or Monday) as per your requirement
days_of_week <- c(
  "Monday",
  "Tuesday",
  "Wednesday",
  "Thursday",
  "Friday",
  "Saturday",
  "Sunday"
)

# Replace numbers with day names
time_visit_wday_fingerprint$dow2 <- factor(
  days_of_week[time_visit_wday_fingerprint$dow],
  levels = days_of_week
)

10.2 HaNS-Login nach Uhrzeit

10.2.1 idvisit

Show the code
time_visit_wday |>
  as_tibble() |>
  count(hour) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "HaNS-Nutzer sind keine Frühaufsteher",
    x = "Uhrzeit",
    y = "Anteil"
  )

Show the code
# coord_polar()

10.2.2 fingerprint

Show the code
time_visit_wday_fingerprint |>
  as_tibble() |>
  count(hour) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "HaNS-Nutzer sind keine Frühaufsteher",
    x = "Uhrzeit",
    y = "Anteil"
  )

Show the code
# coord_polar()
Show the code
time_visit_wday |>
  as_tibble() |>
  count(hour) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  theme_minimal() +
  coord_polar()

10.3 Verteilung der HaNS-Besuche nach Wochentagen

10.3.1 idvisit

Show the code
time_visit_wday |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code
# coord_polar()
Show the code
time_visit_wday |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.3.1.1 fingerprint

Show the code
time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code
# coord_polar()
Show the code
time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.3.2 HaNS-Login nach Wochentagen Uhrzeit

10.3.2.1 idvisit

Show the code
time_visit_wday |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code
# coord_polar()
Show the code
time_visit_wday |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.3.2.2 fingerprint

Show the code
time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code
# coord_polar()
Show the code
time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.4 Anzahl der Visits nach Datum (Tagen) und Uhrzeit (bin2d)

10.4.1 idvisit

Show the code
time2 <-
  time_visit_wday |>
  ungroup() |>
  mutate(date = as.Date(date_time)) |>
  mutate(month_start = floor_date(date_time, "month"))

time2 |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(1, 1)) + # (1 day, 1 hour)
  scale_x_date(date_breaks = "1 month") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(caption = "Each x-bin maps to one week") +
  scale_x_date(breaks = breaks_pretty())

10.4.2 fingerprint

Show the code
time2_fingerprint <-
  time_visit_wday_fingerprint |>
  ungroup() |>
  mutate(date = as.Date(date_time)) |>
  mutate(month_start = floor_date(date_time, "month"))

time2_fingerprint |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(1, 1)) + # (1 day, 1 hour)
  scale_x_date(date_breaks = "1 month") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(caption = "Each x-bin maps to one week") +
  scale_x_date(breaks = breaks_pretty())

10.5 Anzahl der Visits nach Datum (Wochen) und Uhrzeit (bin2d)

10.5.1 idvisit

Show the code
time2 |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week"
  ) +
  scale_x_date(breaks = breaks_pretty())

10.5.2 fingerprint

Show the code
time2_fingerprint |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week"
  ) +
  scale_x_date(breaks = breaks_pretty())

10.6 Anzahl der Visits nach Datum (Wochen) und Wochentag (bin2d)

10.6.1 idvisit

Show the code
time2 |>
  ggplot(aes(x = date, y = dow)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week",
    y = "Day of Week"
  ) +
  scale_y_continuous(breaks = 1:7) +
  scale_x_date(breaks = breaks_pretty())

10.6.2 fingerprint

Show the code
time2_fingerprint |>
  ggplot(aes(x = date, y = dow)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week",
    y = "Day of Week"
  ) +
  scale_y_continuous(breaks = 1:7) +
  scale_x_date(breaks = breaks_pretty())

11 KI-Gebrauch

11.1 Interaktion mit dem LLM

Berechnungsgrundlage: Für diese Analyse wurden alle Events der Kategorie llm gefiltert.

11.1.1 Art und Anzahl der Interaktionen mit dem LLM

Show the code
data_separated_filtered_ai <-
  data_separated_filtered |>
  filter(type == "eventcategory") |>
  filter(str_detect(value, "llm")) |>
  count(value, sort = TRUE) |>
  mutate(prop = n / sum(n))

data_separated_filtered_ai
Show the code
data_separated_filtered_ai |>
  mutate(prop = round(prop, 3)) |>
  ggtexttable()

11.2 Anzahl der message_to_llm

Show the code
llm_interactions <-
  data_separated_filtered |>
  filter(str_detect(value, "message_to_llm"))

11.2.1 Verteilung

Show the code
llm_interactions_count <-
  llm_interactions |>
  count(idvisit, sort = TRUE) |>
  rename(messages_to_llm_n = n)

llm_interactions_count |>
  describe_distribution(messages_to_llm_n, centrality = c("mean", "median"))

11.2.2 Diagramm

Show the code
gghistogram(
  llm_interactions_count,
  x = "messages_to_llm_n",
  bins = 10,
  add = "median"
) +
  labs(caption = "The vertical dotted line denotes the median.")

11.2.3 Anteil Visitors, die mit dem LLM interagieren

11.2.3.1 idvisit

Show the code
data_separated_filtered_llm_interact <-
  data_separated_filtered |>
  mutate(has_llm = str_detect(value, "llm")) |>
  group_by(idvisit) |>
  summarise(llm_used_during_visit = any(has_llm == TRUE)) |>
  count(llm_used_during_visit) |>
  mutate(prop = round(n / sum(n), 2))

data_separated_filtered_llm_interact |>
  gt()
llm_used_during_visit n prop
FALSE 13419 0.94
TRUE 788 0.06
Show the code
data_separated_filtered_llm_interact |>
  ggtexttable()

11.2.3.2 fingerprint

Show the code
data_separated_filtered_llm_interact_fingerprint <-
  data_separated_filtered |>
  mutate(has_llm = str_detect(value, "llm")) |>
  group_by(fingerprint) |>
  summarise(llm_used_during_visit = any(has_llm == TRUE)) |>
  count(llm_used_during_visit) |>
  mutate(prop = round(n / sum(n), 2))

data_separated_filtered_llm_interact_fingerprint |>
  gt()
llm_used_during_visit n prop
FALSE 6649 0.93
TRUE 511 0.07
Show the code
data_separated_filtered_llm_interact_fingerprint |>
  ggtexttable()

11.2.4 … Im Zeitverlauf

Show the code
idvisit_has_llm |>
  head(30)
Show the code
idvisit_has_llm_timeline <-
  idvisit_has_llm |>
  count(year_month, uses_llm) |>
  ungroup() |>
  group_by(year_month) |>
  mutate(prop = round(n / sum(n), 2))

idvisit_has_llm_timeline
Show the code
idvisit_has_llm_timeline |>
  ggtexttable()

Show the code
idvisit_has_llm |>
  count(year_month, uses_llm) |>
  ungroup() |>
  mutate(year_month_date = ymd(paste0(year_month, "-01"))) |>
  group_by(year_month_date) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(
    x = year_month_date,
    y = prop,
    color = uses_llm,
    groups = uses_llm
  )) +
  geom_point() +
  geom_line(aes(group = uses_llm)) +
  labs(
    title = "Visitors, die mit dem LLM interagieren im Zeitverlauf (Anteile)"
  ) +
  scale_x_date(breaks = pretty_breaks())

Show the code
idvisit_has_llm |>
  count(year_month, uses_llm) |>
  ungroup() |>
  mutate(year_month_date = ymd(paste0(year_month, "-01"))) |>
  group_by(year_month) |>
  ggplot(aes(x = year_month_date, y = n, color = uses_llm, groups = uses_llm)) +
  geom_point() +
  geom_line(aes(group = uses_llm)) +
  labs(
    title = "Visitors, die mit dem LLM interagieren im Zeitverlauf (Anzahl)"
  ) +
  scale_x_date(breaks = pretty_breaks())

11.3 Anzahl der Interaktionen bei den Usern, die mit dem LLM interagieren

Show the code
d_n_interactions_w_llm <-
  data_separated_filtered |>
  filter(type == "eventcategory") |>
  filter(str_detect(value, "llm")) |>
  group_by(idvisit) |>
  summarise(n_interactions_w_llm = n())
Show the code
d_n_interactions_w_llm |>
  select(n_interactions_w_llm) |>
  describe_distribution() |>
  print_md()
Variable Mean SD IQR Range Skewness Kurtosis n n_Missing
n_interactions_w_llm 165.59 216.45 459 (1.00, 500.00) 0.69 -1.46 640 0
Show the code
d_n_interactions_w_llm |>
  ggplot(aes(x = n_interactions_w_llm)) +
  geom_histogram()

11.4 Klick auf ein Wort im Transkript

Ausgewertet wird im Folgenden die Variable “click_transcript_word”.

11.4.1 Insgesamt

Show the code
data_separated_filtered |>
  filter(type == "subtitle") |>
  # rm empty rows:
  filter(!is.na(value) & value != "") |>
  count(click_transcript_word = str_detect(value, "click_transcript_word")) |>
  mutate(prop = round(n / sum(n), 2)) |>
  tt()
click_transcript_word n prop
FALSE 1138774 0.99
TRUE 8439 0.01

11.4.2 Im Zeitverlauf

11.4.2.1 idvisit

Show the code
click_transcript_word_per_month <-
  data_separated_filtered |>
  # rm all groups WITHOUT "click_transcript_word":
  group_by(idvisit) |>
  filter(!any(value = str_detect(value, "click_transcript_word"))) |>
  ungroup() |>
  mutate(date_visit = ymd_hms(value)) |>
  mutate(month_visit = floor_date(date_visit, unit = "month")) |>
  drop_na(date_visit) |>
  group_by(idvisit) |>
  slice(1) |>
  ungroup() |>
  count(month_visit)

click_transcript_word_per_month
Show the code
click_transcript_word_per_month |>
  ggplot(aes(x = month_visit, y = n)) +
  geom_line()

11.4.2.2 fingerprint

Show the code
click_transcript_word_per_month_fingerprint <-
  data_separated_filtered |>
  # rm all groups WITHOUT "click_transcript_word":
  group_by(fingerprint) |>
  filter(!any(value = str_detect(value, "click_transcript_word"))) |>
  ungroup() |>
  mutate(date_visit = ymd_hms(value)) |>
  mutate(month_visit = floor_date(date_visit, unit = "month")) |>
  drop_na(date_visit) |>
  group_by(fingerprint) |>
  slice(1) |>
  ungroup() |>
  count(month_visit)

click_transcript_word_per_month_fingerprint
Show the code
click_transcript_word_per_month_fingerprint |>
  ggplot(aes(x = month_visit, y = n)) +
  geom_line()

11.5 KI-Aktionen

11.5.1 Insgesamt (ganzer Zeitraum)

Show the code
data_long |>
  head(300)

11.5.2 Im Detail

Show the code
regex_pattern <- "Category: \"(.*?)(?=', Action)"

# Explaining this regex_pattern:
# Find the literal string
# 1. `Category: ` (surrounded by quotation marks)
# 2. Capture any characters (.*?) that follow, non-greedily, until...
# 3. ...it encounters the literal sequence,  ` Action`) immediately after the captured string.

ai_actions_count <-
  data_long |>
  # slice(1:1000) |>
  filter(str_detect(value, "transcript")) |>
  mutate(category = str_extract(value, regex_pattern)) |>
  select(category) |>
  mutate(category = str_replace_all(category, "[\"']", "")) |>
  count(category, sort = TRUE)

ai_actions_count |>
  tt()
category n
NA 217862
Category: clear_transcript_text_for_llm_context 104111
Category: click_transcript_word 8439
Category: select_transcript_text_for_llm_context 576
Category: click_button 43
Category: llm_response_de 3
Category: llm_response_en 3

11.5.3 KI-Klicks pro Monat

Im Objekt wird gezählt, wie oft der String "click_transcript_word" in den Daten (Langformat) gefunden wird, s. Target ai_transcript_clicks_per_month in der Targets-Pipeline.

Show the code
ai_transcript_clicks_per_month |>
  head(30)
Show the code
ai_transcript_clicks_per_month_count <-
  ai_transcript_clicks_per_month |>
  count(year_month, clicks_transcript_any) |>
  ungroup() |>
  group_by(year_month) |>
  mutate(prop = round(n / sum(n), 2))

ai_transcript_clicks_per_month_count
Show the code
ai_transcript_clicks_per_month_count |>
  ggtexttable()

Show the code
ai_transcript_clicks_per_month_count |>
  mutate(date = ymd(paste0(year_month, "-01"))) |>
  ggplot(aes(x = date, y = n)) +
  geom_line(group = 1) +
  geom_point() +
  theme_minimal() +
  labs(title = "Number of AI transcript clicks per month", x = "date [months]")

11.6 Output des LLMs: llm_response - Tokens und Tokenlänge

11.6.1 Deutsch vs. Englisch

Show the code
llm_response_text |>
  count(lang) |>
  mutate(prob = n / sum(n))

11.6.2 Anzahl der Tokens

Show the code
llm_response_text |>
  describe_distribution(select = "tokens_n")

11.6.3 Anzahl vorab existierender Fragen

11.6.3.1 Anzahl verify_option_wrong und verify_option_correct

11.6.3.1.1 idvisit
Show the code
verify_option_summary <-
  data_separated_filtered |>
  group_by(idvisit) |>
  filter(value == "verify_option_wrong" | value == "verify_option_correct") |>
  summarise(verify_option = n())
Show the code
verify_option_summary |>
  gghistogram(x = "verify_option")

Show the code
verify_option_summary |>
  describe_distribution(verify_option) |>
  print_md()
Variable Mean SD IQR Range Skewness Kurtosis n n_Missing
verify_option 35.24 36.76 30 (4.00, 245.00) 2.64 9.11 207 0
11.6.3.1.2 fingerprint
Show the code
# verify_option_summary_fingerprint <-
#   data_separated_filtered |>
#   group_by(fingerprint) |>
#   filter(value == "verify_option_wrong" | value == "verify_option_correct") |>
#   summarise(verify_option = n())

setDT(data_separated_filtered) # Ensure your data frame is a data.table

verify_option_summary_fingerprint <- data_separated_filtered[
  # 1. Filtering (i)
  value %in% c("verify_option_wrong", "verify_option_correct"),

  # 2. Summarize (.j) - calculate the count (n)
  .(verify_option = .N),

  # 3. Grouping (by)
  by = .(fingerprint)
]

verify_option_summary_fingerprint <- as_tibble(
  verify_option_summary_fingerprint
)
Show the code
verify_option_summary_fingerprint |>
  gghistogram(x = "verify_option")

Show the code
verify_option_summary_fingerprint |>
  describe_distribution(verify_option) |>
  print_md()
Variable Mean SD IQR Range Skewness Kurtosis n n_Missing
verify_option 41.68 46.81 35 (4.00, 252.00) 2.35 5.94 175 0

11.6.3.2 Anzahl verify_option_wrong verify_option_div_by_4 - geteilt durch 4

Show the code
verify_option_summary <-
  verify_option_summary |>
  mutate(verify_option_div_by_4 = verify_option / 4)

verify_option_summary |>
  gghistogram(x = "verify_option_div_by_4")

Show the code
verify_option_summary |>
  mutate(verify_option_div_by_4 = verify_option / 4) |>
  describe_distribution(verify_option_div_by_4) |>
  print_md()
Variable Mean SD IQR Range Skewness Kurtosis n n_Missing
verify_option_div_by_4 8.81 9.19 7.50 (1.00, 61.25) 2.64 9.11 207 0

11.6.3.3 Anzahl “Multiple choice answer selected”

Show the code
check_if_both_methods_give_same_number <-
  n_mc_answers_selected |>
  full_join(verify_option_summary)

check_if_both_methods_give_same_number |>
  head(20) |>
  gt()
n verify_option verify_option_div_by_4
1560
6 28 7.00
1569
2 14 3.50
2021
2 21 5.25
2022
10 126 31.50
2394
2 21 5.25
2718
2 7 1.75
2740
2 7 1.75
2883
2 126 31.50
2902
2 126 31.50
2912
2 77 19.25
2932
6 35 8.75
2950
2 56 14.00
2978
14 245 61.25
2979
4 35 8.75
3103
2 14 3.50
3257
2 7 1.75
3691
2 35 8.75
3700
4 84 21.00
3741
2 21 5.25
3804
2 70 17.50

Nein, beide Methoden liefern nicht die gleiche Zahl.

11.6.3.4 “Multiple choice answer selected” im Zeitverlauf

Show the code
mc_answers_with_timestamps <-
  mc_answers_with_timestamps |>
  mutate(month_start = floor_date(timestamp, "month")) |>
  ungroup() |>
  arrange(timestamp) |>
  mutate(n_cumulated = cumsum(n)) |>
  mutate(date = as.Date(timestamp))

lim <- c(
  min(mc_answers_with_timestamps$date),
  max(mc_answers_with_timestamps$date)
)

mc_answers_with_timestamps |>
  ggplot(aes(x = date, y = n_cumulated)) +
  scale_x_date(limits = lim, labels = scales::label_date_short()) +
  geom_point() +
  geom_line()

11.6.3.5 Anzahl generate_questionaire

Show the code
# generate_questionaire_summary <-
#   data_separated_filtered |>
#   group_by(idvisit) |>
#   filter(value == "generate_questionaire") |>
#   summarise(generate_questionaire = n())

setDT(data_separated_filtered) # Convert the data.frame to a data.table in place

generate_questionaire_summary <- data_separated_filtered[
  # 1. Filtering (i)
  value == "generate_questionaire",

  # 2. Summarize (.j) - calculate the count (.N) and rename it
  .(generate_questionaire = .N),

  # 3. Grouping (by)
  by = .(idvisit)
]
Show the code
generate_questionaire_summary |>
  describe_distribution(generate_questionaire) |>
  print_md()
Variable Mean SD IQR Range Skewness Kurtosis n n_Missing
generate_questionaire 3.11 5.93 2 (1.00, 66.00) 5.89 46.07 367 0

11.6.3.6 Anzahl vorab existierender Fragen

Show the code
setDT(generate_questionaire_summary)
setDT(verify_option_summary)

# 1. Full Join (Merge)
# Use the 'merge' function with all.x=TRUE and all.y=TRUE for a full join
# Assumes the join column is 'idvisit' as used in your previous examples
prior_existing_questions_summary <- merge(
  generate_questionaire_summary,
  verify_option_summary,
  by = "idvisit",
  all = TRUE
)

# 2. Mutate (Calculation)
# Use .j to create the new column
prior_existing_questions_summary[,
  prior_existing_questions_n := verify_option - generate_questionaire
]

# prior_existing_questions_summary <-
#   generate_questionaire_summary |>
#   full_join(verify_option_summary) |>
#   mutate(prior_existing_questions_n = verify_option - generate_questionaire)
Show the code
prior_existing_questions_summary |>
  # drop_na() |>
  gghistogram(x = "prior_existing_questions_n")

Show the code
prior_existing_questions_summary |>
  describe_distribution(prior_existing_questions_n) |>
  print_md()
Variable Mean SD IQR Range Skewness Kurtosis n n_Missing
prior_existing_questions_n 38.87 44.57 39 (-59.00, 236.00) 1.98 5.40 91 392

12 Videozeit

Wie viel Zeit verbringen die Nutzer mit dem Betrachten von Videos (“Glotzdauer”)?

12.1 Glotzdauer allgemein

Achtung: Die Videozeit ist schwierig auszuwerten. Die Nutzer beenden keine Videos, in dem sie auf “Pause” drücken, sondern indem sie andere Aktionen durchführen. Dies ist aber analytisch schwer abzubilden.

Vgl. die Definition des Targets glotzdauer in der Pipeline.

Kurz gesagt wird die Zeit-Differenz zwischen zwei aufeinander folgenden “Play” und “Pause” Aktionen berechnet.

Allerdings hat dieses Vorgehen Schwierigkeiten: Nicht immer folgt auf einem “Play” ein “Pause”. Es ist schwer auszuwerten, wann die Betrachtung eines Videos endet. Daher ist diese Analyse nur vorsichtig zu interpretieren.

Die Definition der Funktion glotzdauer.R ist online dokumentiert.

Show the code
data_separated_distinct_slice |>
  head(30)

Für die folgende Darstellung wurden die absoluten Zeitwerte verwendet, d.h. ohne Vorzeichen.

Show the code
data_separated_distinct_slice |>
  # we will assume that negative glotzdauer is the as positive glotzdauer:
  mutate(time_diff = abs(time_diff)) |>
  # without glotzdauer smaller than 10 minutes:
  filter(time_diff < 60 * 10) |>
  ggplot(aes(x = time_diff)) +
  geom_histogram() +
  scale_x_time() +
  labs(
    x = "Time interval [minutes]",
    caption = "Only time intervals less than 10 minutes. It is assumed that video time is positive only (no negative time intervals)."
  ) +
  theme_minimal()

Show the code
glotzdauer_prepped <-
  data_separated_distinct_slice |>
  # we will assume that negative glotzdauer is the as positive glotzdauer:
  mutate(time_diff_abs_sec = abs(as.numeric(time_diff, units = "secs"))) |>
  # without glotzdauer smaller than 10 minutes:
  filter(time_diff_abs_sec < 60 * 10) |>
  mutate(time_diff_abs_min = time_diff_abs_sec / 60)

glotzdauer_tbl <-
  glotzdauer_prepped |>
  select(time_diff_abs_sec, time_diff_abs_min) |>
  describe_distribution()

glotzdauer_tbl
Show the code
glotzdauer_tbl |>
  mutate(across(where(is.numeric), ~ round(., 2))) |>
  ggpubr::ggtexttable()

12.2 Glotzdauer im Zeitverlauf

Show the code
glotzdauer_prepped_tbl <-
  glotzdauer_prepped |>
  mutate(first_of_month = floor_date(date, unit = "month")) |>
  group_by(first_of_month) |>
  summarise(time_diff_mean = mean(time_diff, na.rm = TRUE))


glotzdauer_prepped_tbl
Show the code
glotzdauer_prepped_tbl |>
  ggplot(aes(x = first_of_month, y = time_diff_mean)) +
  geom_line() +
  theme_minimal()

13 Abschluss

.