Analyse der HaNS-Matomo-Daten

Author

Sebastian Sauer

Published

September 29, 2025

1 Hintergrund

Dieser Arbeitsbericht schildert das technische Vorgehen im Rahmen der Analyse der Matomo-Daten des BMBF-Projekt “HaNS”.

1.1 Vorgehen

Die Matomo-Klickdaten aller Semester der Projektlaufzeit wurden für diese Analyse verarbeitet. Mit Hilfe einer R-Pipeline wurden eine Reihe von Forschungsfragen analysiert.

Der komplette Code ist online dokumentiert unter https://github.com/sebastiansauer/hans. Aus Datenschutzgründen sind online keine Daten eingestellt.

Die zentrale Analyse-Pipeline-Datei ist https://github.com/sebastiansauer/hans/blob/main/_targets.R.

1.2 Forschungsfragen

Wie viele Nutzer gibt es und in welchem Zeitraum?
In welcher Frequenz wird HaNS aufgesucht? Wie groß sind die zeitlichen Zwischenräume zwischen der Benutzung der Plattform?
Wie oft wird HaNS pro Zeitraum (z.B. Monat) besucht?
Wie verändert sich die Nutzung im Zeitverlauf?
Wie viele Aktionen bringt ein Visit mit sich? Wie ist die statistische Verteilung der Aktionen pro Visit?
Wie lang verweilen die Nutzer pro Visit?
Wie verändert sich die Nutzungsdauer pro Visit im Zeitverlauf?
Welche Aktionen führen die Nutzer auf Hans aus?
Wie verändern sich die Verteilungen der Aktionshäufigkeiten im Zeitverlauf?
An welchen Tagen und zu welcher Zeit kommen die User zu HaNS?
Wie häufig und in welcher Art inteagieren die Nutzer mit dem LLM in HaNS?
Wie groß ist der Anteil der Nutzer, die mit dem LLM interagieren?
Wie verändert sich der Anteil der Nutzer, die mit dem LLM interagieren, im Zeitverlauf?
Wie oft wird auf ein Wort im Transkript des LLM geklickt?
Wie oft wird ein Transkript-Dienst in HaNS in Anspruch genommen?
Wie verändert sich die Nutzung der Transkript-Dienste in HaNS im Zeitverlauf?
Wie lange werden Videos angeschaut?
Wie verändert sich die Betrachtungsdauer im Zeitverlauf?

2 Setup

2.1 R-Pakete starten

Show the code

library(targets)
library(tidyverse)
library(ggokabeito)
library(easystats)
library(gt)
library(ggfittext)
library(scales)
library(visdat)
library(collapse)
library(ggpubr)
library(knitr)
library(tinytable)
library(data.table)

Show the code

library(ggplot2)
theme_set(theme_minimal())

2.2 Optionen setzen

Show the code

options(lubridate.week.start = 1) # Monday as first day
#options(collapse_mask = "all") # use collapse for all dplyr operations
options(chromote.headless = "new") # Chrome headleass needed for gtsave

Show the code

scale_colour_discrete <- function(...) scale_colour_brewer(palette = "Set2")
scale_fill_discrete <- function(...) scale_fill_brewer(palette = "Set2")

2.3 Daten laden

Show the code

tar_load(ai_transcript_clicks_per_month)
tar_load(config)
tar_load(course_and_uni_per_visit)
tar_load(data_all_fct)
tar_load(data_long)
tar_load(data_prepped)
tar_load(data_separated_distinct_slice)
tar_load(data_separated_filtered)
#tar_load(data_users_only)
tar_load(idvisit_has_llm)
tar_load(llm_response_text)
tar_load(n_action)
tar_load(n_action_type)
tar_load(n_action_w_date)
tar_load(time_duration)
tar_load(time_since_last_visit)
tar_load(time_spent)
tar_load(time_spent_w_course_university)
tar_load(time_visit_wday)
tar_load(n_mc_answers_selected)
tar_load(mc_answers_with_timestamps)
tar_load(n_action_fingerprint)
tar_load(time_visit_wday_fingerprint)
tar_load(n_action_w_date_fingerprint)
tar_load(time_spent_fingerprint)

3 Datenaufbereitung und Analysepipeline

3.1 Targets-Pipeline stellt Überblick aller Analyseschritte dar

Die Analyse wird im Rahmen einer Targets-Pipeline beschrieben und ist offen auf Github einsehbar.

3.2 Langformat

Aufgrund des “rechts flatternden” Datenformat (d.h. unterschiedliche Zeilenlängen) wurden die Daten in ein Langformat überführt, zwecks besserer/einfacherer Analyse.

Dazu wurden (neben den ID-Variablen, v.a. idvisit) die actionDetails_-Variablen verwendet. Der Code des Pivotierens in das Langformat ist in der Funktion longify-data.R einsehbar.

Die Daten im Langformat wurden dann noch etwas aufbereitet mt der Funktion slimify-data.R.

Show the code

data_separated_filtered |>
  head(30)

4 Überblick über die Daten

4.1 Roh-Daten laden und inspizieren (data_all_fact)

4.1.1 Dimension

Der Roh-Datensatz verfügt über

14207 Zeilen
3181 Spalten (Dubletten und Spalten mit Bildern bereits entfernt)

Jede Zeile entspricht einem “Visit”.

4.1.2 Erster Blick

Show the code

data_all_fct_head100 <-
  data_all_fct %>%
  select(1:100) %>%
  slice_head(n = 100)

Show the code

data_all_fct_head100 %>%
  visdat::vis_dat()

4.1.3 (Fast) leere Spalten

Show the code

d_na_cols <- data.frame(
  id = 1:ncol(data_all_fct),
  names = names(data_all_fct),
  na_prop = colMeans(is.na(data_all_fct))
)

4.1.3.1 Leere Spalten

Show the code

d_na_cols |>
  filter(na_prop == 1)

4.1.3.2 Fast leere Spalten

Show the code

no_na_cols <-
  d_na_cols |>
  filter(na_prop > .9) |>
  nrow()

no_na_cols

[1] 1951

Important

Sehr viele Spalten, 1951 sind fast leer.

4.1.4 Namen (1-100)

Show the code

d_100_names <- data.frame(
  id = 1:100,
  col_name = data_all_fct_head100 %>% names()
)

d_100_names

4.1.5 Werte der erst 100 Spalten

Show the code

data_all_fct_head100

4.1.6 Datensatz data_separated_filtered, Zeilen 1-100

Show the code

data_separated_filtered %>%
  slice(1:100)

4.2 Fallzahl im Nur-User-Datensatz

Entfernt man Developer, Admins und Lecturers aus dem Roh-Datensatz so bleiben weniger Zeilen übrig:

14207 Zeilen
3181 Spalten

4.3 Datensatz mit Anzahl der Aktionen pro User

4.3.1 idvisit

Show the code

n_action |> dim()

[1] 14207     2

Show the code

n_action |>
  head(30)

Show the code

n_action |>
  ggplot(aes(x = nr_max)) +
  geom_histogram()

4.3.2 fingerprint

Show the code

n_action_fingerprint |> head(30)

Show the code

n_action_fingerprint |>
  ggplot(aes(x = nr_max)) +
  geom_histogram()

5 Zeitraum

5.1 Beginn/Ende der Daten

Show the code

n_action_w_date |>
  head(30)

Show the code

min_max_time <-
  n_action_w_date |>
  summarise(
    time_min = min(date_time_start, na.rm = T),
    time_max = max(date_time_start, na.rm = T)
  )

min_max_time |>
  gt()

time_min	time_max
2022-12-05 15:33:45	2025-07-14 23:40:45

Important

Erster Visit im Datensatz: 2022-12-05 15:33:45.

Letzter Visit im Datensatz: 2025-07-14 23:40:45.

Diese Statistik wurde auf Basis des Datenobjekts data_separated_filtered berechnet, vgl. das Target dieses Objekts in der Pipeline.

5.2 Days since last visit

5.2.1 Insgesamt

5.2.1.1 idvisit

Show the code

time_visit_wday |>
  head(30)

Show the code

time_since_last_visit <-
  time_since_last_visit |>
  mutate(dayssincelastvisit = as.numeric(dayssincelastvisit)) |>
  distinct(idvisit, .keep_all = TRUE)

time_since_last_visit |>
  datawizard::describe_distribution(dayssincelastvisit) |>
  knitr::kable(digits = 2)

Variable	Mean	SD	IQR	Min	Max	Skewness	Kurtosis	n	n_Missing
dayssincelastvisit	6.89	15.75	0	1	87	2.98	8.26	14207	0

Show the code

time_since_last_visit |>
  ggplot(aes(x = dayssincelastvisit)) +
  geom_density() +
  labs(
    title = "If visitor return, they return mostly not later than a few days."
  )

Important

Die Nutzer nutzen die Seite in Abständen von wenigen Tagen?

5.2.1.2 fingerprint

Show the code

time_visit_wday_fingerprint |> head()

Show the code

time_since_last_visit_fingerprint <-
  time_since_last_visit |>
  mutate(dayssincelastvisit = as.numeric(dayssincelastvisit)) |>
  distinct(fingerprint, .keep_all = TRUE)

time_since_last_visit |>
  datawizard::describe_distribution(dayssincelastvisit) |>
  knitr::kable(digits = 2)

Variable	Mean	SD	IQR	Min	Max	Skewness	Kurtosis	n	n_Missing
dayssincelastvisit	6.89	15.75	0	1	87	2.98	8.26	14207	0

Show the code

time_since_last_visit |>
  ggplot(aes(x = dayssincelastvisit)) +
  geom_density() +
  labs(
    title = "If visitor return, they return mostly not later than a few days."
  )

5.2.2 Nach Lehrveranstaltungen

Show the code

time_since_last_visit_per_course <-
  time_since_last_visit |>
  left_join(course_and_uni_per_visit) |>
  drop_na()

Show the code

time_since_last_visit_per_course_summary <-
  time_since_last_visit_per_course |>
  group_by(course) |>
  summarise(
    dayssincelastvisit_mean = mean(dayssincelastvisit),
    dayssincelastvisit_sd = sd(dayssincelastvisit),
    dayssincelastvisit_n = n()
  ) |>
  mutate(
    dayssincelastvisit_n_log = log(dayssincelastvisit_n, base = 10) + 0.001
  )

Show the code

time_since_last_visit_per_course_summary

Show the code

time_since_last_visit_per_course_summary |>
  ggplot(aes(
    y = reorder(course, dayssincelastvisit_mean),
    x = dayssincelastvisit_mean
  )) +
  geom_errorbar(aes(
    xmin = dayssincelastvisit_mean - dayssincelastvisit_sd,
    xmax = dayssincelastvisit_mean + dayssincelastvisit_sd
  )) +
  geom_point(aes(alpha = log(dayssincelastvisit_n)), show.legend = FALSE) +
  labs(
    x = "Days since last visit (mean±sd)",
    y = "course",
    title = "In some courses, users use HaNS frequently.",
    caption = "Grey saturation of the mean dots refers to the log10 of the sample size (N)"
  ) +
  geom_text(
    aes(label = round(dayssincelastvisit_n)),
    x = Inf,
    hjust = 1.2,
    size = 2
  ) +
  annotate(
    x = Inf,
    y = Inf,
    label = "N",
    geom = "label",
    hjust = 1,
    vjust = 1
  ) +
  scale_y_discrete(expand = expansion(mult = 0.1)) +
  theme_minimal()

5.3 Visits im Zeitverlauf

Wie viele Visits (von Hans) gab es?

5.3.1 Pro Monat

5.3.1.1 idivisit

Show the code

time_visit_wday_summary <-
  time_visit_wday |>
  ungroup() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  mutate(
    month_name = month(date_time, label = TRUE, abbr = FALSE),
    month_num = month(date_time, label = FALSE),
    year_num = year(date_time)
  )

Show the code

time_visit_wday_summary |>
  group_by(year_num, month_num) |>
  summarise(n = n()) |>
  gt()

month_num	n
2022
12	329
2023
1	455
2	561
3	149
4	253
5	391
6	292
7	441
8	26
9	39
10	614
11	660
12	519
2024
1	783
2	85
3	138
4	329
5	413
6	593
7	743
8	16
9	23
10	731
11	918
12	765
2025
1	959
2	155
3	507
4	1011
5	557
6	321
7	430
NA
NA	1

Show the code

time_visit_wday_summary |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ggplot(aes(x = month_start, y = n)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  labs(
    title = "The number of visits reflect the teaching periods of the semesters.",
    x = "month/year"
  )

5.3.1.2 fingerprint

Show the code

time_visit_wday_summary_fingerprint <-
  time_visit_wday_fingerprint |>
  ungroup() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  mutate(
    month_name = month(date_time, label = TRUE, abbr = FALSE),
    month_num = month(date_time, label = FALSE),
    year_num = year(date_time)
  )

Show the code

time_visit_wday_summary_fingerprint |>
  group_by(year_num, month_num) |>
  summarise(n = n()) |>
  gt()

month_num	n
2022
12	235
2023
1	248
2	303
3	99
4	160
5	226
6	195
7	227
8	17
9	23
10	402
11	412
12	325
2024
1	445
2	50
3	94
4	179
5	204
6	274
7	214
8	10
9	16
10	365
11	417
12	317
2025
1	347
2	74
3	217
4	424
5	273
6	171
7	196
NA
NA	1

Show the code

time_visit_wday_summary_fingerprint |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ggplot(aes(x = month_start, y = n)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  labs(
    title = "The number of visits reflect the teaching periods of the semesters.",
    x = "month/year"
  )

5.3.2 Pro Woche

Show the code

time_visit_wday_summary_week <-
  time_visit_wday |>
  ungroup() |>
  mutate(week_start = floor_date(date_time, "week")) |>
  mutate(week_num = week(date_time), year_num = year(date_time))

Show the code

time_visit_wday_summary_week_summarized <-
  time_visit_wday_summary_week |>
  group_by(year_num, week_num) |>
  summarise(n = n())

time_visit_wday_summary_week_summarized

Show the code

time_visit_wday_summary_week_summarized_dateformat <-
  time_visit_wday_summary_week |>
  group_by(week_start) |>
  summarise(n = n())

Show the code

time_visit_wday_summary_week_summarized_dateformat |>
  ggplot(aes(x = week_start, y = n)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  geom_smooth(method = "gam", se = FALSE, color = "blue") +
  labs(
    title = "The number of visits is increasing and reflects the teaching periods of the semesters.",
    x = "week number/year"
  )

Important

The number of visits has increased over time.

5.3.3 Akkumulierte Seitenaufrufe im Zeitverlauf

5.3.3.1 Monat - idvisit

Show the code

time_visit_wday_summary |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ungroup() |>
  mutate(n_cumsum = cumsum(n)) |>
  ggplot(aes(x = month_start, y = n_cumsum)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm") +
  labs(title = "Visits have increased linearly over time.", x = "month/year")

5.3.3.2 Monat - fingerprint

Show the code

time_visit_wday_summary_fingerprint |>
  group_by(year_num, month_start) |>
  summarise(n = n()) |>
  ungroup() |>
  mutate(n_cumsum = cumsum(n)) |>
  ggplot(aes(x = month_start, y = n_cumsum)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm") +
  labs(title = "Visits have increased linearly over time.", x = "month/year")

5.3.3.3 Woche

Show the code

time_visit_wday_summary_week |>
  group_by(year_num, week_start) |>
  summarise(n = n()) |>
  ungroup() |>
  mutate(n_cumsum = cumsum(n)) |>
  ggplot(aes(x = week_start, y = n_cumsum)) +
  geom_line(group = 1, color = "grey60") +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm") +
  labs(
    title = "Visits have increased approx. linearly over time.",
    x = "week/year"
  )

5.4 Statistiken

Die folgenden Statistiken beruhen auf dem Datensatz data_separated_filtered:

5.4.1 idivisit

Show the code

glimpse(data_separated_filtered)

Rows: 4,477,584
Columns: 5
$ nr          <int> 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5…
$ type        <fct> subtitle, timestamp, eventcategory, eventaction, timestamp…
$ value       <fct> "https://hans.th-nuernberg.de/", "2023-03-23 18:37:56", "c…
$ idvisit     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ fingerprint <fct> aa8a78771b4f21ff, aa8a78771b4f21ff, aa8a78771b4f21ff, aa8a…

nr fasst die Nummer der Aktion innerhalb eines bestimmten Visits.

5.4.2 fingerprint

Show the code

data_separated_filtered |>
  distinct(fingerprint, .keep_all = TRUE) |>
  glimpse()

Rows: 7,160
Columns: 5
$ nr          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ type        <fct> subtitle, subtitle, subtitle, subtitle, subtitle, subtitle…
$ value       <fct> "https://hans.th-nuernberg.de/", "https://hans.th-nuernber…
$ idvisit     <int> 1, 3, 6, 7, 8, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,…
$ fingerprint <fct> aa8a78771b4f21ff, 1f026ad3cbbdf325, 518965d4e1ae7e2d, aa95…

5.5 Mit allen Daten (den 499er-Daten)

5.5.1 idvisit

Show the code

tbl_n_action <-
  n_action |>
  describe_distribution(nr_max, centrality = c("median", "mean"))

tbl_n_action

nr_max gibt den Maximalwert von nr zurück, sagt also, wie viele Aktionen maximal während eines Vitis ausgeführt wurden.

Betrachtet man die Anzahl der Aktionen pro Visit näher, so fällt auf, dass der Maximalwert (499) sehr häufig vorkommt:

Show the code

n_action |>
  count(nr_max) |>
  ggplot(aes(x = nr_max, y = n)) +
  geom_col() +
  geom_vline(
    xintercept = tbl_n_action$Median,
    color = "blue",
    linetype = "dashed"
  ) +
  labs(
    caption = "Vertical dashed lines shows the median.",
    title = "Most users to only a few actions, but some do many."
  )

Important

Die meisten Nutzer machen nur wenige Aktionen pro Visit, aber einige machen sehr viele.

Hier noch in einer anderen Darstellung:

Show the code

n_action |>
  count(nr_max) |>
  ggplot(aes(x = nr_max, y = n)) +
  geom_point()

Der Maximalwert ist einfach auffällig häufig:

Show the code

n_action |>
  count(nr_max == 499) |>
  gt()

nr_max == 499	n
FALSE	13626
TRUE	581

Es erscheint plausibel, dass der Maximalwert alle “gekappten” (zensierten, abgeschnittenen) Werte fasst, also viele Werte, die eigentlich größer wären (aber dann zensiert wurden).

5.5.2 fingerprint

Show the code

tbl_n_action_fingerprint <-
  n_action_fingerprint |>
  describe_distribution(nr_max, centrality = c("median", "mean"))

tbl_n_action_fingerprint

Show the code

n_action_fingerprint |>
  count(nr_max) |>
  ggplot(aes(x = nr_max, y = n)) +
  geom_col() +
  geom_vline(
    xintercept = tbl_n_action_fingerprint$Median,
    color = "blue",
    linetype = "dashed"
  ) +
  labs(
    caption = "Vertical dashed lines shows the median.",
    title = "Most users to only a few actions, but some do many."
  )

5.6 Nur Visitors, für die weniger als 500 Aktionen protokolliert sind

5.6.1 idvisit

Show the code

n_action_lt_500 <-
  n_action |>
  filter(nr_max != 499)

n_action_lt_500 |>
  describe_distribution(nr_max) |>
  gt() |>
  fmt_number(columns = where(is.numeric), decimals = 2)

Variable	Mean	SD	IQR	Min	Max	Skewness	Kurtosis	n	n_Missing
nr_max	61.88	88.53	77.00	1.00	496.00	2.27	5.47	13,626.00	0.00

5.6.2 fingerprint

Show the code

n_action_lt_500_fingerprint <-
  n_action_fingerprint |>
  filter(nr_max != 499)

n_action_lt_500_fingerprint |>
  describe_distribution(nr_max) |>
  gt() |>
  fmt_number(columns = where(is.numeric), decimals = 2)

Variable	Mean	SD	IQR	Min	Max	Skewness	Kurtosis	n	n_Missing
nr_max	75.78	99.73	100.00	1.00	496.00	1.88	3.31	6,771.00	0.00

6 Lehrveranstaltungen

6.1 Anzahl an Lehrveranstaltungen nach Hochschule

6.1.1 fingerprint

Show the code

course_and_uni_per_visit |>
  count(university)

Show the code

course_and_uni_per_visit |>
  count(university) |>
  drop_na() |>
  ggplot(aes(y = reorder(university, n), x = n)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "TH Nürnberg hosts the most courses on HaNS by far.",
    y = "University"
  )

6.1.2 fingerprint

Show the code

course_and_uni_per_visit |>
  distinct(fingerprint, .keep_all = TRUE) |>
  count(university) |>
  ggplot(aes(y = reorder(university, n), x = n)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "TH Nürnberg hosts the most courses on HaNS by far.",
    y = "University"
  )

6.2 Visits nach Lehrveranstaltung und Jahr

6.2.1 idvisit

Show the code

time_spent_w_course_university |>
  count(year, course)

Show the code

time_spent_w_course_university |>
  count(year, course) |>
  drop_na() |>
  ggplot(aes(x = n, y = course, fill = factor(year), )) +
  geom_col(position = "dodge") +
  labs(title = "The course 'GeSOA' is the most active course on HaNS.")

6.2.2 fingerprint

Show the code

time_spent_w_course_university |>
  distinct(fingerprint, .keep_all = TRUE) |>
  count(year, course) |>
  drop_na() |>
  ggplot(aes(x = n, y = course, fill = factor(year), )) +
  geom_col(position = "dodge") +
  labs(title = "The course 'GeSOA' is the most active course on HaNS.")

7 Aktionen pro Visit/Fingerprint

7.1 Statistiken pro Visit

Show the code

n_actions_searches_interactions <-
  data_prepped |>
  select(
    idvisit,
    fingerprint,
    any_of(c(
      "searches",
      "actions",
      "interactions",
      "referrertype",
      "referrername",
      "language",
      "devicetype",
      "devicemodel",
      "operatingsystem",
      "browsername"
    ))
  )

7.1.1 Unique IDs, Fingerprints, Mean searches, Mean actions

Auswertung - der Anzahlen der uniquen visitids und uniquen Fingerprints - Mittelwerte der Anzahl der Suchen und Aktionen pro Besuch

7.1.1.1 idivisit und fingerprint

Show the code

n_actions_searches_interactions |>
  as.data.frame() |>
  summarise(
    idvisit_n = length(unique(idvisit)),
    fingerprint_n = length(unique(fingerprint)),
    actions_mean = mean(as.integer(actions), na.rm = TRUE),
    searches_mean = mean(as.integer(searches), na.rm = TRUE)
  )

Note

Es gibt etwa doppelt so viele Besucher wie unique Nutzer.

7.1.2 Referrer Type pro Visit

Show the code

n_actions_searches_interactions |>
  count(referrertype, sort = TRUE)

7.1.3 Referrer Type Name pro Visit

Show the code

n_actions_searches_interactions |>
  count(referrername, sort = TRUE)

7.1.4 devicemodel

Show the code

n_actions_searches_interactions |>
  count(devicemodel, sort = TRUE) |>
  slice_head(n = 10)

7.1.5 operatingsystem

Show the code

n_actions_searches_interactions |>
  count(operatingsystem, sort = TRUE) |>
  slice_head(n = 10)

7.1.6 browsername

Show the code

n_actions_searches_interactions |>
  count(browsername, sort = TRUE) |>
  slice_head(n = 10)

Die Mac-User scheinen besonders aktiv zu sein auf HaNS.

7.2 Aktionen pro idvisit/fingerprint - Mit den 499er-Daten

7.2.1 idvisit

Show the code

n_action_avg = mean(n_action$nr_max) |> round(0)
n_action_median = median(n_action$nr_max) |> round(0)
n_action_sd = sd(n_action$nr_max) |> round(0)
n_action_iqr = IQR(n_action$nr_max) |> round(0)

n_action |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale MW±SD"
  ) +
  theme_minimal() +
  geom_vline(xintercept = n_action_avg, color = palette_okabe_ito()[1]) +
  geom_segment(
    x = n_action_avg - n_action_sd,
    y = 0,
    xend = n_action_avg + n_action_sd,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_avg,
    y = 1500,
    label = paste0("MW = ", n_action_avg)
  ) +
  annotate(
    "label",
    x = n_action_avg + n_action_sd,
    y = 0,
    label = paste0("SD = ", n_action_sd)
  )

Show the code

#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")

n_action |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Median; der horizontale Median±IQR"
  ) +
  theme_minimal() +
  geom_vline(xintercept = n_action_median, color = palette_okabe_ito()[1]) +
  geom_segment(
    x = n_action_median - n_action_iqr,
    y = 0,
    xend = n_action_median + n_action_iqr,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_median,
    y = 1500,
    label = paste0("Md = ", n_action_median)
  ) +
  annotate(
    "label",
    x = n_action_median + n_action_iqr,
    y = 0,
    label = paste0("IQR = ", n_action_iqr)
  )

Show the code

#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")

Mittelwert der Aktionen pro Visit: 80.
SD der Aktionen pro Visit: 123.
MD: 27.
IQR: : 88.

7.2.2 fingerprint

Show the code

n_action_fingerprint_avg = mean(n_action_fingerprint$nr_max) |> round(0)
n_action_fingerprint_median = median(n_action_fingerprint$nr_max) |> round(0)
n_action_fingerprint_sd = sd(n_action_fingerprint$nr_max) |> round(0)
n_action_fingerprint_iqr = IQR(n_action_fingerprint$nr_max) |> round(0)

n_action_fingerprint |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale MW±SD"
  ) +
  theme_minimal() +
  geom_vline(
    xintercept = n_action_fingerprint_avg,
    color = palette_okabe_ito()[1]
  ) +
  geom_segment(
    x = n_action_fingerprint_avg - n_action_fingerprint_sd,
    y = 0,
    xend = n_action_fingerprint_avg + n_action_fingerprint_sd,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg,
    y = 1500,
    label = paste0("MW = ", n_action_fingerprint_avg)
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg + n_action_fingerprint_sd,
    y = 0,
    label = paste0("SD = ", n_action_fingerprint_sd)
  )

Show the code

#geom_label(aes(x = n_action_fingerprint_avg), y = 1, label = "Mean")

n_action_fingerprint |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    caption = "Der vertikale Strich zeigt den Median; der horizontale Median±IQR"
  ) +
  theme_minimal() +
  geom_vline(
    xintercept = n_action_fingerprint_median,
    color = palette_okabe_ito()[1]
  ) +
  geom_segment(
    x = n_action_fingerprint_median - n_action_fingerprint_iqr,
    y = 0,
    xend = n_action_fingerprint_median + n_action_fingerprint_iqr,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_median,
    y = 1500,
    label = paste0("Md = ", n_action_fingerprint_median)
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_median + n_action_fingerprint_iqr,
    y = 0,
    label = paste0("IQR = ", n_action_fingerprint_iqr)
  )

Show the code

#geom_label(aes(x = n_action_fingerprint_avg), y = 1, label = "Mean")

7.3 Ohne 499er-Daten

7.3.1 idvisit

Show the code

n_action_avg2 = mean(n_action_lt_500$nr_max) |> round(0)
n_action_sd2 = sd(n_action_lt_500$nr_max) |> round(2)

n_action_lt_500 |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    title = "Verteilung der User-Aktionen pro Visit",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale die SD"
  ) +
  theme_minimal() +
  geom_vline(xintercept = n_action_avg2, color = palette_okabe_ito()[1]) +
  geom_segment(
    x = n_action_avg - n_action_sd2,
    y = 0,
    xend = n_action_avg2 + n_action_sd2,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_avg2,
    y = 1500,
    label = paste0("MW = ", n_action_avg2)
  ) +
  annotate(
    "label",
    x = n_action_avg2 + n_action_sd2,
    y = 0,
    label = paste0("SD = ", n_action_sd2)
  )

Show the code

#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")

Mittelwert der Aktionen pro Visit: 62.
SD der Aktionen pro Visit: 88.53.

7.3.2 fingerprint

Show the code

n_action_fingerprint_avg2 = mean(n_action_lt_500_fingerprint$nr_max) |> round(0)
n_action_fingerprint_sd2 = sd(n_action_lt_500_fingerprint$nr_max) |> round(2)

n_action_lt_500_fingerprint |>
  ggplot() +
  geom_histogram(aes(x = nr_max)) +
  labs(
    x = "Anzahl von Aktionen pro Visit",
    y = "n",
    title = "Verteilung der User-Aktionen pro Visit",
    caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale die SD"
  ) +
  theme_minimal() +
  geom_vline(
    xintercept = n_action_fingerprint_avg2,
    color = palette_okabe_ito()[1]
  ) +
  geom_segment(
    x = n_action_fingerprint_avg - n_action_fingerprint_sd2,
    y = 0,
    xend = n_action_fingerprint_avg2 + n_action_fingerprint_sd2,
    yend = 0,
    color = palette_okabe_ito()[2],
    size = 2
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg2,
    y = 1500,
    label = paste0("MW = ", n_action_fingerprint_avg2)
  ) +
  annotate(
    "label",
    x = n_action_fingerprint_avg2 + n_action_fingerprint_sd2,
    y = 0,
    label = paste0("SD = ", n_action_fingerprint_sd2)
  )

Show the code

#geom_label(aes(x = n_action_avg), y = 1, label = "Mean")

7.4 Anzahl Aktionen im Zeitverlauf

7.4.1 Monat

7.4.1.1 idvisit

Show the code

n_action_w_date |>
  ggplot(aes(x = month_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(
    fun.data = mean_sdl,
    fun.args = list(mult = 1),
    geom = "errorbar",
    width = 0.2
  ) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per visit has incresed over time")

Show the code

n_action_w_date |>
  ggplot(aes(x = month_date, y = nr_max)) +
  geom_jitter(alpha = .1)

7.4.1.2 fingerprint

Show the code

n_action_w_date_fingerprint |>
  ggplot(aes(x = month_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(
    fun.data = mean_sdl,
    fun.args = list(mult = 1),
    geom = "errorbar",
    width = 0.2
  ) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per visit has incresed over time")

Show the code

n_action_w_date_fingerprint |>
  ggplot(aes(x = month_date, y = nr_max)) +
  geom_jitter(alpha = .1)

7.4.2 Regression (Monat)

7.4.2.1 idvisit

Show the code

lm(nr_max ~ month_date, data = n_action_w_date)


Call:
lm(formula = nr_max ~ month_date, data = n_action_w_date)

Coefficients:
(Intercept)   month_date  
 -5.956e+02    3.937e-07

7.4.2.2 fingerprint

Show the code

lm(nr_max ~ month_date, data = n_action_w_date_fingerprint)


Call:
lm(formula = nr_max ~ month_date, data = n_action_w_date_fingerprint)

Coefficients:
(Intercept)   month_date  
 -1.186e+03    7.503e-07

7.4.3 Woche

7.4.3.1 idvisit

Show the code

n_action_w_date |>
  mutate(week_date = as.Date(week_date)) |>
  ggplot(aes(x = week_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(fun.data = mean_sdl, geom = "errorbar", width = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per visit has incresed over time")

7.4.3.2 fingerprint

Show the code

n_action_w_date_fingerprint |>
  mutate(week_date = as.Date(week_date)) |>
  ggplot(aes(x = week_date, y = nr_max)) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  stat_summary(fun.data = mean_sdl, geom = "errorbar", width = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "The number of actions per fingerprint has incresed over time")

7.4.4 Regression (Woche)

7.4.4.1 idvisit

Show the code

lm(nr_max ~ week_date, data = n_action_w_date)


Call:
lm(formula = nr_max ~ week_date, data = n_action_w_date)

Coefficients:
(Intercept)    week_date  
  -5.93e+02     3.92e-07

7.4.4.2 fingerprint

Show the code

lm(nr_max ~ week_date, data = n_action_w_date_fingerprint)


Call:
lm(formula = nr_max ~ week_date, data = n_action_w_date_fingerprint)

Coefficients:
(Intercept)    week_date  
 -1.178e+03    7.453e-07

7.5 Gruppierung der Visits/fingerprints nach Anzahl der Aktionen

7.5.1 idvisit

Show the code

n_action_lt_500 <-
  n_action_lt_500 |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  )

Show the code

n_action_lt_500 |>
  count(n_actions_type) |>
  gt()

n_actions_type	n
glimpser	7388
heavy user	465
serious user	5773

Show the code

ggplot(n_action_lt_500) +
  aes(x = n_actions_type) +
  geom_bar()

7.5.1.1 fingerprint

Show the code

n_action_lt_500_fingerprint <-
  n_action_lt_500_fingerprint |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  )

Show the code

n_action_lt_500_fingerprint |>
  count(n_actions_type) |>
  gt()

n_actions_type	n
glimpser	3269
heavy user	334
serious user	3168

Show the code

ggplot(n_action_lt_500_fingerprint) +
  aes(x = n_actions_type) +
  geom_bar()

7.6 Gruppierung der Visits im Zeitverlauf

7.6.1 idvisit

Show the code

n_action_w_date |>
  group_by(month_date) |>
  count(nr_max) |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  ) |>
  count(n_actions_type) |>
  ggplot(aes(
    x = month_date,
    y = n,
    color = n_actions_type,
    group = n_actions_type
  )) +
  geom_point() +
  geom_line()

7.6.2 fingerprint

Show the code

n_action_w_date_fingerprint |>
  group_by(month_date) |>
  count(nr_max) |>
  mutate(
    n_actions_type = case_when(
      nr_max < 30 ~ "glimpser",
      nr_max < 300 ~ "serious user",
      TRUE ~ "heavy user"
    )
  ) |>
  count(n_actions_type) |>
  ggplot(aes(
    x = month_date,
    y = n,
    color = n_actions_type,
    group = n_actions_type
  )) +
  geom_point() +
  geom_line()

8 Verweildauer pro Visit

8.1 Berechnungsgrundlage der Verweildauer

Die Verweildauer wurde berechnet als Differenz zwischen kleinstem und größtem Datumszeitwert (POSixct) eines Visits (also pro Wert der Variablen idvisit), vgl. [Funktion diff_time](https://github.com/sebastiansauer/hans/blob/main/funs/diff_time.R). Diese Variable heißttime_diffim Objekttime_spent`.

Dabei wird das Objekt data_separated_filtered herangezogen, vgl. die Definition es Targets “time_spent” in der Targets-Pipeline.

8.2 Vorverarbeitung

Die Visit-Zeit wurde auf 600 Min. trunkiert/begrenzt.

8.2.1 idvisit

Show the code

time_spent |>
  head(30)

Show the code

time_spent <-
  time_spent |>
  # compute time (t) in minutes (min):
  mutate(t_minutes = as.numeric(time_diff, units = "mins")) |>
  filter(t_minutes < 600)

8.2.2 fingerprint

Show the code

time_spent_fingerprint |>
  head(30)

Show the code

time_spent_fingerprint <-
  time_spent_fingerprint |>
  # compute time (t) in minutes (min):
  mutate(t_minutes = as.numeric(time_diff, units = "mins")) |>
  filter(t_minutes < 600)

8.3 Verweildauer-Statistiken in Sekunden

Die Verweildauer ist im Folgenden dargestellt auf Grundlage oben dargestellter Berechnungsgrundlage (in Sekunden).

8.3.1 idvisit

Show the code

time_spent |>
  summarise(
    mean_time_diff = round(mean(time_diff), 2),
    sd_time_diff = sd(time_diff),
    min_time_diff = min(time_diff), # shortest duration
    max_time_diff = max(time_diff) # longest
  )

8.3.2 fingerprint

Show the code

time_spent_fingerprint |>
  summarise(
    mean_time_diff = round(mean(time_diff), 2),
    sd_time_diff = sd(time_diff),
    min_time_diff = min(time_diff), # shortest duration
    max_time_diff = max(time_diff) # longest
  )

8.4 Verweildauer auf Basis der Variable `visitduration`

8.4.1 Für alle Daten

Alternativ zur Berechnung der Verweildauer steht eine Variable, visitduration zur Verfügung, die (offenbar) die Dauer des Visits misst bzw. messen soll.

Allerdings resultieren substanziell andere Werte, wenn man diese Variable heranzieht zur Berechnung der Verweildauer, vgl. Target time_duration in der Targets-Pipeline.

Show the code

time_duration |>
  head(30)

Show the code

time_duration |>
  summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |>
  mutate(duration_min_avg = duration_sec_avg / 60)

8.4.2 Für unique idvisits

Show the code

time_duration |>
  distinct(idvisit, .keep_all = TRUE) |>
  summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |>
  mutate(duration_min_avg = duration_sec_avg / 60)

8.4.3 Für unique fingerprints

Show the code

time_duration |>
  distinct(fingerprint, .keep_all = TRUE) |>
  summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |>
  mutate(duration_min_avg = duration_sec_avg / 60)

8.5 Verweildauer-Statistiken in Minuten

Show the code

time_spent |>
  mutate(time_diff_minutes = time_length(time_diff, unit = "minute")) |>
  summarise(
    mean_time_diff = round(mean(time_diff_minutes), 2),
    sd_time_diff = sd(time_diff_minutes),
    min_time_diff = min(time_diff_minutes), # shortest duration
    max_time_diff = max(time_diff_minutes) # longest
  )

Show the code

small_padding_theme <- ggpubr::ttheme(
  tbody.style = tbody_style(size = 8), # Smaller font size can help
  colnames.style = colnames_style(size = 9, face = "bold"),
  padding = unit(c(2, 2), "mm") # Reduce horizontal and vertical padding
)

Show the code

ggpubr::ggtexttable(
  time_spent_summary,
  rows = NULL,
  theme = small_padding_theme
)

8.6 Visualisierung der Verweildauer

8.6.1 Binwidth=10 Minutes

Show the code

time_spent |>
  mutate(time_diff_minutes = time_diff / 60) |>
  ggplot(aes(x = time_diff_minutes)) + # minutes
  geom_histogram(binwidth = 10) +
  #scale_x_time() +
  theme_minimal() +
  labs(y = "n", x = "Verweildauer in HaNS pro Visit in d:h:m") +
  scale_x_time(breaks = pretty_breaks())

8.6.2 Bin width= 20 Minutes

Show the code

time_spent |>
  mutate(time_diff_minutes = time_diff / 60) |>
  ggplot(aes(x = time_diff_minutes)) + # minutes
  geom_histogram(binwidth = 20) +
  theme_minimal() +
  labs(
    y = "n",
    x = "Verweildauer",
    title = "Verweildauer in HaNS pro Visit in d:h:m"
  ) +
  scale_x_time(breaks = pretty_breaks())

8.6.3 Zeitdauer begrenzt auf 1-120 Min.

Show the code

time_spent2 <-
  time_spent |>
  filter(time_diff > 1, time_diff < 120)

time_spent2 |>
  ggplot(aes(x = time_diff)) +
  geom_histogram(binwidth = 10) +
  theme_minimal() +
  labs(
    y = "n",
    x = "Verweildauer in HaNS pro Visit in Minuten",
    title = "Verweildauer begrenzt auf 1-120 Minuten",
    caption = "bindwidth = 10 Min."
  )

8.6.4 Veränderung der Verweildauer im Zeitverlauf

8.6.4.1 Monat

Die Einheit von time_spent ist Sekunden.

Show the code

time_spent_by_month <-
  time_spent |>
  mutate(month_start = floor_date(time_min, "month")) |>
  mutate(
    month_name = month(month_start, label = TRUE, abbr = FALSE),
    month_num = month(month_start, label = FALSE),
    year = year(month_start)
  ) |>
  group_by(month_num, year) |>
  summarise(
    time_spent_month_avg = mean(time_diff, na.rm = TRUE),
    time_spent_month_sd = sd(time_diff, na.rm = TRUE)
  ) |>
  arrange(year, month_num)

time_spent_by_month

Show the code

time_spent_by_month |>
  mutate(
    time_spent_month_avg = round(time_spent_month_avg, 2),
    time_spent_month_sd = round(time_spent_month_sd, 2)
  ) |>
  ggtexttable()

Show the code

time_spent_by_month_name <-
  time_spent |>
  mutate(month_start = floor_date(time_min, "month")) |>
  mutate(
    month_name = month(month_start, label = TRUE, abbr = FALSE),
    month_num = month(month_start, label = FALSE),
    year = year(month_start)
  ) |>
  group_by(month_start, year) |>
  summarise(
    time_spent_month_avg = mean(time_diff, na.rm = TRUE),
    time_spent_month_sd = sd(time_diff, na.rm = TRUE)
  )

time_spent_by_month_name |>
  ggplot(aes(x = month_start, y = time_spent_month_avg)) +
  geom_line(group = 1, color = "grey60") +
  geom_point()

8.6.4.2 Jahr

Show the code

time_spent_by_year <-
  time_spent |>
  mutate(month_start = floor_date(time_min, "month")) |>
  mutate(
    month_name = month(month_start, label = TRUE, abbr = FALSE),
    month_num = month(month_start, label = FALSE),
    year = year(month_start)
  ) |>
  group_by(year) |>
  summarise(
    time_spent_avg = mean(time_diff, na.rm = TRUE),
    time_spent_sd = sd(time_diff, na.rm = TRUE)
  )

time_spent_by_year

Show the code

time_spent_by_year |>
  ggplot(aes(x = year, y = time_spent_avg)) +
  geom_line(group = 1, color = "grey60") +
  geom_point()

8.6.4.3 Woche

Show the code

time_spent_by_week_name <-
  time_spent |>
  mutate(week_start = floor_date(time_min, "week")) |>
  mutate(week_num = week(week_start), year = year(week_start)) |>
  group_by(week_start, year) |>
  summarise(
    time_spent_week_avg = mean(time_diff, na.rm = TRUE),
    time_spent_week_sd = sd(time_diff, na.rm = TRUE)
  )

time_spent_by_week_name |>
  ggplot(aes(x = week_start, y = time_spent_week_avg)) +
  geom_line(group = 1, color = "grey60") +
  geom_point()

8.7 Zusammenhang von Lehrveranstaltung und Verweildauer

Show the code

time_spent_w_course_university_summary <-
  time_spent_w_course_university |>
  group_by(floor_date_month) |>
  summarise(
    distinct_courses_n = n_distinct(course),
    diff_time_mean = mean(time_diff, na.rm = TRUE),
    n = n()
  )

time_spent_w_course_university_summary

Show the code

time_spent_w_course_university_summary |>
  ggplot(aes(x = distinct_courses_n, y = diff_time_mean)) +
  geom_point()

8.8 Zusammenhang von Lehrveranstaltung und Anzahl Visits

Show the code

time_spent_w_course_university_summary |>
  ggplot(aes(x = distinct_courses_n, y = n)) +
  geom_point() +
  labs(y = "No. of visits per month", x = "No. of distinct courses per month")

9 Was machen die User?

Was machen die Visitors eigentlich? Und wie oft?

9.1 Häufigkeiten

Für das Objekt n_action_type wurde die Spalte subtitle in den Langformat-Daten ausgewertet, s. Funktionsdefinition von count_user_action_type.

Show the code

n_action_type |>
  head(30)

Achtung: Es kann sinnvoller sein, alternativ zu dieser Analyse die Analyse auf Basis von eventcategory heranzuziehen. Dort werden alle Arten von Events berücksichtigt. Hier, in der vorliegenden, nur ausgewählte Events.

9.1.1 Nach bestimmten Kategorien

Show the code

n_action_type_counted <-
  n_action_type |>
  drop_na() |>
  count(category, sort = TRUE) |>
  mutate(prop = round(n / sum(n), 2))

n_action_type_counted |>
  gt()

category	n	prop
video	845813	0.84
click_slideChange	61934	0.06
visit_page	55551	0.06
Media item	17485	0.02
login	6550	0.01
in_media_search	3422	0.00
Search Results Count	2856	0.00
click_topic	2799	0.00
Medien	1646	0.00
logout	1495	0.00
Kanäle	1395	0.00
GESOA	1358	0.00
click_channelcard	848	0.00
Evaluation	183	0.00
Data protection	39	0.00

9.1.2 Nach Kategorien im Zeitverlauf

Show the code

n_action_type_per_month <-
  n_action_type |>
  select(nr, idvisit, category) |>
  ungroup() |>
  left_join(time_visit_wday |> ungroup()) |>
  select(-c(dow, hour, nr)) |>
  drop_na() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  count(month_start, category)

Show the code

n_action_type_per_month

9.1.3 Nur die Top3-Kategorien

9.1.3.1 idvisit

Show the code

time_visit_wday |>
  head(30)

Show the code

n_action_type_per_month_top3 <-
  n_action_type |>
  select(nr, idvisit, category) |>
  ungroup() |>
  filter(category %in% c("video", "click_slideChange", "visit_page")) |>
  left_join(time_visit_wday |> ungroup()) |>
  select(-c(dow, hour, nr)) |>
  drop_na() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  count(month_start, category)

Show the code

n_action_type_per_month_top3

Show the code

n_action_type_per_month_top3 |>
  ggplot(aes(x = month_start, y = n, color = category, group = category)) +
  geom_line()

9.1.3.2 fingerprint

Show the code

time_visit_wday_fingerprint |>
  head(30)

Show the code

n_action_type_per_month_top3_fingerprint <-
  n_action_type |>
  select(nr, fingerprint, category) |>
  ungroup() |>
  filter(category %in% c("video", "click_slideChange", "visit_page")) |>
  left_join(time_visit_wday_fingerprint |> ungroup()) |>
  select(-c(dow, hour, nr)) |>
  drop_na() |>
  mutate(month_start = floor_date(date_time, "month")) |>
  count(month_start, category)

Show the code

n_action_type_per_month_top3_fingerprint

Show the code

n_action_type_per_month_top3_fingerprint |>
  ggplot(aes(x = month_start, y = n, color = category, group = category)) +
  geom_line()

9.1.4 Top3 - Pro Kurs

Show the code

n_action_type_course_uni <-
  n_action_type |>
  left_join(course_and_uni_per_visit |> mutate(idvisit = as.integer(idvisit)))

Show the code

n_action_type_per_month_top3_per_course <-
  n_action_type_course_uni |>
  filter(category %in% c("video", "click_slideChange", "visit_page")) |>
  drop_na() |>
  mutate(month_start = floor_date(actiondetails_0_timestamp, "month")) |>
  count(course, month_start, category)

Show the code

n_action_type_per_month_top3_per_course |>
  ggplot(aes(x = month_start, y = n, color = category, group = category)) +
  facet_wrap(~course, ncol = 3, scales = "free_y") +
  geom_line() +
  theme(legend.position = "bottom") +
  scale_x_date(date_labels = "%b %Y")

9.1.5 `eventcategory`

Für folgende Analyse wurde eine andere Variable als oben herangezogen, nämlich eventcategory. Dadurch resultieren etwas andere Ergebnisse.

Show the code

data_separated_filtered_count <-
  data_separated_filtered |>
  filter(type == "eventcategory") |>
  count(value, sort = TRUE) |>
  mutate(prop = n / sum(n))

data_separated_filtered_count

Show the code

data_separated_filtered_count |>
  ggtexttable()

Als Excel-Datei abspeichern:

Show the code

#data_separated_filtered_count |>
#  writexl::write_xlsx(path = "obj/data_separated_filtered_count.xlsx")

9.1.6 User-Typen nach Aktivitäten

Was ist die Hauptaktivität pro User? - Verteilung

9.1.6.1 idvisit

Show the code

n_action_type_distro <-
  n_action_type |>
  group_by(idvisit) |>
  summarise(category_max = max(category, na.rm = TRUE)) |>
  count(category_max)

n_action_type_distro

Show the code

n_action_type_distro |>
  ggplot(aes(x = n, y = category_max)) +
  geom_col()

9.1.6.2 fingerprint

Show the code

n_action_type_distro_fingerpr <-
  n_action_type |>
  group_by(fingerprint) |>
  summarise(category_max = max(category, na.rm = TRUE)) |>
  count(category_max)

n_action_type_distro

Show the code

n_action_type_distro_fingerpr |>
  ggplot(aes(x = n, y = category_max)) +
  geom_col()

9.2 Verteilung

Show the code

n_action_type_counted <-
  n_action_type |>
  count(category, sort = TRUE)

9.2.1 Insgesamt - Rohwerte

Show the code

n_action_type_counted |>
  ggplot(aes(y = reorder(category, n), x = n)) +
  geom_col() +
  geom_bar_text() +
  labs(
    x = "User-Aktion",
    y = "Aktion",
    title = "Anzahl der User-Aktionen nach Kategorie"
  ) +
  theme_minimal() +
  scale_x_continuous(labels = scales::comma)

9.2.2 Insgesamt - Log-Skalierung

Show the code

n_action_type_counted |>
  ggplot(aes(y = reorder(category, n), x = n)) +
  geom_col() +
  geom_bar_text() +
  labs(
    x = "Anazhl der User-Aktionen",
    y = "Aktion",
    title = "Anzahl der User-Aktionen nach Kategorie",
    caption = "Log10-Skala"
  ) +
  theme_minimal() +
  scale_x_log10()

9.2.3 Pro Kurs - Rohwerte

Show the code

n_action_type_course_uni_counted <-
  n_action_type_course_uni |>
  group_by(course) |>
  count(category, sort = TRUE) |>
  drop_na()

Show the code

n_action_type_course_uni_counted |>
  ggplot() +
  aes(y = category, x = log(n, base = 10)) +
  geom_col() +
  facet_wrap(~course)

10 An welchen Tagen und zu welcher Zeit kommen die User zu HaNS?

10.1 Setup

10.1.1 idvisit

Show the code

# Define a vector with the names of the days of the week
# Note: Adjust the start of the week (Sunday or Monday) as per your requirement
days_of_week <- c(
  "Monday",
  "Tuesday",
  "Wednesday",
  "Thursday",
  "Friday",
  "Saturday",
  "Sunday"
)

# Replace numbers with day names
time_visit_wday$dow2 <- factor(
  days_of_week[time_visit_wday$dow],
  levels = days_of_week
)

10.1.2 fingerprint

Show the code

# Define a vector with the names of the days of the week
# Note: Adjust the start of the week (Sunday or Monday) as per your requirement
days_of_week <- c(
  "Monday",
  "Tuesday",
  "Wednesday",
  "Thursday",
  "Friday",
  "Saturday",
  "Sunday"
)

# Replace numbers with day names
time_visit_wday_fingerprint$dow2 <- factor(
  days_of_week[time_visit_wday_fingerprint$dow],
  levels = days_of_week
)

10.2 HaNS-Login nach Uhrzeit

10.2.1 idvisit

Show the code

time_visit_wday |>
  as_tibble() |>
  count(hour) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "HaNS-Nutzer sind keine Frühaufsteher",
    x = "Uhrzeit",
    y = "Anteil"
  )

Show the code

# coord_polar()

10.2.2 fingerprint

Show the code

time_visit_wday_fingerprint |>
  as_tibble() |>
  count(hour) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "HaNS-Nutzer sind keine Frühaufsteher",
    x = "Uhrzeit",
    y = "Anteil"
  )

Show the code

# coord_polar()

Show the code

time_visit_wday |>
  as_tibble() |>
  count(hour) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  theme_minimal() +
  coord_polar()

10.3 Verteilung der HaNS-Besuche nach Wochentagen

10.3.1 idvisit

Show the code

time_visit_wday |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code

# coord_polar()

Show the code

time_visit_wday |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.3.1.1 fingerprint

Show the code

time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code

# coord_polar()

Show the code

time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = dow2, y = prop)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.3.2 HaNS-Login nach Wochentagen Uhrzeit

10.3.2.1 idvisit

Show the code

time_visit_wday |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code

# coord_polar()

Show the code

time_visit_wday |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.3.2.2 fingerprint

Show the code

time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  )

Show the code

# coord_polar()

Show the code

time_visit_wday_fingerprint |>
  as_tibble() |>
  count(dow2, hour) |>
  group_by(dow2) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = hour, y = prop)) +
  geom_col() +
  facet_wrap(~dow2) +
  theme_minimal() +
  labs(
    title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten",
    x = "Wochentag",
    y = "Anteil"
  ) +
  coord_polar()

10.4 Anzahl der Visits nach Datum (Tagen) und Uhrzeit (bin2d)

10.4.1 idvisit

Show the code

time2 <-
  time_visit_wday |>
  ungroup() |>
  mutate(date = as.Date(date_time)) |>
  mutate(month_start = floor_date(date_time, "month"))

time2 |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(1, 1)) + # (1 day, 1 hour)
  scale_x_date(date_breaks = "1 month") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(caption = "Each x-bin maps to one week") +
  scale_x_date(breaks = breaks_pretty())

10.4.2 fingerprint

Show the code

time2_fingerprint <-
  time_visit_wday_fingerprint |>
  ungroup() |>
  mutate(date = as.Date(date_time)) |>
  mutate(month_start = floor_date(date_time, "month"))

time2_fingerprint |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(1, 1)) + # (1 day, 1 hour)
  scale_x_date(date_breaks = "1 month") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(caption = "Each x-bin maps to one week") +
  scale_x_date(breaks = breaks_pretty())

10.5 Anzahl der Visits nach Datum (Wochen) und Uhrzeit (bin2d)

10.5.1 idvisit

Show the code

time2 |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week"
  ) +
  scale_x_date(breaks = breaks_pretty())

10.5.2 fingerprint

Show the code

time2_fingerprint |>
  ggplot(aes(x = date, y = hour)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week"
  ) +
  scale_x_date(breaks = breaks_pretty())

10.6 Anzahl der Visits nach Datum (Wochen) und Wochentag (bin2d)

10.6.1 idvisit

Show the code

time2 |>
  ggplot(aes(x = date, y = dow)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week",
    y = "Day of Week"
  ) +
  scale_y_continuous(breaks = 1:7) +
  scale_x_date(breaks = breaks_pretty())

10.6.2 fingerprint

Show the code

time2_fingerprint |>
  ggplot(aes(x = date, y = dow)) +
  geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour
  scale_x_date(date_breaks = "1 week", date_labels = "%W") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_c() +
  labs(
    x = "Week number in 2023/2024",
    caption = "Each x-bin maps to one week",
    y = "Day of Week"
  ) +
  scale_y_continuous(breaks = 1:7) +
  scale_x_date(breaks = breaks_pretty())

11 KI-Gebrauch

11.1 Interaktion mit dem LLM

Berechnungsgrundlage: Für diese Analyse wurden alle Events der Kategorie llm gefiltert.

11.1.1 Art und Anzahl der Interaktionen mit dem LLM

Show the code

data_separated_filtered_ai <-
  data_separated_filtered |>
  filter(type == "eventcategory") |>
  filter(str_detect(value, "llm")) |>
  count(value, sort = TRUE) |>
  mutate(prop = n / sum(n))

data_separated_filtered_ai

Show the code

data_separated_filtered_ai |>
  mutate(prop = round(prop, 3)) |>
  ggtexttable()

11.2 Anzahl der `message_to_llm`

Show the code

llm_interactions <-
  data_separated_filtered |>
  filter(str_detect(value, "message_to_llm"))

11.2.1 Verteilung

Show the code

llm_interactions_count <-
  llm_interactions |>
  count(idvisit, sort = TRUE) |>
  rename(messages_to_llm_n = n)

llm_interactions_count |>
  describe_distribution(messages_to_llm_n, centrality = c("mean", "median"))

11.2.2 Diagramm

Show the code

gghistogram(
  llm_interactions_count,
  x = "messages_to_llm_n",
  bins = 10,
  add = "median"
) +
  labs(caption = "The vertical dotted line denotes the median.")

11.2.3 Anteil Visitors, die mit dem LLM interagieren

11.2.3.1 idvisit

Show the code

data_separated_filtered_llm_interact <-
  data_separated_filtered |>
  mutate(has_llm = str_detect(value, "llm")) |>
  group_by(idvisit) |>
  summarise(llm_used_during_visit = any(has_llm == TRUE)) |>
  count(llm_used_during_visit) |>
  mutate(prop = round(n / sum(n), 2))

data_separated_filtered_llm_interact |>
  gt()

llm_used_during_visit	n	prop
FALSE	13419	0.94
TRUE	788	0.06

Show the code

data_separated_filtered_llm_interact |>
  ggtexttable()

11.2.3.2 fingerprint

Show the code

data_separated_filtered_llm_interact_fingerprint <-
  data_separated_filtered |>
  mutate(has_llm = str_detect(value, "llm")) |>
  group_by(fingerprint) |>
  summarise(llm_used_during_visit = any(has_llm == TRUE)) |>
  count(llm_used_during_visit) |>
  mutate(prop = round(n / sum(n), 2))

data_separated_filtered_llm_interact_fingerprint |>
  gt()

llm_used_during_visit	n	prop
FALSE	6649	0.93
TRUE	511	0.07

Show the code

data_separated_filtered_llm_interact_fingerprint |>
  ggtexttable()

11.2.4 … Im Zeitverlauf

Show the code

idvisit_has_llm |>
  head(30)

Show the code

idvisit_has_llm_timeline <-
  idvisit_has_llm |>
  count(year_month, uses_llm) |>
  ungroup() |>
  group_by(year_month) |>
  mutate(prop = round(n / sum(n), 2))

idvisit_has_llm_timeline

Show the code

idvisit_has_llm_timeline |>
  ggtexttable()

Show the code

idvisit_has_llm |>
  count(year_month, uses_llm) |>
  ungroup() |>
  mutate(year_month_date = ymd(paste0(year_month, "-01"))) |>
  group_by(year_month_date) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(
    x = year_month_date,
    y = prop,
    color = uses_llm,
    groups = uses_llm
  )) +
  geom_point() +
  geom_line(aes(group = uses_llm)) +
  labs(
    title = "Visitors, die mit dem LLM interagieren im Zeitverlauf (Anteile)"
  ) +
  scale_x_date(breaks = pretty_breaks())

Show the code

idvisit_has_llm |>
  count(year_month, uses_llm) |>
  ungroup() |>
  mutate(year_month_date = ymd(paste0(year_month, "-01"))) |>
  group_by(year_month) |>
  ggplot(aes(x = year_month_date, y = n, color = uses_llm, groups = uses_llm)) +
  geom_point() +
  geom_line(aes(group = uses_llm)) +
  labs(
    title = "Visitors, die mit dem LLM interagieren im Zeitverlauf (Anzahl)"
  ) +
  scale_x_date(breaks = pretty_breaks())

11.3 Anzahl der Interaktionen bei den Usern, die mit dem LLM interagieren

Show the code

d_n_interactions_w_llm <-
  data_separated_filtered |>
  filter(type == "eventcategory") |>
  filter(str_detect(value, "llm")) |>
  group_by(idvisit) |>
  summarise(n_interactions_w_llm = n())

Show the code

d_n_interactions_w_llm |>
  select(n_interactions_w_llm) |>
  describe_distribution() |>
  print_md()

Variable	Mean	SD	IQR	Range	Skewness	Kurtosis	n	n_Missing
n_interactions_w_llm	165.59	216.45	459	(1.00, 500.00)	0.69	-1.46	640	0

Show the code

d_n_interactions_w_llm |>
  ggplot(aes(x = n_interactions_w_llm)) +
  geom_histogram()

11.4 Klick auf ein Wort im Transkript

Ausgewertet wird im Folgenden die Variable “click_transcript_word”.

11.4.1 Insgesamt

Show the code

data_separated_filtered |>
  filter(type == "subtitle") |>
  # rm empty rows:
  filter(!is.na(value) & value != "") |>
  count(click_transcript_word = str_detect(value, "click_transcript_word")) |>
  mutate(prop = round(n / sum(n), 2)) |>
  tt()

click_transcript_word	n	prop
FALSE	1138774	0.99
TRUE	8439	0.01

11.4.2 Im Zeitverlauf

11.4.2.1 idvisit

Show the code

click_transcript_word_per_month <-
  data_separated_filtered |>
  # rm all groups WITHOUT "click_transcript_word":
  group_by(idvisit) |>
  filter(!any(value = str_detect(value, "click_transcript_word"))) |>
  ungroup() |>
  mutate(date_visit = ymd_hms(value)) |>
  mutate(month_visit = floor_date(date_visit, unit = "month")) |>
  drop_na(date_visit) |>
  group_by(idvisit) |>
  slice(1) |>
  ungroup() |>
  count(month_visit)

click_transcript_word_per_month

Show the code

click_transcript_word_per_month |>
  ggplot(aes(x = month_visit, y = n)) +
  geom_line()

11.4.2.2 fingerprint

Show the code

click_transcript_word_per_month_fingerprint <-
  data_separated_filtered |>
  # rm all groups WITHOUT "click_transcript_word":
  group_by(fingerprint) |>
  filter(!any(value = str_detect(value, "click_transcript_word"))) |>
  ungroup() |>
  mutate(date_visit = ymd_hms(value)) |>
  mutate(month_visit = floor_date(date_visit, unit = "month")) |>
  drop_na(date_visit) |>
  group_by(fingerprint) |>
  slice(1) |>
  ungroup() |>
  count(month_visit)

click_transcript_word_per_month_fingerprint

Show the code

click_transcript_word_per_month_fingerprint |>
  ggplot(aes(x = month_visit, y = n)) +
  geom_line()

11.5 KI-Aktionen

11.5.1 Insgesamt (ganzer Zeitraum)

Show the code

data_long |>
  head(300)

11.5.2 Im Detail

Show the code

regex_pattern <- "Category: \"(.*?)(?=', Action)"

# Explaining this regex_pattern:
# Find the literal string
# 1. `Category: ` (surrounded by quotation marks)
# 2. Capture any characters (.*?) that follow, non-greedily, until...
# 3. ...it encounters the literal sequence,  ` Action`) immediately after the captured string.

ai_actions_count <-
  data_long |>
  # slice(1:1000) |>
  filter(str_detect(value, "transcript")) |>
  mutate(category = str_extract(value, regex_pattern)) |>
  select(category) |>
  mutate(category = str_replace_all(category, "[\"']", "")) |>
  count(category, sort = TRUE)

ai_actions_count |>
  tt()

category	n
NA	217862
Category: clear_transcript_text_for_llm_context	104111
Category: click_transcript_word	8439
Category: select_transcript_text_for_llm_context	576
Category: click_button	43
Category: llm_response_de	3
Category: llm_response_en	3

11.5.3 KI-Klicks pro Monat

Im Objekt wird gezählt, wie oft der String "click_transcript_word" in den Daten (Langformat) gefunden wird, s. Target ai_transcript_clicks_per_month in der Targets-Pipeline.

Show the code

ai_transcript_clicks_per_month |>
  head(30)

Show the code

ai_transcript_clicks_per_month_count <-
  ai_transcript_clicks_per_month |>
  count(year_month, clicks_transcript_any) |>
  ungroup() |>
  group_by(year_month) |>
  mutate(prop = round(n / sum(n), 2))

ai_transcript_clicks_per_month_count

Show the code

ai_transcript_clicks_per_month_count |>
  ggtexttable()

Show the code

ai_transcript_clicks_per_month_count |>
  mutate(date = ymd(paste0(year_month, "-01"))) |>
  ggplot(aes(x = date, y = n)) +
  geom_line(group = 1) +
  geom_point() +
  theme_minimal() +
  labs(title = "Number of AI transcript clicks per month", x = "date [months]")

11.6 Output des LLMs: `llm_response` - Tokens und Tokenlänge

11.6.1 Deutsch vs. Englisch

Show the code

llm_response_text |>
  count(lang) |>
  mutate(prob = n / sum(n))

11.6.2 Anzahl der Tokens

Show the code

llm_response_text |>
  describe_distribution(select = "tokens_n")

11.6.3 Anzahl vorab existierender Fragen

11.6.3.1 Anzahl `verify_option_wrong` und `verify_option_correct`

11.6.3.1.1 idvisit

Show the code

verify_option_summary <-
  data_separated_filtered |>
  group_by(idvisit) |>
  filter(value == "verify_option_wrong" | value == "verify_option_correct") |>
  summarise(verify_option = n())

Show the code

verify_option_summary |>
  gghistogram(x = "verify_option")

Show the code

verify_option_summary |>
  describe_distribution(verify_option) |>
  print_md()

Variable	Mean	SD	IQR	Range	Skewness	Kurtosis	n	n_Missing
verify_option	35.24	36.76	30	(4.00, 245.00)	2.64	9.11	207	0

11.6.3.1.2 fingerprint

Show the code

# verify_option_summary_fingerprint <-
#   data_separated_filtered |>
#   group_by(fingerprint) |>
#   filter(value == "verify_option_wrong" | value == "verify_option_correct") |>
#   summarise(verify_option = n())

setDT(data_separated_filtered) # Ensure your data frame is a data.table

verify_option_summary_fingerprint <- data_separated_filtered[
  # 1. Filtering (i)
  value %in% c("verify_option_wrong", "verify_option_correct"),

  # 2. Summarize (.j) - calculate the count (n)
  .(verify_option = .N),

  # 3. Grouping (by)
  by = .(fingerprint)
]

verify_option_summary_fingerprint <- as_tibble(
  verify_option_summary_fingerprint
)

Show the code

verify_option_summary_fingerprint |>
  gghistogram(x = "verify_option")

Show the code

verify_option_summary_fingerprint |>
  describe_distribution(verify_option) |>
  print_md()

Variable	Mean	SD	IQR	Range	Skewness	Kurtosis	n	n_Missing
verify_option	41.68	46.81	35	(4.00, 252.00)	2.35	5.94	175	0

11.6.3.2 Anzahl `verify_option_wrong` verify_option_div_by_4 - geteilt durch 4

Show the code

verify_option_summary <-
  verify_option_summary |>
  mutate(verify_option_div_by_4 = verify_option / 4)

verify_option_summary |>
  gghistogram(x = "verify_option_div_by_4")

Show the code

verify_option_summary |>
  mutate(verify_option_div_by_4 = verify_option / 4) |>
  describe_distribution(verify_option_div_by_4) |>
  print_md()

Variable	Mean	SD	IQR	Range	Skewness	Kurtosis	n	n_Missing
verify_option_div_by_4	8.81	9.19	7.50	(1.00, 61.25)	2.64	9.11	207	0

11.6.3.3 Anzahl “Multiple choice answer selected”

Show the code

check_if_both_methods_give_same_number <-
  n_mc_answers_selected |>
  full_join(verify_option_summary)

check_if_both_methods_give_same_number |>
  head(20) |>
  gt()

n	verify_option	verify_option_div_by_4
1560
6	28	7.00
1569
2	14	3.50
2021
2	21	5.25
2022
10	126	31.50
2394
2	21	5.25
2718
2	7	1.75
2740
2	7	1.75
2883
2	126	31.50
2902
2	126	31.50
2912
2	77	19.25
2932
6	35	8.75
2950
2	56	14.00
2978
14	245	61.25
2979
4	35	8.75
3103
2	14	3.50
3257
2	7	1.75
3691
2	35	8.75
3700
4	84	21.00
3741
2	21	5.25
3804
2	70	17.50

Nein, beide Methoden liefern nicht die gleiche Zahl.

11.6.3.4 “Multiple choice answer selected” im Zeitverlauf

Show the code

mc_answers_with_timestamps <-
  mc_answers_with_timestamps |>
  mutate(month_start = floor_date(timestamp, "month")) |>
  ungroup() |>
  arrange(timestamp) |>
  mutate(n_cumulated = cumsum(n)) |>
  mutate(date = as.Date(timestamp))

lim <- c(
  min(mc_answers_with_timestamps$date),
  max(mc_answers_with_timestamps$date)
)

mc_answers_with_timestamps |>
  ggplot(aes(x = date, y = n_cumulated)) +
  scale_x_date(limits = lim, labels = scales::label_date_short()) +
  geom_point() +
  geom_line()

11.6.3.5 Anzahl `generate_questionaire`

Show the code

# generate_questionaire_summary <-
#   data_separated_filtered |>
#   group_by(idvisit) |>
#   filter(value == "generate_questionaire") |>
#   summarise(generate_questionaire = n())

setDT(data_separated_filtered) # Convert the data.frame to a data.table in place

generate_questionaire_summary <- data_separated_filtered[
  # 1. Filtering (i)
  value == "generate_questionaire",

  # 2. Summarize (.j) - calculate the count (.N) and rename it
  .(generate_questionaire = .N),

  # 3. Grouping (by)
  by = .(idvisit)
]

Show the code

generate_questionaire_summary |>
  describe_distribution(generate_questionaire) |>
  print_md()

Variable	Mean	SD	IQR	Range	Skewness	Kurtosis	n	n_Missing
generate_questionaire	3.11	5.93	2	(1.00, 66.00)	5.89	46.07	367	0

11.6.3.6 Anzahl vorab existierender Fragen

Show the code

setDT(generate_questionaire_summary)
setDT(verify_option_summary)

# 1. Full Join (Merge)
# Use the 'merge' function with all.x=TRUE and all.y=TRUE for a full join
# Assumes the join column is 'idvisit' as used in your previous examples
prior_existing_questions_summary <- merge(
  generate_questionaire_summary,
  verify_option_summary,
  by = "idvisit",
  all = TRUE
)

# 2. Mutate (Calculation)
# Use .j to create the new column
prior_existing_questions_summary[,
  prior_existing_questions_n := verify_option - generate_questionaire
]

# prior_existing_questions_summary <-
#   generate_questionaire_summary |>
#   full_join(verify_option_summary) |>
#   mutate(prior_existing_questions_n = verify_option - generate_questionaire)

Show the code

prior_existing_questions_summary |>
  # drop_na() |>
  gghistogram(x = "prior_existing_questions_n")

Show the code

prior_existing_questions_summary |>
  describe_distribution(prior_existing_questions_n) |>
  print_md()

Variable	Mean	SD	IQR	Range	Skewness	Kurtosis	n	n_Missing
prior_existing_questions_n	38.87	44.57	39	(-59.00, 236.00)	1.98	5.40	91	392

12 Videozeit

Wie viel Zeit verbringen die Nutzer mit dem Betrachten von Videos (“Glotzdauer”)?

12.1 Glotzdauer allgemein

Achtung: Die Videozeit ist schwierig auszuwerten. Die Nutzer beenden keine Videos, in dem sie auf “Pause” drücken, sondern indem sie andere Aktionen durchführen. Dies ist aber analytisch schwer abzubilden.

Vgl. die Definition des Targets glotzdauer in der Pipeline.

Kurz gesagt wird die Zeit-Differenz zwischen zwei aufeinander folgenden “Play” und “Pause” Aktionen berechnet.

Allerdings hat dieses Vorgehen Schwierigkeiten: Nicht immer folgt auf einem “Play” ein “Pause”. Es ist schwer auszuwerten, wann die Betrachtung eines Videos endet. Daher ist diese Analyse nur vorsichtig zu interpretieren.

Die Definition der Funktion glotzdauer.R ist online dokumentiert.

Show the code

data_separated_distinct_slice |>
  head(30)

Für die folgende Darstellung wurden die absoluten Zeitwerte verwendet, d.h. ohne Vorzeichen.

Show the code

data_separated_distinct_slice |>
  # we will assume that negative glotzdauer is the as positive glotzdauer:
  mutate(time_diff = abs(time_diff)) |>
  # without glotzdauer smaller than 10 minutes:
  filter(time_diff < 60 * 10) |>
  ggplot(aes(x = time_diff)) +
  geom_histogram() +
  scale_x_time() +
  labs(
    x = "Time interval [minutes]",
    caption = "Only time intervals less than 10 minutes. It is assumed that video time is positive only (no negative time intervals)."
  ) +
  theme_minimal()

Show the code

glotzdauer_prepped <-
  data_separated_distinct_slice |>
  # we will assume that negative glotzdauer is the as positive glotzdauer:
  mutate(time_diff_abs_sec = abs(as.numeric(time_diff, units = "secs"))) |>
  # without glotzdauer smaller than 10 minutes:
  filter(time_diff_abs_sec < 60 * 10) |>
  mutate(time_diff_abs_min = time_diff_abs_sec / 60)

glotzdauer_tbl <-
  glotzdauer_prepped |>
  select(time_diff_abs_sec, time_diff_abs_min) |>
  describe_distribution()

glotzdauer_tbl

Show the code

glotzdauer_tbl |>
  mutate(across(where(is.numeric), ~ round(., 2))) |>
  ggpubr::ggtexttable()

12.2 Glotzdauer im Zeitverlauf

Show the code

glotzdauer_prepped_tbl <-
  glotzdauer_prepped |>
  mutate(first_of_month = floor_date(date, unit = "month")) |>
  group_by(first_of_month) |>
  summarise(time_diff_mean = mean(time_diff, na.rm = TRUE))


glotzdauer_prepped_tbl

Show the code

glotzdauer_prepped_tbl |>
  ggplot(aes(x = first_of_month, y = time_diff_mean)) +
  geom_line() +
  theme_minimal()

13 Abschluss

--- title: "Analyse der HaNS-Matomo-Daten" date: now author: Sebastian Sauer toc: true number-sections: true format: html: theme: lumen embed-resources: true toc: true toc-location: right toc-depth: 3 number-sections: true code-fold: true code-summary: "Show the code" code-tools: true df-print: paged lightbox: true execute: warning: false cache: true params: recompute_gt: false --- # Hintergrund Dieser Arbeitsbericht schildert das technische Vorgehen im Rahmen der Analyse der Matomo-Daten des BMBF-Projekt "HaNS". ## Vorgehen Die Matomo-Klickdaten aller Semester der Projektlaufzeit wurden für diese Analyse verarbeitet. Mit Hilfe einer R-Pipeline wurden eine Reihe von Forschungsfragen analysiert. Der komplette Code ist online dokumentiert unter <https://github.com/sebastiansauer/hans>. Aus Datenschutzgründen sind online keine Daten eingestellt. Die zentrale Analyse-Pipeline-Datei ist <https://github.com/sebastiansauer/hans/blob/main/_targets.R>. ## Forschungsfragen 1. Wie viele Nutzer gibt es und in welchem Zeitraum? 2. In welcher Frequenz wird HaNS aufgesucht? Wie groß sind die zeitlichen Zwischenräume zwischen der Benutzung der Plattform? 3. Wie oft wird HaNS pro Zeitraum (z.B. Monat) besucht? 4. Wie verändert sich die Nutzung im Zeitverlauf? 5. Wie viele Aktionen bringt ein Visit mit sich? Wie ist die statistische Verteilung der Aktionen pro Visit? 6. Wie lang verweilen die Nutzer pro Visit? 7. Wie verändert sich die Nutzungsdauer pro Visit im Zeitverlauf? 8. Welche Aktionen führen die Nutzer auf Hans aus? 9. Wie verändern sich die Verteilungen der Aktionshäufigkeiten im Zeitverlauf? 10. An welchen Tagen und zu welcher Zeit kommen die User zu HaNS? 11. Wie häufig und in welcher Art inteagieren die Nutzer mit dem LLM in HaNS? 12. Wie groß ist der Anteil der Nutzer, die mit dem LLM interagieren? 13. Wie verändert sich der Anteil der Nutzer, die mit dem LLM interagieren, im Zeitverlauf? 14. Wie oft wird auf ein Wort im Transkript des LLM geklickt? 15. Wie oft wird ein Transkript-Dienst in HaNS in Anspruch genommen? 16. Wie verändert sich die Nutzung der Transkript-Dienste in HaNS im Zeitverlauf? 17. Wie lange werden Videos angeschaut? 18. Wie verändert sich die Betrachtungsdauer im Zeitverlauf? # Setup ## R-Pakete starten ```{r load-libs} library(targets) library(tidyverse) library(ggokabeito) library(easystats) library(gt) library(ggfittext) library(scales) library(visdat) library(collapse) library(ggpubr) library(knitr) library(tinytable) library(data.table) ``` ```{r} #| cache: false library(ggplot2) theme_set(theme_minimal()) ``` ## Optionen setzen ```{r options} options(lubridate.week.start = 1) # Monday as first day #options(collapse_mask = "all") # use collapse for all dplyr operations options(chromote.headless = "new") # Chrome headleass needed for gtsave ``` ```{r} scale_colour_discrete <- function(...) scale_colour_brewer(palette = "Set2") scale_fill_discrete <- function(...) scale_fill_brewer(palette = "Set2") ```       ## Daten laden ```{r import-tar-objects-data} tar_load(ai_transcript_clicks_per_month) tar_load(config) tar_load(course_and_uni_per_visit) tar_load(data_all_fct) tar_load(data_long) tar_load(data_prepped) tar_load(data_separated_distinct_slice) tar_load(data_separated_filtered) #tar_load(data_users_only) tar_load(idvisit_has_llm) tar_load(llm_response_text) tar_load(n_action) tar_load(n_action_type) tar_load(n_action_w_date) tar_load(time_duration) tar_load(time_since_last_visit) tar_load(time_spent) tar_load(time_spent_w_course_university) tar_load(time_visit_wday) tar_load(n_mc_answers_selected) tar_load(mc_answers_with_timestamps) tar_load(n_action_fingerprint) tar_load(time_visit_wday_fingerprint) tar_load(n_action_w_date_fingerprint) tar_load(time_spent_fingerprint) ``` # Datenaufbereitung und Analysepipeline ## Targets-Pipeline stellt Überblick aller Analyseschritte dar Die Analyse wird im Rahmen einer [Targets-Pipeline](https://github.com/sebastiansauer/hans/blob/main/_targets.R) beschrieben und ist offen auf Github einsehbar. ## Langformat Aufgrund des "rechts flatternden" Datenformat (d.h. unterschiedliche Zeilenlängen) wurden die Daten in ein Langformat überführt, zwecks besserer/einfacherer Analyse. Dazu wurden (neben den ID-Variablen, v.a. `idvisit`) die `actionDetails_`-Variablen verwendet. Der Code des Pivotierens in das Langformat ist in der Funktion [longify-data.R](https://github.com/sebastiansauer/hans/blob/main/funs/longify-data.R) einsehbar. Die Daten im Langformat wurden dann noch etwas aufbereitet mt der Funktion [slimify-data.R](https://github.com/sebastiansauer/hans/blob/main/funs/slimify_data.R). ```{r data_separated_filtered_head} data_separated_filtered |> head(30) ``` # Überblick über die Daten ## Roh-Daten laden und inspizieren (data_all_fact) ### Dimension Der Roh-Datensatz verfügt über - `r nrow(data_all_fct)` Zeilen - `r ncol(data_all_fct)` Spalten (Dubletten und Spalten mit Bildern bereits entfernt) Jede Zeile entspricht einem "Visit". ### Erster Blick ```{r data_all_fct_head100} data_all_fct_head100 <- data_all_fct %>% select(1:100) %>% slice_head(n = 100) ``` ```{r vis-dat} data_all_fct_head100 %>% visdat::vis_dat() ``` ### (Fast) leere Spalten ```{r} d_na_cols <- data.frame( id = 1:ncol(data_all_fct), names = names(data_all_fct), na_prop = colMeans(is.na(data_all_fct)) ) ``` #### Leere Spalten ```{r} d_na_cols |> filter(na_prop == 1) ``` #### Fast leere Spalten ```{r} no_na_cols <- d_na_cols |> filter(na_prop > .9) |> nrow() no_na_cols ``` :::{.callout-important} Sehr viele Spalten, `r no_na_cols` sind fast leer. ::: ### Namen (1-100) ```{r data_all_fct_head100-2} d_100_names <- data.frame( id = 1:100, col_name = data_all_fct_head100 %>% names() ) d_100_names ``` ### Werte der erst 100 Spalten ```{r} data_all_fct_head100 ``` ### Datensatz data_separated_filtered, Zeilen 1-100 ```{r data_separated_filtered} data_separated_filtered %>% slice(1:100) ``` ## Fallzahl im Nur-User-Datensatz Entfernt man *Developer*, *Admins* und *Lecturers* aus dem Roh-Datensatz so bleiben weniger Zeilen übrig: - `r nrow(data_prepped)` Zeilen - `r ncol(data_prepped)` Spalten ## Datensatz mit Anzahl der Aktionen pro User ### idvisit ```{r} n_action |> dim() ``` ```{r load-count-action} n_action |> head(30) ``` ```{r} n_action |> ggplot(aes(x = nr_max)) + geom_histogram() ``` ### fingerprint ```{r} n_action_fingerprint |> head(30) ``` ```{r} n_action_fingerprint |> ggplot(aes(x = nr_max)) + geom_histogram() ``` # Zeitraum ## Beginn/Ende der Daten ```{r time_minmax-load} n_action_w_date |> head(30) ``` ```{r comp-time-min-max} min_max_time <- n_action_w_date |> summarise( time_min = min(date_time_start, na.rm = T), time_max = max(date_time_start, na.rm = T) ) min_max_time |> gt() ``` :::{.callout-important} Erster Visit im Datensatz: `r min_max_time$time_min`. Letzter Visit im Datensatz: `r min_max_time$time_max`. ::: Diese Statistik wurde auf Basis des Datenobjekts `data_separated_filtered` berechnet, vgl. [das Target dieses Objekts in der Pipeline](https://github.com/sebastiansauer/hans/blob/main/_targets.R#L170). ## Days since last visit ### Insgesamt #### idvisit ```{r} time_visit_wday |> head(30) ``` ```{r days-since-last-visit} time_since_last_visit <- time_since_last_visit |> mutate(dayssincelastvisit = as.numeric(dayssincelastvisit)) |> distinct(idvisit, .keep_all = TRUE) time_since_last_visit |> datawizard::describe_distribution(dayssincelastvisit) |> knitr::kable(digits = 2) time_since_last_visit |> ggplot(aes(x = dayssincelastvisit)) + geom_density() + labs( title = "If visitor return, they return mostly not later than a few days." ) ``` :::{.callout-important} Die Nutzer nutzen die Seite in Abständen von wenigen Tagen? ::: #### fingerprint ```{r} time_visit_wday_fingerprint |> head() ``` ```{r days-since-last-visit_finggerprint} time_since_last_visit_fingerprint <- time_since_last_visit |> mutate(dayssincelastvisit = as.numeric(dayssincelastvisit)) |> distinct(fingerprint, .keep_all = TRUE) time_since_last_visit |> datawizard::describe_distribution(dayssincelastvisit) |> knitr::kable(digits = 2) time_since_last_visit |> ggplot(aes(x = dayssincelastvisit)) + geom_density() + labs( title = "If visitor return, they return mostly not later than a few days." ) ``` ### Nach Lehrveranstaltungen ```{r time_since_last_visit_per_course} time_since_last_visit_per_course <- time_since_last_visit |> left_join(course_and_uni_per_visit) |> drop_na() ``` ```{r time_since_last_visit_per_course_summary} time_since_last_visit_per_course_summary <- time_since_last_visit_per_course |> group_by(course) |> summarise( dayssincelastvisit_mean = mean(dayssincelastvisit), dayssincelastvisit_sd = sd(dayssincelastvisit), dayssincelastvisit_n = n() ) |> mutate( dayssincelastvisit_n_log = log(dayssincelastvisit_n, base = 10) + 0.001 ) ``` ```{r time_since_last_visit_per_course_summary-print} time_since_last_visit_per_course_summary ``` ```{r time_since_last_visit_per_course_summary-plot} time_since_last_visit_per_course_summary |> ggplot(aes( y = reorder(course, dayssincelastvisit_mean), x = dayssincelastvisit_mean )) + geom_errorbar(aes( xmin = dayssincelastvisit_mean - dayssincelastvisit_sd, xmax = dayssincelastvisit_mean + dayssincelastvisit_sd )) + geom_point(aes(alpha = log(dayssincelastvisit_n)), show.legend = FALSE) + labs( x = "Days since last visit (mean±sd)", y = "course", title = "In some courses, users use HaNS frequently.", caption = "Grey saturation of the mean dots refers to the log10 of the sample size (N)" ) + geom_text( aes(label = round(dayssincelastvisit_n)), x = Inf, hjust = 1.2, size = 2 ) + annotate( x = Inf, y = Inf, label = "N", geom = "label", hjust = 1, vjust = 1 ) + scale_y_discrete(expand = expansion(mult = 0.1)) + theme_minimal() ``` ## Visits im Zeitverlauf Wie viele Visits (von Hans) gab es? ### Pro Monat #### idivisit ```{r time_visit_wday_summary} time_visit_wday_summary <- time_visit_wday |> ungroup() |> mutate(month_start = floor_date(date_time, "month")) |> mutate( month_name = month(date_time, label = TRUE, abbr = FALSE), month_num = month(date_time, label = FALSE), year_num = year(date_time) ) ``` ```{r time_visit_wday_summary-table} time_visit_wday_summary |> group_by(year_num, month_num) |> summarise(n = n()) |> gt() ``` ```{r time_visit_wday_summary2-plot} time_visit_wday_summary |> group_by(year_num, month_start) |> summarise(n = n()) |> ggplot(aes(x = month_start, y = n)) + geom_line(group = 1, color = "grey60") + geom_point() + labs( title = "The number of visits reflect the teaching periods of the semesters.", x = "month/year" ) ``` #### fingerprint ```{r time_visit_wday_summary_2} time_visit_wday_summary_fingerprint <- time_visit_wday_fingerprint |> ungroup() |> mutate(month_start = floor_date(date_time, "month")) |> mutate( month_name = month(date_time, label = TRUE, abbr = FALSE), month_num = month(date_time, label = FALSE), year_num = year(date_time) ) ``` ```{r time_visit_wday_summary-table_2} time_visit_wday_summary_fingerprint |> group_by(year_num, month_num) |> summarise(n = n()) |> gt() ``` ```{r time_visit_wday_summary2-plot_2} time_visit_wday_summary_fingerprint |> group_by(year_num, month_start) |> summarise(n = n()) |> ggplot(aes(x = month_start, y = n)) + geom_line(group = 1, color = "grey60") + geom_point() + labs( title = "The number of visits reflect the teaching periods of the semesters.", x = "month/year" ) ``` ### Pro Woche ```{r time_visit_wday_summary_week} time_visit_wday_summary_week <- time_visit_wday |> ungroup() |> mutate(week_start = floor_date(date_time, "week")) |> mutate(week_num = week(date_time), year_num = year(date_time)) ``` ```{r time_visit_wday_summary_week_summarized} time_visit_wday_summary_week_summarized <- time_visit_wday_summary_week |> group_by(year_num, week_num) |> summarise(n = n()) time_visit_wday_summary_week_summarized ``` ```{r time_visit_wday_summary_week_summarized_dateformat} time_visit_wday_summary_week_summarized_dateformat <- time_visit_wday_summary_week |> group_by(week_start) |> summarise(n = n()) ``` ```{r time_visit_wday_summary_week_summarized_dateformat-plot} time_visit_wday_summary_week_summarized_dateformat |> ggplot(aes(x = week_start, y = n)) + geom_line(group = 1, color = "grey60") + geom_point() + geom_smooth(method = "gam", se = FALSE, color = "blue") + labs( title = "The number of visits is increasing and reflects the teaching periods of the semesters.", x = "week number/year" ) ``` :::{.callout-important} The number of visits has increased over time. ::: ### Akkumulierte Seitenaufrufe im Zeitverlauf #### Monat - idvisit ```{r time_visit_wday_summary-plot} time_visit_wday_summary |> group_by(year_num, month_start) |> summarise(n = n()) |> ungroup() |> mutate(n_cumsum = cumsum(n)) |> ggplot(aes(x = month_start, y = n_cumsum)) + geom_line(group = 1, color = "grey60") + geom_point() + theme_minimal() + geom_smooth(method = "lm") + labs(title = "Visits have increased linearly over time.", x = "month/year") ``` #### Monat - fingerprint ```{r time_visit_wday_summary-plot_2} time_visit_wday_summary_fingerprint |> group_by(year_num, month_start) |> summarise(n = n()) |> ungroup() |> mutate(n_cumsum = cumsum(n)) |> ggplot(aes(x = month_start, y = n_cumsum)) + geom_line(group = 1, color = "grey60") + geom_point() + theme_minimal() + geom_smooth(method = "lm") + labs(title = "Visits have increased linearly over time.", x = "month/year") ``` #### Woche ```{r} time_visit_wday_summary_week |> group_by(year_num, week_start) |> summarise(n = n()) |> ungroup() |> mutate(n_cumsum = cumsum(n)) |> ggplot(aes(x = week_start, y = n_cumsum)) + geom_line(group = 1, color = "grey60") + geom_point() + theme_minimal() + geom_smooth(method = "lm") + labs( title = "Visits have increased approx. linearly over time.", x = "week/year" ) ``` ## Statistiken Die folgenden Statistiken beruhen auf dem Datensatz `data_separated_filtered`: ### idivisit ```{r} glimpse(data_separated_filtered) ``` `nr` fasst die Nummer der Aktion innerhalb eines bestimmten Visits. ### fingerprint ```{r} data_separated_filtered |> distinct(fingerprint, .keep_all = TRUE) |> glimpse() ``` ## Mit allen Daten (den 499er-Daten) ### idvisit ```{r tbl_n_action} tbl_n_action <- n_action |> describe_distribution(nr_max, centrality = c("median", "mean")) tbl_n_action ``` ![](tbl_count_action.png) `nr_max` gibt den Maximalwert von `nr` zurück, sagt also, wie viele Aktionen maximal während eines Vitis ausgeführt wurden. Betrachtet man die Anzahl der Aktionen pro Visit näher, so fällt auf, dass der Maximalwert (499) sehr häufig vorkommt: ```{r count-action-plot} n_action |> count(nr_max) |> ggplot(aes(x = nr_max, y = n)) + geom_col() + geom_vline( xintercept = tbl_n_action$Median, color = "blue", linetype = "dashed" ) + labs( caption = "Vertical dashed lines shows the median.", title = "Most users to only a few actions, but some do many." ) ``` :::{.callout-important} Die meisten Nutzer machen nur wenige Aktionen pro Visit, aber einige machen sehr viele. ::: Hier noch in einer anderen Darstellung: ```{r count-action-plot2} n_action |> count(nr_max) |> ggplot(aes(x = nr_max, y = n)) + geom_point() ``` Der Maximalwert ist einfach auffällig häufig: ```{r n-action-table} n_action |> count(nr_max == 499) |> gt() ``` Es erscheint plausibel, dass der Maximalwert alle "gekappten" (*zensierten*, abgeschnittenen) Werte fasst, also viele Werte, die eigentlich größer wären (aber dann zensiert wurden). ### fingerprint ```{r tbl_n_action_2} tbl_n_action_fingerprint <- n_action_fingerprint |> describe_distribution(nr_max, centrality = c("median", "mean")) tbl_n_action_fingerprint ``` ```{r count-action-plot-fingerprint} n_action_fingerprint |> count(nr_max) |> ggplot(aes(x = nr_max, y = n)) + geom_col() + geom_vline( xintercept = tbl_n_action_fingerprint$Median, color = "blue", linetype = "dashed" ) + labs( caption = "Vertical dashed lines shows the median.", title = "Most users to only a few actions, but some do many." ) ``` ## Nur Visitors, für die weniger als 500 Aktionen protokolliert sind ### idvisit ```{r count-action-tbl2} n_action_lt_500 <- n_action |> filter(nr_max != 499) n_action_lt_500 |> describe_distribution(nr_max) |> gt() |> fmt_number(columns = where(is.numeric), decimals = 2) ``` ### fingerprint ```{r count-action-tbl2-fingerpint} n_action_lt_500_fingerprint <- n_action_fingerprint |> filter(nr_max != 499) n_action_lt_500_fingerprint |> describe_distribution(nr_max) |> gt() |> fmt_number(columns = where(is.numeric), decimals = 2) ``` # Lehrveranstaltungen ## Anzahl an Lehrveranstaltungen nach Hochschule ### fingerprint ```{r course_and_uni_per_visit-count} course_and_uni_per_visit |> count(university) ``` ```{r course_and_uni_per_visit_plot} course_and_uni_per_visit |> count(university) |> drop_na() |> ggplot(aes(y = reorder(university, n), x = n)) + geom_col() + theme_minimal() + labs( title = "TH Nürnberg hosts the most courses on HaNS by far.", y = "University" ) ``` ### fingerprint ```{r course_and_uni_per_visit_plot_fingerprint} course_and_uni_per_visit |> distinct(fingerprint, .keep_all = TRUE) |> count(university) |> ggplot(aes(y = reorder(university, n), x = n)) + geom_col() + theme_minimal() + labs( title = "TH Nürnberg hosts the most courses on HaNS by far.", y = "University" ) ``` ## Visits nach Lehrveranstaltung und Jahr ### idvisit ```{r time_spent_w_course_university_count} time_spent_w_course_university |> count(year, course) ``` ```{r ime_spent_w_course_university_count_plot} time_spent_w_course_university |> count(year, course) |> drop_na() |> ggplot(aes(x = n, y = course, fill = factor(year), )) + geom_col(position = "dodge") + labs(title = "The course 'GeSOA' is the most active course on HaNS.") ``` ### fingerprint ```{r time_spent_w_course_university_count_plot_fingerprint} time_spent_w_course_university |> distinct(fingerprint, .keep_all = TRUE) |> count(year, course) |> drop_na() |> ggplot(aes(x = n, y = course, fill = factor(year), )) + geom_col(position = "dodge") + labs(title = "The course 'GeSOA' is the most active course on HaNS.") ``` # Aktionen pro Visit/Fingerprint ## Statistiken pro Visit ```{r data_prepped} n_actions_searches_interactions <- data_prepped |> select( idvisit, fingerprint, any_of(c( "searches", "actions", "interactions", "referrertype", "referrername", "language", "devicetype", "devicemodel", "operatingsystem", "browsername" )) ) ``` ### Unique IDs, Fingerprints, Mean searches, Mean actions Auswertung - der Anzahlen der uniquen visitids und uniquen Fingerprints - Mittelwerte der Anzahl der Suchen und Aktionen pro Besuch #### idivisit und fingerprint ```{r n_actions_searches_interactions-summary} n_actions_searches_interactions |> as.data.frame() |> summarise( idvisit_n = length(unique(idvisit)), fingerprint_n = length(unique(fingerprint)), actions_mean = mean(as.integer(actions), na.rm = TRUE), searches_mean = mean(as.integer(searches), na.rm = TRUE) ) ``` :::{.callout-note} Es gibt etwa doppelt so viele Besucher wie unique Nutzer. ::: ### Referrer Type pro Visit ```{r} #| error: true n_actions_searches_interactions |> count(referrertype, sort = TRUE) ``` ### Referrer Type Name pro Visit ```{r} #| error: true n_actions_searches_interactions |> count(referrername, sort = TRUE) ``` ### devicemodel ```{r} #| error: true n_actions_searches_interactions |> count(devicemodel, sort = TRUE) |> slice_head(n = 10) ``` ### operatingsystem ```{r} #| error: true n_actions_searches_interactions |> count(operatingsystem, sort = TRUE) |> slice_head(n = 10) ``` ### browsername ```{r} #| error: true n_actions_searches_interactions |> count(browsername, sort = TRUE) |> slice_head(n = 10) ``` Die Mac-User scheinen besonders aktiv zu sein auf HaNS. ## Aktionen pro idvisit/fingerprint - Mit den 499er-Daten ### idvisit ```{r plot-count-action} #| error: true n_action_avg = mean(n_action$nr_max) |> round(0) n_action_median = median(n_action$nr_max) |> round(0) n_action_sd = sd(n_action$nr_max) |> round(0) n_action_iqr = IQR(n_action$nr_max) |> round(0) n_action |> ggplot() + geom_histogram(aes(x = nr_max)) + labs( x = "Anzahl von Aktionen pro Visit", y = "n", caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale MW±SD" ) + theme_minimal() + geom_vline(xintercept = n_action_avg, color = palette_okabe_ito()[1]) + geom_segment( x = n_action_avg - n_action_sd, y = 0, xend = n_action_avg + n_action_sd, yend = 0, color = palette_okabe_ito()[2], size = 2 ) + annotate( "label", x = n_action_avg, y = 1500, label = paste0("MW = ", n_action_avg) ) + annotate( "label", x = n_action_avg + n_action_sd, y = 0, label = paste0("SD = ", n_action_sd) ) #geom_label(aes(x = n_action_avg), y = 1, label = "Mean") n_action |> ggplot() + geom_histogram(aes(x = nr_max)) + labs( x = "Anzahl von Aktionen pro Visit", y = "n", caption = "Der vertikale Strich zeigt den Median; der horizontale Median±IQR" ) + theme_minimal() + geom_vline(xintercept = n_action_median, color = palette_okabe_ito()[1]) + geom_segment( x = n_action_median - n_action_iqr, y = 0, xend = n_action_median + n_action_iqr, yend = 0, color = palette_okabe_ito()[2], size = 2 ) + annotate( "label", x = n_action_median, y = 1500, label = paste0("Md = ", n_action_median) ) + annotate( "label", x = n_action_median + n_action_iqr, y = 0, label = paste0("IQR = ", n_action_iqr) ) #geom_label(aes(x = n_action_avg), y = 1, label = "Mean") ``` - Mittelwert der Aktionen pro Visit: `r round(n_action_avg, 2)`. - SD der Aktionen pro Visit: `r round(n_action_sd, 2)`. - MD: `r round(n_action_median, 2)`. - IQR: : `r round(n_action_iqr, 2)`. ### fingerprint ```{r plot-count-action-fingerprint} #| error: true n_action_fingerprint_avg = mean(n_action_fingerprint$nr_max) |> round(0) n_action_fingerprint_median = median(n_action_fingerprint$nr_max) |> round(0) n_action_fingerprint_sd = sd(n_action_fingerprint$nr_max) |> round(0) n_action_fingerprint_iqr = IQR(n_action_fingerprint$nr_max) |> round(0) n_action_fingerprint |> ggplot() + geom_histogram(aes(x = nr_max)) + labs( x = "Anzahl von Aktionen pro Visit", y = "n", caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale MW±SD" ) + theme_minimal() + geom_vline( xintercept = n_action_fingerprint_avg, color = palette_okabe_ito()[1] ) + geom_segment( x = n_action_fingerprint_avg - n_action_fingerprint_sd, y = 0, xend = n_action_fingerprint_avg + n_action_fingerprint_sd, yend = 0, color = palette_okabe_ito()[2], size = 2 ) + annotate( "label", x = n_action_fingerprint_avg, y = 1500, label = paste0("MW = ", n_action_fingerprint_avg) ) + annotate( "label", x = n_action_fingerprint_avg + n_action_fingerprint_sd, y = 0, label = paste0("SD = ", n_action_fingerprint_sd) ) #geom_label(aes(x = n_action_fingerprint_avg), y = 1, label = "Mean") n_action_fingerprint |> ggplot() + geom_histogram(aes(x = nr_max)) + labs( x = "Anzahl von Aktionen pro Visit", y = "n", caption = "Der vertikale Strich zeigt den Median; der horizontale Median±IQR" ) + theme_minimal() + geom_vline( xintercept = n_action_fingerprint_median, color = palette_okabe_ito()[1] ) + geom_segment( x = n_action_fingerprint_median - n_action_fingerprint_iqr, y = 0, xend = n_action_fingerprint_median + n_action_fingerprint_iqr, yend = 0, color = palette_okabe_ito()[2], size = 2 ) + annotate( "label", x = n_action_fingerprint_median, y = 1500, label = paste0("Md = ", n_action_fingerprint_median) ) + annotate( "label", x = n_action_fingerprint_median + n_action_fingerprint_iqr, y = 0, label = paste0("IQR = ", n_action_fingerprint_iqr) ) #geom_label(aes(x = n_action_fingerprint_avg), y = 1, label = "Mean") ``` ## Ohne 499er-Daten ### idvisit ```{r plot-count-action-2} n_action_avg2 = mean(n_action_lt_500$nr_max) |> round(0) n_action_sd2 = sd(n_action_lt_500$nr_max) |> round(2) n_action_lt_500 |> ggplot() + geom_histogram(aes(x = nr_max)) + labs( x = "Anzahl von Aktionen pro Visit", y = "n", title = "Verteilung der User-Aktionen pro Visit", caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale die SD" ) + theme_minimal() + geom_vline(xintercept = n_action_avg2, color = palette_okabe_ito()[1]) + geom_segment( x = n_action_avg - n_action_sd2, y = 0, xend = n_action_avg2 + n_action_sd2, yend = 0, color = palette_okabe_ito()[2], size = 2 ) + annotate( "label", x = n_action_avg2, y = 1500, label = paste0("MW = ", n_action_avg2) ) + annotate( "label", x = n_action_avg2 + n_action_sd2, y = 0, label = paste0("SD = ", n_action_sd2) ) #geom_label(aes(x = n_action_avg), y = 1, label = "Mean") ``` - Mittelwert der Aktionen pro Visit: `r round(n_action_avg2, 2)`. - SD der Aktionen pro Visit: `r round(n_action_sd2, 2)`. ### fingerprint ```{r plot-count-action-2_2} n_action_fingerprint_avg2 = mean(n_action_lt_500_fingerprint$nr_max) |> round(0) n_action_fingerprint_sd2 = sd(n_action_lt_500_fingerprint$nr_max) |> round(2) n_action_lt_500_fingerprint |> ggplot() + geom_histogram(aes(x = nr_max)) + labs( x = "Anzahl von Aktionen pro Visit", y = "n", title = "Verteilung der User-Aktionen pro Visit", caption = "Der vertikale Strich zeigt den Mittelwert; der horizontale die SD" ) + theme_minimal() + geom_vline( xintercept = n_action_fingerprint_avg2, color = palette_okabe_ito()[1] ) + geom_segment( x = n_action_fingerprint_avg - n_action_fingerprint_sd2, y = 0, xend = n_action_fingerprint_avg2 + n_action_fingerprint_sd2, yend = 0, color = palette_okabe_ito()[2], size = 2 ) + annotate( "label", x = n_action_fingerprint_avg2, y = 1500, label = paste0("MW = ", n_action_fingerprint_avg2) ) + annotate( "label", x = n_action_fingerprint_avg2 + n_action_fingerprint_sd2, y = 0, label = paste0("SD = ", n_action_fingerprint_sd2) ) #geom_label(aes(x = n_action_avg), y = 1, label = "Mean") ``` ## Anzahl Aktionen im Zeitverlauf ### Monat #### idvisit ```{r} n_action_w_date |> ggplot(aes(x = month_date, y = nr_max)) + stat_summary(fun = mean, geom = "point", size = 2) + stat_summary( fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.2 ) + geom_smooth(method = "lm") + labs(title = "The number of actions per visit has incresed over time") ``` ```{r n_action_w_date-plot} n_action_w_date |> ggplot(aes(x = month_date, y = nr_max)) + geom_jitter(alpha = .1) ``` #### fingerprint ```{r} n_action_w_date_fingerprint |> ggplot(aes(x = month_date, y = nr_max)) + stat_summary(fun = mean, geom = "point", size = 2) + stat_summary( fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.2 ) + geom_smooth(method = "lm") + labs(title = "The number of actions per visit has incresed over time") ``` ```{r n_action_w_date-plot_2} n_action_w_date_fingerprint |> ggplot(aes(x = month_date, y = nr_max)) + geom_jitter(alpha = .1) ``` ### Regression (Monat) #### idvisit ```{r} lm(nr_max ~ month_date, data = n_action_w_date) ``` #### fingerprint ```{r} lm(nr_max ~ month_date, data = n_action_w_date_fingerprint) ``` ### Woche #### idvisit ```{r} n_action_w_date |> mutate(week_date = as.Date(week_date)) |> ggplot(aes(x = week_date, y = nr_max)) + stat_summary(fun = mean, geom = "point", size = 2) + stat_summary(fun.data = mean_sdl, geom = "errorbar", width = 0.2) + geom_smooth(method = "lm") + labs(title = "The number of actions per visit has incresed over time") ``` #### fingerprint ```{r} n_action_w_date_fingerprint |> mutate(week_date = as.Date(week_date)) |> ggplot(aes(x = week_date, y = nr_max)) + stat_summary(fun = mean, geom = "point", size = 2) + stat_summary(fun.data = mean_sdl, geom = "errorbar", width = 0.2) + geom_smooth(method = "lm") + labs(title = "The number of actions per fingerprint has incresed over time") ``` ### Regression (Woche) #### idvisit ```{r} lm(nr_max ~ week_date, data = n_action_w_date) ``` #### fingerprint ```{r} lm(nr_max ~ week_date, data = n_action_w_date_fingerprint) ``` ## Gruppierung der Visits/fingerprints nach Anzahl der Aktionen ### idvisit ```{r} n_action_lt_500 <- n_action_lt_500 |> mutate( n_actions_type = case_when( nr_max < 30 ~ "glimpser", nr_max < 300 ~ "serious user", TRUE ~ "heavy user" ) ) ``` ```{r} n_action_lt_500 |> count(n_actions_type) |> gt() ``` ```{r} ggplot(n_action_lt_500) + aes(x = n_actions_type) + geom_bar() ``` #### fingerprint ```{r} n_action_lt_500_fingerprint <- n_action_lt_500_fingerprint |> mutate( n_actions_type = case_when( nr_max < 30 ~ "glimpser", nr_max < 300 ~ "serious user", TRUE ~ "heavy user" ) ) ``` ```{r} n_action_lt_500_fingerprint |> count(n_actions_type) |> gt() ``` ```{r} ggplot(n_action_lt_500_fingerprint) + aes(x = n_actions_type) + geom_bar() ``` ## Gruppierung der Visits im Zeitverlauf ### idvisit ```{r} n_action_w_date |> group_by(month_date) |> count(nr_max) |> mutate( n_actions_type = case_when( nr_max < 30 ~ "glimpser", nr_max < 300 ~ "serious user", TRUE ~ "heavy user" ) ) |> count(n_actions_type) |> ggplot(aes( x = month_date, y = n, color = n_actions_type, group = n_actions_type )) + geom_point() + geom_line() ``` ### fingerprint ```{r} n_action_w_date_fingerprint |> group_by(month_date) |> count(nr_max) |> mutate( n_actions_type = case_when( nr_max < 30 ~ "glimpser", nr_max < 300 ~ "serious user", TRUE ~ "heavy user" ) ) |> count(n_actions_type) |> ggplot(aes( x = month_date, y = n, color = n_actions_type, group = n_actions_type )) + geom_point() + geom_line() ``` # Verweildauer pro Visit ## Berechnungsgrundlage der Verweildauer Die Verweildauer wurde berechnet als Differenz zwischen kleinstem und größtem Datumszeitwert (POSixct) eines Visits (also pro Wert der Variablen `idvisit`), vgl. [Funktion `diff_time](https://github.com/sebastiansauer/hans/blob/main/funs/diff_time.R). Diese Variable heißt `time_diff` im Objekt `time_spent`. Dabei wird das Objekt `data_separated_filtered` herangezogen, vgl. [die Definition es Targets "time_spent" in der Targets-Pipeline](https://github.com/sebastiansauer/hans/blob/main/_targets.R#L205). ## Vorverarbeitung Die Visit-Zeit wurde auf 600 Min. trunkiert/begrenzt. ### idvisit ```{r} time_spent |> head(30) ``` ```{r load-time-spent} time_spent <- time_spent |> # compute time (t) in minutes (min): mutate(t_minutes = as.numeric(time_diff, units = "mins")) |> filter(t_minutes < 600) ``` ### fingerprint ```{r} time_spent_fingerprint |> head(30) ``` ```{r load-time-spent_2} time_spent_fingerprint <- time_spent_fingerprint |> # compute time (t) in minutes (min): mutate(t_minutes = as.numeric(time_diff, units = "mins")) |> filter(t_minutes < 600) ``` ## Verweildauer-Statistiken in Sekunden Die Verweildauer ist im Folgenden dargestellt auf Grundlage oben dargestellter Berechnungsgrundlage (in Sekunden). ### idvisit ```{r comp-diff-time-stats} time_spent |> summarise( mean_time_diff = round(mean(time_diff), 2), sd_time_diff = sd(time_diff), min_time_diff = min(time_diff), # shortest duration max_time_diff = max(time_diff) # longest ) ``` ### fingerprint ```{r comp-diff-time-stats_2} time_spent_fingerprint |> summarise( mean_time_diff = round(mean(time_diff), 2), sd_time_diff = sd(time_diff), min_time_diff = min(time_diff), # shortest duration max_time_diff = max(time_diff) # longest ) ``` ## Verweildauer auf Basis der Variable `visitduration` ### Für alle Daten Alternativ zur Berechnung der Verweildauer steht eine Variable, `visitduration` zur Verfügung, die (offenbar) die Dauer des Visits misst bzw. messen soll. Allerdings resultieren substanziell andere Werte, wenn man diese Variable heranzieht zur Berechnung der Verweildauer, vgl. [Target `time_duration` in der Targets-Pipeline](https://github.com/sebastiansauer/hans/blob/main/_targets.R#L211). ```{r} time_duration |> head(30) ``` ```{r time-duration} time_duration |> summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |> mutate(duration_min_avg = duration_sec_avg / 60) ``` ### Für unique idvisits ```{r time-duration_2} time_duration |> distinct(idvisit, .keep_all = TRUE) |> summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |> mutate(duration_min_avg = duration_sec_avg / 60) ``` ### Für unique fingerprints ```{r time-duration-fingerprint} time_duration |> distinct(fingerprint, .keep_all = TRUE) |> summarise(duration_sec_avg = mean(visitduration_sec, na.rm = TRUE)) |> mutate(duration_min_avg = duration_sec_avg / 60) ``` ## Verweildauer-Statistiken in Minuten ```{r time-spent-tbl} time_spent |> mutate(time_diff_minutes = time_length(time_diff, unit = "minute")) |> summarise( mean_time_diff = round(mean(time_diff_minutes), 2), sd_time_diff = sd(time_diff_minutes), min_time_diff = min(time_diff_minutes), # shortest duration max_time_diff = max(time_diff_minutes) # longest ) ``` ```{r} small_padding_theme <- ggpubr::ttheme( tbody.style = tbody_style(size = 8), # Smaller font size can help colnames.style = colnames_style(size = 9, face = "bold"), padding = unit(c(2, 2), "mm") # Reduce horizontal and vertical padding ) ``` ```{r} #| eval: false ggpubr::ggtexttable( time_spent_summary, rows = NULL, theme = small_padding_theme ) ``` ## Visualisierung der Verweildauer ### Binwidth=10 Minutes ```{r plot-time-spent1} time_spent |> mutate(time_diff_minutes = time_diff / 60) |> ggplot(aes(x = time_diff_minutes)) + # minutes geom_histogram(binwidth = 10) + #scale_x_time() + theme_minimal() + labs(y = "n", x = "Verweildauer in HaNS pro Visit in d:h:m") + scale_x_time(breaks = pretty_breaks()) ``` ### Bin width= 20 Minutes ```{r plot-time-spent2} time_spent |> mutate(time_diff_minutes = time_diff / 60) |> ggplot(aes(x = time_diff_minutes)) + # minutes geom_histogram(binwidth = 20) + theme_minimal() + labs( y = "n", x = "Verweildauer", title = "Verweildauer in HaNS pro Visit in d:h:m" ) + scale_x_time(breaks = pretty_breaks()) ``` ### Zeitdauer begrenzt auf 1-120 Min. ```{r plot-time-spent3} time_spent2 <- time_spent |> filter(time_diff > 1, time_diff < 120) time_spent2 |> ggplot(aes(x = time_diff)) + geom_histogram(binwidth = 10) + theme_minimal() + labs( y = "n", x = "Verweildauer in HaNS pro Visit in Minuten", title = "Verweildauer begrenzt auf 1-120 Minuten", caption = "bindwidth = 10 Min." ) ``` ### Veränderung der Verweildauer im Zeitverlauf #### Monat Die Einheit von `time_spent` ist Sekunden. ```{r} time_spent_by_month <- time_spent |> mutate(month_start = floor_date(time_min, "month")) |> mutate( month_name = month(month_start, label = TRUE, abbr = FALSE), month_num = month(month_start, label = FALSE), year = year(month_start) ) |> group_by(month_num, year) |> summarise( time_spent_month_avg = mean(time_diff, na.rm = TRUE), time_spent_month_sd = sd(time_diff, na.rm = TRUE) ) |> arrange(year, month_num) time_spent_by_month ``` ```{r time_spent_by_month} time_spent_by_month |> mutate( time_spent_month_avg = round(time_spent_month_avg, 2), time_spent_month_sd = round(time_spent_month_sd, 2) ) |> ggtexttable() ``` ```{r time_spent_by_month_name} time_spent_by_month_name <- time_spent |> mutate(month_start = floor_date(time_min, "month")) |> mutate( month_name = month(month_start, label = TRUE, abbr = FALSE), month_num = month(month_start, label = FALSE), year = year(month_start) ) |> group_by(month_start, year) |> summarise( time_spent_month_avg = mean(time_diff, na.rm = TRUE), time_spent_month_sd = sd(time_diff, na.rm = TRUE) ) time_spent_by_month_name |> ggplot(aes(x = month_start, y = time_spent_month_avg)) + geom_line(group = 1, color = "grey60") + geom_point() ``` #### Jahr ```{r} time_spent_by_year <- time_spent |> mutate(month_start = floor_date(time_min, "month")) |> mutate( month_name = month(month_start, label = TRUE, abbr = FALSE), month_num = month(month_start, label = FALSE), year = year(month_start) ) |> group_by(year) |> summarise( time_spent_avg = mean(time_diff, na.rm = TRUE), time_spent_sd = sd(time_diff, na.rm = TRUE) ) time_spent_by_year ``` ```{r} time_spent_by_year |> ggplot(aes(x = year, y = time_spent_avg)) + geom_line(group = 1, color = "grey60") + geom_point() ``` #### Woche ```{r time_spent_by_week_name} time_spent_by_week_name <- time_spent |> mutate(week_start = floor_date(time_min, "week")) |> mutate(week_num = week(week_start), year = year(week_start)) |> group_by(week_start, year) |> summarise( time_spent_week_avg = mean(time_diff, na.rm = TRUE), time_spent_week_sd = sd(time_diff, na.rm = TRUE) ) time_spent_by_week_name |> ggplot(aes(x = week_start, y = time_spent_week_avg)) + geom_line(group = 1, color = "grey60") + geom_point() ``` ## Zusammenhang von Lehrveranstaltung und Verweildauer ```{r} time_spent_w_course_university_summary <- time_spent_w_course_university |> group_by(floor_date_month) |> summarise( distinct_courses_n = n_distinct(course), diff_time_mean = mean(time_diff, na.rm = TRUE), n = n() ) time_spent_w_course_university_summary ``` ```{r} time_spent_w_course_university_summary |> ggplot(aes(x = distinct_courses_n, y = diff_time_mean)) + geom_point() ``` ## Zusammenhang von Lehrveranstaltung und Anzahl Visits ```{r} time_spent_w_course_university_summary |> ggplot(aes(x = distinct_courses_n, y = n)) + geom_point() + labs(y = "No. of visits per month", x = "No. of distinct courses per month") ``` # Was machen die User? Was machen die Visitors eigentlich? Und wie oft? ## Häufigkeiten Für das Objekt `n_action_type` wurde die Spalte `subtitle` in den Langformat-Daten ausgewertet, s. [Funktionsdefinition von `count_user_action_type`](https://github.com/sebastiansauer/hans/blob/main/funs/count_user_action.R). ```{r} n_action_type |> head(30) ``` Achtung: Es kann sinnvoller sein, alternativ zu dieser Analyse die Analyse auf Basis von `eventcategory` heranzuziehen. Dort werden alle Arten von Events berücksichtigt. Hier, in der vorliegenden, nur ausgewählte Events. ### Nach bestimmten Kategorien ```{r category-tab} n_action_type_counted <- n_action_type |> drop_na() |> count(category, sort = TRUE) |> mutate(prop = round(n / sum(n), 2)) n_action_type_counted |> gt() ``` ### Nach Kategorien im Zeitverlauf ```{r} n_action_type_per_month <- n_action_type |> select(nr, idvisit, category) |> ungroup() |> left_join(time_visit_wday |> ungroup()) |> select(-c(dow, hour, nr)) |> drop_na() |> mutate(month_start = floor_date(date_time, "month")) |> count(month_start, category) ``` ```{r} n_action_type_per_month ``` ### Nur die Top3-Kategorien #### idvisit ```{r} time_visit_wday |> head(30) ``` ```{r n_action_type_per_month_top3} n_action_type_per_month_top3 <- n_action_type |> select(nr, idvisit, category) |> ungroup() |> filter(category %in% c("video", "click_slideChange", "visit_page")) |> left_join(time_visit_wday |> ungroup()) |> select(-c(dow, hour, nr)) |> drop_na() |> mutate(month_start = floor_date(date_time, "month")) |> count(month_start, category) ``` ```{r n_action_type_per_month_top3-gt} n_action_type_per_month_top3 ``` ```{r n_action_type_per_month_top3-ggplot} n_action_type_per_month_top3 |> ggplot(aes(x = month_start, y = n, color = category, group = category)) + geom_line() ``` #### fingerprint ```{r} time_visit_wday_fingerprint |> head(30) ``` ```{r n_action_type_per_month_top3_2} n_action_type_per_month_top3_fingerprint <- n_action_type |> select(nr, fingerprint, category) |> ungroup() |> filter(category %in% c("video", "click_slideChange", "visit_page")) |> left_join(time_visit_wday_fingerprint |> ungroup()) |> select(-c(dow, hour, nr)) |> drop_na() |> mutate(month_start = floor_date(date_time, "month")) |> count(month_start, category) ``` ```{r n_action_type_per_month_top3-gt_2} n_action_type_per_month_top3_fingerprint ``` ```{r n_action_type_per_month_top3-ggplot_2} n_action_type_per_month_top3_fingerprint |> ggplot(aes(x = month_start, y = n, color = category, group = category)) + geom_line() ``` ### Top3 - Pro Kurs ```{r n_action_type_course_uni} n_action_type_course_uni <- n_action_type |> left_join(course_and_uni_per_visit |> mutate(idvisit = as.integer(idvisit))) ``` ```{r n_action_type_per_month_top3_per_course} n_action_type_per_month_top3_per_course <- n_action_type_course_uni |> filter(category %in% c("video", "click_slideChange", "visit_page")) |> drop_na() |> mutate(month_start = floor_date(actiondetails_0_timestamp, "month")) |> count(course, month_start, category) ``` ```{r n_action_type_per_month_top3_per_course-plot} #| fig-asp: 1.5 n_action_type_per_month_top3_per_course |> ggplot(aes(x = month_start, y = n, color = category, group = category)) + facet_wrap(~course, ncol = 3, scales = "free_y") + geom_line() + theme(legend.position = "bottom") + scale_x_date(date_labels = "%b %Y") ``` ### `eventcategory` Für folgende Analyse wurde eine andere Variable als oben herangezogen, nämlich `eventcategory`. Dadurch resultieren etwas andere Ergebnisse. ```{r data_separated_filtered_count} data_separated_filtered_count <- data_separated_filtered |> filter(type == "eventcategory") |> count(value, sort = TRUE) |> mutate(prop = n / sum(n)) data_separated_filtered_count ``` ```{r} data_separated_filtered_count |> ggtexttable() ``` Als Excel-Datei abspeichern: ```{r} #data_separated_filtered_count |> # writexl::write_xlsx(path = "obj/data_separated_filtered_count.xlsx") ``` ### User-Typen nach Aktivitäten Was ist die Hauptaktivität pro User? - Verteilung #### idvisit ```{r} n_action_type_distro <- n_action_type |> group_by(idvisit) |> summarise(category_max = max(category, na.rm = TRUE)) |> count(category_max) n_action_type_distro ``` ```{r} n_action_type_distro |> ggplot(aes(x = n, y = category_max)) + geom_col() ``` #### fingerprint ```{r} n_action_type_distro_fingerpr <- n_action_type |> group_by(fingerprint) |> summarise(category_max = max(category, na.rm = TRUE)) |> count(category_max) n_action_type_distro ``` ```{r} n_action_type_distro_fingerpr |> ggplot(aes(x = n, y = category_max)) + geom_col() ``` ## Verteilung ```{r} n_action_type_counted <- n_action_type |> count(category, sort = TRUE) ``` ### Insgesamt - Rohwerte ```{r vis-count-action-type} n_action_type_counted |> ggplot(aes(y = reorder(category, n), x = n)) + geom_col() + geom_bar_text() + labs( x = "User-Aktion", y = "Aktion", title = "Anzahl der User-Aktionen nach Kategorie" ) + theme_minimal() + scale_x_continuous(labels = scales::comma) ``` ### Insgesamt - Log-Skalierung ```{r vis-count-action-type-log} #| fig-width: 9 n_action_type_counted |> ggplot(aes(y = reorder(category, n), x = n)) + geom_col() + geom_bar_text() + labs( x = "Anazhl der User-Aktionen", y = "Aktion", title = "Anzahl der User-Aktionen nach Kategorie", caption = "Log10-Skala" ) + theme_minimal() + scale_x_log10() ``` ### Pro Kurs - Rohwerte ```{r n_action_type_course_uni_counted} n_action_type_course_uni_counted <- n_action_type_course_uni |> group_by(course) |> count(category, sort = TRUE) |> drop_na() ``` ```{r n_action_type_course_uni_counted_plot} #| fig-asp: 1.5 n_action_type_course_uni_counted |> ggplot() + aes(y = category, x = log(n, base = 10)) + geom_col() + facet_wrap(~course) ``` # An welchen Tagen und zu welcher Zeit kommen die User zu HaNS? ## Setup ### idvisit ```{r setup-dates} # Define a vector with the names of the days of the week # Note: Adjust the start of the week (Sunday or Monday) as per your requirement days_of_week <- c( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday" ) # Replace numbers with day names time_visit_wday$dow2 <- factor( days_of_week[time_visit_wday$dow], levels = days_of_week ) ``` ### fingerprint ```{r setup-dates-fin} # Define a vector with the names of the days of the week # Note: Adjust the start of the week (Sunday or Monday) as per your requirement days_of_week <- c( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday" ) # Replace numbers with day names time_visit_wday_fingerprint$dow2 <- factor( days_of_week[time_visit_wday_fingerprint$dow], levels = days_of_week ) ``` ## HaNS-Login nach Uhrzeit ### idvisit ```{r vis-hans-login-hour} time_visit_wday |> as_tibble() |> count(hour) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = hour, y = prop)) + geom_col() + theme_minimal() + labs( title = "HaNS-Nutzer sind keine Frühaufsteher", x = "Uhrzeit", y = "Anteil" ) # coord_polar() ``` ### fingerprint ```{r vis-hans-login-hour-fi} time_visit_wday_fingerprint |> as_tibble() |> count(hour) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = hour, y = prop)) + geom_col() + theme_minimal() + labs( title = "HaNS-Nutzer sind keine Frühaufsteher", x = "Uhrzeit", y = "Anteil" ) # coord_polar() ``` ```{r vis-hans-login-hour-polar} time_visit_wday |> as_tibble() |> count(hour) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = hour, y = prop)) + geom_col() + theme_minimal() + coord_polar() ``` ## Verteilung der HaNS-Besuche nach Wochentagen ### idvisit ```{r vis-hans-login-wday-bar} time_visit_wday |> as_tibble() |> count(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = dow2, y = prop)) + geom_col() + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen", x = "Wochentag", y = "Anteil" ) # coord_polar() ``` ```{r vis-hans-login-wday-polar-fi} time_visit_wday |> as_tibble() |> count(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = dow2, y = prop)) + geom_col() + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen", x = "Wochentag", y = "Anteil" ) + coord_polar() ``` #### fingerprint ```{r vis-hans-login-wday-bar-fi} time_visit_wday_fingerprint |> as_tibble() |> count(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = dow2, y = prop)) + geom_col() + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen", x = "Wochentag", y = "Anteil" ) # coord_polar() ``` ```{r vis-hans-login-wday-polar} time_visit_wday_fingerprint |> as_tibble() |> count(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = dow2, y = prop)) + geom_col() + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen", x = "Wochentag", y = "Anteil" ) + coord_polar() ``` ### HaNS-Login nach Wochentagen Uhrzeit #### idvisit ```{r vis-hans-login-wday-hour} time_visit_wday |> as_tibble() |> count(dow2, hour) |> group_by(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = hour, y = prop)) + geom_col() + facet_wrap(~dow2) + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten", x = "Wochentag", y = "Anteil" ) # coord_polar() ``` ```{r vis-hans-login-wday-hour-polar} #| fig-width: 9 #| fig-asp: 1.5 time_visit_wday |> as_tibble() |> count(dow2, hour) |> group_by(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = hour, y = prop)) + geom_col() + facet_wrap(~dow2) + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten", x = "Wochentag", y = "Anteil" ) + coord_polar() ``` #### fingerprint ```{r vis-hans-login-wday-hour_2} time_visit_wday_fingerprint |> as_tibble() |> count(dow2, hour) |> group_by(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = hour, y = prop)) + geom_col() + facet_wrap(~dow2) + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten", x = "Wochentag", y = "Anteil" ) # coord_polar() ``` ```{r vis-hans-login-wday-hour-polar_2} #| fig-width: 9 #| fig-asp: 1.5 time_visit_wday_fingerprint |> as_tibble() |> count(dow2, hour) |> group_by(dow2) |> mutate(prop = n / sum(n)) |> ggplot(aes(x = hour, y = prop)) + geom_col() + facet_wrap(~dow2) + theme_minimal() + labs( title = "Verteilung der HaNS-Logins nach Wochentagen und Uhrzeiten", x = "Wochentag", y = "Anteil" ) + coord_polar() ``` ## Anzahl der Visits nach Datum (Tagen) und Uhrzeit (bin2d) ### idvisit ```{r} time2 <- time_visit_wday |> ungroup() |> mutate(date = as.Date(date_time)) |> mutate(month_start = floor_date(date_time, "month")) time2 |> ggplot(aes(x = date, y = hour)) + geom_bin2d(binwidth = c(1, 1)) + # (1 day, 1 hour) scale_x_date(date_breaks = "1 month") + theme(legend.position = "bottom") + scale_fill_viridis_c() + labs(caption = "Each x-bin maps to one week") + scale_x_date(breaks = breaks_pretty()) ``` ### fingerprint ```{r} time2_fingerprint <- time_visit_wday_fingerprint |> ungroup() |> mutate(date = as.Date(date_time)) |> mutate(month_start = floor_date(date_time, "month")) time2_fingerprint |> ggplot(aes(x = date, y = hour)) + geom_bin2d(binwidth = c(1, 1)) + # (1 day, 1 hour) scale_x_date(date_breaks = "1 month") + theme(legend.position = "bottom") + scale_fill_viridis_c() + labs(caption = "Each x-bin maps to one week") + scale_x_date(breaks = breaks_pretty()) ``` ## Anzahl der Visits nach Datum (Wochen) und Uhrzeit (bin2d) ### idvisit ```{r} time2 |> ggplot(aes(x = date, y = hour)) + geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour scale_x_date(date_breaks = "1 week", date_labels = "%W") + theme(legend.position = "bottom") + scale_fill_viridis_c() + labs( x = "Week number in 2023/2024", caption = "Each x-bin maps to one week" ) + scale_x_date(breaks = breaks_pretty()) ``` ### fingerprint ```{r} time2_fingerprint |> ggplot(aes(x = date, y = hour)) + geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour scale_x_date(date_breaks = "1 week", date_labels = "%W") + theme(legend.position = "bottom") + scale_fill_viridis_c() + labs( x = "Week number in 2023/2024", caption = "Each x-bin maps to one week" ) + scale_x_date(breaks = breaks_pretty()) ``` ## Anzahl der Visits nach Datum (Wochen) und Wochentag (bin2d) ### idvisit ```{r p-visits-day-wday} time2 |> ggplot(aes(x = date, y = dow)) + geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour scale_x_date(date_breaks = "1 week", date_labels = "%W") + theme(legend.position = "bottom") + scale_fill_viridis_c() + labs( x = "Week number in 2023/2024", caption = "Each x-bin maps to one week", y = "Day of Week" ) + scale_y_continuous(breaks = 1:7) + scale_x_date(breaks = breaks_pretty()) ``` ### fingerprint ```{r p-visits-day-wday_2} time2_fingerprint |> ggplot(aes(x = date, y = dow)) + geom_bin2d(binwidth = c(7, 1)) + # 1 week, 1 hour scale_x_date(date_breaks = "1 week", date_labels = "%W") + theme(legend.position = "bottom") + scale_fill_viridis_c() + labs( x = "Week number in 2023/2024", caption = "Each x-bin maps to one week", y = "Day of Week" ) + scale_y_continuous(breaks = 1:7) + scale_x_date(breaks = breaks_pretty()) ``` # KI-Gebrauch ## Interaktion mit dem LLM Berechnungsgrundlage: Für diese Analyse wurden alle Events der Kategorie `llm` gefiltert. ### Art und Anzahl der Interaktionen mit dem LLM ```{r data_separated_filtered_ai} data_separated_filtered_ai <- data_separated_filtered |> filter(type == "eventcategory") |> filter(str_detect(value, "llm")) |> count(value, sort = TRUE) |> mutate(prop = n / sum(n)) data_separated_filtered_ai ``` ```{r} data_separated_filtered_ai |> mutate(prop = round(prop, 3)) |> ggtexttable() ``` ## Anzahl der `message_to_llm` ```{r llm_interactions} llm_interactions <- data_separated_filtered |> filter(str_detect(value, "message_to_llm")) ``` ### Verteilung ```{r} llm_interactions_count <- llm_interactions |> count(idvisit, sort = TRUE) |> rename(messages_to_llm_n = n) llm_interactions_count |> describe_distribution(messages_to_llm_n, centrality = c("mean", "median")) ``` ### Diagramm ```{r} gghistogram( llm_interactions_count, x = "messages_to_llm_n", bins = 10, add = "median" ) + labs(caption = "The vertical dotted line denotes the median.") ``` ### Anteil Visitors, die mit dem LLM interagieren #### idvisit ```{r} data_separated_filtered_llm_interact <- data_separated_filtered |> mutate(has_llm = str_detect(value, "llm")) |> group_by(idvisit) |> summarise(llm_used_during_visit = any(has_llm == TRUE)) |> count(llm_used_during_visit) |> mutate(prop = round(n / sum(n), 2)) data_separated_filtered_llm_interact |> gt() ``` ```{r} data_separated_filtered_llm_interact |> ggtexttable() ``` #### fingerprint ```{r} data_separated_filtered_llm_interact_fingerprint <- data_separated_filtered |> mutate(has_llm = str_detect(value, "llm")) |> group_by(fingerprint) |> summarise(llm_used_during_visit = any(has_llm == TRUE)) |> count(llm_used_during_visit) |> mutate(prop = round(n / sum(n), 2)) data_separated_filtered_llm_interact_fingerprint |> gt() ``` ```{r} data_separated_filtered_llm_interact_fingerprint |> ggtexttable() ``` ### ... Im Zeitverlauf ```{r idvisit_has_llm} idvisit_has_llm |> head(30) ``` ```{r idvisit_has_llm_timeline} idvisit_has_llm_timeline <- idvisit_has_llm |> count(year_month, uses_llm) |> ungroup() |> group_by(year_month) |> mutate(prop = round(n / sum(n), 2)) idvisit_has_llm_timeline ``` ```{r} idvisit_has_llm_timeline |> ggtexttable() ``` ```{r idvisit_has_llm_plot} idvisit_has_llm |> count(year_month, uses_llm) |> ungroup() |> mutate(year_month_date = ymd(paste0(year_month, "-01"))) |> group_by(year_month_date) |> mutate(prop = n / sum(n)) |> ggplot(aes( x = year_month_date, y = prop, color = uses_llm, groups = uses_llm )) + geom_point() + geom_line(aes(group = uses_llm)) + labs( title = "Visitors, die mit dem LLM interagieren im Zeitverlauf (Anteile)" ) + scale_x_date(breaks = pretty_breaks()) ``` ```{r} idvisit_has_llm |> count(year_month, uses_llm) |> ungroup() |> mutate(year_month_date = ymd(paste0(year_month, "-01"))) |> group_by(year_month) |> ggplot(aes(x = year_month_date, y = n, color = uses_llm, groups = uses_llm)) + geom_point() + geom_line(aes(group = uses_llm)) + labs( title = "Visitors, die mit dem LLM interagieren im Zeitverlauf (Anzahl)" ) + scale_x_date(breaks = pretty_breaks()) ``` ## Anzahl der Interaktionen bei den Usern, die mit dem LLM interagieren ```{r d_n_interactions_w_llm} d_n_interactions_w_llm <- data_separated_filtered |> filter(type == "eventcategory") |> filter(str_detect(value, "llm")) |> group_by(idvisit) |> summarise(n_interactions_w_llm = n()) ``` ```{r d_n_interactions_w_llm-describe-distro} d_n_interactions_w_llm |> select(n_interactions_w_llm) |> describe_distribution() |> print_md() ``` ```{r d_n_interactions_w_llm-plot} d_n_interactions_w_llm |> ggplot(aes(x = n_interactions_w_llm)) + geom_histogram() ``` ## Klick auf ein Wort im Transkript Ausgewertet wird im Folgenden die Variable "click_transcript_word". ### Insgesamt ```{r ai-click-transcript-word} data_separated_filtered |> filter(type == "subtitle") |> # rm empty rows: filter(!is.na(value) & value != "") |> count(click_transcript_word = str_detect(value, "click_transcript_word")) |> mutate(prop = round(n / sum(n), 2)) |> tt() ``` ### Im Zeitverlauf #### idvisit ```{r click_transcript_word_per_month} click_transcript_word_per_month <- data_separated_filtered |> # rm all groups WITHOUT "click_transcript_word": group_by(idvisit) |> filter(!any(value = str_detect(value, "click_transcript_word"))) |> ungroup() |> mutate(date_visit = ymd_hms(value)) |> mutate(month_visit = floor_date(date_visit, unit = "month")) |> drop_na(date_visit) |> group_by(idvisit) |> slice(1) |> ungroup() |> count(month_visit) click_transcript_word_per_month ``` ```{r click_transcript_word_per_month_plot} click_transcript_word_per_month |> ggplot(aes(x = month_visit, y = n)) + geom_line() ``` #### fingerprint ```{r click_transcript_word_per_month_2} click_transcript_word_per_month_fingerprint <- data_separated_filtered |> # rm all groups WITHOUT "click_transcript_word": group_by(fingerprint) |> filter(!any(value = str_detect(value, "click_transcript_word"))) |> ungroup() |> mutate(date_visit = ymd_hms(value)) |> mutate(month_visit = floor_date(date_visit, unit = "month")) |> drop_na(date_visit) |> group_by(fingerprint) |> slice(1) |> ungroup() |> count(month_visit) click_transcript_word_per_month_fingerprint ``` ```{r click_transcript_word_per_month_plot_2} click_transcript_word_per_month_fingerprint |> ggplot(aes(x = month_visit, y = n)) + geom_line() ``` ## KI-Aktionen ### Insgesamt (ganzer Zeitraum) ```{r} data_long |> head(300) ``` ### Im Detail ```{r ai-actions-count} regex_pattern <- "Category: \"(.*?)(?=', Action)" # Explaining this regex_pattern: # Find the literal string # 1. `Category: ` (surrounded by quotation marks) # 2. Capture any characters (.*?) that follow, non-greedily, until... # 3. ...it encounters the literal sequence, ` Action`) immediately after the captured string. ai_actions_count <- data_long |> # slice(1:1000) |> filter(str_detect(value, "transcript")) |> mutate(category = str_extract(value, regex_pattern)) |> select(category) |> mutate(category = str_replace_all(category, "[\"']", "")) |> count(category, sort = TRUE) ai_actions_count |> tt() ``` ### KI-Klicks pro Monat Im Objekt wird gezählt, wie oft der String `"click_transcript_word"` in den Daten (Langformat) gefunden wird, s. Target `ai_transcript_clicks_per_month` in der Targets-Pipeline. ```{r ai_transcript_clicks_per_month} ai_transcript_clicks_per_month |> head(30) ``` ```{r ai-click-transcript-word-months} ai_transcript_clicks_per_month_count <- ai_transcript_clicks_per_month |> count(year_month, clicks_transcript_any) |> ungroup() |> group_by(year_month) |> mutate(prop = round(n / sum(n), 2)) ai_transcript_clicks_per_month_count ``` ```{r} ai_transcript_clicks_per_month_count |> ggtexttable() ``` ```{r ai_transcript_clicks_per_month_count-plot} ai_transcript_clicks_per_month_count |> mutate(date = ymd(paste0(year_month, "-01"))) |> ggplot(aes(x = date, y = n)) + geom_line(group = 1) + geom_point() + theme_minimal() + labs(title = "Number of AI transcript clicks per month", x = "date [months]") ``` ## Output des LLMs: `llm_response` - Tokens und Tokenlänge ### Deutsch vs. Englisch ```{r llm_response_text-count} llm_response_text |> count(lang) |> mutate(prob = n / sum(n)) ``` ### Anzahl der Tokens ```{r} llm_response_text |> describe_distribution(select = "tokens_n") ``` ### Anzahl vorab existierender Fragen #### Anzahl `verify_option_wrong` und `verify_option_correct` ##### idvisit ```{r verify_option_summary} verify_option_summary <- data_separated_filtered |> group_by(idvisit) |> filter(value == "verify_option_wrong" | value == "verify_option_correct") |> summarise(verify_option = n()) ``` ```{r} verify_option_summary |> gghistogram(x = "verify_option") ``` ```{r} verify_option_summary |> describe_distribution(verify_option) |> print_md() ``` ##### fingerprint ```{r verify_option_summary_2} # verify_option_summary_fingerprint <- # data_separated_filtered |> # group_by(fingerprint) |> # filter(value == "verify_option_wrong" | value == "verify_option_correct") |> # summarise(verify_option = n()) setDT(data_separated_filtered) # Ensure your data frame is a data.table verify_option_summary_fingerprint <- data_separated_filtered[ # 1. Filtering (i) value %in% c("verify_option_wrong", "verify_option_correct"), # 2. Summarize (.j) - calculate the count (n) .(verify_option = .N), # 3. Grouping (by) by = .(fingerprint) ] verify_option_summary_fingerprint <- as_tibble( verify_option_summary_fingerprint ) ``` ```{r} verify_option_summary_fingerprint |> gghistogram(x = "verify_option") ``` ```{r} verify_option_summary_fingerprint |> describe_distribution(verify_option) |> print_md() ``` #### Anzahl `verify_option_wrong` verify_option_div_by_4 - geteilt durch 4 ```{r} verify_option_summary <- verify_option_summary |> mutate(verify_option_div_by_4 = verify_option / 4) verify_option_summary |> gghistogram(x = "verify_option_div_by_4") ``` ```{r} verify_option_summary |> mutate(verify_option_div_by_4 = verify_option / 4) |> describe_distribution(verify_option_div_by_4) |> print_md() ``` #### Anzahl "Multiple choice answer selected" ```{r} check_if_both_methods_give_same_number <- n_mc_answers_selected |> full_join(verify_option_summary) check_if_both_methods_give_same_number |> head(20) |> gt() ``` Nein, beide Methoden liefern *nicht* die gleiche Zahl. #### "Multiple choice answer selected" im Zeitverlauf ```{r} mc_answers_with_timestamps <- mc_answers_with_timestamps |> mutate(month_start = floor_date(timestamp, "month")) |> ungroup() |> arrange(timestamp) |> mutate(n_cumulated = cumsum(n)) |> mutate(date = as.Date(timestamp)) lim <- c( min(mc_answers_with_timestamps$date), max(mc_answers_with_timestamps$date) ) mc_answers_with_timestamps |> ggplot(aes(x = date, y = n_cumulated)) + scale_x_date(limits = lim, labels = scales::label_date_short()) + geom_point() + geom_line() ``` #### Anzahl `generate_questionaire` ```{r generate_questionaire_summary} # generate_questionaire_summary <- # data_separated_filtered |> # group_by(idvisit) |> # filter(value == "generate_questionaire") |> # summarise(generate_questionaire = n()) setDT(data_separated_filtered) # Convert the data.frame to a data.table in place generate_questionaire_summary <- data_separated_filtered[ # 1. Filtering (i) value == "generate_questionaire", # 2. Summarize (.j) - calculate the count (.N) and rename it .(generate_questionaire = .N), # 3. Grouping (by) by = .(idvisit) ] ``` ```{r} generate_questionaire_summary |> describe_distribution(generate_questionaire) |> print_md() ``` #### Anzahl vorab existierender Fragen ```{r prior_existing_questions_summary} setDT(generate_questionaire_summary) setDT(verify_option_summary) # 1. Full Join (Merge) # Use the 'merge' function with all.x=TRUE and all.y=TRUE for a full join # Assumes the join column is 'idvisit' as used in your previous examples prior_existing_questions_summary <- merge( generate_questionaire_summary, verify_option_summary, by = "idvisit", all = TRUE ) # 2. Mutate (Calculation) # Use .j to create the new column prior_existing_questions_summary[, prior_existing_questions_n := verify_option - generate_questionaire ] # prior_existing_questions_summary <- # generate_questionaire_summary |> # full_join(verify_option_summary) |> # mutate(prior_existing_questions_n = verify_option - generate_questionaire) ``` ```{r} prior_existing_questions_summary |> # drop_na() |> gghistogram(x = "prior_existing_questions_n") ``` ```{r} prior_existing_questions_summary |> describe_distribution(prior_existing_questions_n) |> print_md() ``` # Videozeit Wie viel Zeit verbringen die Nutzer mit dem Betrachten von Videos ("Glotzdauer")? ## Glotzdauer allgemein Achtung: Die Videozeit ist schwierig auszuwerten. Die Nutzer beenden keine Videos, in dem sie auf "Pause" drücken, sondern indem sie andere Aktionen durchführen. Dies ist aber analytisch schwer abzubilden. Vgl. die Definition des Targets `glotzdauer` in der [Pipeline](https://github.com/sebastiansauer/hans/blob/main/_targets.R#L269). Kurz gesagt wird die Zeit-Differenz zwischen zwei aufeinander folgenden "Play" und "Pause" Aktionen berechnet. Allerdings hat dieses Vorgehen Schwierigkeiten: Nicht immer folgt auf einem "Play" ein "Pause". Es ist schwer auszuwerten, wann die Betrachtung eines Videos endet. Daher ist diese Analyse nur vorsichtig zu interpretieren. Die Definition [der Funktion glotzdauer.R](https://github.com/sebastiansauer/hans/blob/main/funs/glotzdauer.R) ist online dokumentiert. ```{r} data_separated_distinct_slice |> head(30) ``` Für die folgende Darstellung wurden die *absoluten* Zeitwerte verwendet, d.h. ohne Vorzeichen. ```{r p-plotzdauer} data_separated_distinct_slice |> # we will assume that negative glotzdauer is the as positive glotzdauer: mutate(time_diff = abs(time_diff)) |> # without glotzdauer smaller than 10 minutes: filter(time_diff < 60 * 10) |> ggplot(aes(x = time_diff)) + geom_histogram() + scale_x_time() + labs( x = "Time interval [minutes]", caption = "Only time intervals less than 10 minutes. It is assumed that video time is positive only (no negative time intervals)." ) + theme_minimal() ``` ```{r glotzdauer-stats} glotzdauer_prepped <- data_separated_distinct_slice |> # we will assume that negative glotzdauer is the as positive glotzdauer: mutate(time_diff_abs_sec = abs(as.numeric(time_diff, units = "secs"))) |> # without glotzdauer smaller than 10 minutes: filter(time_diff_abs_sec < 60 * 10) |> mutate(time_diff_abs_min = time_diff_abs_sec / 60) glotzdauer_tbl <- glotzdauer_prepped |> select(time_diff_abs_sec, time_diff_abs_min) |> describe_distribution() glotzdauer_tbl ``` ```{r glotzdauer_tbl-ggtexttable} glotzdauer_tbl |> mutate(across(where(is.numeric), ~ round(., 2))) |> ggpubr::ggtexttable() ``` ## Glotzdauer im Zeitverlauf ```{r glotzdauer_prepped_tbl} glotzdauer_prepped_tbl <- glotzdauer_prepped |> mutate(first_of_month = floor_date(date, unit = "month")) |> group_by(first_of_month) |> summarise(time_diff_mean = mean(time_diff, na.rm = TRUE)) glotzdauer_prepped_tbl ``` ```{r glotzdauer_prepped_plot} glotzdauer_prepped_tbl |> ggplot(aes(x = first_of_month, y = time_diff_mean)) + geom_line() + theme_minimal() ``` # Abschluss .