library(tidyverse)
data("germeval_train", package = "pradadata")
data("germeval_test", package = "pradadata")
germeval03-sent-wordvec-xgb
Aufgabe
Erstellen Sie ein prädiktives Modell für Textdaten. Nutzen Sie Sentiments und TextFeatures im Rahmen von Feature-Engineering. Nutzen Sie außerdem deutsche Word-Vektoren für das Feature-Engineering.
Als Lernalgorithmus verwenden Sie XGB.
Verwenden Sie die GermEval-2018-Daten.
Die Daten sind unter CC-BY-4.0 lizensiert. Author: Wiegand, Michael (Spoken Language Systems, Saarland University (2010-2018), Leibniz Institute for the German Language (since 2019)),
Die Daten sind auch über das R-Paket PradaData zu beziehen.
Die AV lautet c1
. Die (einzige) UV lautet: text
.
Hinweise:
- Orientieren Sie sich im Übrigen an den allgemeinen Hinweisen des Datenwerks.
- Nutzen Sie Tidymodels.
- Nutzen Sie das
sentiws
Lexikon. - ❗ Achten Sie darauf, die Variable
c2
zu entfernen bzw. nicht zu verwenden.
Lösung
<-
d_train |>
germeval_train select(id, c1, text)
library(tictoc)
library(tidymodels)
library(syuzhet)
library(beepr)
library(lobstr) # object size
data("sentiws", package = "pradadata")
Eine Vorlage für ein Tidymodels-Pipeline findet sich hier.
Learner/Modell
<-
mod boost_tree(mode = "classification",
learn_rate = .01,
tree_depth = 5
)
Rezept
Pfad zu den Wordvecktoren:
<- "/Users/sebastiansaueruser/datasets/word-embeddings/wikipedia2vec/part-0.arrow" path_wordvec
source("https://raw.githubusercontent.com/sebastiansauer/Datenwerk2/main/funs/def_recipe_wordvec_senti.R")
<- def_recipe_wordvec_senti(data_train = d_train,
rec path_wordvec = path_wordvec)
Workflow
source("https://raw.githubusercontent.com/sebastiansauer/Datenwerk2/main/funs/def_df.R")
<- def_wf()
wf
wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
13 Recipe Steps
• step_text_normalization()
• step_mutate()
• step_mutate()
• step_mutate()
• step_mutate()
• step_textfeature()
• step_tokenize()
• step_stopwords()
• step_word_embeddings()
• step_zv()
• ...
• and 3 more steps.
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)
Main Arguments:
tree_depth = 5
learn_rate = 0.01
Computational engine: xgboost
Check
tic()
<- prep(rec)
rec_prepped toc()
67.325 sec elapsed
rec_prepped
obj_size(rec_prepped)
3.17 GB
Groß!
tidy(rec_prepped)
# A tibble: 13 × 6
number operation type trained skip id
<int> <chr> <chr> <lgl> <lgl> <chr>
1 1 step text_normalization TRUE FALSE text_normalization_QTRCS
2 2 step mutate TRUE FALSE mutate_z4zTn
3 3 step mutate TRUE FALSE mutate_bjCuT
4 4 step mutate TRUE FALSE mutate_OVxpj
5 5 step mutate TRUE FALSE mutate_TRK3c
6 6 step textfeature TRUE FALSE textfeature_6BkkC
7 7 step tokenize TRUE FALSE tokenize_csz3N
8 8 step stopwords TRUE FALSE stopwords_HU9cX
9 9 step word_embeddings TRUE FALSE word_embeddings_2ZNxu
10 10 step zv TRUE FALSE zv_FNUiA
11 11 step normalize TRUE FALSE normalize_bOlig
12 12 step impute_mean TRUE FALSE impute_mean_kRaUZ
13 13 step mutate TRUE FALSE mutate_PpudL
<- bake(rec_prepped, new_data = NULL)
d_rec_baked
head(d_rec_baked)
# A tibble: 6 × 121
id c1 emo_count schimpf_count emoji_count textfeature_text_copy_n_wo…¹
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 OTHER 0.575 -0.450 -0.353 -0.495
2 2 OTHER -1.11 -0.450 -0.353 -0.0874
3 3 OTHER 0.186 -0.450 0.774 -0.903
4 4 OTHER 0.202 -0.450 -0.353 -0.0874
5 5 OFFENSE 0.168 -0.450 -0.353 -0.393
6 6 OTHER -1.12 -0.450 -0.353 2.46
# ℹ abbreviated name: ¹textfeature_text_copy_n_words
# ℹ 115 more variables: textfeature_text_copy_n_uq_words <dbl>,
# textfeature_text_copy_n_charS <dbl>,
# textfeature_text_copy_n_uq_charS <dbl>,
# textfeature_text_copy_n_digits <dbl>,
# textfeature_text_copy_n_hashtags <dbl>,
# textfeature_text_copy_n_uq_hashtags <dbl>, …
sum(is.na(d_rec_baked))
[1] 0
obj_size(d_rec_baked)
4.85 MB
Fit
tic()
<-
fit_wordvec_senti_xgb fit(wf,
data = d_train)
toc()
35.314 sec elapsed
beep()
fit_wordvec_senti_xgb
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
13 Recipe Steps
• step_text_normalization()
• step_mutate()
• step_mutate()
• step_mutate()
• step_mutate()
• step_textfeature()
• step_tokenize()
• step_stopwords()
• step_word_embeddings()
• step_zv()
• ...
• and 3 more steps.
── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 42.4 Kb
call:
xgboost::xgb.train(params = list(eta = 0.01, max_depth = 5, gamma = 0,
colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist,
verbose = 0, nthread = 1, objective = "binary:logistic")
params (as set within xgb.train):
eta = "0.01", max_depth = "5", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
xgb.attributes:
niter
callbacks:
cb.evaluation.log()
# of features: 119
niter: 15
nfeatures : 119
evaluation_log:
iter training_logloss
1 0.6904064
2 0.6877236
---
14 0.6590144
15 0.6568817
Objekt-Größe:
::obj_size(fit_wordvec_senti_xgb) lobstr
3.17 GB
Groß!
Wie wir gesehen haben, ist das Rezept riesig.
library(butcher)
<- butcher(fit_wordvec_senti_xgb)
out ::obj_size(out) lobstr
3.16 GB
Test-Set-Güte
Vorhersagen im Test-Set:
tic()
<-
preds predict(fit_wordvec_senti_xgb, new_data = germeval_test)
toc()
22.669 sec elapsed
Und die Vorhersagen zum Test-Set hinzufügen, damit man TRUTH
und ESTIMATE
vergleichen kann:
<-
d_test |>
germeval_test bind_cols(preds) |>
mutate(c1 = as.factor(c1))
<- metric_set(accuracy, f_meas)
my_metrics my_metrics(d_test,
truth = c1,
estimate = .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.689
2 f_meas binary 0.400