library(tidymodels)
tidymodels-tree1
statlearning
trees
tidymodels
string
mtcars
Aufgabe
Berechnen Sie folgende prädiktiven Modelle und vergleichen Sie die Modellgüte:
- Entscheidungsbaum
- Bagging (Bootstrap-Bäume)
Modellformel: am ~ .
(Datensatz mtcars
)
Berichten Sie die Modellgüte (ROC-AUC).
Hinweise:
- Tunen Sie alle Parameter (die der Engine anbietet).
- Verwenden Sie Defaults, wo nicht anders angegeben.
- Führen Sie eine \(v=2\)-fache Kreuzvalidierung durch (weil die Stichprobe so klein ist).
- Beachten Sie die üblichen Hinweise.
Lösung
Setup
library(tidymodels)
data(mtcars)
library(tictoc) # Zeitmessung
library(baguette)
Für Klassifikation verlangt Tidymodels eine nominale AV, keine numerische:
<-
mtcars %>%
mtcars mutate(am = factor(am))
Daten teilen
<- initial_split(mtcars)
d_split <- training(d_split)
d_train <- testing(d_split) d_test
Modell(e)
<-
mod_tree decision_tree(mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune())
<-
mod_bag bag_tree(mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune())
Rezept(e)
<-
rec_plain recipe(am ~ ., data = d_train)
Resampling
<- vfold_cv(d_train, v = 2) rsmpl
Workflows
<-
wf_tree workflow() %>%
add_recipe(rec_plain) %>%
add_model(mod_tree)
<-
wf_bag workflow() %>%
add_recipe(rec_plain) %>%
add_model(mod_bag)
Tuning/Fitting
Tuninggrid:
<- grid_regular(extract_parameter_set_dials(mod_tree), levels = 5)
tune_grid tune_grid
# A tibble: 125 × 3
cost_complexity tree_depth min_n
<dbl> <int> <int>
1 0.0000000001 1 2
2 0.0000000178 1 2
3 0.00000316 1 2
4 0.000562 1 2
5 0.1 1 2
6 0.0000000001 4 2
7 0.0000000178 4 2
8 0.00000316 4 2
9 0.000562 4 2
10 0.1 4 2
# ℹ 115 more rows
Da beide Modelle die gleichen Tuningparameter aufweisen, brauchen wir nur ein Grid zu erstellen.
tic()
<-
fit_tree tune_grid(object = wf_tree,
grid = tune_grid,
metrics = metric_set(roc_auc),
resamples = rsmpl)
toc()
18.258 sec elapsed
fit_tree
# Tuning results
# 2-fold cross-validation
# A tibble: 2 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [12/12]> Fold1 <tibble [125 × 7]> <tibble [75 × 3]>
2 <split [12/12]> Fold2 <tibble [125 × 7]> <tibble [75 × 3]>
There were issues with some computations:
- Warning(s) x50: 21 samples were requested but there were 12 rows in the data. 12 ...
- Warning(s) x50: 30 samples were requested but there were 12 rows in the data. 12 ...
- Warning(s) x50: 40 samples were requested but there were 12 rows in the data. 12 ...
Run `show_notes(.Last.tune.result)` for more information.
tic()
<-
fit_bag tune_grid(object = wf_bag,
grid = tune_grid,
metrics = metric_set(roc_auc),
resamples = rsmpl)
toc()
97.331 sec elapsed
Bester Kandidat
show_best(fit_tree)
# A tibble: 5 × 9
cost_complexity tree_depth min_n .metric .estimator mean n std_err
<dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
1 0.0000000001 1 2 roc_auc binary 0.845 2 0.0119
2 0.0000000178 1 2 roc_auc binary 0.845 2 0.0119
3 0.00000316 1 2 roc_auc binary 0.845 2 0.0119
4 0.000562 1 2 roc_auc binary 0.845 2 0.0119
5 0.1 1 2 roc_auc binary 0.845 2 0.0119
# ℹ 1 more variable: .config <chr>
show_best(fit_bag)
# A tibble: 5 × 9
cost_complexity tree_depth min_n .metric .estimator mean n std_err
<dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
1 0.00000316 8 2 roc_auc binary 0.967 2 0.00423
2 0.00000316 11 2 roc_auc binary 0.967 2 0.00423
3 0.0000000178 15 2 roc_auc binary 0.967 2 0.00423
4 0.0000000001 4 11 roc_auc binary 0.967 2 0.00423
5 0.0000000178 11 2 roc_auc binary 0.965 2 0.0206
# ℹ 1 more variable: .config <chr>
Bagging erzielte eine klar bessere Modellgüte (in den Validierungssamples) als das Entscheidungsbaum-Modell.
Finalisieren
<-
wf_best_finalized %>%
wf_bag finalize_workflow(select_best(fit_bag))
Last Fit
<-
final_fit last_fit(object = wf_best_finalized, d_split)
collect_metrics(final_fit)
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 1 Preprocessor1_Model1
2 roc_auc binary 1 Preprocessor1_Model1
3 brier_class binary 0.00103 Preprocessor1_Model1
Wie man sieht, ist die Modellgüte im Test-Sample schlechter als in den Train- bzw. Validierungssamples; ein typischer Befund.
Categories:
- statlearning
- trees
- tidymodels
- string