wfsets_penguins01

statlearning

tidymodels

num

wfsets

Published

May 17, 2023

Aufgabe

Berechnen Sie die Vorhersagegüte (RMSE) für folgende Lernalgorithmen:

lineares Modell
knn (neighbors: tune)

Modellgleichung: body_mass_g ~ bill_length_mm, data = d_train.

Nutzen Sie minimale Vorverarbeitung.

Lösung

Setup

library(tidymodels)
data(penguins, package = "palmerpenguins")

Daten

d <-
  penguins %>% 
  drop_na()

d_split <- initial_split(d)
d_train <- training(d_split)
d_test <- testing(d_split)

Modelle

Lineares Modell:

mod_lin <- linear_reg()

mod_knn <- nearest_neighbor(mode = "regression",
                                  neighbors = tune())

Rezepte

rec_basic <- recipe(body_mass_g ~ bill_length_mm, data = d_train) %>% 
         step_normalize(all_predictors())

rec_basic

Resampling

rsmpls <- vfold_cv(d_train)

Workflow Set

wf_set <-
  workflow_set(
    preproc = list(rec_simple = rec_basic),
    models = list(mod_lm = mod_lin,
                  mod_nn = mod_knn)
  )

wf_set

# A workflow set/tibble: 2 × 4
  wflow_id          info             option    result    
  <chr>             <list>           <list>    <list>    
1 rec_simple_mod_lm <tibble [1 × 4]> <opts[0]> <list [0]>
2 rec_simple_mod_nn <tibble [1 × 4]> <opts[0]> <list [0]>

Fitten

wf_fit <-
  wf_set %>% 
  workflow_map(resamples = rsmpls)

wf_fit

# A workflow set/tibble: 2 × 4
  wflow_id          info             option    result   
  <chr>             <list>           <list>    <list>   
1 rec_simple_mod_lm <tibble [1 × 4]> <opts[1]> <rsmp[+]>
2 rec_simple_mod_nn <tibble [1 × 4]> <opts[1]> <tune[+]>

Check:

wf_fit %>% pluck("result")

[[1]]
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits           id     .metrics         .notes          
   <list>           <chr>  <list>           <list>          
 1 <split [224/25]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [224/25]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [224/25]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [224/25]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [224/25]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [224/25]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [224/25]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [224/25]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [224/25]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [225/24]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>

[[2]]
# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits           id     .metrics          .notes          
   <list>           <chr>  <list>            <list>          
 1 <split [224/25]> Fold01 <tibble [16 × 5]> <tibble [0 × 3]>
 2 <split [224/25]> Fold02 <tibble [16 × 5]> <tibble [0 × 3]>
 3 <split [224/25]> Fold03 <tibble [16 × 5]> <tibble [0 × 3]>
 4 <split [224/25]> Fold04 <tibble [16 × 5]> <tibble [0 × 3]>
 5 <split [224/25]> Fold05 <tibble [16 × 5]> <tibble [0 × 3]>
 6 <split [224/25]> Fold06 <tibble [16 × 5]> <tibble [0 × 3]>
 7 <split [224/25]> Fold07 <tibble [16 × 5]> <tibble [0 × 3]>
 8 <split [224/25]> Fold08 <tibble [16 × 5]> <tibble [0 × 3]>
 9 <split [224/25]> Fold09 <tibble [16 × 5]> <tibble [0 × 3]>
10 <split [225/24]> Fold10 <tibble [16 × 5]> <tibble [0 × 3]>

Bester Kandidat

autoplot(wf_fit)

autoplot(wf_fit, select_best = TRUE)

rank_results(wf_fit, rank_metric = "rmse") %>% 
  filter(.metric == "rmse")

# A tibble: 9 × 9
  wflow_id          .config .metric  mean std_err     n preprocessor model  rank
  <chr>             <chr>   <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
1 rec_simple_mod_nn Prepro… rmse     642.    31.3    10 recipe       near…     1
2 rec_simple_mod_nn Prepro… rmse     646.    30.9    10 recipe       near…     2
3 rec_simple_mod_lm Prepro… rmse     647.    24.0    10 recipe       line…     3
4 rec_simple_mod_nn Prepro… rmse     648.    32.2    10 recipe       near…     4
5 rec_simple_mod_nn Prepro… rmse     659.    31.7    10 recipe       near…     5
6 rec_simple_mod_nn Prepro… rmse     660.    32.2    10 recipe       near…     6
7 rec_simple_mod_nn Prepro… rmse     687.    36.4    10 recipe       near…     7
8 rec_simple_mod_nn Prepro… rmse     729.    39.7    10 recipe       near…     8
9 rec_simple_mod_nn Prepro… rmse     786.    47.6    10 recipe       near…     9

Am besten war das lineare Modell, aber schauen wir uns auch mal das knn-Modell an, v.a. um zu wissen, wie man den besten Tuningparameter-Wert sieht:

extract_workflow_set_result(wf_fit, "rec_simple_mod_nn") %>% 
  select_best()

Warning: No value of `metric` was given; metric 'rmse' will be used.

# A tibble: 1 × 2
  neighbors .config             
      <int> <chr>               
1        14 Preprocessor1_Model8

Last Fit

best_wf <-
  wf_fit %>% 
  extract_workflow("rec_simple_mod_lm")

Finalisieren müssen wir diesen Workflow nicht, da er keine Tuningparameter hatte.

fit_final <-
  best_wf %>% 
  last_fit(d_split)

Modellgüte im Test-Set

collect_metrics(fit_final)

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard     658.    Preprocessor1_Model1
2 rsq     standard       0.342 Preprocessor1_Model1

Categories:

R
statlearning
tidymodels
num