Aufgabe
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom 1.0.5 ✔ recipes 1.0.8
✔ dials 1.2.0 ✔ rsample 1.2.0
✔ dplyr 1.1.4 ✔ tibble 3.2.1
✔ ggplot2 3.5.0 ✔ tidyr 1.3.1
✔ infer 1.0.5 ✔ tune 1.1.2
✔ modeldata 1.3.0 ✔ workflows 1.1.3
✔ parsnip 1.2.0 ✔ workflowsets 1.0.1
✔ purrr 1.0.2 ✔ yardstick 1.3.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(tictoc)
# Data:
d_path <- "https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv"
d <- read.csv(d_path)
Die folgende Pipeline hat einen Fehler. Welcher ist das?
set.seed(42)
d_split <- initial_split(d)
d_train <- training(d_split)
d_test <- testing(d_split)
# model:
mod1 <-
rand_forest(mode = "regression")
# cv:
set.seed(42)
rsmpl <- vfold_cv(d_train)
# recipe:
rec1 <- recipe(body_mass_g ~ ., data = d_train) |>
#step_unknown(all_nominal_predictors(), new_level = "NA") |>
#step_novel(all_nominal_predictors()) |>
step_naomit(all_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_nzv(all_predictors()) |>
step_normalize(all_predictors())
# workflow:
wf1 <-
workflow() %>%
add_model(mod1) %>%
add_recipe(rec1)
# fitting:
tic()
wf1_fit <-
wf1 %>%
fit(data = d_train)
toc()
preds <- predict(wf1_fit, new_data = d_test)
Error: Missing data in columns: bill_length_mm, bill_depth_mm, flipper_length_mm.
Als Check: Das gepreppte/bebackene Rezept:
rec1_prepped <- prep(rec1)
d_train_baked <- bake(rec1_prepped, new_data = NULL)
# A tibble: 6 × 12
rownames bill_length_mm bill_depth_mm flipper_length_mm year body_mass_g
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 -1.24 -1.53 0.386 -0.794 -1.29 3450
2 1.45 1.32 0.386 -0.365 1.14 3675
3 -0.212 0.401 -1.97 0.707 -1.29 4500
4 -0.993 0.343 0.887 -0.294 -0.0757 4150
5 0.530 0.879 -0.566 2.07 -0.0757 5800
6 -0.281 -0.957 0.787 -1.15 1.14 3650
# ℹ 6 more variables: species_Chinstrap <dbl>, species_Gentoo <dbl>,
# island_Dream <dbl>, island_Torgersen <dbl>, sex_female <dbl>,
# sex_male <dbl>
d_train_baked |>
map_int(~ sum(is.na(.)))
rownames bill_length_mm bill_depth_mm flipper_length_mm
0 0 0 0
year body_mass_g species_Chinstrap species_Gentoo
0 0 0 0
island_Dream island_Torgersen sex_female sex_male
0 0 0 0
Lösung
Der Fehler liegt darin, dass das Rezept keine Änderungen an der AV ausführt. In der AV gibt es aber fehlende Werte (NA
) im Test-Set.
rownames species island bill_length_mm
0 0 0 1
bill_depth_mm flipper_length_mm body_mass_g sex
1 1 1 0
year
0
Einen fehlenden Wert, um genau zu sein. Dieser eine fehlende Wert versalzt uns die Suppe:
d_test_nona <-
d_test |>
na.omit()
Und schon geht’s.
preds <- predict(wf1_fit, new_data = d_test_nona)
preds |>
head()
# A tibble: 6 × 1
.pred
<dbl>
1 3952.
2 3675.
3 3615.
4 3806.
5 3490.
6 3390.
Dieser SO-Post handelt von einem vergleichbarem Problem.
Categories:
- tidymodels
- statlearning
- error
- NA
- string