predictioncontest1

ds1

string

Published

May 17, 2023

Question

Aufgabe

Erstellen Sie eine Analyse, die einem typischen Vorhersageprojekt entspricht!

Nutzen Sie den Datensatz penguins!

Sagen Sie die Variable body_mass_g vorher.

Hinweise:

Halten Sie die Analyse einfach.
Teilen Sie Test- vs. Train-Set hälftig auf.
Teilen Sie Analysis vs. Assessment-Set 3:1 auf.
Den Datensatz penguins können Sie entweder aus dem Paket palmerpenguins beziehen oder z.B. von hier via read_csv() importieren.
Orientieren Sie sich im Übrigen an den allgemeinen Hinweisen des Datenwerks.

Lösung

Pakete laden:

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.5.0     ✔ tidyr        1.3.1
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.3.0     ✔ workflows    1.1.3
✔ parsnip      1.2.0     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.3.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ stringr   1.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(easystats)

# Attaching packages: easystats 0.7.0 (red = needs update)
✔ bayestestR  0.13.2   ✔ correlation 0.8.4 
✖ datawizard  0.9.1    ✖ effectsize  0.8.6 
✖ insight     0.19.8   ✔ modelbased  0.8.7 
✖ performance 0.10.9   ✖ parameters  0.21.5
✔ report      0.5.8    ✖ see         0.8.2 

Restart the R-Session and update packages with `easystats::easystats_update()`.

data("penguins", package = "palmerpenguins")

Man erinnere sich, dass ein R-Paket erst (einmalig) installiert sein muss, bevor Sie darauf zugreifen können, etwa um Daten - wie den Datensatz penguins - daraus zu beziehen.

Zeilen mischen und Train- vs. Testset aufteilen:

penguins2 <-
  penguins %>% 
  sample_n(size = nrow(.))

d_train <- penguins2 %>% slice(1:(344/2))
d_test <- penguins2 %>% slice(173:nrow(penguins))

Das Trainset weiter aufteilen:

d_split <- initial_split(d_train)

d_analysis <- training(d_split)
d_assessment <- testing(d_split)

Rezept definieren:

rec1 <-
  recipe(body_mass_g ~ ., data = d_analysis) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_normalize(all_numeric(), -all_outcomes())

Rezept prüfen:

d_analysis_baked <- 
rec1 %>% 
  prep() %>% 
  bake(new_data = NULL)

describe_distribution(d_analysis_baked)

Variable          |      Mean |     SD |     IQR |              Range | Skewness | Kurtosis |   n | n_Missing
-------------------------------------------------------------------------------------------------------------
bill_length_mm    | -5.75e-16 |   1.00 |    1.63 |      [-2.06, 2.72] |     0.10 |    -0.76 | 129 |         0
bill_depth_mm     | -1.12e-16 |   1.00 |    1.52 |      [-2.12, 2.30] |    -0.24 |    -0.70 | 129 |         0
flipper_length_mm |  1.06e-16 |   1.00 |    1.76 |      [-1.62, 2.12] |     0.42 |    -1.15 | 129 |         0
year              |  2.18e-15 |   1.00 |    2.51 |      [-1.27, 1.25] |    -0.01 |    -1.42 | 129 |         0
body_mass_g       |   4187.30 | 812.31 | 1200.00 | [2700.00, 6050.00] |     0.55 |    -0.56 | 128 |         1

Workflow und CV definieren:

m1 <- 
  linear_reg()

wf1 <-
  workflow() %>% 
  add_recipe(rec1) %>% 
  add_model(m1)

cv_scheme <- vfold_cv(d_analysis, v = 2)

Fitten (hier kein Tuning):

fit1 <-
  wf1 %>% 
  tune_grid(resamples = cv_scheme)

Warning: No tuning parameters have been detected, performance will be evaluated
using the resamples with no tuning. Did you want to [tune()] parameters?

Finalisieren:

show_best(fit1)

Warning: No value of `metric` was given; metric 'rmse' will be used.

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    316.     2    6.86 Preprocessor1_Model1

wf1_final <-
  wf1 %>% 
  finalize_workflow(show_best(fit1))

Warning: No value of `metric` was given; metric 'rmse' will be used.

wf1_final

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_impute_knn()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Modellgüte:

fit1_final <-
  wf1_final %>% 
  last_fit(d_split)


collect_metrics(fit1_final)

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard     271.    Preprocessor1_Model1
2 rsq     standard       0.896 Preprocessor1_Model1

fit1_train <-
  wf1_final %>% 
  fit(d_train)


fit1_test <-
  fit1_train %>% 
  predict(d_test)

head(fit1_test)

# A tibble: 6 × 1
  .pred
  <dbl>
1 4366.
2 5444.
3 4154.
4 3277.
5 4164.
6 4905.

Vgl https://workflows.tidymodels.org/reference/predict-workflow.html

Submitten:

subm_df <-
  d_test %>% 
  mutate(id = 173:344) %>% 
  bind_cols(fit1_test) %>% 
  select(id, .pred) %>% 
  rename(pred = .pred)

Und als CSV-Datei speichern:

#write_csv(subm_df, file = "submission_blabla.csv")

Categories:

R
ds1
sose22
string