tmdb06

ds1

tidymodels

statlearning

tmdb

random-forest

num

Published

May 17, 2023

Aufgabe

Melden Sie sich an für die Kaggle Competition TMDB Box Office Prediction - Can you predict a movie’s worldwide box office revenue?.

Sie benötigen dazu ein Konto; es ist auch möglich, sich mit seinem Google-Konto anzumelden.

Bei diesem Prognosewettbewerb geht es darum, vorherzusagen, wieviel Umsatz wohl einige Filme machen werden. Als Prädiktoren stehen einige Infos wie Budget, Genre, Titel etc. zur Verfügung. Eine klassische “predictive Competition” also :-) Allerdings können immer ein paar Schwierigkeiten auftreten ;-)

Aufgabe

Erstellen Sie ein Lineares Modell mit Tidymodels!

Hinweise

Verzichten Sie auf Vorverarbeitung.
Verzichten Sie auf Tuning.
Reichen Sie das Modell ein und berichten Sie Ihren Score.
Begrenzen Sie sich auf folgende Prädiktoren.
Verwenden Sie (langweiligerweise) nur ein lineares Modell.

preds_chosen <- 
  c("id", "budget", "popularity", "runtime")

Lösung

Pakete starten

library(tidyverse)
library(tidymodels)
library(tictoc)
library(finetune)  # Anova Race
library(doParallel)  # parallele Verarbeitung

Daten importieren

d_train_path <- "https://raw.githubusercontent.com/sebastiansauer/Lehre/main/data/tmdb-box-office-prediction/train.csv"
d_test_path <- "https://raw.githubusercontent.com/sebastiansauer/Lehre/main/data/tmdb-box-office-prediction/test.csv"

d_train <- read_csv(d_train_path)
d_test <- read_csv(d_test_path)

Werfen wir einen Blick in die Daten:

glimpse(d_train)

Rows: 3,000
Columns: 23
$ id                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ belongs_to_collection <chr> "[{'id': 313576, 'name': 'Hot Tub Time Machine C…
$ budget                <dbl> 1.40e+07, 4.00e+07, 3.30e+06, 1.20e+06, 0.00e+00…
$ genres                <chr> "[{'id': 35, 'name': 'Comedy'}]", "[{'id': 35, '…
$ homepage              <chr> NA, NA, "http://sonyclassics.com/whiplash/", "ht…
$ imdb_id               <chr> "tt2637294", "tt0368933", "tt2582802", "tt182148…
$ original_language     <chr> "en", "en", "en", "hi", "ko", "en", "en", "en", …
$ original_title        <chr> "Hot Tub Time Machine 2", "The Princess Diaries …
$ overview              <chr> "When Lou, who has become the \"father of the In…
$ popularity            <dbl> 6.575393, 8.248895, 64.299990, 3.174936, 1.14807…
$ poster_path           <chr> "/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg", "/w9Z7A0GHEh…
$ production_companies  <chr> "[{'name': 'Paramount Pictures', 'id': 4}, {'nam…
$ production_countries  <chr> "[{'iso_3166_1': 'US', 'name': 'United States of…
$ release_date          <chr> "2/20/15", "8/6/04", "10/10/14", "3/9/12", "2/5/…
$ runtime               <dbl> 93, 113, 105, 122, 118, 83, 92, 84, 100, 91, 119…
$ spoken_languages      <chr> "[{'iso_639_1': 'en', 'name': 'English'}]", "[{'…
$ status                <chr> "Released", "Released", "Released", "Released", …
$ tagline               <chr> "The Laws of Space and Time are About to be Viol…
$ title                 <chr> "Hot Tub Time Machine 2", "The Princess Diaries …
$ Keywords              <chr> "[{'id': 4379, 'name': 'time travel'}, {'id': 96…
$ cast                  <chr> "[{'cast_id': 4, 'character': 'Lou', 'credit_id'…
$ crew                  <chr> "[{'credit_id': '59ac067c92514107af02c8c8', 'dep…
$ revenue               <dbl> 12314651, 95149435, 13092000, 16000000, 3923970,…

glimpse(d_test)

Rows: 4,398
Columns: 22
$ id                    <dbl> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, …
$ belongs_to_collection <chr> "[{'id': 34055, 'name': 'Pokémon Collection', 'p…
$ budget                <dbl> 0.00e+00, 8.80e+04, 0.00e+00, 6.80e+06, 2.00e+06…
$ genres                <chr> "[{'id': 12, 'name': 'Adventure'}, {'id': 16, 'n…
$ homepage              <chr> "http://www.pokemon.com/us/movies/movie-pokemon-…
$ imdb_id               <chr> "tt1226251", "tt0051380", "tt0118556", "tt125595…
$ original_language     <chr> "ja", "en", "en", "fr", "en", "en", "de", "en", …
$ original_title        <chr> "ディアルガVSパルキアVSダークライ", "Attack of t…
$ overview              <chr> "Ash and friends (this time accompanied by newco…
$ popularity            <dbl> 3.851534, 3.559789, 8.085194, 8.596012, 3.217680…
$ poster_path           <chr> "/tnftmLMemPLduW6MRyZE0ZUD19z.jpg", "/9MgBNBqlH1…
$ production_companies  <chr> NA, "[{'name': 'Woolner Brothers Pictures Inc.',…
$ production_countries  <chr> "[{'iso_3166_1': 'JP', 'name': 'Japan'}, {'iso_3…
$ release_date          <chr> "7/14/07", "5/19/58", "5/23/97", "9/4/10", "2/11…
$ runtime               <dbl> 90, 65, 100, 130, 92, 121, 119, 77, 120, 92, 88,…
$ spoken_languages      <chr> "[{'iso_639_1': 'en', 'name': 'English'}, {'iso_…
$ status                <chr> "Released", "Released", "Released", "Released", …
$ tagline               <chr> "Somewhere Between Time & Space... A Legend Is B…
$ title                 <chr> "Pokémon: The Rise of Darkrai", "Attack of the 5…
$ Keywords              <chr> "[{'id': 11451, 'name': 'pok√©mon'}, {'id': 1155…
$ cast                  <chr> "[{'cast_id': 3, 'character': 'Tonio', 'credit_i…
$ crew                  <chr> "[{'credit_id': '52fe44e7c3a368484e03d683', 'dep…

preds_chosen sind alle Prädiktoren im Datensatz, oder nicht? Das prüfen wir mal kurz:

preds_chosen %in% names(d_train) %>% 
  all()

[1] TRUE

Ja, alle Elemente von preds_chosen sind Prädiktoren im (Train-)Datensatz.

CV

Wir brauchen keine CV, da wir keine Tuningparameter haben.

cv_scheme <- vfold_cv(d_train)

Rezept

rec1 <- 
  recipe(revenue ~ budget + popularity + runtime, data = d_train) %>% 
  step_impute_bag(all_predictors()) %>% 
  step_naomit(all_predictors()) 
rec1

Man beachte, dass noch 21 Prädiktoren angezeigt werden, da das Rezept noch nicht auf den Datensatz angewandt (“gebacken”) wurde.

tidy(rec1)

# A tibble: 2 × 6
  number operation type       trained skip  id              
   <int> <chr>     <chr>      <lgl>   <lgl> <chr>           
1      1 step      impute_bag FALSE   FALSE impute_bag_sBkeX
2      2 step      naomit     FALSE   TRUE  naomit_NxiQP

Rezept checken:

prep(rec1)

d_train_baked <-
  rec1 %>% 
  prep() %>% 
  bake(new_data = NULL)

glimpse(d_train_baked)

Rows: 3,000
Columns: 4
$ budget     <dbl> 1.40e+07, 4.00e+07, 3.30e+06, 1.20e+06, 0.00e+00, 8.00e+06,…
$ popularity <dbl> 6.575393, 8.248895, 64.299990, 3.174936, 1.148070, 0.743274…
$ runtime    <dbl> 93, 113, 105, 122, 118, 83, 92, 84, 100, 91, 119, 98, 122, …
$ revenue    <dbl> 12314651, 95149435, 13092000, 16000000, 3923970, 3261638, 8…

Fehlende Werte noch übrig?

library(easystats)
describe_distribution(d_train_baked) %>% 
  select(Variable, n_Missing)

Variable   | n_Missing
----------------------
budget     |         0
popularity |         0
runtime    |         0
revenue    |         0

Modell

model_lm <- linear_reg()

Workflow

wf1 <-
  workflow() %>% 
  add_model(model_lm) %>% 
  add_recipe(rec1)

Modell fitten (und tunen)

#doParallel::registerDoParallel(4)
tic()
lm_fit1 <-
  wf1 %>% 
  fit(d_train)
toc()

0.645 sec elapsed

preds <-
  lm_fit1 %>% 
  predict(d_test)

Submission df

submission_df <-
  d_test %>% 
  select(id) %>% 
  bind_cols(preds) %>% 
  rename(revenue = .pred)

head(submission_df)

# A tibble: 6 × 2
     id   revenue
  <dbl>     <dbl>
1  3001 -4147506.
2  3002 -8808140.
3  3003  8523980.
4  3004 31675099.
5  3005  -504355.
6  3006 13531355.

Abspeichern und einreichen:

#write_csv(submission_df, file = "submission.csv")

Kaggle Score

Diese Submission erzielte einen Score von Score: 6.14787 (RMSLE).

sol <- 6.14787

Categories:

ds1
tidymodels
statlearning
tmdb
random-forest
num