Simple way to separate train and test sample in R

For statistical modeling, it is typical to separate a train sample from a test sample. The training sample is used to build (“train”) the model, whereas the test sample is used to gauge the predictive quality of the model.

There are many ways to split off a test sample from the train sample. One quite simple, tidyverse-oriented way, is the following.

First, load the tidyverse. Next, load some data.

library(tidyverse)
data(Affairs, package = "AER")

Then, create an index vector of the length of your train sample, say 80% of the total sample size.

set.seed(42)
index <- sample(1:601, size = trunc(.8 * 601))

Put bluntly, we draw 480 (.8*601) cases from the dataset, and note their row numbers.

a_train <- Affairs %>%
  filter(row_number() %in% index)

The test set is the complement of the train set, drawn similarly:

a_test <- Affairs %>%
  filter(!(row_number() %in% index))

Simple way to separate train and test sample in R

October 17, 2017

This blog has moved

Wie gut schätzt eine Stichprobe die Grundgesamtheit?

Some thoughts on tidyveal and environments in R