Multiple ways to subsetting data frames in R

October 15, 2016

Subsetting a data frame is an essential and frequently performed task. Here, some basic ideas are presented.

Get some data first.

str(mtcars)

## 'data.frame':	32 obs. of  11 variables:
##  $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ##$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $disp: num 160 160 108 258 360 ... ##$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ##$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $qsec: num 16.5 17 18.6 19.4 17 ... ##$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $am : num 1 1 1 0 0 0 0 0 0 0 ... ##$ gear: num  4 4 4 3 3 3 3 4 4 4 ...

mtcars$mpg  ## [1] 21.0 21.0 22.8 21.4 18.7 18.1  Whatever comes after the $ is understood by R as the name (without quotation marks) of the column within that data frame. $ is a shorthand for [[]] (but not exactly the same; see here for an excellent overview). Two: Addressing dataframe as a matrix (2-dim-structure, 2-dim-matrix) As data frames can also be addressed as rectangular, two-dimensional matrices, we may subset specific elements using a x-y-coordinate scheme where in R matrices the row is addressed first, and the column second, eg., mtcars(1,2) would be first line of second column. mtcars[1,1]  ## [1] 21  mtcars[1, c(1,2)]  ## mpg cyl ## Mazda RX4 21 6  mtcars[1, 1:3]  ## mpg cyl disp ## Mazda RX4 21 6 160  mtcars[1, c(1:3)]  ## mpg cyl disp ## Mazda RX4 21 6 160  mtcars[, c(1:3)]  ## mpg cyl disp ## Mazda RX4 21.0 6 160 ## Mazda RX4 Wag 21.0 6 160 ## Datsun 710 22.8 4 108 ## Hornet 4 Drive 21.4 6 258 ## Hornet Sportabout 18.7 8 360 ## Valiant 18.1 6 225  mtcars[1, "mpg"]  ## [1] 21  mtcars[1, c("mpg", "cyl")]  ## mpg cyl ## Mazda RX4 21 6  Again, the c() operator may be used to group several rows or columns. Columns may again addressed by their names (row names are unusual). The : colon operator is allowed, too. Three: Logical subsetting in dataframes mtcars[c(T, T, F, F, F, F, F, F, F, F, T)]  ## mpg cyl carb ## Mazda RX4 21.0 6 4 ## Mazda RX4 Wag 21.0 6 4 ## Datsun 710 22.8 4 1 ## Hornet 4 Drive 21.4 6 1 ## Hornet Sportabout 18.7 8 2 ## Valiant 18.1 6 1  mtcars[c(T, T, F)]  ## mpg cyl hp drat qsec vs gear carb ## Mazda RX4 21.0 6 110 3.90 16.46 0 4 4 ## Mazda RX4 Wag 21.0 6 110 3.90 17.02 0 4 4 ## Datsun 710 22.8 4 93 3.85 18.61 1 4 1 ## Hornet 4 Drive 21.4 6 110 3.08 19.44 1 3 1 ## Hornet Sportabout 18.7 8 175 3.15 17.02 0 3 2 ## Valiant 18.1 6 105 2.76 20.22 1 3 1  In the first example above, the columns #1, #2, #and #11 are selected, because their position is indexed as TRUE (or T). Note that if you supply less elements than the length of the objects (eg., here 11 columns/elements), R will recycle your elements until the full length of the element is met (here: TTF-TTF-TTF-TT). Again, the data frame can be addressed either as a list (1-dim), or as a 2-dim matrix. See here for an example using logical indexing and addressing the data frame as a 2-dim matrix: mtcars[c(T, T, F), c(T, T, F)]  ## mpg cyl hp drat qsec vs gear carb ## Mazda RX4 21.0 6 110 3.90 16.46 0 4 4 ## Mazda RX4 Wag 21.0 6 110 3.90 17.02 0 4 4 ## Hornet 4 Drive 21.4 6 110 3.08 19.44 1 3 1 ## Hornet Sportabout 18.7 8 175 3.15 17.02 0 3 2  Actually, the logical subsetting is quite powerful. We can use a predicate function, ie., a function delivering a logial state (TRUE or FALSE) within the subsetting: mtcars[mtcars$cyl == 6, c(1,2)]

##                 mpg cyl
## Mazda RX4      21.0   6
## Mazda RX4 Wag  21.0   6
## Hornet 4 Drive 21.4   6
## Valiant        18.1   6


Here, we declared that we only want rows for which the following condition is TRUE: mtcars\$cyl == 6. (And only cols 1 and 2.)

Final words

Subsetting in R is an essential task. It is also not so easy, as many slightly different variants exist. Here, only some ideas were presented. A much broader are excellently presented by Hadley Wickham here.

Besides that subsettting using base R should be well understood, it may be more comfortable to use functions such as select from dplyr.

New bar stacking with ggplot 2.2.0

Recently, ggplot2 2.2.0 was released. Among other news, stacking bar plot was improved. Here is a short demonstration.Load libraries`...… Continue reading

Crashkurs zur Erstellung von Barplots für Umfrage-Daten

Published on November 13, 2016

Some thoughts (and simulation) on overfitting

Published on November 12, 2016