# Multiple ways to subsetting data frames in R

## October 15, 2016

Subsetting a data frame is an essential and frequently performed task. Here, some basic ideas are presented.

Get some data first.

str(mtcars)

## 'data.frame':	32 obs. of  11 variables:
##  $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ##$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $disp: num 160 160 108 258 360 ... ##$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ##$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $qsec: num 16.5 17 18.6 19.4 17 ... ##$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $am : num 1 1 1 0 0 0 0 0 0 0 ... ##$ gear: num  4 4 4 3 3 3 3 4 4 4 ...

mtcars$mpg  ## [1] 21.0 21.0 22.8 21.4 18.7 18.1  Whatever comes after the $ is understood by R as the name (without quotation marks) of the column within that data frame. $ is a shorthand for [[]] (but not exactly the same; see here for an excellent overview). ## Two: Addressing dataframe as a matrix (2-dim-structure, 2-dim-matrix) As data frames can also be addressed as rectangular, two-dimensional matrices, we may subset specific elements using a x-y-coordinate scheme where in R matrices the row is addressed first, and the column second, eg., mtcars(1,2) would be first line of second column. mtcars[1,1]  ## [1] 21  mtcars[1, c(1,2)]  ## mpg cyl ## Mazda RX4 21 6  mtcars[1, 1:3]  ## mpg cyl disp ## Mazda RX4 21 6 160  mtcars[1, c(1:3)]  ## mpg cyl disp ## Mazda RX4 21 6 160  mtcars[, c(1:3)]  ## mpg cyl disp ## Mazda RX4 21.0 6 160 ## Mazda RX4 Wag 21.0 6 160 ## Datsun 710 22.8 4 108 ## Hornet 4 Drive 21.4 6 258 ## Hornet Sportabout 18.7 8 360 ## Valiant 18.1 6 225  mtcars[1, "mpg"]  ## [1] 21  mtcars[1, c("mpg", "cyl")]  ## mpg cyl ## Mazda RX4 21 6  Again, the c() operator may be used to group several rows or columns. Columns may again addressed by their names (row names are unusual). The : colon operator is allowed, too. ## Three: Logical subsetting in dataframes mtcars[c(T, T, F, F, F, F, F, F, F, F, T)]  ## mpg cyl carb ## Mazda RX4 21.0 6 4 ## Mazda RX4 Wag 21.0 6 4 ## Datsun 710 22.8 4 1 ## Hornet 4 Drive 21.4 6 1 ## Hornet Sportabout 18.7 8 2 ## Valiant 18.1 6 1  mtcars[c(T, T, F)]  ## mpg cyl hp drat qsec vs gear carb ## Mazda RX4 21.0 6 110 3.90 16.46 0 4 4 ## Mazda RX4 Wag 21.0 6 110 3.90 17.02 0 4 4 ## Datsun 710 22.8 4 93 3.85 18.61 1 4 1 ## Hornet 4 Drive 21.4 6 110 3.08 19.44 1 3 1 ## Hornet Sportabout 18.7 8 175 3.15 17.02 0 3 2 ## Valiant 18.1 6 105 2.76 20.22 1 3 1  In the first example above, the columns #1, #2, #and #11 are selected, because their position is indexed as TRUE (or T). Note that if you supply less elements than the length of the objects (eg., here 11 columns/elements), R will recycle your elements until the full length of the element is met (here: TTF-TTF-TTF-TT). Again, the data frame can be addressed either as a list (1-dim), or as a 2-dim matrix. See here for an example using logical indexing and addressing the data frame as a 2-dim matrix: mtcars[c(T, T, F), c(T, T, F)]  ## mpg cyl hp drat qsec vs gear carb ## Mazda RX4 21.0 6 110 3.90 16.46 0 4 4 ## Mazda RX4 Wag 21.0 6 110 3.90 17.02 0 4 4 ## Hornet 4 Drive 21.4 6 110 3.08 19.44 1 3 1 ## Hornet Sportabout 18.7 8 175 3.15 17.02 0 3 2  Actually, the logical subsetting is quite powerful. We can use a predicate function, ie., a function delivering a logial state (TRUE or FALSE) within the subsetting: mtcars[mtcars$cyl == 6, c(1,2)]

##                 mpg cyl
## Mazda RX4      21.0   6
## Mazda RX4 Wag  21.0   6
## Hornet 4 Drive 21.4   6
## Valiant        18.1   6


Here, we declared that we only want rows for which the following condition is TRUE: mtcars\$cyl == 6. (And only cols 1 and 2.)

## Final words

Subsetting in R is an essential task. It is also not so easy, as many slightly different variants exist. Here, only some ideas were presented. A much broader are excellently presented by Hadley Wickham here.

Besides that subsettting using base R should be well understood, it may be more comfortable to use functions such as select from dplyr.

### New bar stacking with ggplot 2.2.0

Recently, ggplot2 2.2.0 was released. Among other news, stacking bar plot was improved. Here is a short demonstration.Load libraries`...… Continue reading

#### Crashkurs zur Erstellung von Barplots für Umfrage-Daten

Published on November 13, 2016

#### Some thoughts (and simulation) on overfitting

Published on November 12, 2016