Subsetting a data frame is an essential and frequently performed task. Here, some basic ideas are presented.
Get some data first.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mtcars <- head(mtcars) # for shorter output
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
One: Addressing dataframe as list (vector, 1-dim structure)
Data frames can be understood/addressed as lists, ie., as some type of vectors. Vectors have one dimension. Thus, we can access/subset with one index only (one dimension). For example mtcars[1]
selects the first element (ie., column) of mtcars
.
mtcars[1]
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
mtcars["mpg"]
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
mtcars[c("mpg", "cyl")]
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
mtcars[1:3]
## mpg cyl disp
## Mazda RX4 21.0 6 160
## Mazda RX4 Wag 21.0 6 160
## Datsun 710 22.8 4 108
## Hornet 4 Drive 21.4 6 258
## Hornet Sportabout 18.7 8 360
## Valiant 18.1 6 225
Note that data frames are vectors (lists) technically, where the vectors are the columns and the columns possess names. Thus we can address the columns by their names. Of course, we can select (address/index/subset) more than one element (column) using the c()
function.
Besides names, we can address the elements by their number: type a positive integer to subset the respective element. c
can be used here too.
Note that the :
(colon) operator is a short hand for c(from_this_column_to_that column)
.
Similarly to addressing the names of the data frames using brackets []
, we can use the Dollar $
operator:
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
Whatever comes after the $
is understood by R as the name (without quotation marks) of the column within that data frame. $
is a shorthand for [[]]
(but not exactly the same; see here for an excellent overview).
Two: Addressing dataframe as a matrix (2-dim-structure, 2-dim-matrix)
As data frames can also be addressed as rectangular, two-dimensional matrices, we may subset specific elements using a x-y-coordinate scheme where in R matrices the row is addressed first, and the column second, eg., mtcars(1,2)
would be first line of second column.
mtcars[1,1]
## [1] 21
mtcars[1, c(1,2)]
## mpg cyl
## Mazda RX4 21 6
mtcars[1, 1:3]
## mpg cyl disp
## Mazda RX4 21 6 160
mtcars[1, c(1:3)]
## mpg cyl disp
## Mazda RX4 21 6 160
mtcars[, c(1:3)]
## mpg cyl disp
## Mazda RX4 21.0 6 160
## Mazda RX4 Wag 21.0 6 160
## Datsun 710 22.8 4 108
## Hornet 4 Drive 21.4 6 258
## Hornet Sportabout 18.7 8 360
## Valiant 18.1 6 225
mtcars[1, "mpg"]
## [1] 21
mtcars[1, c("mpg", "cyl")]
## mpg cyl
## Mazda RX4 21 6
Again, the c()
operator may be used to group several rows or columns. Columns may again addressed by their names (row names are unusual). The :
colon operator is allowed, too.
Three: Logical subsetting in dataframes
mtcars[c(T, T, F, F, F, F, F, F, F, F, T)]
## mpg cyl carb
## Mazda RX4 21.0 6 4
## Mazda RX4 Wag 21.0 6 4
## Datsun 710 22.8 4 1
## Hornet 4 Drive 21.4 6 1
## Hornet Sportabout 18.7 8 2
## Valiant 18.1 6 1
mtcars[c(T, T, F)]
## mpg cyl hp drat qsec vs gear carb
## Mazda RX4 21.0 6 110 3.90 16.46 0 4 4
## Mazda RX4 Wag 21.0 6 110 3.90 17.02 0 4 4
## Datsun 710 22.8 4 93 3.85 18.61 1 4 1
## Hornet 4 Drive 21.4 6 110 3.08 19.44 1 3 1
## Hornet Sportabout 18.7 8 175 3.15 17.02 0 3 2
## Valiant 18.1 6 105 2.76 20.22 1 3 1
In the first example above, the columns #1, #2, #and #11 are selected, because their position is indexed as TRUE
(or T
).
Note that if you supply less elements than the length of the objects (eg., here 11 columns/elements), R will recycle your elements until the full length of the element is met (here: TTF-TTF-TTF-TT).
Again, the data frame can be addressed either as a list (1-dim), or as a 2-dim matrix. See here for an example using logical indexing and addressing the data frame as a 2-dim matrix:
mtcars[c(T, T, F), c(T, T, F)]
## mpg cyl hp drat qsec vs gear carb
## Mazda RX4 21.0 6 110 3.90 16.46 0 4 4
## Mazda RX4 Wag 21.0 6 110 3.90 17.02 0 4 4
## Hornet 4 Drive 21.4 6 110 3.08 19.44 1 3 1
## Hornet Sportabout 18.7 8 175 3.15 17.02 0 3 2
Actually, the logical subsetting is quite powerful. We can use a predicate function, ie., a function delivering a logial state (TRUE
or FALSE
) within the subsetting:
mtcars[mtcars$cyl == 6, c(1,2)]
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Hornet 4 Drive 21.4 6
## Valiant 18.1 6
Here, we declared that we only want rows for which the following condition is TRUE
: mtcars$cyl == 6
. (And only cols 1 and 2.)
Final words
Subsetting in R is an essential task. It is also not so easy, as many slightly different variants exist. Here, only some ideas were presented. A much broader are excellently presented by Hadley Wickham here.
Besides that subsettting using base R should be well understood, it may be more comfortable to use functions such as select
from dplyr
.