Different ways to count NAs over multiple columns

Reading time ~6 minutes

There are a number of ways in R to count NAs (missing values). A common use case is to count the NAs over multiple columns, ie., a whole dataframe. That’s basically the question “how many NAs are there in each column of my dataframe”? This post demonstrates some ways to answer this question.

Way 1: using sapply

A typical way (or classical way) in R to achieve some iteration is using apply and friends. sapply renders through a list and simplifies (hence the “s” in sapply) if possible.

sapply(mtcars, function(x) sum(is.na(x)))
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
#>    0    0    0    0    0    0    0    0    0    0    0

Pros: Straightforward. No dependencies on other packages. Tried and true.
Cons: Not typestable; not sure you will always get the same data type back from this function. You might be surprised and get something you did not expect. That’s no problem in interactive use, but you’d not want for programming.

Way 2: using purrr::map

map maps (applies) a function to each element of a vector/list. Here, the following code reads as “Apply ‘sum(is.na(.))’ on each column of mtcars”. Mind the tilde ~ before function. The dot .` refers to the respective column.

library(tidyverse)
map(mtcars, ~sum(is.na(.)))
#> $mpg
#> [1] 0
#>
#> $cyl
#> [1] 0
#>
#> $disp
#> [1] 0
#>
#> $hp
#> [1] 0
#>
#> $drat
#> [1] 0
#>
#> $wt
#> [1] 0
#>
#> $qsec
#> [1] 0
#>
#> $vs
#> [1] 0
#>
#> $am
#> [1] 0
#>
#> $gear
#> [1] 0
#>
#> $carb
#> [1] 0

Pros: Modern, cool. Straightforward. Typestable. Complex function with lots of use cases.
Cons: Learning curve. Depends on package tidyverse.

Way 3: using dplyr

The following code can be translated as something like this:

1. Hey R, take mtcars -and then-    
2. Select all columns (if I'm in a good mood tomorrow, I might select fewer) -and then-  
3. Summarise all selected columns by using the function 'sum(is.na(.))'

The dot . refers to what was handed over by the pipe, ie., the output of the last step.

mtcars %>%
  select(everything()) %>%  # replace to your needs
  summarise_all(funs(sum(is.na(.))))
#>   mpg cyl disp hp drat wt qsec vs am gear carb
#> 1   0   0    0  0    0  0    0  0  0    0    0

Pros: Straightforward. Quite simple (simple than map to me).
Cons: Depends on package tidyverse.

Way 4: Counting NAs rowwise using apply

Sometimes it is useful to count the NAs rowwise (case by case). apply allows for applying a function to each row of a dataframe (that the MARGIN parameter).

apply(mtcars, MARGIN = 1, function(x) sum(is.na(x)))
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710
#>                   0                   0                   0
#>      Hornet 4 Drive   Hornet Sportabout             Valiant
#>                   0                   0                   0
#>          Duster 360           Merc 240D            Merc 230
#>                   0                   0                   0
#>            Merc 280           Merc 280C          Merc 450SE
#>                   0                   0                   0
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood
#>                   0                   0                   0
#> Lincoln Continental   Chrysler Imperial            Fiat 128
#>                   0                   0                   0
#>         Honda Civic      Toyota Corolla       Toyota Corona
#>                   0                   0                   0
#>    Dodge Challenger         AMC Javelin          Camaro Z28
#>                   0                   0                   0
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2
#>                   0                   0                   0
#>        Lotus Europa      Ford Pantera L        Ferrari Dino
#>                   0                   0                   0
#>       Maserati Bora          Volvo 142E
#>                   0                   0

Pros: Straightforward.
Cons: ?

Way 5: Counting NAs rowwise using dplyr

mtcars %>%
  rowwise %>%
  summarise(NA_per_row = sum(is.na(.)))
#> # A tibble: 32 x 1
#>    NA_per_row
#>         <int>
#>  1          0
#>  2          0
#>  3          0
#>  4          0
#>  5          0
#>  6          0
#>  7          0
#>  8          0
#>  9          0
#> 10          0
#> # ... with 22 more rows

Pro: Fits into the pipe thinking.
Cons: Somewhat less common.

Debrief

Counting NAs over all columns of a dataframe is quite a common task. Like always (?) in R, there a multiple ways to achieve it. Some where shown here (more exist). Enjoy!

Crashkurs Datenanalyse mit R

Nicht jeder liebt Datenanalyse und Statistik... in gleichem Maße! Das ist zumindest meine Erfahrung aus dem Unterricht 🔥. Crashkurse zu R...… Continue reading

Different ways to present summaries in ggplot2

Published on September 08, 2017