Practical data cleansing in R

What is “data cleansing” about?

Data analysis, in practice, consists typically of some different steps which can be subsumed as “preparing data” and “model data” (not considering communication here):

(Inspired by this)

Often, the first major part — “prepare” — is the most time consuming. This can be lamented since many analysts prefer the cool modeling aspects (since I want to show my math!). In practice, one rather has to get his (her) hands dirt…

In this post, I want to put together some kind of checklist of frequent steps in data preparation. More precisely, I would like to detail some typical steps in “cleansing” your data. Such steps include:

identify missings
identify outliers
check for overall plausibility and errors (e.g, typos)
identify highly correlated variables
identify variables with (nearly) no variance
identify variables with strange names or values
check variable classes (eg. characters vs factors)
remove/transform some variables (maybe your model does not like categorial variables)
rename some variables or values (especially interesting if large number)
check some overall pattern (statistical/ numerical summaries)
center/scale variables

You can read the full post including source code here (Github). Here is an output file (html).

Example: Analyse missing values

Practical data cleansing in R

July 24, 2016

New bar stacking with ggplot 2.2.0

Crashkurs zur Erstellung von Barplots für Umfrage-Daten

Some thoughts (and simulation) on overfitting