# Practical data cleansing in R

## July 24, 2016

Reading time ~1 minute

What is “data cleansing” about?

Data analysis, in practice, consists typically of some different steps which can be subsumed as “preparing data” and “model data” (not considering communication here):

(Inspired by this)

Often, the first major part — “prepare” — is the most time consuming. This can be lamented since many analysts prefer the cool modeling aspects (since I want to show my math!). In practice, one rather has to get his (her) hands dirt…

In this post, I want to put together some kind of checklist of frequent steps in data preparation. More precisely, I would like to detail some typical steps in “cleansing” your data. Such steps include:

• identify missings
• identify outliers
• check for overall plausibility and errors (e.g, typos)
• identify highly correlated variables
• identify variables with (nearly) no variance
• identify variables with strange names or values
• check variable classes (eg. characters vs factors)
• remove/transform some variables (maybe your model does not like categorial variables)
• rename some variables or values (especially interesting if large number)
• check some overall pattern (statistical/ numerical summaries)
• center/scale variables

You can read the full post including source code here (Github). Here is an output file (html).

Example: Analyse missing values

### New bar stacking with ggplot 2.2.0

Recently, ggplot2 2.2.0 was released. Among other news, stacking bar plot was improved. Here is a short demonstration.Load libraries`...… Continue reading

#### Crashkurs zur Erstellung von Barplots für Umfrage-Daten

Published on November 13, 2016

#### Some thoughts (and simulation) on overfitting

Published on November 12, 2016