Some time ago, I posted about how to plot frequencies using ggplot2. One point that remained untouched was how to sort the order of the bars. Let’s look at that issue here.

First, let’s load some data.

data(tips, package = "reshape2")

And the usual culprits.

library(tidyverse)
library(scales)  # for percentage scales

First, let’s plot a standard plot, with bars unsorted.

tips %>% 
  count(day) %>% 
  mutate(perc = n / nrow(tips)) -> tips2

ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-3

Hang on, what could ‘unsorted’ possibly mean? There must be some rule, by which ggplot2 determines order.

And the rule is:

  • if factor, the order of factor levels is used
  • if character, an alphabetical order ist used

Sorting bars by factor ordering

Albeit it appears common not to like factors, now that’s a situation when they are useful. Factors provide an easy for sorting, see:

tips2$day <- factor(tips2$day,levels = c("Fri", "Sat", "Sun", "Thur"))

Now let’s plot again:

ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-5

Sorting bars by some numeric variable

Often, we do not want just some ordering, we want to order by frequency, the most frequent bar coming first. This can be achieved in this way.

ggplot(tips2, aes(x = reorder(day, -perc), y = perc)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-6

Note the minus sign -, leaving it out will change the ordering (from low to high).

Happy plotting!

Edit: This post was updated, including two errors fixed - thanks to (private) comments from Norman Markgraf.

z-values, aka values coming from an z-transformation are a frequent creature in statistics land. Among their properties are the following:

  1. mean is zero
  2. variance is one (and hence sd is one)

But why is that? How come that this two properties are true? The goal of this post is to shed light on these two properties of z-values.

Mean value of z-value is zero

There are a number of ways to explain this fact.

One is that it is one feature of mean values that the sum of the differences of the mean is zero. z-values tell nothing but the difference to the mean (in relation to the SD of the distribution). Hence, once you realize that the mean z-value is nothing but some mean value, you will see that the mean of a distribution/sample in z-values equals zero.

Intuition

Look at the following codea and diagram as an intuition:

library(tidyverse)

mtcars %>% 
  select(cyl) %>% 
  slice(1:10) %>% 
  mutate(cyl_mean = mean(cyl),
         diff = cyl - cyl_mean) %>% 
  summarise(sum(diff))
##      sum(diff)
## 1 1.776357e-15

So our sum of the diff values is (practically) zero.

mtcars %>% 
  rownames_to_column %>% 
  select(cyl, rowname) %>% 
  slice(1:10) %>% 
  mutate(cyl_mean = mean(cyl),
         diff = cyl - cyl_mean,
         sign_diff = sign(diff)) %>% 
  ggplot() +
  aes(x = rowname, y = diff) +
  geom_col(aes(fill = factor(sign_diff))) +
  coord_flip() +
  guides(fill = FALSE)

plot of chunk unnamed-chunk-2

The diagram above is meant to give an impression that the negative and positive differences to the mean “cancel out”, ie., sum up to zero. Put pluntly: First, concatenate the red bars. Then, concatenate the magenta bars. You will find the total red bar and the total magenta bar are of the same length.

One further visualization:

cyl_mean <- mean(mtcars[1:10, "cyl"])

mtcars %>% 
  rownames_to_column %>% 
  select(cyl, rowname) %>% 
  slice(1:10) %>% 
  mutate(cyl_mean = mean(cyl),
         diff = cyl - cyl_mean,
         sign_diff = sign(diff),
         ID = 1:10) %>% 
  ggplot() +
  aes(x = ID, y = cyl) +
  geom_col(aes(fill = factor(sign_diff))) +
  coord_flip() +
  guides(fill = FALSE) +
  geom_hline(yintercept = cyl_mean) +
  geom_rect(aes(xmin = ID-.45, xmax = ID+.45,
                ymin = cyl_mean,
                ymax = cyl_mean + diff
               ),
            alpha = .7) +
  scale_x_continuous(breaks = 1:10)

plot of chunk unnamed-chunk-3

Saying that the differences of the values to the mean sum up amounts to saying that the “negative bars” (starting at the vertical mean line at about 5.8) are of equal length if concatenated as the “positive bars” (again starting at the vertical line), concatenated.

More formally

More formally, let be some numeric variable, and the respective arithmetic mean, then .

But why does this equation hold? And how does this apply to z-values?

Let’s look at some algebra:

But the sum of the differences to the mean () is zero. Hence, the whole term is zero. That’s why (that’s one reason hot to explain why) the mean of z-values is zero.

But… why is the sum of differences from the mean equal to zero?

SD of z-values is one

Ok, maybe the mean value of z-values is zero. But why is the SD or the variance equal to one?

If the variance is one, we will agree that the sd is one, too, because the root of 1 is 1.

Intuition

Well, suppose you take all the differences from the mean and divide them by . Let’s call the new differences . Not very surprisingly, the of will also change accordingly - multiplied by , ie., will be divided by the factor . And that yields 1.

More formally

Let be the variance (V) of some variable X, the remaining details are as above.

But, , as discussed above. Thus, the equation is shortened to:

Now we replace by its definition.

Rearranging gives:

which can be simplified to

Thus, we see that the variance of z-values equals 1.


Similarly, picking up the idea from the intuition above, note that

In other words, if we multiply our values (the s) by some factor , the resulting variance will increase by . In case that the mean value is zero (as for z-values), then we may say that “if we multiply our differences by some factor , the variance will increase by ”. Taking the root of the variance, we are left with the sd changed by factor .

Plotting a function is often helpful to better understand what’s going on. Plotting curves in R base is simple by virtue of function curve. But how to draw curves using ggplot2?

That’s a little bit more complicated by can still be accomplished by 1-2 lines.

library(ggplot2)

Normal curve

p <- ggplot(data = data.frame(x = c(-3, 3)), aes(x))
  
p + stat_function(fun = dnorm, n = 101) 

plot of chunk unnamed-chunk-2

stat_function is some kind of parallel function to curve. The parameter n tells ggplot how finely granulated the curve should be. Compare to n=10:

p + stat_function(fun = dnorm, n = 10) 

plot of chunk unnamed-chunk-3

Logistic curve

The logitistic curve plays an eniment role in many statistical methods, e.g., regression for binary events, and Rasch model in psychometric. It is sometimes called “s-type” curve (or “ogive”) due to its form vaguely resembling an “S”:

scurve <- function(x){
  y <- exp(x) / (1 + exp(x))
  return(y)
}

p + stat_function(fun = scurve, n = 100) 

plot of chunk unnamed-chunk-4

As our function does not have a prebottled version in base R, we have defined a function beforehand. That function is then passed over to ggplot2.

Alternatively, we could have done that in one step:

p + stat_function(fun = function(x) exp(x)/(1+exp(x)), n = 100) 

plot of chunk unnamed-chunk-5

Which is shorter but somewhat less readable.

Ln-Function

Now the principle is clear and we can readily apply it to whatever function we wish. Let’s take the natural logarithm (log in R) as a final example.

p + stat_function(fun = log, n = 100)
## Warning in .Primitive("log")(x_trans): NaNs produced
## Warning: Removed 50 rows containing missing values (geom_path).

plot of chunk unnamed-chunk-6

Note that the ln-function is not defined for negative values (zero gives -Inf).

An old story is that one of the farmer with a fence of some given length, say 20m. Now this farmer wants to put up his fence so that he claims the largest piece of land possible. What width (w) and height (h) should we pick?

Instead of a formal proof, let’s start with a visualization.

First, we need some packages.

library(tidyverse)
library(gganimate)
library(RColorBrewer)
library(scales)
library(knitr)

Now, let’s make up serveral ways to split up a rectengular piece of land. Note that we only need to define two sides (width and height), as the circumference of a rectangle is .

df <- data_frame(
  w = c(0:10, 9:0),
  h = c(10:0, 1:10),
  area = w * h)

df$row <- 1:nrow(df)

Next, let’s define a palette from Brewer’s palette no 5.

cols <- brewer_pal("seq")(5)
mypal <- gradient_n_pal(cols)(seq(0, 1, length.out = 21))

names(mypal) <- as.character(df$area)

Next, plot an animated diagram:

p <- ggplot(df) +
  aes(x = w, y = h, fill = mypal, frame = row) +
  geom_rect(aes(xmax = w, ymax = h), xmin = 0, ymin = 0) +
  labs(title = paste0("area = ", df$area)) +
  # theme(legend.position = "none") +
  guides(fill = FALSE) +
  theme(plot.title = element_text(hjust = 0.5))

gganimate(p)
detach("package:gganimate", unload=TRUE)

Save output as gif:

#gganimate(p, "output.gif")

We can see (those wich quick eyes) that area is maximized when our piece of land is of quadratic shape. Glimpsing at the numbers confirms that impression:

df %>% 
  select(row, w, h, area) %>% 
  kable
row w h area
1 0 10 0
2 1 9 9
3 2 8 16
4 3 7 21
5 4 6 24
6 5 5 25
7 6 4 24
8 7 3 21
9 8 2 16
10 9 1 9
11 10 0 0
12 9 1 9
13 8 2 16
14 7 3 21
15 6 4 24
16 5 5 25
17 4 6 24
18 3 7 21
19 2 8 16
20 1 9 9
21 0 10 0

Area as a function of w

Maybe a more classical plot of area as a function of w (or h) is in order.

ggplot(df) +
  aes(y = area) +
  geom_line(aes(x = w)) +
  geom_point(aes(x = w), color = "red") 

plot of chunk unnamed-chunk-7

Similarly, area as a function of h:

ggplot(df) +
  aes(y = area) +
  geom_line(aes(x = h)) +
  geom_point(aes(x = h), color = "red") 

plot of chunk unnamed-chunk-8

That’s of coure no formal proof, but we get a “feel” that quadratic forms make best use of the fence (ie maximize the area for a given circumference) for all rectangular forms.

Proof

More formally, note that , or .

The area is given by

.

The derivative of this function is zero at the maximum, so

.

This equation is satisfied when , hence also . So the maximum rectengular area enclosed by a 20 meter fence is provided by a side length of 5 meters. a

A well-known property of regression models is that they capture the unique contribution of a predictor. By “unique” we mean the effect of the predictor (on the criterion) if the other predictor(s) is/are held constant. A typical classroom example goes along the following lines.

All about storks

  1. There’s a correlation between babies and storks. Counties with lots of storks enjoy large number of babies and v.v.

  2. However, I have children, I know the storks are not overly involved in that business, so says the teacher (polite laughters in the audience).

  3. Once we consider a third variable, call it nature-level, we see that the correlation between storks and babies vanishes.

  4. Assume that nature-level and storks are correlated - in nature-rich areas there are a lot of storks. Assume further that nature-level and the amount of babies are correlated - in nature-rich areas people have relatively many babies.

  5. That’s why we find a correlation between babies and storks (“spurious” it is called), which is (fully) attributable to the nature-level.

This is all well-known but it helps to see it in real data. That’s the objective of this post.

Tip, sex, and total_bill - two simple bivariate analyses

Let’s have a look at the tips dataset.

data(tips, package = "reshape2")

And load some datajudo packages.

library(tidyverse)

First, simple bivariate assocations for sex-tip and total_bill-tip.

Is there an assocation between sex and tip?

ggplot(tips) +
  ggplot2::aes(x = sex, y = tip) +
  geom_jitter(color = "grey80") +
  geom_jitter(data = tips, aes(color = sex)) +
  stat_summary(fun.y = mean, color = "red", geom = "line", size = 2, group = 1) +
  stat_summary(fun.y = mean, color = "red", geom = "point", size = 5, group = 1, alpha = .5)

plot of chunk unnamed-chunk-3

tips2 <- select(tips, -sex)

tips %>% 
  ggplot() +
  aes(x = total_bill, y = tip) +
  geom_point(data = tips2, color = "grey80") +
  geom_point(aes(color = sex)) +
  facet_wrap(~sex)

plot of chunk unnamed-chunk-3

It appears that men do tip slightly more than women do.

tips %>% 
  group_by(sex) %>% 
  summarise(tip_sex_mean = mean(tip),
            tip_sex_md = median(tip))
## # A tibble: 2 x 3
##      sex tip_sex_mean tip_sex_md
##   <fctr>        <dbl>      <dbl>
## 1 Female     2.833448       2.75
## 2   Male     3.089618       3.00

Ok, some 25 Cent difference, in favor of men. Some association is present in the data.

Let’s run a regression on that, just to double-check.

lm(tip ~ sex,
   data = tips)
## 
## Call:
## lm(formula = tip ~ sex, data = tips)
## 
## Coefficients:
## (Intercept)      sexMale  
##      2.8334       0.2562

Ok, confirmed.

Now, what about the association of total_bill and tip?

tips %>% 
  ggplot +
  aes(x = total_bill, y = tip) +
  geom_point()

plot of chunk unnamed-chunk-6

Oh, clearly, some correlation is present between total_bill and tip.

tips %>% 
  summarise(cor_tip_bill = cor(tip, total_bill))
##   cor_tip_bill
## 1    0.6757341

And let’s have a look at the slope of the regression.

lm(tip ~ total_bill, data = tips)
## 
## Call:
## lm(formula = tip ~ total_bill, data = tips)
## 
## Coefficients:
## (Intercept)   total_bill  
##      0.9203       0.1050

Multivariate analysis

Now let’s do that in one run:

lm(tip ~ total_bill + sex,
   data = tips)
## 
## Call:
## lm(formula = tip ~ total_bill + sex, data = tips)
## 
## Coefficients:
## (Intercept)   total_bill      sexMale  
##     0.93328      0.10523     -0.02661

Notice that the slope of the predictor total_bill has not changed much. But sex did!

So, the unique contribution of sex is very small, about 2 Cent, and in favor of women when considered in junction with total_bill!

Let’s visualize the multivariate appraoch

First, for easier visualization, let’s bin total_bill in say, 5, bins.

tips$bill_bins <- cut_interval(tips$total_bill, n = 5)

Now let’s plot all three variables in one go.

tips %>% 
  ggplot +
  aes(x = bill_bins, 
      y = tip,
      fill = sex) +
  geom_boxplot() +
  geom_point(alpha = .5)

plot of chunk unnamed-chunk-11

We see that females do tip more than men when the 5 bins of total_bill are considered each in isolation. For 4 of the 5 bins we find that the median of the women is higher than that of the men. Only in the 5th, ie., highest bin, we found that men tip higher. However, this bin is slightly populated, so we may negligate it. In sum, looking at each bin of total_bill - women tip more than men do!

Depicted differently:

tips %>% 
  ggplot() +
  aes(x = total_bill, y = tip) +
  geom_point(data = tips2, color = "grey80") +
  geom_point(aes(color = sex)) +
  facet_wrap(~sex)

plot of chunk unnamed-chunk-12

Dichotomize total_bill

Maybe even more vivid is we we do not bin total_bill in 5 bins, but just in two.

tips$bill_bins <- cut_interval(tips$total_bill, n = 2)

Again, let’s plot the three variables.

tips %>% 
  ggplot +
  aes(x = bill_bins, 
      y = tip,
      fill = sex) +
  geom_boxplot() +
  geom_point(alpha = .5)

plot of chunk unnamed-chunk-14

We find that women seem to tip slightly more than men do; at least not less.

A Simpson case

What happens can be considered a Simpson’s paradox, based on the fact that men spend more than women do (higher total_bill). Even though they tip less or equally to women it appears they tip “in sum” (not considering total_bill) more simply because total_bill and tip are related.

Let’s dichotomize total_bill and see which sex group (females or males) occur more frequently.

tips %>% 
  mutate(bill_di = ntile(total_bill, n = 2),
         bill_di_f = if_else(bill_di == 1, "low", "hi")) %>% 
  count(sex, bill_di_f) %>% 
 # gather(bill_group, n, -n) %>% 
  ggplot() +
  aes(x = sex, fill = bill_di_f, y = n) +
  geom_col(position = "fill")

plot of chunk unnamed-chunk-15

We see that the in the male group, high values occurr more frequently.

Path model

sex_male has a positive effect on total_bill; and total_bill in turn on tip. But sex_male does not exert much of an influence on tip anymore, after controling for total_bill (in fact, the positive effect turned to a weak negative impact).

library(DiagrammeR)
mermaid("
graph LR
  total_bill-->tip
  sex_male-->total_bill
  sex_male---tip
 ")

plot of chunk unnamed-chunk-16

Conclusion

That kind of “swapping” of the regression coefficient is a frequent observation in (multiple) regression. We may conclude that the multivariate analyses with 3 variables is more adequate than the bivariate analyses with one predictor only.

However, a caveat is that we can never be sure that this swapping process is now complete. Maybe we are missing out some more variable that will again make the coefficients swap. Alas, this we cannot know in observational reseach. That’s why observational data should not be overinterpreted.

Of course, even if commonly speak of “effects” this language should never be taken as indicative of causal effects. Causal effects is nothing that can firmly be established by statistics. Causality claims rest of more logical and substantive argument.