10 min read

Lesson 5 | More ggplot

Objectives

  • Practice ggplot basics
  • Learn new geoms to explore patterns within data
    • geom_boxplot
    • geom_violin
    • geom_jitter
  • Clearly plot group means and variance with geom_errorbar and position_dodge
  • Connect repeated measures data through time with geom_line and aes(group=)
  • De-emphasize plot elements with alpha=

Materials

Script

MoreGGplot.R

Walking through the script

Load packages & view data ChickWeight, which comes with \({\bf\textsf{R}}\):

pacman::p_load(tidyverse)
(ChickWeight <- as_tibble(ChickWeight))  
## # A tibble: 578 x 4
##    weight  Time Chick Diet 
##     <dbl> <dbl> <ord> <fct>
##  1     42     0 1     1    
##  2     51     2 1     1    
##  3     59     4 1     1    
##  4     64     6 1     1    
##  5     76     8 1     1    
##  6     93    10 1     1    
##  7    106    12 1     1    
##  8    125    14 1     1    
##  9    149    16 1     1    
## 10    171    18 1     1    
## # … with 568 more rows

Boxplots

Boxplots are ideal for plotting continous data across two or more levels of a categorical variable (<fct> or <chr> in tibble).

To get started, use filter() to trim the dataset down to last round of (Time)

EndWeight <- filter(ChickWeight, Time==max(Time))
EndWeight
## # A tibble: 45 x 4
##    weight  Time Chick Diet 
##     <dbl> <dbl> <ord> <fct>
##  1    205    21 1     1    
##  2    215    21 2     1    
##  3    202    21 3     1    
##  4    157    21 4     1    
##  5    223    21 5     1    
##  6    157    21 6     1    
##  7    305    21 7     1    
##  8     98    21 9     1    
##  9    124    21 10    1    
## 10    175    21 11    1    
## # … with 35 more rows

Let’s use tidyverse verbs to tweak the appearance and make this boring Diet variable more fun:

EndWeight <- mutate(EndWeight, 
                      ActualDiet = 
                        recode(Diet, 
                          "1"="corn", 
                          "2"="oats", 
                          "3"="burgers", 
                          "4"="tacos") )

It shouldn’t be a surprise that geom_boxplot is what we use to get a boxplot out of ggplot:

ggplot(EndWeight) + 
  geom_boxplot(aes(x=ActualDiet, y=weight))

Note that the default order for levels of vectors stored as characters is alphabetical. We can reorder the X axis by applying a function to Y values (in this case, median).

A conventional way to do this would be to add reorder() to aes():

ggplot(EndWeight, 
       aes(x=reorder(ActualDiet, weight, median), 
           y=weight)) + 
  geom_boxplot()

An alternative is to modify the variable with mutate() and pipe the data in. Note the difference in the X axis label between these two graphs.

EndWeight %>%
  mutate(ActualDiet = reorder(ActualDiet, weight, median)) %>%
ggplot(aes(x=ActualDiet, y=weight)) + 
  geom_boxplot()

reorder() allows one to easily reverse the direction by reordering Y variable on the opposite sign:

EW_gg <-  
  EndWeight %>%
    mutate(ActualDiet = reorder(ActualDiet, -weight, median)) %>%
  ggplot(aes(x=ActualDiet, y=weight))
EW_gg + geom_boxplot()

Remember we can use labs() to give clearer axis labels:

EW_gg <- EW_gg + labs(x="Chick diet", 
                      y="End weight (g)") 
EW_gg + geom_boxplot() 

geom_jitter

Boxplots are useful because they provide insight into the distribution of the data. But they can still obscure some trends. ggplot offers a couple tricks to get more information about where data actually are in the distribution.

Firstly, we could add the data points themselves…

EW_gg + geom_boxplot() +
        geom_point() 

…but this clearly doesn’t show all of the data:

EndWeight %>%
  group_by(ActualDiet) %>%
  summarize(Count = n())
## # A tibble: 4 x 2
##   ActualDiet Count
##   <fct>      <int>
## 1 corn          16
## 2 oats          10
## 3 burgers       10
## 4 tacos          9

Notice how in each case, fewer data points appear in the graph than chicks actually fed each diet. Clearly a few points are overlapping.

Adding jitter is the addition of some side-to-side noise and reduce the overlap of points with similar Y values:

EW_gg + geom_boxplot() +
        geom_jitter()

Ok maybe that’s too much noise.

EW_gg + geom_boxplot() +
  geom_jitter(width=0.15)

geom_jitter supports several options. Here we use one of the two-color points, 21 (pch is a base graphical parameter short for something like point character that is synoymous with shape):

EW_gg + geom_boxplot() +
  geom_jitter(width=0.15, 
              pch=21, 
              stroke=1.5,
              color="black", 
              fill="red")

geom_violin

An alternative to boxplots are violin plots, which instead of showing fixed quantiles of the data, plot a smoother pattern of the distribution. Wider parts have a relatively greater number of points at that value:

EW_gg + geom_violin()

The quantiles can be represented, as well…

EW_gg + geom_violin(draw_quantiles= c(0.25, 0.5, 0.75)) 
EW_gg + geom_violin(draw_quantiles= c(0.5))             

… as can the data:

EW_gg + geom_violin(draw_quantiles= c(0.25, 0.5, 0.75)) +
  geom_jitter(width=0.15, 
              pch=21, 
              stroke=1.5,
              color="black", 
              fill="red")

Plotting error bars

Despite the value of boxplots and violin plots, there are some reasons why they won’t work in all cases, especially for publication:

  • They can take up too much space, or get scrunched in small plots with several levels.
  • They don’t show what typical statistical models actually test–differences in means based on symetrical error (e.g., mean ± s.e., mean ± s.d.)
  • Some folks are unfamiliar with them
  • Some folks just don’t like them

Whatever the reason, when you move from the data exploration phase to presentation phase, you’ll want to plot point and whisker graphs with mean and error. “Dynamite plots” are not an option; see Dynamite plots: Unmitigated evil? and links therein.

EWsumm <- EndWeight %>%
            group_by(ActualDiet) %>%
              summarise(
                 mean=mean(weight), 
                 sem=sd(weight)/sqrt(length(weight)) )

Note that \({\bf\textsf{R}}\) doesn’t really have a function for standard error, which is defined as

\(\sigma_x = \frac{\sigma}{\sqrt{n}}\)

where

\(\sigma\) is standard deviation sd(), and

\(n\) is the sample size, calculated as the length() of the vector.1

geom_errorbar

Let’s build up the dot-whisker graph. We’ll start with plotting the means as points…

EW_gg <- 
  EWsumm %>%
    mutate(ActualDiet = reorder(ActualDiet, -mean, max)) %>%
ggplot(aes(x=ActualDiet, y=mean))
EW_gg + geom_point()  

… then add the error bars using geom_errorbar:

EW_gg + geom_point() +
        geom_errorbar(aes(ymin=mean-sem, 
                          ymax=mean+sem))

Let’s rein in those whiskers:

EW_gg + geom_point() +
  geom_errorbar(aes(ymin=mean-sem, 
                    ymax=mean+sem), 
                width=0.25)

A couple of things to note about this call to ggplot():

  • geom_errorbar has unique aesthetics: note how aes() requires arguments for the endpoints, ymin and ymax. This is consistent with the convention of describing error as mean ± s.e. geom_errorbar is also useful for plotting confidence intervals, but these are generally entered as absolute values themselves (ymin = 2.5%, ymax = 97.5%) and not relative to a measure of central tendency.
  • geom_errorbar also requires aes(x=), which in this case is used by both geoms because was given in the original ggplot() call that assigned EW_gg; recall we adjusted the order of the X axis.

We can use other optional arguments to improve the appearance of the plot:

EW_gg + geom_errorbar(aes(ymin=mean-sem, 
                    ymax=mean+sem), 
                width=0.25) +
        geom_point(shape=21, 
                   color="black", 
                   fill="white",
                   size=3, 
                   stroke=1.5) 

Note:

  • I always plot geom_errorbar first so that the points sit on top of the error bars, neither point borders nor fill are broken by the vertical bar.
  • stroke= defines the width of the point border.

Additional variables

Let’s go back to the original boring Diet column and add a couple new (fake) variables:

ChickWeight <- 
    ChickWeight %>%
      mutate(Grain = recode(Diet, 
                      "1"="corn", 
                      "2"="corn", 
                      "3"="oats", 
                      "4"="oats" ), 
            Source = recode(Diet, 
                      "1"="conventional", 
                      "2"="organic", 
                      "3"="conventional", 
                      "4"="organic")) 
 EndWeight <- filter(ChickWeight, Time==max(Time))

Then we re-calculate summary stats for both new levels:

EWsumm <- 
  EndWeight %>%
    group_by(Grain, Source) %>% 
      summarise( 
          mean=mean(weight), 
          sem=sd(weight)/sqrt(length(weight)) ) %>%
    ungroup() 

It is easy to add the second variable to a boxplot by adding aes(fill=)

EndWeight %>%
  mutate(Grain = reorder(Grain, -weight, median)) %>%
    ggplot(aes(x=Grain, y=weight)) + 
      geom_boxplot(aes(fill=Source))

… but things get messy when we try to do the same for the dot-whisker plot (this time with aes(color=), which would only change the outline of the boxes in a boxplot if we didn’t use fill):

EWsumm %>%
  mutate(Grain = reorder(Grain, -mean, max)) %>%
    ggplot(aes(x=Grain,
               y=mean,
               color=Source)) +   
      geom_errorbar(aes(ymin=mean-sem, 
                        ymax=mean+sem)) +
      geom_point()  

Notice again how aesthetics defined in the ggplot() call are applied within geoms, especially x=Grain and color=Source, which are required for both geoms.

position_dodge

The solution to the overlap is the position_dodge option of the position= argument, which kind of like jittering adds side-to-side space, but this time the distance is fixed instead of random noise:

EWsumm %>%
  mutate(Grain = reorder(Grain, -mean, max)) %>%
    ggplot(aes(x=Grain,
               y=mean,
               color=Source)) +   
    geom_errorbar(aes(ymin=mean-sem, 
                      ymax=mean+sem), 
                  width=0.15, 
                  position=position_dodge(width = 0.25)) +
    geom_point(position=position_dodge(width = 0.25))

It is very important that the position= argument be defined the same in both geoms lest they not properly overlap.


  1. One could, of course, write their own function, e.g. StdErr <- function(x) { sd(x)/sqrt(length(x))}.