Ggplot2 is another way of doing graphics in R.
Honestly I am a beginner at ggplot too; old dog new tricks, you know.
For you who are starting out, ggplot is good for you and Susan and I think you will like it. It does have a slight learning curve and some idiosynchratic terminology, but then again you get to think beautiful aesthetic thoughts while analyzing data.

Two useful resources for which you can find links on our Canvas site include:

  • The major book about ggplot2: ggplot2_2nd-ed_Wickham.pdf
  • RStudio’s ggplot2-cheatsheet-2.1.pdf

The grammar of graphics

We have reason to be proud. Leland Wilkinson, inventor of the grammar of graphics, is a Yale Ph.D. in Psych. In fact he dedicated his book (which introduced the subject) to his daughter and to John Hartigan (Statistics prof at Yale).

Three key components of a graph:

  • Data
  • “Aesthetic mappings” that give a correspondence between variables in the data and visual properties
  • “Layers” and “geoms” that show the data in various ways. Layers are usually made with a geom_… function.

Aesthetic mappings? Wilkinson writes that the word aesthetics derives from a Greek word \[\alpha\iota\sigma\theta\eta\sigma\iota\zeta,\] which means perception. The additional meanings related to beauty and artistic criteria arose in the 18th century.

Other aspects include “facets” – multiple small plots – more about this below.

The ggplot2 library has a dataset mpg. Note this is a subset of a full data set that you can get find on this page at www.fueleconomy.gov. If you like this kind of thing, it could be a dataset for a project.

mpg # error
## Error in eval(expr, envir, enclos): object 'mpg' not found
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
mpg
## # A tibble: 234 × 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
# tibbles have nice informative displays!

Scatterplots

These may be the most useful/used types of graphical displays, so let’s start here.

ggplot(data = mpg, mapping = aes(x=displ, y=hwy))

# Plots usually start this way: specify data and some aesthetic mappings.
# The line above sets up axes using x and y variables, but doesn't put anything there.
# To add a layer of points we do this:
ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point() 

Changing “theme”:

ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point() + theme_bw()

ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point() + theme_classic()

Adding color to the points:

ggplot(data = mpg, mapping = aes(x=displ, y=hwy, color=class)) + geom_point()

ggplot(mpg, aes(displ, hwy, color=class)) + geom_point()

#^^ The previous 2 plots are the same. Last one is nice and concise, taking advantage of the fact that we don't have to give argument names if we are using them in their default positions. 

# Can also do it this way, putting the aes inside the geom_point:
ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point(mapping = aes(color=class))

Compared to the base graphics, it’s great that the legend is not inside the plot, and it was made with no pain!

But wait, there are 234 rows and I don’t think I see 234 points. We should jitter.

ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter()

ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=0,height=0)

ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=0,height=1)

ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=.1,height=0)

ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=1,height=0)

How does jittering work?

reps <- 100
d <- d0 <- data.frame(x=rep(c(0,0,1,1),reps),y=rep(c(0,1,0,1),reps))
ggplot(d, aes(x,y)) + geom_point()

ggplot(d, aes(x,y)) + geom_jitter()

ggplot(d, aes(x,y)) + geom_jitter(height=0.1)

ggplot(d, aes(x,y)) + geom_jitter(width=0.1)

ggplot(d, aes(x,y)) + geom_jitter(width=0.1, height = 0.1)

d <- rbind(d0, data.frame(x=.1,y=0))
ggplot(d, aes(x,y)) + geom_point()

ggplot(d, aes(x,y)) + geom_jitter()

Facetting

To make separate plots of subsets determined by values of a variable.

p <- ggplot(mpg, aes(displ, hwy))
p + facet_wrap(~class)

p + facet_wrap(~class) + geom_point()

# compare to using an aesthetic:
p + geom_jitter(aes(color=class))

# Can specify the number of columns:
p + facet_wrap(~cyl, ncol = 3) + geom_point()

Scales free vs constrained: by default the scales on the axes are consistent with each other across the facets. But this can be controlled, and you can “free” the x and/or y scales:

p + facet_wrap(~cyl) + geom_point()

p + facet_wrap(~cyl, scales = "free") + geom_point()

p + facet_wrap(~cyl, scales = "free_y") + geom_point()

More Geoms

Smoothing, including linear and also nonlinear regression using “loess” (for “locally weighted regression”):

p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess'

# Default is "loess" method. Smoothing is controlled by "span" parameter, the larger the smoother.
p + geom_point() + geom_smooth(span=.2)
## `geom_smooth()` using method = 'loess'

p + geom_point() + geom_smooth(span=1)
## `geom_smooth()` using method = 'loess'

# If you don't want the confidence bands:
p + geom_point() + geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess'

# geom_smooth can do other kinds of regression, including linear:
p + geom_point() + geom_smooth(method = "lm")

p + geom_point() + geom_smooth(method = "lm", se=FALSE)

# An example with facets:
p + facet_wrap(~cyl) + geom_smooth(method = "lm") #oops no points!

p + facet_wrap(~cyl) + geom_smooth(method = "lm") + geom_point()

Boxplots

ggplot(mpg, aes(drv,hwy)) + geom_boxplot()

ggplot(mpg, aes(drv,hwy)) + geom_point()

ggplot(mpg, aes(drv,hwy)) + geom_jitter(width = .1)

ggplot(mpg, aes(drv,hwy)) + geom_violin()

Here’s something you can do if you want; we didn’t have time to see this in class. We can re-ordering a factor, e.g. class, to make boxplots nicer.

ggplot(mpg, aes(class,hwy)) + geom_boxplot()

class1 <- reorder(mpg$class, mpg$hwy)
class1
##   [1] compact    compact    compact    compact    compact    compact   
##   [7] compact    compact    compact    compact    compact    compact   
##  [13] compact    compact    compact    midsize    midsize    midsize   
##  [19] suv        suv        suv        suv        suv        2seater   
##  [25] 2seater    2seater    2seater    2seater    suv        suv       
##  [31] suv        suv        midsize    midsize    midsize    midsize   
##  [37] midsize    minivan    minivan    minivan    minivan    minivan   
##  [43] minivan    minivan    minivan    minivan    minivan    minivan   
##  [49] pickup     pickup     pickup     pickup     pickup     pickup    
##  [55] pickup     pickup     pickup     suv        suv        suv       
##  [61] suv        suv        suv        suv        pickup     pickup    
##  [67] pickup     pickup     pickup     pickup     pickup     pickup    
##  [73] pickup     pickup     suv        suv        suv        suv       
##  [79] suv        suv        suv        suv        suv        pickup    
##  [85] pickup     pickup     pickup     pickup     pickup     pickup    
##  [91] subcompact subcompact subcompact subcompact subcompact subcompact
##  [97] subcompact subcompact subcompact subcompact subcompact subcompact
## [103] subcompact subcompact subcompact subcompact subcompact subcompact
## [109] midsize    midsize    midsize    midsize    midsize    midsize   
## [115] midsize    subcompact subcompact subcompact subcompact subcompact
## [121] subcompact subcompact suv        suv        suv        suv       
## [127] suv        suv        suv        suv        suv        suv       
## [133] suv        suv        suv        suv        suv        suv       
## [139] suv        suv        suv        compact    compact    midsize   
## [145] midsize    midsize    midsize    midsize    midsize    midsize   
## [151] suv        suv        suv        suv        midsize    midsize   
## [157] midsize    midsize    midsize    suv        suv        suv       
## [163] suv        suv        suv        subcompact subcompact subcompact
## [169] subcompact compact    compact    compact    compact    suv       
## [175] suv        suv        suv        suv        suv        midsize   
## [181] midsize    midsize    midsize    midsize    midsize    midsize   
## [187] compact    compact    compact    compact    compact    compact   
## [193] compact    compact    compact    compact    compact    compact   
## [199] suv        suv        pickup     pickup     pickup     pickup    
## [205] pickup     pickup     pickup     compact    compact    compact   
## [211] compact    compact    compact    compact    compact    compact   
## [217] compact    compact    compact    compact    compact    subcompact
## [223] subcompact subcompact subcompact subcompact subcompact midsize   
## [229] midsize    midsize    midsize    midsize    midsize    midsize   
## attr(,"scores")
##    2seater    compact    midsize    minivan     pickup subcompact 
##   24.80000   28.29787   27.29268   22.36364   16.87879   28.14286 
##        suv 
##   18.12903 
## Levels: pickup suv minivan 2seater midsize subcompact compact
ggplot(mpg, aes(class1,hwy)) + geom_boxplot()

class2 <- reorder(mpg$class, mpg$hwy, FUN = median)
class2
##   [1] compact    compact    compact    compact    compact    compact   
##   [7] compact    compact    compact    compact    compact    compact   
##  [13] compact    compact    compact    midsize    midsize    midsize   
##  [19] suv        suv        suv        suv        suv        2seater   
##  [25] 2seater    2seater    2seater    2seater    suv        suv       
##  [31] suv        suv        midsize    midsize    midsize    midsize   
##  [37] midsize    minivan    minivan    minivan    minivan    minivan   
##  [43] minivan    minivan    minivan    minivan    minivan    minivan   
##  [49] pickup     pickup     pickup     pickup     pickup     pickup    
##  [55] pickup     pickup     pickup     suv        suv        suv       
##  [61] suv        suv        suv        suv        pickup     pickup    
##  [67] pickup     pickup     pickup     pickup     pickup     pickup    
##  [73] pickup     pickup     suv        suv        suv        suv       
##  [79] suv        suv        suv        suv        suv        pickup    
##  [85] pickup     pickup     pickup     pickup     pickup     pickup    
##  [91] subcompact subcompact subcompact subcompact subcompact subcompact
##  [97] subcompact subcompact subcompact subcompact subcompact subcompact
## [103] subcompact subcompact subcompact subcompact subcompact subcompact
## [109] midsize    midsize    midsize    midsize    midsize    midsize   
## [115] midsize    subcompact subcompact subcompact subcompact subcompact
## [121] subcompact subcompact suv        suv        suv        suv       
## [127] suv        suv        suv        suv        suv        suv       
## [133] suv        suv        suv        suv        suv        suv       
## [139] suv        suv        suv        compact    compact    midsize   
## [145] midsize    midsize    midsize    midsize    midsize    midsize   
## [151] suv        suv        suv        suv        midsize    midsize   
## [157] midsize    midsize    midsize    suv        suv        suv       
## [163] suv        suv        suv        subcompact subcompact subcompact
## [169] subcompact compact    compact    compact    compact    suv       
## [175] suv        suv        suv        suv        suv        midsize   
## [181] midsize    midsize    midsize    midsize    midsize    midsize   
## [187] compact    compact    compact    compact    compact    compact   
## [193] compact    compact    compact    compact    compact    compact   
## [199] suv        suv        pickup     pickup     pickup     pickup    
## [205] pickup     pickup     pickup     compact    compact    compact   
## [211] compact    compact    compact    compact    compact    compact   
## [217] compact    compact    compact    compact    compact    subcompact
## [223] subcompact subcompact subcompact subcompact subcompact midsize   
## [229] midsize    midsize    midsize    midsize    midsize    midsize   
## attr(,"scores")
##    2seater    compact    midsize    minivan     pickup subcompact 
##       25.0       27.0       27.0       23.0       17.0       26.0 
##        suv 
##       17.5 
## Levels: pickup suv minivan 2seater subcompact compact midsize
ggplot(mpg, aes(class2,hwy)) + geom_boxplot()

Histograms

Just use an “x” aesthetic, not x and y:

ggplot(mpg, aes(hwy)) 

ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 5)

ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Subsetting by drv:

p <- ggplot(mpg, aes(x = hwy))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram() + facet_wrap(~drv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram() + facet_wrap(~drv, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram(aes(fill=drv)) + facet_wrap(~drv, ncol = 1) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A couple more things thrown in for fun (didn’t get to see these in class):

p + aes(color=drv) + geom_freqpoly(lwd=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + aes(color=drv) + geom_density(lwd=2)

Titles and axis labels

# See cheatsheet, p. 2, "Labels" section.
ggplot(mpg, aes(displ, hwy)) + geom_point() + labs(x="displacement", y="highway mpg", title="Fuel efficiency decreases with engine size")

Back to college data and first-generation percentages

source(file = "http://www.stat.yale.edu/~jtc5/STAT230/data/170309-college-dump.r")
colleges
## [1] "Yale University"                    
## [2] "Harvard University"                 
## [3] "Princeton University"               
## [4] "University of Connecticut"          
## [5] "University of Massachusetts-Amherst"
## [6] "Rutgers University-New Brunswick"
r
## # A tibble: 60 × 3
##     firstgen         college  year
##        <dbl>           <chr> <int>
## 1  0.1430281 Yale University  2005
## 2  0.1476378 Yale University  2006
## 3  0.1610787 Yale University  2007
## 4  0.1844106 Yale University  2008
## 5  0.2078804 Yale University  2009
## 6  0.1866667 Yale University  2010
## 7  0.1857335 Yale University  2011
## 8  0.2054208 Yale University  2012
## 9  0.2009132 Yale University  2013
## 10 0.2246256 Yale University  2014
## # ... with 50 more rows

The plot we did last time:

colors <- c("blue", "red", "black", "blue", "red", "black")
types <- c(1,1,1,2,2,2)
plot(firstgen ~ year, data=r, type="n", ylab="first-generation student proportion")
for(i in 1:length(colleges)){
  rows <- which(r$college == colleges[i])
  lines(r$year[rows], r$firstgen[rows], type="b", lty=types[i], col=colors[i])
}
#
# Looks ok up to here.  Let's add a legend:
#
legend("bottomright", legend=colleges, col=colors, lty=types, pch=19) # cex=0.65
title(main="First-generation students at private and public schools")

Here is a way to make a similar plot using ggplot, again putting the legends outside the coordinate axes where they belong.

# View(r)
r$type <- rep(c("private","public"), each=30)

ggplot(r, aes(x = year, y=firstgen, color=college)) + 
  geom_point()

ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) + 
  geom_point() + geom_line()

That looks quite good to me, and it’s easy too!

Oops, just noticed the years like 2007.5, which doesn’t seem so nice. We can modify the x axis scale with scale_x_continuous:

ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) + 
  geom_point() + geom_line() + 
  scale_x_continuous(breaks = 2005:2014)

That’s nice. We don’t need the grid lines between the “major” grid lines, so here is a way to remove them:

ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) + 
  geom_point() + geom_line() + 
  scale_x_continuous(breaks = 2005:2014, minor_breaks = NULL)

Oh, and I forgot a title:

ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) + 
  geom_point() + geom_line() + 
  scale_x_continuous(breaks = 2005:2014, minor_breaks = NULL) +
  labs(title="Closing the gap?",
       y="propotion of first-generation students")