Ggplot2 is another way of doing graphics in R.
Honestly I am a beginner at ggplot too; old dog new tricks, you know.
For you who are starting out, ggplot is good for you and Susan and I think you will like it. It does have a slight learning curve and some idiosynchratic terminology, but then again you get to think beautiful aesthetic thoughts while analyzing data.
Two useful resources for which you can find links on our Canvas site include:
We have reason to be proud. Leland Wilkinson, inventor of the grammar of graphics, is a Yale Ph.D. in Psych. In fact he dedicated his book (which introduced the subject) to his daughter and to John Hartigan (Statistics prof at Yale).
Three key components of a graph:
Aesthetic mappings? Wilkinson writes that the word aesthetics derives from a Greek word \[\alpha\iota\sigma\theta\eta\sigma\iota\zeta,\] which means perception. The additional meanings related to beauty and artistic criteria arose in the 18th century.
Other aspects include “facets” – multiple small plots – more about this below.
The ggplot2 library has a dataset mpg
. Note this is a subset of a full data set that you can get find on this page at www.fueleconomy.gov. If you like this kind of thing, it could be a dataset for a project.
mpg # error
## Error in eval(expr, envir, enclos): object 'mpg' not found
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
# tibbles have nice informative displays!
These may be the most useful/used types of graphical displays, so let’s start here.
ggplot(data = mpg, mapping = aes(x=displ, y=hwy))
# Plots usually start this way: specify data and some aesthetic mappings.
# The line above sets up axes using x and y variables, but doesn't put anything there.
# To add a layer of points we do this:
ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point()
Changing “theme”:
ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point() + theme_bw()
ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point() + theme_classic()
Adding color to the points:
ggplot(data = mpg, mapping = aes(x=displ, y=hwy, color=class)) + geom_point()
ggplot(mpg, aes(displ, hwy, color=class)) + geom_point()
#^^ The previous 2 plots are the same. Last one is nice and concise, taking advantage of the fact that we don't have to give argument names if we are using them in their default positions.
# Can also do it this way, putting the aes inside the geom_point:
ggplot(data = mpg, mapping = aes(x=displ, y=hwy)) + geom_point(mapping = aes(color=class))
Compared to the base graphics, it’s great that the legend is not inside the plot, and it was made with no pain!
But wait, there are 234 rows and I don’t think I see 234 points. We should jitter.
ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter()
ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=0,height=0)
ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=0,height=1)
ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=.1,height=0)
ggplot(mpg, aes(displ, hwy, color=class)) + geom_jitter(width=1,height=0)
reps <- 100
d <- d0 <- data.frame(x=rep(c(0,0,1,1),reps),y=rep(c(0,1,0,1),reps))
ggplot(d, aes(x,y)) + geom_point()
ggplot(d, aes(x,y)) + geom_jitter()
ggplot(d, aes(x,y)) + geom_jitter(height=0.1)
ggplot(d, aes(x,y)) + geom_jitter(width=0.1)
ggplot(d, aes(x,y)) + geom_jitter(width=0.1, height = 0.1)
d <- rbind(d0, data.frame(x=.1,y=0))
ggplot(d, aes(x,y)) + geom_point()
ggplot(d, aes(x,y)) + geom_jitter()
To make separate plots of subsets determined by values of a variable.
p <- ggplot(mpg, aes(displ, hwy))
p + facet_wrap(~class)
p + facet_wrap(~class) + geom_point()
# compare to using an aesthetic:
p + geom_jitter(aes(color=class))
# Can specify the number of columns:
p + facet_wrap(~cyl, ncol = 3) + geom_point()
Scales free vs constrained: by default the scales on the axes are consistent with each other across the facets. But this can be controlled, and you can “free” the x and/or y scales:
p + facet_wrap(~cyl) + geom_point()
p + facet_wrap(~cyl, scales = "free") + geom_point()
p + facet_wrap(~cyl, scales = "free_y") + geom_point()
Smoothing, including linear and also nonlinear regression using “loess” (for “locally weighted regression”):
p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess'
# Default is "loess" method. Smoothing is controlled by "span" parameter, the larger the smoother.
p + geom_point() + geom_smooth(span=.2)
## `geom_smooth()` using method = 'loess'
p + geom_point() + geom_smooth(span=1)
## `geom_smooth()` using method = 'loess'
# If you don't want the confidence bands:
p + geom_point() + geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess'
# geom_smooth can do other kinds of regression, including linear:
p + geom_point() + geom_smooth(method = "lm")
p + geom_point() + geom_smooth(method = "lm", se=FALSE)
# An example with facets:
p + facet_wrap(~cyl) + geom_smooth(method = "lm") #oops no points!
p + facet_wrap(~cyl) + geom_smooth(method = "lm") + geom_point()
ggplot(mpg, aes(drv,hwy)) + geom_boxplot()
ggplot(mpg, aes(drv,hwy)) + geom_point()
ggplot(mpg, aes(drv,hwy)) + geom_jitter(width = .1)
ggplot(mpg, aes(drv,hwy)) + geom_violin()
Here’s something you can do if you want; we didn’t have time to see this in class. We can re-ordering a factor, e.g. class
, to make boxplots nicer.
ggplot(mpg, aes(class,hwy)) + geom_boxplot()
class1 <- reorder(mpg$class, mpg$hwy)
class1
## [1] compact compact compact compact compact compact
## [7] compact compact compact compact compact compact
## [13] compact compact compact midsize midsize midsize
## [19] suv suv suv suv suv 2seater
## [25] 2seater 2seater 2seater 2seater suv suv
## [31] suv suv midsize midsize midsize midsize
## [37] midsize minivan minivan minivan minivan minivan
## [43] minivan minivan minivan minivan minivan minivan
## [49] pickup pickup pickup pickup pickup pickup
## [55] pickup pickup pickup suv suv suv
## [61] suv suv suv suv pickup pickup
## [67] pickup pickup pickup pickup pickup pickup
## [73] pickup pickup suv suv suv suv
## [79] suv suv suv suv suv pickup
## [85] pickup pickup pickup pickup pickup pickup
## [91] subcompact subcompact subcompact subcompact subcompact subcompact
## [97] subcompact subcompact subcompact subcompact subcompact subcompact
## [103] subcompact subcompact subcompact subcompact subcompact subcompact
## [109] midsize midsize midsize midsize midsize midsize
## [115] midsize subcompact subcompact subcompact subcompact subcompact
## [121] subcompact subcompact suv suv suv suv
## [127] suv suv suv suv suv suv
## [133] suv suv suv suv suv suv
## [139] suv suv suv compact compact midsize
## [145] midsize midsize midsize midsize midsize midsize
## [151] suv suv suv suv midsize midsize
## [157] midsize midsize midsize suv suv suv
## [163] suv suv suv subcompact subcompact subcompact
## [169] subcompact compact compact compact compact suv
## [175] suv suv suv suv suv midsize
## [181] midsize midsize midsize midsize midsize midsize
## [187] compact compact compact compact compact compact
## [193] compact compact compact compact compact compact
## [199] suv suv pickup pickup pickup pickup
## [205] pickup pickup pickup compact compact compact
## [211] compact compact compact compact compact compact
## [217] compact compact compact compact compact subcompact
## [223] subcompact subcompact subcompact subcompact subcompact midsize
## [229] midsize midsize midsize midsize midsize midsize
## attr(,"scores")
## 2seater compact midsize minivan pickup subcompact
## 24.80000 28.29787 27.29268 22.36364 16.87879 28.14286
## suv
## 18.12903
## Levels: pickup suv minivan 2seater midsize subcompact compact
ggplot(mpg, aes(class1,hwy)) + geom_boxplot()
class2 <- reorder(mpg$class, mpg$hwy, FUN = median)
class2
## [1] compact compact compact compact compact compact
## [7] compact compact compact compact compact compact
## [13] compact compact compact midsize midsize midsize
## [19] suv suv suv suv suv 2seater
## [25] 2seater 2seater 2seater 2seater suv suv
## [31] suv suv midsize midsize midsize midsize
## [37] midsize minivan minivan minivan minivan minivan
## [43] minivan minivan minivan minivan minivan minivan
## [49] pickup pickup pickup pickup pickup pickup
## [55] pickup pickup pickup suv suv suv
## [61] suv suv suv suv pickup pickup
## [67] pickup pickup pickup pickup pickup pickup
## [73] pickup pickup suv suv suv suv
## [79] suv suv suv suv suv pickup
## [85] pickup pickup pickup pickup pickup pickup
## [91] subcompact subcompact subcompact subcompact subcompact subcompact
## [97] subcompact subcompact subcompact subcompact subcompact subcompact
## [103] subcompact subcompact subcompact subcompact subcompact subcompact
## [109] midsize midsize midsize midsize midsize midsize
## [115] midsize subcompact subcompact subcompact subcompact subcompact
## [121] subcompact subcompact suv suv suv suv
## [127] suv suv suv suv suv suv
## [133] suv suv suv suv suv suv
## [139] suv suv suv compact compact midsize
## [145] midsize midsize midsize midsize midsize midsize
## [151] suv suv suv suv midsize midsize
## [157] midsize midsize midsize suv suv suv
## [163] suv suv suv subcompact subcompact subcompact
## [169] subcompact compact compact compact compact suv
## [175] suv suv suv suv suv midsize
## [181] midsize midsize midsize midsize midsize midsize
## [187] compact compact compact compact compact compact
## [193] compact compact compact compact compact compact
## [199] suv suv pickup pickup pickup pickup
## [205] pickup pickup pickup compact compact compact
## [211] compact compact compact compact compact compact
## [217] compact compact compact compact compact subcompact
## [223] subcompact subcompact subcompact subcompact subcompact midsize
## [229] midsize midsize midsize midsize midsize midsize
## attr(,"scores")
## 2seater compact midsize minivan pickup subcompact
## 25.0 27.0 27.0 23.0 17.0 26.0
## suv
## 17.5
## Levels: pickup suv minivan 2seater subcompact compact midsize
ggplot(mpg, aes(class2,hwy)) + geom_boxplot()
Just use an “x” aesthetic, not x and y:
ggplot(mpg, aes(hwy))
ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 5)
ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Subsetting by drv:
p <- ggplot(mpg, aes(x = hwy))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + geom_histogram() + facet_wrap(~drv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + geom_histogram() + facet_wrap(~drv, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + geom_histogram(aes(fill=drv)) + facet_wrap(~drv, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A couple more things thrown in for fun (didn’t get to see these in class):
p + aes(color=drv) + geom_freqpoly(lwd=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + aes(color=drv) + geom_density(lwd=2)
# See cheatsheet, p. 2, "Labels" section.
ggplot(mpg, aes(displ, hwy)) + geom_point() + labs(x="displacement", y="highway mpg", title="Fuel efficiency decreases with engine size")
source(file = "http://www.stat.yale.edu/~jtc5/STAT230/data/170309-college-dump.r")
colleges
## [1] "Yale University"
## [2] "Harvard University"
## [3] "Princeton University"
## [4] "University of Connecticut"
## [5] "University of Massachusetts-Amherst"
## [6] "Rutgers University-New Brunswick"
r
## # A tibble: 60 × 3
## firstgen college year
## <dbl> <chr> <int>
## 1 0.1430281 Yale University 2005
## 2 0.1476378 Yale University 2006
## 3 0.1610787 Yale University 2007
## 4 0.1844106 Yale University 2008
## 5 0.2078804 Yale University 2009
## 6 0.1866667 Yale University 2010
## 7 0.1857335 Yale University 2011
## 8 0.2054208 Yale University 2012
## 9 0.2009132 Yale University 2013
## 10 0.2246256 Yale University 2014
## # ... with 50 more rows
The plot we did last time:
colors <- c("blue", "red", "black", "blue", "red", "black")
types <- c(1,1,1,2,2,2)
plot(firstgen ~ year, data=r, type="n", ylab="first-generation student proportion")
for(i in 1:length(colleges)){
rows <- which(r$college == colleges[i])
lines(r$year[rows], r$firstgen[rows], type="b", lty=types[i], col=colors[i])
}
#
# Looks ok up to here. Let's add a legend:
#
legend("bottomright", legend=colleges, col=colors, lty=types, pch=19) # cex=0.65
title(main="First-generation students at private and public schools")
Here is a way to make a similar plot using ggplot, again putting the legends outside the coordinate axes where they belong.
# View(r)
r$type <- rep(c("private","public"), each=30)
ggplot(r, aes(x = year, y=firstgen, color=college)) +
geom_point()
ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) +
geom_point() + geom_line()
That looks quite good to me, and it’s easy too!
Oops, just noticed the years like 2007.5, which doesn’t seem so nice. We can modify the x axis scale with scale_x_continuous:
ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) +
geom_point() + geom_line() +
scale_x_continuous(breaks = 2005:2014)
That’s nice. We don’t need the grid lines between the “major” grid lines, so here is a way to remove them:
ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) +
geom_point() + geom_line() +
scale_x_continuous(breaks = 2005:2014, minor_breaks = NULL)
Oh, and I forgot a title:
ggplot(r, aes(x = year, y=firstgen, color=college, linetype=type)) +
geom_point() + geom_line() +
scale_x_continuous(breaks = 2005:2014, minor_breaks = NULL) +
labs(title="Closing the gap?",
y="propotion of first-generation students")