Good data visualizations can have a strong impact on how you think about a phenomena. Earlier in 2016, a single image made by Ed Hawkins brought home how much our global average temperature has changed in a way that even the famous hockey stick graph didn’t.
There is a lot going on in this image. It’s truly stunning. And a lot of it illustrates all of the best principles of data visualization - and it can be done in ggplot2! So, today, we’re going to use this graph as our point of entry into exploring data visualization and ggplot2. Along the way we’re going to explore many many aspects of ggplot2
that are available to us.
We’ll begin by loading in the data using readr
. We’re doing this so that, later, we can do some ordering by month - but can determine that ordering ourselves. We’ll also load dplyr
as we’ll use its functions in a few places today.
library(readr)
hadcrut_temp_anomoly <- read_csv("./data/hadcrut_temp_anomoly_1850_2015.csv")
## Parsed with column specification:
## cols(
## year = col_integer(),
## month = col_integer(),
## anomaly = col_double(),
## month_name = col_character()
## )
hadcrut_temp_anomoly
## # A tibble: 1,992 x 4
## year month anomaly month_name
## <int> <int> <dbl> <chr>
## 1 1850 1 -0.702 Jan
## 2 1851 1 -0.303 Jan
## 3 1852 1 -0.308 Jan
## 4 1853 1 -0.177 Jan
## 5 1854 1 -0.36 Jan
## 6 1855 1 -0.176 Jan
## 7 1856 1 -0.119 Jan
## 8 1857 1 -0.512 Jan
## 9 1858 1 -0.532 Jan
## 10 1859 1 -0.307 Jan
## # ... with 1,982 more rows
Here we see a data table with year, month, temperature anomoly for that month, and a name for the month. The month name is convenient - but - if we want to use it as an ordered variable later, we’re going to have problems. R by default orders alphabetically.
To impose a different ordering schema on month_name
, we’ll need to turn it into a factor
- a character vector that has a set of ordered levels. If we had used read.csv()
to load the data, this would have already been done - but is generally bad practice to assume.
To create a factor, we can use factor()
. We can see the order of the levels with levels()
.
hadcrut_temp_anomoly$month_name <- factor(hadcrut_temp_anomoly$month_name)
levels(hadcrut_temp_anomoly$month_name)
## [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct"
## [12] "Sep"
Uh oh - alphabetical! Enter forcats
- a wonderful library from the tidyverse
designed to work with factors in just such a scenario. There are a number of useful functions such as fct_recode()
for changing the names of different factor levels, fct_relevel()
for specifying arbitrary factor level orders, fct_rev()
for reversing level order, and more. We’re going to use fct_inorder
which specifies level order in the order we see the first appearance of each level. As the data is sorted by month, this should work out just like we need.
library(forcats)
hadcrut_temp_anomoly$month_name <- fct_inorder(hadcrut_temp_anomoly$month_name)
levels(hadcrut_temp_anomoly$month_name)
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
Notice now that month_name
’s class changes when we look at hadcrut_temp_anomoly
hadcrut_temp_anomoly
## # A tibble: 1,992 x 4
## year month anomaly month_name
## <int> <int> <dbl> <fct>
## 1 1850 1 -0.702 Jan
## 2 1851 1 -0.303 Jan
## 3 1852 1 -0.308 Jan
## 4 1853 1 -0.177 Jan
## 5 1854 1 -0.36 Jan
## 6 1855 1 -0.176 Jan
## 7 1856 1 -0.119 Jan
## 8 1857 1 -0.512 Jan
## 9 1858 1 -0.532 Jan
## 10 1859 1 -0.307 Jan
## # ... with 1,982 more rows
Exercise: Make a new column that contains the same information as month_name
. Now, try fct_rev()
, fct_relevel()
, and last, fct_recode()
on it to futz with the factor levels. After each one, look at the output of levels()
to see what you did.
All right! Let’s try out ggplot2
. ggplot2
works simply by you specifying things in roughly the following order:
1. A ggplot2
object which links to a data set and has information about aesthetics.
2. A geom
or geometry which specifies how the data is to be seen.
3. Other bells and whistles which we will get to.
We add these pieces together with a +
sign. So let’s start with something simple, just visualizing a distribution. First, we have to create a ggplot. We’ll link it to the hadcrut_temp_anomoly
data, and map x to anomaly using the aes
function, as we’re going to look at the distribution of anomaly.
library(ggplot2)
had_dens <- ggplot(data = hadcrut_temp_anomoly,
mapping = aes(x = anomaly))
had_dens
Wait, what? Nothing happened! That actually makes sense, as we haven’t told ggplot2
anything about plotting! There’s no geom
. For starters, let’s try the simple geom_histogram()
had_dens +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Nice! Aside from R yelling at us and giving telling us to use a different bin size, it’s not bad!
Exercise: Now you try geom_density
. How’s it look? If you want, try adding arguments like fill
, color
, and alpha
to geom_density()
if you want to get fancy.
To view multiple distributions, we’ll need aesthetics that specify groupings. We’ll look in a minute at how we put those on an axis themselves, but for now, let’s introduce group
.
had_dens_group <- ggplot(data = hadcrut_temp_anomoly,
mapping = aes(x = anomaly, group=month_name))
There’s a lot we can try here. Let’s revisit one old friend.
had_dens_group +
geom_density()
Well that shows something interesting, doesn’t it! What if we don’t overlap? We can try a few different ‘position’ arguments.
had_dens_group +
geom_density(position="stack")
had_dens_group +
geom_density(position=position_dodge(width=3))
## Warning: position_dodge requires non-overlapping x intervals
Exercise: Now you try geom_histogram
. How’s it look? Bad, right? Try different colors, alphas, and fills to see if you can improve. Maybe a different position?
ggplot2
contains a number of different types of geometries for displaying information in two dimensions However, all of them begin by giving ggplot2 some information, just as before. What data is going to be used? What elements of that data will map onto different aesthetic values and scales in the plot? We’ll start by creating the base of our plot with the x-axis being month and the y-axis being temperature anomoly.
had_plot_base <- ggplot(data = hadcrut_temp_anomoly,
mapping = aes(x = month_name, y = anomaly))
Great! We have the basics of our plot saved as an object.
To take a basic plot and add a geometry choice to it, we use one of the family of geom_
s in ggplot2
. One of the most basic plot types we ever encounter is the scatterplot. y versus x. That’s all. For that, we have geom_point()
had_plot_base +
geom_point()
This is great! But, well, a few things. First, man, lots of points overlap. If only they were kinda transparent. Second, maybe they’re too small? Each geom
has a number of options for different visual elements - color, size, alpha, etc. So, for example, we can modify this plot a bit.
had_plot_base +
geom_point(alpha = 0.5, size=3)
Note - all of these are relative to 1. Well, now we can see the distribution of points a bit better, given overlap. But….
Maybe we want to add a bit of random noise to the points, to better visualize what’s going on here. For that, there’s geom_jitter
.
had_plot_base +
geom_jitter()
Neat! You can see the density of each cluster for each month so much more clearly! Now, you may be wondering - hold on - I only want to add jitter to the x-axis, not the y-axis. For that, there’s the width and height argument of how much jitter room there is to spare. So - let’s say we want the points to vary in the x by \(\pm\) 0.5, but 0 in the y. And let’s throw in some alpha for good measure.
had_plot_base +
geom_jitter(width=0.5, height=0, alpha=0.8)
This is all well and good, but, eyeballing is not the same as some solid information on the distribution of the data. Let’s try some of the 2D ways of visualizing data.
geom_boxplot()
, geom_violin()
, stat_summary()
. Which do you prefer?ggridges
. You’ll have to swap x and y in your mapping, but, try out geom_density_ridges()
OK - but these are timeseries! We need lines, no? There is, of course, a geom_line
that shows lines connecting points, but does not show the points (yes, you can layer both geom_line
and geom_point
).
had_plot_base +
geom_line()
UGH - what happened? Welp, because month_name is a factor, all of the points within a month were connected. Oops! There is a way around it, and that is to specify what your groups are, rather than let ggplot define it for you.
One of the nice things about geoms is that we can add aesthetic elements from the data frame. Heck, if you want, you can add a geom with a whole new data set and new aesthetics, but for now, let’s just redefine the group aesthetic.
had_plot_base +
geom_line(mapping = aes(group=year))
Perfect! We are on our way!
Exericise: Try a line plot, using group = month
. Feel free to play with other elements of the plot, such as alpha.
We’ve got our line plot working the way we want it to. But we have other information.
R provides a number of different scales that can all be tied to a variable inside the aes
function. Most commonly, we’ll use color, fill, alpha, size, and shape. There are other options, such as lty for line type, but those will be specific to the geom.
As color is the most common modification, let’s concentrate on that. The lessons can be used broadly for other scales.
had_lines <- had_plot_base +
geom_line(mapping = aes(group=year, color=year))
had_lines
A few things to note. First, year is treated continuously, so it’s placed on a gradient. Second the default color gradient is from dark to light blue.
This is great, but what if we want different colors? If your colors are continuous, the first go-to to change them is scale_color_continuous
- and we add scales just like we added geoms before. Let’s start with a traditional blue to red gradient.
had_lines +
scale_color_continuous(low = "blue", high = "red")
This works great. We could also have used scale_color_gradient
exactly the same way to achieve the same results. If you want to know more of what colors are available in R, see this post or
This is much better. But it’s still hard to see the middle ranges of years For that, we have the scale_gradient_2
function, which takes an argument for what the midpoint color should be, but then you have to specify the value for that midpoint. Let’s just go with the 1925.
had_lines +
scale_color_gradient2(low = "blue", mid = "yellow", high = "red",
midpoint = 1925)
Groovy! What if we wanted something more arbitrary - say, a 7 colors of the rainbow gradient? scale_color_gradientn
has you covered.
had_lines +
scale_color_gradientn(colors=rainbow(7))
Note the rainbow()
function. R comes with a few different color palatte functions (and see the colors
helpfile for how to view all of the colors in R). For each palatte, we feed it a number of colors, and get a vector back. Using some code from the rainbow
helpfile, here ar ethe default pallates.
Try one, and see what it does to you!
There are of course a ton of packages with other pallates our there. One of the most popular, because it’s color selection is based on research looking at color blindness, and how we see sequential or diverging palattes of color, is RColorBrewer
. You can view a lot more about it at http://colorbrewer2.org/ - for now, let’s take a gander at what it provides.
#install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all(n=10, exact.n=FALSE)
That’s a lot. Which of these do you think is best for seeing differences between years? Why? I admit, I’m often partial to BrBG
had_lines +
scale_color_gradientn(colors=brewer.pal(n = 7,
name = "BrBG"))
In point of fact, there are many different color palette packages out there. I’ll just leave you with two more. The first is my personal favorite - the Wes Anderson package - https://github.com/karthik/wesanderson. Don’t worry, there are install packages on the github page.
The second is the viridis palatte. This is a pretty common yellow–blue-based palette (and the package - https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html - also comes with a ‘magma’ palette as well.) These two palettes are great as they show up well for people with color blindness. Heck - if you really want to have color-blind safe color schemes, you might also want to check out the dichromat
package - it not only contains multiple colorblind safe palettes, but it also allows you to see what your color palette will look like under different forms of color blindness.
So, it’s no accident that viridis is what was used for the original animation. Let’s make it our own. We’ll also add guide = 'none'
to lose the colorbar at ths side.
library(viridis)
## Loading required package: viridisLite
had_lines +
scale_color_viridis(guide="none")
What happens to our histogram if we add a fill argument. Try fill = month
and fill = factor(year)
(note that without the factor, nothing happens!). You might also want to remove the guide with + guides(fill = "none")
.
Create a line plot of year on the x and anomoly on the y. Include month as a group. Try coloring by anomoly as well. First use the default palatte, and then try a blue-white-red.
Now color that line plot by month. Try at least two different palettes either of your own devising or from an additional package. Note, if you use month_name
, you’ll be mucking about with scale_color_manual
. Or you can use RColorBrewer which handles discrete groups easily using scale_color_brewer()
.
Before we get to the circular animation hotness, I wanted to take a brief digression to show another way to visualize multiple dimensions. Figures often have multiple panels, each of which shows a different slice of the data. In ggplot
, these are called facets. Before we jump whole hog into seeing this big timeseries, I want us to really look at each month and see if there are trends there. So, I want us to begin a slightly different plot - let’s look at temperature anomoly over time, but with different lines showing different months, and the x axis is year.
had_months <- ggplot(data = hadcrut_temp_anomoly,
mapping = aes(x= year, y = anomaly, group = month_name)) +
geom_line()
had_months
If you’d like, try adding a color and muck about with scale_color_discrete
and its values
argument. Categorical groupings are really a place also where RColorBrewer
and wesanderson
shine.
But, what if instead of separating by color - as these lines all overlap - we really wanted to see each line on its own in a panel. There are two functions that can help us here. The first is facet_wrap()
had_months +
facet_wrap(~month_name)
The second is a function that allows us to facet with two variables - one with rows and one with columns. We can’t do that with this example, but, in essence, you would use a similar tilda notation as above - facet_grid(rows ~ cols)
It’s a little bit more difficult to facet our monthly plot. Maybe we want to split it up by decade. But, wait, year is continuous. What to do? R and ggplot2
have a number of functions based off of the base function cut()
. Let’s try the first, that can make decadal plots.
had_lines +
facet_wrap(~cut_width(year, 10))
It’s not pretty (that’s a lot of decades!), but you can see the principle.
Exercise: 1. Try out cut_interval
and cut_number
. What do they do?
2. Make a plot where facets and colors reflect the same information.
As we move further into statistical visualization, ggplot2
has a number of statistical summary plots. THe first, for grouped data, is stat_summary
. This function will take grouped data and calculate means and other summary information. fun.data
takes a function as an argument to calculate summary statistics. For example:
had_plot_base +
stat_summary(fun.data=mean_se)
For continuous data, we often look for trends. So going back to our faceted plot before, we can add a smoothed spline to each panel with:
had_months +
facet_wrap(~month_name) +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
If instead we wanted a straight linear model fit or fit from another model, we can use the method
argument.
had_months +
facet_wrap(~month_name) +
stat_smooth(method="lm")
OK, getting back to our initial graph - the grey background, the font type, etc., might not really be doing it for you.
Ugh. Grey background. Weird white lines. Maybe we don’t like the default. Ggplot2 provides some alternatives wtih theme
functions. Now, you can specify what you’d like to your heart’s content, but, there are a few canned differnt themes that can be quite nice. The two I use most commonly are theme_bw()
and theme_void()
had_lines_color + theme_bw()
had_lines_color + theme_void()
The theme
function is highly dynamic - you can specify simple items, such as the base_size
for basic font sizes, etc. Or you can customize to your hearts content, with angle of text on axes, different color schemes, etc. See ?theme
or the theme vignette.
But it doesn’t stop there. There are whole libraries of themes for you to try! And they are constantly being updated! Want to make your figure look like it came from Excel, fivethirtyeight.com, or was made by Tufte himself?
#install.packages("ggthemes")
library(ggthemes)
had_lines_color +
theme_excel()
had_lines_color +
theme_fivethirtyeight()
had_lines_color +
theme_tufte(base_size=17)
See the ggthemes
package vignette for more.
But for us, let’s go with the solarized theme with the option light=FALSE
, as it’s close to the original.
had_lines_color_theme <- had_lines_color +
theme_solarized(light=FALSE)
Exercise: Take the year by anomoly plots and apply three different themes of your choice to them.
OK, now for the fripperies and frills - annotating our plots with information! In the GIF above, we wad demarcations of certain critical threshold as well as a title. There were also years that changed as time went by, but we’ll save that for a moment.
First, there are two critical lines - 1.5 and 2.0 degrees C. To add a vertical or horizontal line to the plot, we use geom_vline
or geom_hline
. These accept a xintercept
or yintercept
, respectively, to indicate where they should cross a plot. So, to add to our current plot.
had_lines_annotated <- had_lines_color_theme +
geom_hline(yintercept=c(1.5, 2.0), color="red")
had_lines_annotated
We also want to have some annotations on the lines that say what each one is. For that, we could use geom_text
, but, that can get messy with facets, etc. Better to use annotate
. This function takes an x and y coordinate, a geom
type, and then all of the arguments needed for that geom. Now, often we use this with just geom_text
, but that gets tricky given that we want to overlay this on a line.
had_lines_annotated +
annotate(x=c(1,1), y=c(1.5,2), geom="text", label=c("1.5C", "2.0C"))
Instead, we want something that is a label
- something that has a background. We need to specify the background color, which I’ll put in in hex code, as I don’t know it’s name (looked this up in some information about solarized color palattes), and a color for the text - white.
had_lines_annotated <- had_lines_annotated+
annotate(x=c(1,1), y=c(1.5,2),
geom="label",
label=c("1.5C", "2.0C"),
fill="#002b36",
color="white", label.size=0)
had_lines_annotated
Gorgeous!
And last, titles are easy. There’s a function for it! ggtitle
had_lines_annotated <- had_lines_annotated +
ggtitle("Global Temperature Change (1850-2015)")
had_lines_annotated
Of course, we’ve done this with simple cartesian axes that we haven’t modified at all. We don’t always want to do that.
First, we want to have the lines extend all the way to the edges of the graph. So, to do that, we want the limits of the x-axis to be set to Jan and Dec. The xlim()
function (and there is a ylim()
function as well) should take care of that. If we had a continuous scale, we’d just feed a vector, with minimum and maximum - e.g. c(1,12)
.
But, here we want to feed the factor vector from which everything is derived. We’ve taken care of ordering, so, we no longer need xlim()
. Instead, we want to use the scale_x_discrete
function that lets us set properties of an x axis with discrete values. In our case, we want to alter how much additional space is placed around the first and last month using the expand
argument. So let’s cut that down to 0!
had_lines_annotated_final <- had_lines_annotated +
coord_cartesian(expand=FALSE)
had_lines_annotated_final
The labels get cut off - but they’ll come back in a second!
There are a number of ways to modify an axis. Using scale_x_continuous()
provides access to a rich number of ways to modify the x axis (or replace with y, and you get the picture).
Often we’re interested in log transforming our axes:
had_lines_annotated_final +
scale_y_log10()
Here, this isn’t a great idea, as we lose the negative numbers. But, you can begin to see the power of how the different axis scales can work. We can invoke an arbitrary scale using the trans
argument of scale_x_continuous()
, although there are many already buit in.
We want a circular plot, however, which implies polar coordinates. And there’s an option for that, too! We’ll also have to deal with that expansion issue again.
had_lines_annotated_final <- had_lines_annotated_final +
coord_polar()+
scale_x_discrete(expand=c(0,0))
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
had_lines_annotated_final
Oh! Fancy! Almost there!
The whole month name, anomaly think on the axes is a little odd. Fortunately, we can replace the string labels on either axis. We could have put in other things, but for now, let’s go blank.
had_lines_annotated_final <- had_lines_annotated_final +
xlab("") +
ylab("")
had_lines_annotated_final
For more great ggplot2 extensions, see https://www.ggplot2-exts.org/
Right now, we’re in the midst of a great re-write of the package that actually performs animations with ggplot. The gganimate package attempts to implement a nice consistent grammer of animation, much like a grammar for graphics. It’s not on CRAN yet, so we’ll have to install it and its companion, transformr
with devtools.
install.packages("devtools") #if you don't have it yet
devtools::install_github("thomasp85/farver")
devtools::install_github("thomasp85/tweenr")
devtools::install_github("thomasp85/transformr")
devtools::install_github("thomasp85/gganimate")
gganimate
introduces a few new aesthetics for animations. The first is the transition_*
series that defines how elements move between each other in a plot. For example, to generate the initial animation, we’d use the transition_reveal()
which gradually adds pieces to the animation. It uses the first argument to say what is being added and the along
argument to say in what order it should be added.
library(gganimate)
had_lines_annotated_final +
transition_reveal(year, along = year)
What happens if you use along=as.numeric(month_name)
? Note, we convert to a numeric as gganimate
needs a number, and does not know how to order characters.
There are many many other elements to gganimate - it’s a growing library - such as leaving trails behind points, and more - but that’s a tutorial for another time (or one you should write!)
had_points <- ggplot(data = hadcrut_temp_anomoly %>% filter(year < 1870),
aes(x = year, y = anomaly, color = month_name)) +
geom_point()
had_points_anim <- had_points +
transition_time(year)
had_points_anim +
shadow_trail(0.05, max_frames=10) +
exit_fade()