Good data visualizations can have a strong impact on how you think about a phenomena. Earlier in 2016, a single image made by Ed Hawkins brought home how much our global average temperature has changed in a way that even the famous hockey stick graph didn’t.

http://www.climatecentral.org/news/see-earths-temperature-spiral-toward-2c-20332
from http://www.climatecentral.org/news/see-earths-temperature-spiral-toward-2c-20332

There is a lot going on in this image. It’s truly stunning. And a lot of it illustrates all of the best principles of data visualization - and it can be done in ggplot2! So, today, we’re going to use this graph as our point of entry into exploring data visualization and ggplot2. Along the way we’re going to explore many many aspects of ggplot2 that are available to us.

1. Load the data

We’ll begin by loading in the data using readr. We’re doing this so that, later, we can do some ordering by month - but can determine that ordering ourselves. We’ll also load dplyr as we’ll use its functions in a few places today.

library(readr)

hadcrut_temp_anomoly <- read_csv("./data/hadcrut_temp_anomoly_1850_2015.csv")
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   month = col_integer(),
##   anomaly = col_double(),
##   month_name = col_character()
## )
hadcrut_temp_anomoly
## # A tibble: 1,992 x 4
##     year month anomaly month_name
##    <int> <int>   <dbl> <chr>     
##  1  1850     1  -0.702 Jan       
##  2  1851     1  -0.303 Jan       
##  3  1852     1  -0.308 Jan       
##  4  1853     1  -0.177 Jan       
##  5  1854     1  -0.36  Jan       
##  6  1855     1  -0.176 Jan       
##  7  1856     1  -0.119 Jan       
##  8  1857     1  -0.512 Jan       
##  9  1858     1  -0.532 Jan       
## 10  1859     1  -0.307 Jan       
## # ... with 1,982 more rows

Here we see a data table with year, month, temperature anomoly for that month, and a name for the month. The month name is convenient - but - if we want to use it as an ordered variable later, we’re going to have problems. R by default orders alphabetically.

1.1 Factors and Forcats

To impose a different ordering schema on month_name, we’ll need to turn it into a factor - a character vector that has a set of ordered levels. If we had used read.csv() to load the data, this would have already been done - but is generally bad practice to assume.

To create a factor, we can use factor(). We can see the order of the levels with levels().

hadcrut_temp_anomoly$month_name <- factor(hadcrut_temp_anomoly$month_name)

levels(hadcrut_temp_anomoly$month_name)
##  [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct"
## [12] "Sep"

Uh oh - alphabetical! Enter forcats - a wonderful library from the tidyverse designed to work with factors in just such a scenario. There are a number of useful functions such as fct_recode() for changing the names of different factor levels, fct_relevel() for specifying arbitrary factor level orders, fct_rev() for reversing level order, and more. We’re going to use fct_inorder which specifies level order in the order we see the first appearance of each level. As the data is sorted by month, this should work out just like we need.

library(forcats)

hadcrut_temp_anomoly$month_name <- fct_inorder(hadcrut_temp_anomoly$month_name)

levels(hadcrut_temp_anomoly$month_name)
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

Notice now that month_name’s class changes when we look at hadcrut_temp_anomoly

hadcrut_temp_anomoly
## # A tibble: 1,992 x 4
##     year month anomaly month_name
##    <int> <int>   <dbl> <fct>     
##  1  1850     1  -0.702 Jan       
##  2  1851     1  -0.303 Jan       
##  3  1852     1  -0.308 Jan       
##  4  1853     1  -0.177 Jan       
##  5  1854     1  -0.36  Jan       
##  6  1855     1  -0.176 Jan       
##  7  1856     1  -0.119 Jan       
##  8  1857     1  -0.512 Jan       
##  9  1858     1  -0.532 Jan       
## 10  1859     1  -0.307 Jan       
## # ... with 1,982 more rows

Exercise: Make a new column that contains the same information as month_name. Now, try fct_rev(), fct_relevel(), and last, fct_recode() on it to futz with the factor levels. After each one, look at the output of levels() to see what you did.

2. Your first ggplot

2.1 Visualizing distributions

All right! Let’s try out ggplot2. ggplot2 works simply by you specifying things in roughly the following order:
1. A ggplot2 object which links to a data set and has information about aesthetics.
2. A geom or geometry which specifies how the data is to be seen.
3. Other bells and whistles which we will get to.

We add these pieces together with a + sign. So let’s start with something simple, just visualizing a distribution. First, we have to create a ggplot. We’ll link it to the hadcrut_temp_anomoly data, and map x to anomaly using the aes function, as we’re going to look at the distribution of anomaly.

library(ggplot2)

had_dens <- ggplot(data = hadcrut_temp_anomoly,
       mapping = aes(x = anomaly))

had_dens

Wait, what? Nothing happened! That actually makes sense, as we haven’t told ggplot2 anything about plotting! There’s no geom. For starters, let’s try the simple geom_histogram()

had_dens +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Nice! Aside from R yelling at us and giving telling us to use a different bin size, it’s not bad!

Exercise: Now you try geom_density. How’s it look? If you want, try adding arguments like fill, color, and alpha to geom_density() if you want to get fancy.

2.2 Visualizing multiple distributions

To view multiple distributions, we’ll need aesthetics that specify groupings. We’ll look in a minute at how we put those on an axis themselves, but for now, let’s introduce group.

had_dens_group <- ggplot(data = hadcrut_temp_anomoly,
       mapping = aes(x = anomaly, group=month_name))

There’s a lot we can try here. Let’s revisit one old friend.

had_dens_group + 
  geom_density()

Well that shows something interesting, doesn’t it! What if we don’t overlap? We can try a few different ‘position’ arguments.

had_dens_group + 
  geom_density(position="stack")

had_dens_group + 
  geom_density(position=position_dodge(width=3))
## Warning: position_dodge requires non-overlapping x intervals

Exercise: Now you try geom_histogram. How’s it look? Bad, right? Try different colors, alphas, and fills to see if you can improve. Maybe a different position?

3. Two dimensions.

3.1 Starting a plot

ggplot2 contains a number of different types of geometries for displaying information in two dimensions However, all of them begin by giving ggplot2 some information, just as before. What data is going to be used? What elements of that data will map onto different aesthetic values and scales in the plot? We’ll start by creating the base of our plot with the x-axis being month and the y-axis being temperature anomoly.

had_plot_base <- ggplot(data = hadcrut_temp_anomoly,
                        mapping = aes(x = month_name, y = anomaly))

Great! We have the basics of our plot saved as an object.

3.2 Scatterplots

To take a basic plot and add a geometry choice to it, we use one of the family of geom_s in ggplot2. One of the most basic plot types we ever encounter is the scatterplot. y versus x. That’s all. For that, we have geom_point()

had_plot_base +
  geom_point()

This is great! But, well, a few things. First, man, lots of points overlap. If only they were kinda transparent. Second, maybe they’re too small? Each geom has a number of options for different visual elements - color, size, alpha, etc. So, for example, we can modify this plot a bit.

had_plot_base +
  geom_point(alpha = 0.5, size=3)

Note - all of these are relative to 1. Well, now we can see the distribution of points a bit better, given overlap. But….

3.3 Jitter

Maybe we want to add a bit of random noise to the points, to better visualize what’s going on here. For that, there’s geom_jitter.

had_plot_base +
  geom_jitter()

Neat! You can see the density of each cluster for each month so much more clearly! Now, you may be wondering - hold on - I only want to add jitter to the x-axis, not the y-axis. For that, there’s the width and height argument of how much jitter room there is to spare. So - let’s say we want the points to vary in the x by \(\pm\) 0.5, but 0 in the y. And let’s throw in some alpha for good measure.

had_plot_base +
  geom_jitter(width=0.5, height=0, alpha=0.8)

2.4 Exercise: Boxplot, Violins, and stacking.

This is all well and good, but, eyeballing is not the same as some solid information on the distribution of the data. Let’s try some of the 2D ways of visualizing data.

  1. Try out the following geoms - geom_boxplot(), geom_violin(), stat_summary(). Which do you prefer?
  2. Try adding multiple geoms together. Does order matter?
  3. If you want to get saucy, install ggridges. You’ll have to swap x and y in your mapping, but, try out geom_density_ridges()

3.5 Lines

OK - but these are timeseries! We need lines, no? There is, of course, a geom_line that shows lines connecting points, but does not show the points (yes, you can layer both geom_line and geom_point).

had_plot_base +
  geom_line()

UGH - what happened? Welp, because month_name is a factor, all of the points within a month were connected. Oops! There is a way around it, and that is to specify what your groups are, rather than let ggplot define it for you.

One of the nice things about geoms is that we can add aesthetic elements from the data frame. Heck, if you want, you can add a geom with a whole new data set and new aesthetics, but for now, let’s just redefine the group aesthetic.

had_plot_base +
  geom_line(mapping = aes(group=year))

Perfect! We are on our way!

Exericise: Try a line plot, using group = month. Feel free to play with other elements of the plot, such as alpha.

4. Adding information via aesthetic mapping

We’ve got our line plot working the way we want it to. But we have other information.

R provides a number of different scales that can all be tied to a variable inside the aes function. Most commonly, we’ll use color, fill, alpha, size, and shape. There are other options, such as lty for line type, but those will be specific to the geom.

4.1 Color

As color is the most common modification, let’s concentrate on that. The lessons can be used broadly for other scales.

had_lines <- had_plot_base +
  geom_line(mapping = aes(group=year, color=year))

had_lines

A few things to note. First, year is treated continuously, so it’s placed on a gradient. Second the default color gradient is from dark to light blue.

This is great, but what if we want different colors? If your colors are continuous, the first go-to to change them is scale_color_continuous - and we add scales just like we added geoms before. Let’s start with a traditional blue to red gradient.

had_lines +
  scale_color_continuous(low = "blue", high = "red")

This works great. We could also have used scale_color_gradient exactly the same way to achieve the same results. If you want to know more of what colors are available in R, see this post or

4.2 More on Gradients

This is much better. But it’s still hard to see the middle ranges of years For that, we have the scale_gradient_2 function, which takes an argument for what the midpoint color should be, but then you have to specify the value for that midpoint. Let’s just go with the 1925.

had_lines +
  scale_color_gradient2(low = "blue", mid = "yellow", high = "red",
                        midpoint = 1925)

Groovy! What if we wanted something more arbitrary - say, a 7 colors of the rainbow gradient? scale_color_gradientn has you covered.

had_lines +
  scale_color_gradientn(colors=rainbow(7))

Note the rainbow() function. R comes with a few different color palatte functions (and see the colors helpfile for how to view all of the colors in R). For each palatte, we feed it a number of colors, and get a vector back. Using some code from the rainbow helpfile, here ar ethe default pallates.

Try one, and see what it does to you!

4.3 More Organized Gradients

There are of course a ton of packages with other pallates our there. One of the most popular, because it’s color selection is based on research looking at color blindness, and how we see sequential or diverging palattes of color, is RColorBrewer. You can view a lot more about it at http://colorbrewer2.org/ - for now, let’s take a gander at what it provides.

#install.packages("RColorBrewer")
library(RColorBrewer)

display.brewer.all(n=10, exact.n=FALSE)

That’s a lot. Which of these do you think is best for seeing differences between years? Why? I admit, I’m often partial to BrBG

had_lines +
  scale_color_gradientn(colors=brewer.pal(n = 7, 
                                          name = "BrBG"))

4.4 More Color Packages

In point of fact, there are many different color palette packages out there. I’ll just leave you with two more. The first is my personal favorite - the Wes Anderson package - https://github.com/karthik/wesanderson. Don’t worry, there are install packages on the github page.

The second is the viridis palatte. This is a pretty common yellow–blue-based palette (and the package - https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html - also comes with a ‘magma’ palette as well.) These two palettes are great as they show up well for people with color blindness. Heck - if you really want to have color-blind safe color schemes, you might also want to check out the dichromat package - it not only contains multiple colorblind safe palettes, but it also allows you to see what your color palette will look like under different forms of color blindness.

So, it’s no accident that viridis is what was used for the original animation. Let’s make it our own. We’ll also add guide = 'none' to lose the colorbar at ths side.

library(viridis)
## Loading required package: viridisLite
had_lines +
  scale_color_viridis(guide="none")

4.5 Exercises

  1. What happens to our histogram if we add a fill argument. Try fill = month and fill = factor(year) (note that without the factor, nothing happens!). You might also want to remove the guide with + guides(fill = "none").

  2. Create a line plot of year on the x and anomoly on the y. Include month as a group. Try coloring by anomoly as well. First use the default palatte, and then try a blue-white-red.

  3. Now color that line plot by month. Try at least two different palettes either of your own devising or from an additional package. Note, if you use month_name, you’ll be mucking about with scale_color_manual. Or you can use RColorBrewer which handles discrete groups easily using scale_color_brewer().

5. Facets to see grouped information

Before we get to the circular animation hotness, I wanted to take a brief digression to show another way to visualize multiple dimensions. Figures often have multiple panels, each of which shows a different slice of the data. In ggplot, these are called facets. Before we jump whole hog into seeing this big timeseries, I want us to really look at each month and see if there are trends there. So, I want us to begin a slightly different plot - let’s look at temperature anomoly over time, but with different lines showing different months, and the x axis is year.

had_months <- ggplot(data = hadcrut_temp_anomoly,
       mapping = aes(x= year, y = anomaly, group = month_name)) +
  geom_line()
  
had_months

If you’d like, try adding a color and muck about with scale_color_discrete and its values argument. Categorical groupings are really a place also where RColorBrewer and wesanderson shine.

But, what if instead of separating by color - as these lines all overlap - we really wanted to see each line on its own in a panel. There are two functions that can help us here. The first is facet_wrap()

had_months +
  facet_wrap(~month_name)

The second is a function that allows us to facet with two variables - one with rows and one with columns. We can’t do that with this example, but, in essence, you would use a similar tilda notation as above - facet_grid(rows ~ cols)

5.2 Faceting by continuous values

It’s a little bit more difficult to facet our monthly plot. Maybe we want to split it up by decade. But, wait, year is continuous. What to do? R and ggplot2 have a number of functions based off of the base function cut(). Let’s try the first, that can make decadal plots.

had_lines +
  facet_wrap(~cut_width(year, 10))

It’s not pretty (that’s a lot of decades!), but you can see the principle.

Exercise: 1. Try out cut_interval and cut_number. What do they do?
2. Make a plot where facets and colors reflect the same information.

7. Making your plot theme your own

OK, getting back to our initial graph - the grey background, the font type, etc., might not really be doing it for you.

Ugh. Grey background. Weird white lines. Maybe we don’t like the default. Ggplot2 provides some alternatives wtih theme functions. Now, you can specify what you’d like to your heart’s content, but, there are a few canned differnt themes that can be quite nice. The two I use most commonly are theme_bw() and theme_void()

had_lines_color + theme_bw()

had_lines_color + theme_void()

The theme function is highly dynamic - you can specify simple items, such as the base_size for basic font sizes, etc. Or you can customize to your hearts content, with angle of text on axes, different color schemes, etc. See ?theme or the theme vignette.

But it doesn’t stop there. There are whole libraries of themes for you to try! And they are constantly being updated! Want to make your figure look like it came from Excel, fivethirtyeight.com, or was made by Tufte himself?

#install.packages("ggthemes")
library(ggthemes)

had_lines_color +
  theme_excel()

had_lines_color +
  theme_fivethirtyeight()

had_lines_color +
  theme_tufte(base_size=17)

See the ggthemes package vignette for more.

But for us, let’s go with the solarized theme with the option light=FALSE, as it’s close to the original.

had_lines_color_theme <- had_lines_color +
  theme_solarized(light=FALSE)

Exercise: Take the year by anomoly plots and apply three different themes of your choice to them.

8. Plot annotation

OK, now for the fripperies and frills - annotating our plots with information! In the GIF above, we wad demarcations of certain critical threshold as well as a title. There were also years that changed as time went by, but we’ll save that for a moment.

8.1 Adding lines

First, there are two critical lines - 1.5 and 2.0 degrees C. To add a vertical or horizontal line to the plot, we use geom_vline or geom_hline. These accept a xintercept or yintercept, respectively, to indicate where they should cross a plot. So, to add to our current plot.

had_lines_annotated <- had_lines_color_theme +
  geom_hline(yintercept=c(1.5, 2.0), color="red")

had_lines_annotated

8.2 Adding text annotations

We also want to have some annotations on the lines that say what each one is. For that, we could use geom_text, but, that can get messy with facets, etc. Better to use annotate. This function takes an x and y coordinate, a geom type, and then all of the arguments needed for that geom. Now, often we use this with just geom_text, but that gets tricky given that we want to overlay this on a line.

had_lines_annotated +
  annotate(x=c(1,1), y=c(1.5,2), geom="text", label=c("1.5C", "2.0C"))

Instead, we want something that is a label - something that has a background. We need to specify the background color, which I’ll put in in hex code, as I don’t know it’s name (looked this up in some information about solarized color palattes), and a color for the text - white.

had_lines_annotated <- had_lines_annotated+
        annotate(x=c(1,1), y=c(1.5,2), 
                 geom="label", 
                 label=c("1.5C", "2.0C"), 
                 fill="#002b36", 
                 color="white", label.size=0)  

had_lines_annotated

Gorgeous!

8.3 Title

And last, titles are easy. There’s a function for it! ggtitle

had_lines_annotated <- had_lines_annotated +
  ggtitle("Global Temperature Change (1850-2015)")

had_lines_annotated

9. Modifying axes

Of course, we’ve done this with simple cartesian axes that we haven’t modified at all. We don’t always want to do that.

9.1 Axis limits

First, we want to have the lines extend all the way to the edges of the graph. So, to do that, we want the limits of the x-axis to be set to Jan and Dec. The xlim() function (and there is a ylim() function as well) should take care of that. If we had a continuous scale, we’d just feed a vector, with minimum and maximum - e.g. c(1,12).

But, here we want to feed the factor vector from which everything is derived. We’ve taken care of ordering, so, we no longer need xlim(). Instead, we want to use the scale_x_discrete function that lets us set properties of an x axis with discrete values. In our case, we want to alter how much additional space is placed around the first and last month using the expand argument. So let’s cut that down to 0!

had_lines_annotated_final <- had_lines_annotated + 
  coord_cartesian(expand=FALSE)
  
had_lines_annotated_final

The labels get cut off - but they’ll come back in a second!

9.2 Axis modification

There are a number of ways to modify an axis. Using scale_x_continuous() provides access to a rich number of ways to modify the x axis (or replace with y, and you get the picture).

Often we’re interested in log transforming our axes:

had_lines_annotated_final + 
  scale_y_log10()

Here, this isn’t a great idea, as we lose the negative numbers. But, you can begin to see the power of how the different axis scales can work. We can invoke an arbitrary scale using the trans argument of scale_x_continuous(), although there are many already buit in.

We want a circular plot, however, which implies polar coordinates. And there’s an option for that, too! We’ll also have to deal with that expansion issue again.

had_lines_annotated_final <- had_lines_annotated_final +
  coord_polar()+ 
  scale_x_discrete(expand=c(0,0))
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
had_lines_annotated_final

Oh! Fancy! Almost there!

9.3 Axis labeling

The whole month name, anomaly think on the axes is a little odd. Fortunately, we can replace the string labels on either axis. We could have put in other things, but for now, let’s go blank.

had_lines_annotated_final <- had_lines_annotated_final +
  xlab("") +
  ylab("")

had_lines_annotated_final

For more great ggplot2 extensions, see https://www.ggplot2-exts.org/

10. Animation

Right now, we’re in the midst of a great re-write of the package that actually performs animations with ggplot. The gganimate package attempts to implement a nice consistent grammer of animation, much like a grammar for graphics. It’s not on CRAN yet, so we’ll have to install it and its companion, transformr with devtools.

install.packages("devtools") #if you don't have it yet
devtools::install_github("thomasp85/farver")
devtools::install_github("thomasp85/tweenr")
devtools::install_github("thomasp85/transformr")
devtools::install_github("thomasp85/gganimate")

gganimate introduces a few new aesthetics for animations. The first is the transition_* series that defines how elements move between each other in a plot. For example, to generate the initial animation, we’d use the transition_reveal() which gradually adds pieces to the animation. It uses the first argument to say what is being added and the along argument to say in what order it should be added.

library(gganimate)

had_lines_annotated_final  + 
   transition_reveal(year, along = year) 

What happens if you use along=as.numeric(month_name)? Note, we convert to a numeric as gganimate needs a number, and does not know how to order characters.

There are many many other elements to gganimate - it’s a growing library - such as leaving trails behind points, and more - but that’s a tutorial for another time (or one you should write!)

had_points <- ggplot(data = hadcrut_temp_anomoly %>% filter(year < 1870),
                     aes(x = year, y = anomaly, color = month_name)) +
  geom_point() 

had_points_anim <- had_points +
  transition_time(year) 

had_points_anim +
  shadow_trail(0.05, max_frames=10) +
  exit_fade()