First - today’s etherpad, along with the mammal data, is at http://etherpad.wikimedia.org/p/607-tidy
For the past few days, we’ve been grapping with a data set that we’ve almost wrestled down to size. The only problem is, what we’re left with is a big table with one score per row, and a lot of NAs. This doesn’t seem efficient, nor does it seem like a good format for plotting. Clearly, we need to TIDY THE DATA.
As we talked about a few weeks ago, there are both long and wide data formats. Long data formats have one observation and one measurement per row. So, multiple rows constitude a single observation.
These kinds of data sets are great for plotting summary information for each group and each variable aggregated together. Think of a bar plot with averages for different species taken across different sites.
Wide data has a every measurement in a single observation in a single row.
This kind of data is ideal for things like scatterplots of one measurement against another, with each observation as a single data point.
Quite often, data is recorded in a long format for efficiency. Let’s talk about this with respect to a data set of mammals. For every site a researcher visited, they wrote down the species of mammals they say, and the density of those mammals. As this was just a running tally, the data ended up in a long format:
## The mammals data frame:
## site taxon density
## 1 1 Suncus etruscus 6.2
## 2 1 Sorex cinereus 5.2
## 3 2 Myotis nigricans 11.0
## 4 3 Notiosorex crawfordi 1.2
## 5 3 Scuncus etruscus 9.4
## 6 3 Myotis nigricans 9.6
But, what if we wanted to plot the abundance of, say, Myostis nigricans against Scuncus etruscus - we can’t do that from here. We need to spread
the data into a wide format. For that, there is the tidyr
library and it’s spread
function.
spread
works simply in that it take a data set, and you tell it the name of the column which contains values that will be the column names in your new data set - a so called “key” column. You also tell it which column contains the relevant numbers - the “values” column.
library(tidyr)
m_wide <- mammals %>%
spread(taxon, density)
m_wide
## site Myotis nigricans Notiosorex crawfordi Scuncus etruscus
## 1 1 NA NA NA
## 2 2 11.0 NA NA
## 3 3 9.6 1.2 9.4
## Sorex cinereus Suncus etruscus
## 1 5.2 6.2
## 2 NA NA
## 3 NA NA
Great! We have a wide data set. But - uh oh, what are all of those NAs doing there?
We often have data sets like this. Ones where we have many measurements that we have not observed. Sometimes, they are NA - we just didn’t take them. Other times, such as in this data set, we want to fill them in with a sensible value, such as 0.
The simplest approach here would be to use the fill
argument in spread
.
m_wide_0 <- mammals %>%
spread(taxon, density, fill=0)
m_wide_0
## site Myotis nigricans Notiosorex crawfordi Scuncus etruscus
## 1 1 0.0 0.0 0.0
## 2 2 11.0 0.0 0.0
## 3 3 9.6 1.2 9.4
## Sorex cinereus Suncus etruscus
## 1 5.2 6.2
## 2 0.0 0.0
## 3 0.0 0.0
Perfect! Other times, though, we want to fill those in right into the long data format. For that we have the complete
function. In that function, we specify which columns we want all combinations of, and then supply a list of how new values should be filled in for other columns. If we don’t give a column name in that list, it defaults to NA. Oh, complete
function… you complete
me.
m_long_0 <- mammals %>%
complete(site, taxon, fill=list(density=0))
m_long_0
## # A tibble: 15 × 3
## site taxon density
## <dbl> <fctr> <dbl>
## 1 1 Myotis nigricans 0.0
## 2 1 Notiosorex crawfordi 0.0
## 3 1 Scuncus etruscus 0.0
## 4 1 Sorex cinereus 5.2
## 5 1 Suncus etruscus 6.2
## 6 2 Myotis nigricans 11.0
## 7 2 Notiosorex crawfordi 0.0
## 8 2 Scuncus etruscus 0.0
## 9 2 Sorex cinereus 0.0
## 10 2 Suncus etruscus 0.0
## 11 3 Myotis nigricans 9.6
## 12 3 Notiosorex crawfordi 1.2
## 13 3 Scuncus etruscus 9.4
## 14 3 Sorex cinereus 0.0
## 15 3 Suncus etruscus 0.0
We could then take that new complete
long data frame and spread
it with abandon! Either way, now we can plot!
library(ggplot2)
ggplot(m_wide_0, mapping = aes(x=`Myotis nigricans`,
y=`Scuncus etruscus`)) +
geom_point(size=1.4) +
stat_smooth() +
theme_bw()
OK, so, that’s not such an impressive plot, but you see what we’re getting at here…
But what if your intrepid mammal researcher had been wise and recorded their data in a wide format (in truth, this is what I do all the time). And instead of a scatterplot, we wanted to compare the densities of all of the different species in a point-range plot? Well, then we’d have to go from a wide format to long.
To do this, tidyr
offers the gather
function, as in gather up your wide data into one nice long piece. Gather is a little trickier, as you specify what you want the name of the new key column to be, what you want the name of the new values column to be, and then…well, you have some options. You can either specify which columns are to be gathered up (which can take a while) or you can specify which columns you want to exclude. You can do fancy things, like specify all columns between two others with a :, but, honestly, I often specify which to exclude. Let’s see two examples.
m_long <- m_wide_0 %>%
gather(Species_name, Density, -site)
m_long
## site Species_name Density
## 1 1 Myotis nigricans 0.0
## 2 2 Myotis nigricans 11.0
## 3 3 Myotis nigricans 9.6
## 4 1 Notiosorex crawfordi 0.0
## 5 2 Notiosorex crawfordi 0.0
## 6 3 Notiosorex crawfordi 1.2
## 7 1 Scuncus etruscus 0.0
## 8 2 Scuncus etruscus 0.0
## 9 3 Scuncus etruscus 9.4
## 10 1 Sorex cinereus 5.2
## 11 2 Sorex cinereus 0.0
## 12 3 Sorex cinereus 0.0
## 13 1 Suncus etruscus 6.2
## 14 2 Suncus etruscus 0.0
## 15 3 Suncus etruscus 0.0
#Another way of doing the same thing
m_long <- m_wide_0 %>%
gather(Species_name, Density, `Myotis nigricans`:`Suncus etruscus`)
m_long
## site Species_name Density
## 1 1 Myotis nigricans 0.0
## 2 2 Myotis nigricans 11.0
## 3 3 Myotis nigricans 9.6
## 4 1 Notiosorex crawfordi 0.0
## 5 2 Notiosorex crawfordi 0.0
## 6 3 Notiosorex crawfordi 1.2
## 7 1 Scuncus etruscus 0.0
## 8 2 Scuncus etruscus 0.0
## 9 3 Scuncus etruscus 9.4
## 10 1 Sorex cinereus 5.2
## 11 2 Sorex cinereus 0.0
## 12 3 Sorex cinereus 0.0
## 13 1 Suncus etruscus 6.2
## 14 2 Suncus etruscus 0.0
## 15 3 Suncus etruscus 0.0
And now that we have a long format, let’s plot~
ggplot(m_long, mapping=aes(x=Species_name, y=Density)) +
geom_point(size=2) +
stat_summary(fun.data="mean_sdl", color="red")
OK - how would you apply these functions to the Axoltl data?
You may have noticed that we’ve been throwing around just one value column. What if you have two - say, density and average height? Or three? Or more? In other words, a not-long-not-wide-but-in-between data set. These are common.
Well, tidyr isn’t flexible on that count. Instead, you need to unite
those columns into something that’s easy to separate
later down the line. Which we do with… regular expressions! Yes, they are everywhere. Let’s take an example of height in the mammals data set.
mamh <- mammals %>%
mutate(height=rnorm(6,30,3))
So now we want to do everything above, but with density and height! So, we unite them, using a _
as our separator. We could have used anything, but I like _ because it’s used for so few other things with numbers. tidyr
just needs to know the name of the new column, and then which columns will be united
mamh2 <- mamh %>%
unite(measurement, density, height, sep="_")
mamh2
## site taxon measurement
## 1 1 Suncus etruscus 6.2_30.692529240345
## 2 1 Sorex cinereus 5.2_30.0508006127887
## 3 2 Myotis nigricans 11_28.8721744695724
## 4 3 Notiosorex crawfordi 1.2_31.3415094564752
## 5 3 Scuncus etruscus 9.4_28.5882463945002
## 6 3 Myotis nigricans 9.6_35.3923816017375
We can now proceed as normal. Let’s say, for example, that you wanted to fill in all of the empty species-site combos with zeroes. Here’s one workflow to do that.
mamh_long <- mamh2 %>%
spread(taxon, measurement, fill="0_0") %>%
gather(taxon, measurement, -site)
mamh_long
## site taxon measurement
## 1 1 Myotis nigricans 0_0
## 2 2 Myotis nigricans 11_28.8721744695724
## 3 3 Myotis nigricans 9.6_35.3923816017375
## 4 1 Notiosorex crawfordi 0_0
## 5 2 Notiosorex crawfordi 0_0
## 6 3 Notiosorex crawfordi 1.2_31.3415094564752
## 7 1 Scuncus etruscus 0_0
## 8 2 Scuncus etruscus 0_0
## 9 3 Scuncus etruscus 9.4_28.5882463945002
## 10 1 Sorex cinereus 5.2_30.0508006127887
## 11 2 Sorex cinereus 0_0
## 12 3 Sorex cinereus 0_0
## 13 1 Suncus etruscus 6.2_30.692529240345
## 14 2 Suncus etruscus 0_0
## 15 3 Suncus etruscus 0_0
OK, great! We have all of the “0_0” entries (notice what I did there for the fill?), but, now we want to restore our old measurements. For that, we have separate
which takes the relevant column, the new column names in a vector, and the pattern you match to split them.
mamh_long <- mamh_long %>%
separate(measurement, into = c("density", "height"), sep="_")
mamh_long
## site taxon density height
## 1 1 Myotis nigricans 0 0
## 2 2 Myotis nigricans 11 28.8721744695724
## 3 3 Myotis nigricans 9.6 35.3923816017375
## 4 1 Notiosorex crawfordi 0 0
## 5 2 Notiosorex crawfordi 0 0
## 6 3 Notiosorex crawfordi 1.2 31.3415094564752
## 7 1 Scuncus etruscus 0 0
## 8 2 Scuncus etruscus 0 0
## 9 3 Scuncus etruscus 9.4 28.5882463945002
## 10 1 Sorex cinereus 5.2 30.0508006127887
## 11 2 Sorex cinereus 0 0
## 12 3 Sorex cinereus 0 0
## 13 1 Suncus etruscus 6.2 30.692529240345
## 14 2 Suncus etruscus 0 0
## 15 3 Suncus etruscus 0 0
One note - these new columns will be treated as characters (you did split a string after all), so you’ll need to do some cleanup
mamh_long <- mamh_long %>%
mutate(density = as.numeric(density),
height = as.numeric(height))
And there you go! You can now do exciting things like plot the density-height relationship, grouping by taxon, for example.
qplot(density, height, data=mamh_long,
color=taxon, geom="point") +
stat_smooth(method="lm", fill=NA)
OK, to make sure you got separate
and unite
down, first, separate the taxon into two columns - Genus and species (taxon is currently a latin binomial).
Then, add four columsn to the original mammals frame. Now use any technique above (which may or may not involve unite) to fill in 0s for all four columns.
Load the hemlock data.
Note that there are a bunch of different densities. Plot the relationship between Hemlock density and dead Hemlock density.
Well that was uninformative. We want to look at a timeseries of both - heck, maybe even with a timeseries of total tree density as well. Take this wide data set, select out The Hem Den
, Dead Hem Den
, and Tree Den
column sas well as Stand, Year, Lat, and Long as your identifing columns. Then make this wide data set long with those three density columns as your new variable values.
Plot the three variables by year. Color by variable.
Well that was uninformative. Try taking an average by year and variable and plot these timeseries.
We can also use separate
on the stand name to get town and stand number instead of individual stand. Do that, and plot the timeseries of average density of each variable in each town. Use facets, colors, and shapes as you will.
Well, now you have the density per stand. Sometimes averages are better than individual data values. Spread back to a wide format, and plot each pariwise density combination - e.g., Average dead hemlock density per town v average tree density.