Tidy Data Homework

Author

Biol607

Since 2014, a group of us in New England have been surveying kelp forests from Rhode Island to Maine. There’s a LOT of data from several protocols. Let’s muck about with the percent cover data to learn about the sampling program and a bit about kelp forests here in New England.

1. Load me.

The URL of the data is https://github.com/kelpecosystems/observational_data/blob/master/cleaned_data/keen_cover.csv?raw=true - use the readr library to load it in. Show me that you can do it both without downloading it and if you download it.

2. Format

Take a look at the data in any way you see fit to be able to tell me if the data is in a wide or long format. Justify your answer.

3. Check it out.

Let’s learn a bit about who is doing what using group_by(), summarize(), and n_distinct(). 3a. How many sites has each PI done?

3b. How many years of data does each site have? Show it in descending order.

3c. Impress yourself - can you make a figure showing which site was sampled when? There are a lot of ways to do this. Sometimes I use slice(), but I’m sure there are more elegant solutions. For data viz, you can use geoms you’ve used before, or new ones, like geom_tile() or whatever you think would be interesting!

4. Let’s look at some kelp!

4a. This is a big unwieldy dataset. Let’s trim it down to the columns, YEAR, SITE, TRANSECT, PERCENT_COVER, and FAMILY, and SPECIES.

4b. Let’s make it even simpler. Trim the data down so the only species we are looking at are in the family “Laminariaceae”. After that, you can ditch the FAMILY column.

4c. For each species is there only one measurement per species transect each year? Or do we need to worry…… Note, this is a common data check you should be doing if you have a large complex data set!

4d. HAHA that was a trick. I knew there sometimes was more than one. That’s because some of these are measurements of juveniles and some are adults. OK - sum up the cover for each species on each transect so that we only have one measurement per species (adults and juveniles together!)

4e. Neat! Make a plot showing the timeseries of kelps at each site. You’ll want stat_summary() here. You might even need it twice because - note - stat_summary() has a geom argument where you can do things like “line”. What might that do? Check it out! Facet this plot by species, so we can see the trajectory of each. Feel free to gussy this plot up however you would like (or not). Do you notice anything? Comment!

5. Wide relationships

Let’s look at the relationship between two of the species here. Lexi made me do this, I swear. She made me think about tradeoffs in our weekly meeting last week, so now you all have this problem.

5a. If we want to look at the relationships between species, we need a wide data set. Use pivot_wider() to make species into columns with percent cover as your values. Note - be careful to fill in NAs as 0s.

5b. Neat! Is there a relationship between Saccharina latissima and Laminaria digitata? Plot it. As a preview for 2 weeks from now, add a line to your ggplot stat_smooth(method = "lm"). Also, remember that you will need backticks ` around variables with spaces in them. What do you think? Feel free to use any other geoms or explore however you like here.

5c. Hey, so, remember how we filled in a lot of 0s? Yeah, those weren’t in the original long data we plotted….. which means many of those lines from question 4e might be wrong! So let’s pivot this correct long data back wide and then remake the figure from 4e. Does it look different? Does it tell a different story?

Meta 1.

So, this was your first time playing with a novel only mostly clean data set found in the wild. How did you feel working with it? What did you notice as you examined it for the very first time knowing nothing about it?

Meta 2.

Split-Apply-Combine is…. a way of life, really. Is this something you have dealt with previously in your life or work? How comfortable are you with this concept?

Meta 3.

When you’ve made datasets in the past, have they been wide, long, or something else? After this week and the Browman and Woo paper, what advice would you give to future you when making data?

Meta 3.

How much time did this take you, roughly? Again, I’m trying to keep track that these assignments aren’t killer, more than anything.

Meta 4.

Please give yourself a weak/sufficient/strong assessment on this assigment. Feel free to comment on why.