Introduction to An Introduction to Computational Data Analysis for Biology
Jarrett Byrnes
UMass Boston
https://biol607.github.io/

First, Some New Technology

https://etherpad.wikimedia.org/p/607-intro

  • This class will use collaborative note-taking

  • Research shows that this enhances learning!

  • It’s also a way to ask me a question during class

Second, Some Old Technology

  • Green: Party on, Wayne
  • Red: I fell off the understanding wagon
  • Blue: Write a question/Other

And Now, A Pop Quiz!



http://tinyurl.com/firstPopQuiz



Outline for Today

  1. What are we doing here?
  2. Who are we?
  3. How will this course work?
  4. A Philosophy of answering scientific questions with data
  5. R!

Computational

#subset consumptionData into mixed diet treatments only (mixed diet BOX #'s are multiples of 5)
mixedData<-consumptionData[consumptionData$BOX %% 5 == 0,]
mixedData$delta<-mixedData$ADD_WT - mixedData$REM_WT
mixedData$rate<-mixedData$delta/mixedData$days

#reshape to get species-specific consumption rate table
mixed.summary<-ddply(mixedData,.(BOX,SP_CODE),function(x){
  data.frame(CONSUMPTION_RATE=mean(x$rate,na.rm=TRUE))
    })

#############fit linear models (not including consumption--see below)
LMtestChange <- lm(formula = TEST_CHANGE ~ TREATMENT, data = expData, na.action = na.omit)

Code Forces You to Be Explicity About Theory

Computational

#subset consumptionData into mixed diet treatments only (mixed diet BOX #'s are multiples of 5)
mixedData<-consumptionData[consumptionData$BOX %% 5 == 0,]
mixedData$delta<-mixedData$ADD_WT - mixedData$REM_WT
mixedData$rate<-mixedData$delta/mixedData$days

#reshape to get species-specific consumption rate table
mixed.summary<-ddply(mixedData,.(BOX,SP_CODE),function(x){
  data.frame(CONSUMPTION_RATE=mean(x$rate,na.rm=TRUE))
    })

#############fit linear models (not including consumption--see below)
LMtestChange <- lm(formula = TEST_CHANGE ~ TREATMENT, data = expData, na.action = na.omit)

Coding is power

Computational

#subset consumptionData into mixed diet treatments only (mixed diet BOX #'s are multiples of 5)
mixedData<-consumptionData[consumptionData$BOX %% 5 == 0,]
mixedData$delta<-mixedData$ADD_WT - mixedData$REM_WT
mixedData$rate<-mixedData$delta/mixedData$days

#reshape to get species-specific consumption rate table
mixed.summary<-ddply(mixedData,.(BOX,SP_CODE),function(x){
  data.frame(CONSUMPTION_RATE=mean(x$rate,na.rm=TRUE))
  })

#############fit linear models (not including consumption--see below)
LMtestChange <- lm(formula = TEST_CHANGE ~ TREATMENT, data = expData, na.action = na.omit)

Repeatable Research

Data (acquisition)







How do I get good data here?

Data (maintaince)

http://libguides.wits.ac.za/content.php?pid=220705&sid=3862732

Analysis (philosophy)

Analysis (visual)

http://biomedicalcomputationreview.org/content/visualization-space-and-time-seamless-pipelines-now-available

Analysis (model)

## 
## Call:
## lm(formula = shoots ~ treatment.genotypes, data = eelgrass)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.473 -10.723  -1.299   8.955  35.701 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           30.664      5.324   5.760 2.73e-06 ***
## treatment.genotypes    4.635      1.401   3.308  0.00245 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.39 on 30 degrees of freedom
## Multiple R-squared:  0.2672, Adjusted R-squared:  0.2428 
## F-statistic: 10.94 on 1 and 30 DF,  p-value: 0.002449

for Biology

for Biology

SCIENCE FIRST!

  • What is your model(s)?

  • THEN decide on statistical approach

  • Can you get data to paramaterize that model?

  • How does biology inform your modeled results?

Avoiding The Replication Crisis

http://simplystatistics.org/2016/08/24/replication-crisis/

http://simplystatistics.org/2016/08/24/replication-crisis/

Course Goals

  1. Learn how to think about your research in a systematic way to design efficient observational & experimental studies.
  1. Understand how to get the most bang for your buck from your data.
  1. Make you effective collaborators with statisticians.
  1. Make you comfortable enough to learn and grow beyond this class.

Who are You?

  1. Name

  2. Lab

  3. Brief research description

  4. Why are you here?

Outline for Today

  1. What are we doing here?
  2. Who are we?
  3. How will this course work?
  4. A Philosophy of answering scientific questions with data
  5. R!

Lecture and Lab

  • T/Th Lecutre on Concepts
  • Occasional Paper Discussion
  • Th Lab (which will cover some homework problems!)

Yes, Lectures are Coded

R Markdown sometimes with Reveal.js  

http://github.com/biol607/biol607.github.io

Readings for Class: W&S

Whitlock, W.C. and Schluter, D. (2014) The Analysis of Biological Data, 2nd Edition.

http://whitlockschluter.zoology.ubc.ca/

Chapter 1 this week!

Readings for Class: Grolemund & Wickham

http://r4ds.had.co.nz


Grolemund, G., and Wickham, W. 2016. R for Data Science.
http://r4ds.had.co.nz

Quizes

  • Before and After Class

  • Measures understanding - and attendance!

  • Will drop lowest two

  • 10% of your grade

Problem Sets

  • 40% of your grade
  • “Adapted”" from Whitlock and Schluter
  • Will often require R
  • Complete them using Rmarkdown

Midterm

  • Advanced problem set

  • Due Nov 4th

  • 20% of your grade

Final Project

  • Topic of your choosing
    • Your data, public data, any data!
    • Make it dissertation relevant!
    • If part of submitted manuscript, I will retroactively raise your grade  
  • Dates
  • Proposal Due Oct 7th
  • Presentations on Dec 15th
  • Paper due Dec 16th (but earlier fine!)  

  • 30% of your grade

Extra Credit 1: Use Github

  • This whole class is a github repo
  • Homeworks will be posted as part of the repo
  • If you submit your homework via a pull request, +1 > - There will be a github tutorial outside of class hours

Extra Credit 2: Be Nate Silver

Extra Credit 3: Livin’ La Vida Data Science

Extra Credit 4: Data Science for Social Good

http://www.meetup.com/Data-Science-for-Social-Good/

Outline for Today

  1. What are we doing here?
  2. Who are we?
  3. How will this course work?
  4. A Philosophy of answering scientific questions with data
  5. R!

Our Approach to Data Analysis

 
Data from Reusch et al. 2005 PNAS

Start with a Question

Photo by Jessica Abbot

Does seagrass genetic diversity increase productivity?

Build an Understanding of the System

  1. Literature

  2. Observation

  3. Disciplinary History

Construct a Causal Model of the System

 
 

This is your DATA GENERATING PROCESS

Construct a Causal Model of the System

Big Picture DATA GENERATING PROCESS

Construct a Causal Model of the System

What is your ERROR GENERATING PROCESS?

Construct a Causal Model of the System

What can you isolate?

Collect the Data to Best Estimate & Test the Model

1
Genotype
3
Genotypes
6
Genotypes

Look at Your Data

Fit a model(s), chosen to suit data & error generating process!

Analysis!

Build Open Reproducible Research

Many Methods of Sharing Data, Methods, and Results Beyond Publication

  1. GitHub - public code repository

  2. FigShare - share key figures, get a doi

  3. Blog - open ‘notebook’

  4. Dryad or Other Repository - post-publication data sharing

Questions

Outline for Today

  1. What are we doing here?
  2. Who are we?
  3. How will this course work?
  4. A Philosophy of answering scientific questions with data
  5. R!

What is R?

   
A programming language uniquely developed for statistical analysis

Why R?

  1. Free
  2. Huge growing community
  3. Packages to do almost anything
  4. Makes reusable research easy
  5. C-based language
  6. Syntax naturally matches analytical thinking

What is R Studio?

  • Cross-Platform Graphical User Interface for R
  • It is not R

Let’s Fire It Up!

**Open R-Studio.

Don’t have it? Download it from http://rstudio.org**

What do you see?

The Console and Math

 

1+1
## [1] 2

 

You try - different mathematical operators

Everything is an Object

 

a.number<-1+1
a.number
## [1] 2

 
 

You try - what can you save as an object?

Note: Comment Your Code as You Write with #

The text after # is not evaluated.

#This is going to be the number two
a.number<-1+1

#####----------

# You can get creative with comments to separate code blocks
# And write a lot, which is good practice

#####----------

Your comments tell readers - including yourself - what you are doing

Functions Work on Objects

 

sin(a.number)
## [1] 0.9092974

How to get help for a function

?cos

help(cos)

??'cosine function'

Lots of Object Types - like Data!

 

head(cars, n=3) #note the n= argument!
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4

Try looking at all of cars and names(cars)

Graphics are a Snap

plot(speed ~ dist, data=cars)

Look at ?plot to see other arguments to change appearance

Installing Packages

Installing Packages

Installing Packages

You can also install packages from the command line.

install.packages('ggplot2', repos='http://cran.case.edu/', dependencies=TRUE)

Using one of the above methods, install the package ggplot2 and its dependencies now.

Using a Package

library(ggplot2)

qplot(dist, speed, data=cars)

You Try It

  1. Load ggplot2 and look at the mtcars data set

2.Look at the qplot help file & demos

  1. Make two plots (ggplot or plot)

Questions