Hello! So, today we’re going to begin to code in R. We’re going to cover the basics of using R with a focus on data frame objects.

But let’s begin. To talk about what’s going on and take notes offline, use this week’s etherpad at https://etherpad.wikimedia.org/p/607-intro-2020

Let’s start at the very beginning - directory structure!

Everything we do this semester is going to be scripted. There will be scripts. There will be data. There will be figures. If you’re working on a project, you’ll also have a manuscript, tables, raw data, derived, data, and more. Ai ya! That’s a lot of stuff. Putting it all in one directory is a terrible idea, as you will inevitably get lost.

Instead, from now through time immemorial, every project should have

1) It’s own directory

2) A standardized directory structure

Why standardization? Once you get used to a standardized directory structure, creating new projects becomes easier, and navigating and finding things becomes EVEN EASIER. Standardized directory structures are a great way of protecting future you from hours of searching for THAT ONE FILE.

I often use something like this:

My Project 
| 
|- data 
|- scripts 
|- figures 
|- reports 

Or, if you want to see how real it gets…

My Project 
| 
|- raw_data 
|- derived_data 
|- model_outputs 
|- scripts 
   |-----data_cleaning 
   |-----modeling 
   |-----visualization 
|- figures 
|- tables 
|- manuscript 

And/or I will prepend 01_, 02_, 03_ etc. to my code files so I remember what order to run them in. But, hey, we’re not there yet.

What’s great, is that R allows you to keep things a bit more organized by using something called ‘projects’ Projects. If you go to the file menu and select ‘create a new project’ you can create a directory with an .Rproj file in it. Instead of just opening Rstudio in the future, you can open the .Rproj file. This has the advantage of letting R know where everything is located.

EXERCISE: Set up a folder for this class using an R Project. In that folder, create a folder for data, homework, lab, and exam.

R Scripts and Comments

For this and all in class code exercises, I want you to save what you do in a file in your lab folder.

What are the benefits of a script file? Well, 1. You won’t forget what you just did. It will still be there. 2. You can debug code if you run it and it borks on you with ease. 3. You can save the script for future use. 4. You can share it with collaborators. 5. You create a permanent re-usable archive of what you have done. Your science is now repeatable.

How do I create a script file? To do that, open a new file. Awesome. Now, before we do anything, I want to talk about commenting. In R, we can write comments. Comments are never interpreted. So, for example, put the following in your file, and then copy and paste it (or hit cmd-enter or ctrl-enter) into your console. Notice nothing happens.

#Nothing is going to happen

Comments are terribly useful, because you can use them to add information to a file. Often, before I start a script, I’ll lay it out in a few chunks. First, I’ll supply header information. Second, I’ll write out my workflow in plain english comments. For example:

###########################
#' my_first_file.R
#' 
#' This script is my first R file
#' and it doesn't do much, but it's mine.
#'
#' @author Jarrett Byrnes
#'
#' @changelog
#' 2018-02-01 Added a changelog
#' 2018-01-31 finished this file
###########################

## Do some things here ####

## Do some other things here ####

A few more advanced things to notice here. First, in the header, I have #' a lot. This is a convention in R such that whenever you add a linebreak, you’ll start off with more comments on the next line. There also some @ stuff. There is actually a standard vocabulary here, but for now, know that when you do that in #' comments, it gets highlighted, and is a nice way to help delimit different fields in your headers.

Second, in the blocks of comments after the header, I used a ####. This is a really nice feature for Rstudio where then whatever was before that is recognized as a section header by RStudio. It will make your file easier to work with later.

ALWAYS COMMENT THE HECK OUT OF YOUR CODE TO HELP OUT FUTURE YOU! I’ll include some comments in the code today to show you examples.

Commenting in your scripts is particularly helpful, as it will enable you to known what you’ve done, when, and how. Make your comments useful!

EXERCISE: Setup a script file for today! Give it a useful name! Save it in the right place! Makes some comments in it to remind you what it’s all about.

OK, that’s comments - now, on to the code!

I have this cursor sitting here. What can I do with it?

Peering into your console, often the first source of fear and confusion is, what the heck do I do with this giant blank space. The easiest way to start thinking about R is as th world’s most advanced calculator. Seriously, try it out!

3+4
## [1] 7

Whoah! You can add! Wonderful. Let’s try a few other operations.

3-4
## [1] -1
4*5
## [1] 20
9/3
## [1] 3
2^7
## [1] 128

Yes, basic arithmetic is right there at your fingertips.

EXERCISE: Try some other basic arithmetic operators. Anything not work the way you think it should?

Arithmatic is great. But I want more

While arithmetic is cool (and necessary) we often want R to do more for us. Perhaps calculating more compelling mathematical functions. One that we commonly use is logarithms. Let’s say you want the natural log of 10.

log(10)
## [1] 2.302585

Great! Notice how that worked. We had log. Then two parentheses. Inside of which we had 10. log is a function. Functions in R are hugely powerful. They are the core of many things we do. Functions consist of a function name, those parentheses, and then inside of the parentheses one or more arguments separated by commas. Often these arguments have names. For example, what if we wanted to get the log of 10 in base 10.

log(10, base = 10)
## [1] 1

Now we’ve supplied two arguments.

Help! I don’t know what arguments to give to a function!

We’re going to run into many different functions as we go forward. To get help on how to use them, and what arguments to supply, there are two ways to get help. Let’s look at the help file for log:

?log

help("log")

Both of those do the same thing. Note that there is an order to arguments in the help file. If you put arguments in order in a function, you don’t need to worry about naming them. This is bad practice (future you will be unhappy), as you may forget what they mean. Whenever possible, use named arguments.

But I don’t know what the name of a function I need is!

Sometimes, we don’t know the name of a function we’re looking for. For example, in R, the function for arcsin is asin. But you have no way of knowing that. For that, we use the ?? with quotes.

??"arcsine"

Notice this brings up a list of helpfiles that have the word arcsine in them. Bueno! You can now track down the right function.

EXERCISE: Think of a simple mathematical function. Find its helpfile and then implement it in R.

Variables and You

One of the great things about R is that you can save things as variables and use them later on. Some of them are there already. For example:

#This is Pi
pi
## [1] 3.141593

WHOAH! PI IN R!

What if you wanted to make your own variable. Say, foo. And you wanted foo to always equal the square root of 2.

#let's create a variable foo
foo <- sqrt(2)


#what's inside of foo
foo
## [1] 1.414214

Note that the assignment operator is <- and not an = sign. Now, you can use =, but in R it’s generally bad practice, as = will crop up in other places, and you’ll want to avoid it.

Now you have a variable that you can use form now on that is the square root of two! For example:

foo + 5
## [1] 6.414214
log(foo, base=2)
## [1] 0.5
foo^2
## [1] 2

Some notes on variable names - 1. Make your variable name descriptive. If it needs to be long, so be it, although economy is helpful. 2. Do not use . in your variable name. Such as my.var It has a special meaning. Use _ instead. So, my_var. This is called snake case. You could also use camel case - it’s pretty standard as well. In that case, you’d have myVar

For more, see http://adv-r.had.co.nz/Style.html

More than a number

OK, numbers are great, but there are other types of objects we’ll be dealing with in R. Primarily, we’re going to work with data frames, but let’s build up to a data frame, as it’s big and hairy.

First, are there other object types that have a single element to them beyond numbers? Well, yes! There are strings - words in quotes

"hello"
## [1] "hello"

Also, Booleans, which denote true and false

TRUE
## [1] TRUE
FALSE
## [1] FALSE
0==4
## [1] FALSE
4==4
## [1] TRUE
3 <= 4
## [1] TRUE
3 <= 4
## [1] TRUE

Note the different ways we made comparisons. These will become handy as you move on. Booleans are really 1s and 0s, such that you can even do math with them

TRUE + TRUE + FALSE
## [1] 2

OK, what about something for a missing value. For that, we have NA. This is quite important if, say, you have a missing value in your data set.

NA
## [1] NA
NA + 1
## [1] NA

No one likes an NA, and often we have to find ways around them!

EXERCISE: Make a variable. Now make a variable out of some math equation. Try adding variables of different classes together - what happens?

Combining values into larger objects

Now that we have a few object types down, what is we have a bunch of them we want to work with together? Let’s start with what we call a vector.

What is a vector? A vector is a bunch of numbers (or other things) all in one single object that we can reference with an index. Thing of it as a column in a spreadsheet. For example, let’s say I had a column containing all of the integers from 5 through 10, and I wanted to know the 2nd integer.

my_numbers <- c(5, 6, 7, 8, 9, 10)

my_numbers[2]
## [1] 6

Notice a few things. First, we created our vector wth the function c. This function takes a sequence of values and puts them into a vector. These values can be anything - numbers, strings, booleans, etc. Second, notice that to reference the second value of the vector we used [] Specifically, we put [2] in. This means give me the second value in this object.

EXERCISE: Now you try it. Create a vector - any vector - and try pulling out single values. Do some math with them. Log transform them. Use your vector with an index as you would any other variable. For example

my_vector <- 1:100

my_vector[4] + my_vector[50]
## [1] 54

Oh! Notice the use of : there to get a long vector? Neat trick, no? There are other ways to generate vectors.

EXERCISE: Make two vectors and add them together. First try it with numbers. Then try vectors of different object types. What happens?

Functions to generate vectors for fun and profit

First, a function to get a sequence with non-integer steps between numbers

seq(from = 1, to = 2, by=0.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

Second, 10 random numbers between 0 and 100

runif(10, min = 0, max = 100)
##  [1] 89.18258 23.42808 97.21304 73.41101 95.43178 54.03394 62.91280 84.20774
##  [9] 24.04156 40.77182

Now, vectors are neat, as they allow us to introduce two more concepts. First, some functions take vectors as input, and return other types of objects. For example, let’s say we wanted to sum everything in my_vector above. And then get the average of a bunch of random numbers between 0 and 100

sum(my_vector)
## [1] 5050
#a function in a function!
#oh my!
mean(runif(10, min = 0, max = 100))
## [1] 55.20371

OH! Notice I nested a function inside of a function. YES! You can do that. But only when you really need to. To keep track of things, it’s often better practice to create an object with a variable name that has meaning to you, and then feed that as an argument to another function.

Digging into your variables

Last, often you just want some information about your vector. You’ll want to do this for many more complex objects in the future as well. Fortunately, there’s a number of functions for that.

To know what class something is, there a class() for that!

class("A")
## [1] "character"
class(c(1,2,3))
## [1] "numeric"
class(c(TRUE, FALSE))
## [1] "logical"

Note, this doesn’t tell you information about object length or values. The lowest level function that gives you more detailed info is str().

str() is terribly useful - it reveals what is inside of your object. Often, if you come to someone with a problem about a variable misbehaving, the first thing they will ask you to do is to “str your object!”. Heck, this should be one of the first things you do when you have a bug. 90% of the time when I do this, I find a typo, or a misclassed object, or something. And then I fix it, and can move on with my life.

For example,

str(my_numbers)
##  num [1:6] 5 6 7 8 9 10

shows you that my_numbers is a numeric vector of length 6 and shows you the first few values.

You can also get summarized information about your objects. This is aggregated a bit more, but can be useful.

summary(my_numbers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.25    7.50    7.50    8.75   10.00

Whoah! Summary statistics! One of the great things about summary as well is it gives you information about NAs which can really be tricky.

summary(c(1, 2, NA, 3))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0     1.5     2.0     2.0     2.5     3.0       1

ah HA! Often, NAs are the bane of our existence. We’ll talk about them more as we dig deeper, but for now, summary() is a great way of seeing if they are there.

EXERCISE: Create a vector of any class. str() and summarize() it. Now, create two vectors of different object types. Combine them. What do these two useful functions tell you what happened?