Understanding R

Using RStudio

RStudio demo

To help us code efficiently and reproducibly we’ll be using an integrated development environment (IDE) called RStudio.

R and R Studio installed and working
Using the file pane and setting your working directory
Running code - console and from scripts and rmarkdown documents
Using the Environments pane

R History

What is R?

R is calculator

2 + 2

NO!!

R is a programming languange. Specifically it’s a programming language built for statistics.

And that’s what it’s best at.

R is a dialect of the S languange which was developed by John Chambers at Bell Labs in 1976 and still exists today although hasn’t changed much since 1998.
The philosopy behind S (and R) was to allow users to begin in an interactive enviroment that didn’t explicitly feel like programming.
As their needs and skills grew they could move into more of the programming aspects. This helps us understand some of why R is the way it is.

R began life in New Zealand, developed by Ross Ihaka and Robert Gentleman in 1991.
It was made available to the public in 1993 and in 1995 R was licensed with the GNU General Public License, making it free and open-source.
Version 1.0.0 was released in 2000 and the most recent version, 3.3.1 was released on June 21.

Some key features of R

R runs on almost all platforms and operating systems.
It’s free
The core is quite lean - most funcionality is found in modular packages.
Very powerful graphics and statistics capabilities
Actively developed and a very active user community
Rich and robust package repository (CRAN and Bioconductor)
Excellent interactive capabilities - good for rapid development and data analysis

Packages

Packages are simply bits of code, external to the core R code that are designed to perform a specific function.
The vast majority of the usefulness and functionality of R resides in packages.
These packages live in online repositories and can be installed on your own system to be used.

R has a well defined system of packages, requiring package authors to document well and test installation thouroughly.
This means that most packages will install easily on any system.
For most R packages the central repository is CRAN (The Comprehensive R Archive Network), however, most bioinformatic packages live in another repository called Biocondcutor.
Dispite differences in content and appearances, these essentially function in the same way.

Installing packages

Packages need only be installed once, although you may have to reinstall when upgrading R or when you want to use a newer version of a package.
To install from CRAN all one needs to do is:
```
install.packages("dplyr")
```
If you’re not using RStudio then you may be asked to select a mirror. Just choose the location geographically closest to you.
Bioconductor is slightly different - we’ll cover that in more detail in a later session.

Using packages

Once installed all the functions in a package are available to be used.

dplyr::glimpse(iris)

## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

Here the name of the package is provide followed by two colons and then the name of the function you want to use. The :: loads the package into memory and allows you to access all of the functions.

However, this can get tedious typing out the package name everytime so R provides a function library() used to attach packages.
The library function first loads then “attaches” the package.
Basically this means you can now use functions from a package without typing the package names.
Packages are attached in your current session and need to be attached every time you start a new session.
Technically what is happening here is that when attaching a package R puts those functions in your search path, the place R looks first for objects and functions.

search()

## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

search()

##  [1] ".GlobalEnv"        "package:dplyr"     "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"

glimpse(iris)

## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

There is some confusion on why they are called packages but you used the library() function to attach them.
The correct terminology here is that individual packages are stored in your ‘library’.
You use the library() function to load and attach a package from your library.

Namespaces

An important concept to be aware of when using packages is namespaces.

Given the thousands of packages available it is quite likely that function names will overlap.
If two functions have the same name and both are attached, R will by default use the one attached most recently.
You can see this by looking at the order of the packages in the search path search() function.

To avoid problems and bizzare errors you can specify which function to use by using the :: notation as before to explicity indicate which function you’d like to use.

Pro tip

If your getting strange errors from a function that previously worked fine try typing

?function_name

In RStudio if there are multiple functions attached with the same name then the help window will give you links for both functions and the one at the top of the list is the one R uses by default.

Environments

What is an environment? This is a topic for an entire workshop in and of itself, however it is important to have a basic understanding of environments.

Environments are how R knows where to look for things.
The only environment that you usually have to pay attention to is .GlobalEnv, your working environment.
When you define an object z = 50, this object, z now lives in the global environment.
When you ask R to do something with the object z, say print(z), then R begins to look for object z in the global environment.
If it can’t find it there then it searches other places it knows like attached packages.
For example if z happened to be a function in a package called alphabet and that package had been attached (library(alphabet)), then R would find z there.

Environments are important to understand even when you are starting out because they can be the source of hard to find but devestating mistakes.

Know what is in your environment!

RStudio has a very useful panel called Environment that tells you exactly what is in your global environment. The function ls() also lists the objects in your global environment.

Here’s the kicker - you can define objects of any name in your global environment. Here’s something you should never do. Best not to run this code in your own session.

5 + 5 
## [1] 10

`+` = function(x,y) {
    return(x*y)
}
5 + 5
## [1] 25

ls()
## [1] "+"

rm("+")
5 + 5
## [1] 10

Why does this work?

Hint: Think about where R looks first to find objects.

Some tips:

Always start your analysis in a new environment
Never save your workspace (R asks you about this when quitting, always say no)
Don’t analyze different projects in the same environment.
If you have an error try re-running your script in a new session (fresh environment)
On the flipside, make sure your script can run in a new session.
Use unique names for your objects

Assignment

A quick note here on two different assignment operators used in R. Historically R has used <- for assignment.

x <- 5
x

## [1] 5

However, in this course so far you’ve seen me using = for assignment.

x = 5
x

## [1] 5

Both are equally valid, despite what you may read otherwise. Each has a couple of quirks to be aware of but these are very minor.

Decide which one you prefer and be consistent.

Using `<-`

Longer to type (two key strokes, plus SHIFT)
Can make mistakes like this:
```
x< -5
```
Your code will look more like the majority of what’s out there.

Using `=`

Quicker to type
Similar to modern programming lanugages
Also used for passing parameters in functions rnorm(n = 10)
Difference between = and == can be confusing to start

Decide for yourself, be consistent and whichever you choose make sure to surround it in spaces.

# Good
x = 5 
y <- 4

# Bad
x=5
y<-4

Other important key R functions

Generating and manipulating sequences

seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 20, by = 2)
##  [1]  1  3  5  7  9 11 13 15 17 19

seq(from = 2, by = 2.5, length.out = 10)
##  [1]  2.0  4.5  7.0  9.5 12.0 14.5 17.0 19.5 22.0 24.5

rep(2, 3)
## [1] 2 2 2

rep(1:3, 4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

rep(1:3, each = 2)
## [1] 1 1 2 2 3 3

rep(c("A", "B", "C"), each = 6)
##  [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C"
## [18] "C"

Distributions

normal_dist = rnorm(n = 1e6, mean = 2, sd = 1.2)
hist(normal_dist, main = "A normal distribution")

uniform_dist = runif(n = 1e6, min = 10, max = 20)
hist(uniform_dist, main = "A uniform distribution")

Basic statistics

set.seed(3823)
x = sample(1:1000, size = 50, replace = TRUE)

max(x)
## [1] 982

min(x)
## [1] 4

range(x)
## [1]   4 982

mean(x)
## [1] 511.7

median(x)
## [1] 511

sum(x)
## [1] 25585

sd(x)
## [1] 265.8911

y = rnorm(x, 1, 0.2 * x) + x
plot(x,y)

var(x)
## [1] 70698.09

cor(x,y)
## [1] 0.9351317

Simple linear model

my_model = lm(y ~ x)
print(my_model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -36.125        1.121

summary(my_model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -278.58  -50.71    9.29   53.68  326.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.1253    35.2760  -1.024    0.311    
## x             1.1210     0.0613  18.286   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.1 on 48 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.8719 
## F-statistic: 334.4 on 1 and 48 DF,  p-value: < 2.2e-16

plot(x, y, main = "Linear model")
abline(my_model, col = "red")

Where to get help

Built-in help

R-package authors are required to document their functions although this happens at a various levels of usefulness.

Simply type ?function_name to get help on a function.
Type ??function to do a fuzzy search.
Look carefully what parameters the function requires and what type they are.
Some are required (listed first, no default) and some are optional (a default value is usually listed).
Most function help will also indicate what the function returns.
Good documentation also has more information on what the function is doing.
Package manuals (required for both CRAN and Bioconductor) are all of these function help pages gathered together in one place.

Vignettes

Vignettes (required for Bioconductor packages, but not CRAN) are longer form documentation usually in the form of a tutorial or example usage.
These can be extremely helpful and are usually the best place to start when working with a new package.
For example the limma (linear models for microarrays) vignette is a small book and an excellent resource on learning how to analyze microarrays (and RNA-seq).
Vignettes can be found by typing browseVignettes("packageName") in your console or on the Bioconductor web page for a package.

Elsewhere

Sometimes authors will provide more detailed documentation online.
This is more common for more recent packages where the authors may have a github repository and associated webpage.
Often discussion pages (Google groups, Github) can also be a useful source of help

Errors

GOOGLE IT!!!

If you email me with an error I haven’t seen, the first thing I will do is Google it. If you go ahead and post a question on a forum when an easy answer can be found by googling don’t be surprised for an unpleasant response.

But sometimes an easy answer can’t be found so here’s a quick process to walk through:

Re-read the error and then think about it for a minute. See if you can’t get a grasp on what’s really going wrong.
Check your code for errors. Spelling errors, misplaced commas, forgotten parenthesis can all cause problems
Look it up - I very, very rarely get an error that someone else hasn’t seen before.
If you still can’t find a solution then you can ask for help. Make friends in this class - get an R buddy. I can answer brief questions or you can post questions online at Stack Overflow. Sometimes package developers have specific discussion groups on Google groups or Github. These can be very useful.

To get you started here are few of the more common errors you might see:

Think about what is going wrong for each of these.

my_object

## Error in eval(expr, envir, enclos): object 'my_object' not found

Hint: Type ls()

iris[, 6]

## Error in `[.data.frame`(iris, , 6): undefined columns selected

Hint: How many columns does the iris data frame have?

sample[1:10,]

## Error in sample[1:10, ]: object of type 'closure' is not subsettable

Hint: What does typeof(sample) give you? What about sample(10)? Or ?sample

mat[4, 2]

## Error in eval(expr, envir, enclos): object 'mat' not found

Hint: What are the dimensions (dim()) of mat?

pet_a_cat()

## Error in eval(expr, envir, enclos): could not find function "pet_a_cat"

Hint: This one is pretty self-explanatory

nothing = NA
if (nothing == NA) {
    print("empty")
}

## Error in if (nothing == NA) {: missing value where TRUE/FALSE needed

Hint: What does nothing == NA give you? How about is.na(nothing)?

my_data = read.table("mydata.txt")

## Warning in file(file, "rt"): cannot open file 'mydata.txt': No such file or
## directory

## Error in file(file, "rt"): cannot open the connection

Hint: Read the error message carefully.

x = data.frame(y = NULL)
x$y = 1:4

## Error in `$<-.data.frame`(`*tmp*`, "y", value = 1:4): replacement has 4 rows, data has 0

Hint: How many rows does x have?