To help us code efficiently and reproducibly we’ll be using an integrated development environment (IDE) called RStudio.
R is calculator
2 + 2
R is a programming languange. Specifically it’s a programming language built for statistics.
And that’s what it’s best at.
Packages need only be installed once, although you may have to reinstall when upgrading R or when you want to use a newer version of a package.
To install from CRAN all one needs to do is:
install.packages("dplyr")
If you’re not using RStudio then you may be asked to select a mirror. Just choose the location geographically closest to you.
Bioconductor is slightly different - we’ll cover that in more detail in a later session.
Once installed all the functions in a package are available to be used.
dplyr::glimpse(iris)
## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
Here the name of the package is provide followed by two colons and then the name of the function you want to use. The ::
loads the package into memory and allows you to access all of the functions.
library()
used to attach packages.search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
search()
## [1] ".GlobalEnv" "package:dplyr" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
glimpse(iris)
## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
library()
function to attach them.library()
function to load and attach a package from your library.An important concept to be aware of when using packages is namespaces.
search()
function.To avoid problems and bizzare errors you can specify which function to use by using the ::
notation as before to explicity indicate which function you’d like to use.
If your getting strange errors from a function that previously worked fine try typing
?function_name
In RStudio if there are multiple functions attached with the same name then the help window will give you links for both functions and the one at the top of the list is the one R uses by default.
What is an environment? This is a topic for an entire workshop in and of itself, however it is important to have a basic understanding of environments.
.GlobalEnv
, your working environment.z = 50
, this object, z
now lives in the global environment.z
, say print(z)
, then R begins to look for object z
in the global environment.z
happened to be a function in a package called alphabet
and that package had been attached (library(alphabet)
), then R would find z
there.Environments are important to understand even when you are starting out because they can be the source of hard to find but devestating mistakes.
Know what is in your environment!
RStudio has a very useful panel called Environment
that tells you exactly what is in your global environment. The function ls()
also lists the objects in your global environment.
Here’s the kicker - you can define objects of any name in your global environment. Here’s something you should never do. Best not to run this code in your own session.
5 + 5
## [1] 10
`+` = function(x,y) {
return(x*y)
}
5 + 5
## [1] 25
ls()
## [1] "+"
rm("+")
5 + 5
## [1] 10
Why does this work?
Hint: Think about where R looks first to find objects.
Some tips:
A quick note here on two different assignment operators used in R. Historically R has used <-
for assignment.
x <- 5
x
## [1] 5
However, in this course so far you’ve seen me using =
for assignment.
x = 5
x
## [1] 5
Both are equally valid, despite what you may read otherwise. Each has a couple of quirks to be aware of but these are very minor.
Decide which one you prefer and be consistent.
<-
Can make mistakes like this:
x< -5
Your code will look more like the majority of what’s out there.
=
rnorm(n = 10)
=
and ==
can be confusing to startDecide for yourself, be consistent and whichever you choose make sure to surround it in spaces.
# Good
x = 5
y <- 4
# Bad
x=5
y<-4
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1, 20, by = 2)
## [1] 1 3 5 7 9 11 13 15 17 19
seq(from = 2, by = 2.5, length.out = 10)
## [1] 2.0 4.5 7.0 9.5 12.0 14.5 17.0 19.5 22.0 24.5
rep(2, 3)
## [1] 2 2 2
rep(1:3, 4)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each = 2)
## [1] 1 1 2 2 3 3
rep(c("A", "B", "C"), each = 6)
## [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C"
## [18] "C"
normal_dist = rnorm(n = 1e6, mean = 2, sd = 1.2)
hist(normal_dist, main = "A normal distribution")
uniform_dist = runif(n = 1e6, min = 10, max = 20)
hist(uniform_dist, main = "A uniform distribution")
set.seed(3823)
x = sample(1:1000, size = 50, replace = TRUE)
max(x)
## [1] 982
min(x)
## [1] 4
range(x)
## [1] 4 982
mean(x)
## [1] 511.7
median(x)
## [1] 511
sum(x)
## [1] 25585
sd(x)
## [1] 265.8911
y = rnorm(x, 1, 0.2 * x) + x
plot(x,y)
var(x)
## [1] 70698.09
cor(x,y)
## [1] 0.9351317
my_model = lm(y ~ x)
print(my_model)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -36.125 1.121
summary(my_model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -278.58 -50.71 9.29 53.68 326.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.1253 35.2760 -1.024 0.311
## x 1.1210 0.0613 18.286 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.1 on 48 degrees of freedom
## Multiple R-squared: 0.8745, Adjusted R-squared: 0.8719
## F-statistic: 334.4 on 1 and 48 DF, p-value: < 2.2e-16
plot(x, y, main = "Linear model")
abline(my_model, col = "red")
R-package authors are required to document their functions although this happens at a various levels of usefulness.
?function_name
to get help on a function.??function
to do a fuzzy search.browseVignettes("packageName")
in your console or on the Bioconductor web page for a package.If you email me with an error I haven’t seen, the first thing I will do is Google it. If you go ahead and post a question on a forum when an easy answer can be found by googling don’t be surprised for an unpleasant response.
But sometimes an easy answer can’t be found so here’s a quick process to walk through:
To get you started here are few of the more common errors you might see:
Think about what is going wrong for each of these.
my_object
## Error in eval(expr, envir, enclos): object 'my_object' not found
Hint: Type ls()
iris[, 6]
## Error in `[.data.frame`(iris, , 6): undefined columns selected
Hint: How many columns does the iris
data frame have?
sample[1:10,]
## Error in sample[1:10, ]: object of type 'closure' is not subsettable
Hint: What does typeof(sample)
give you? What about sample(10)
? Or ?sample
mat[4, 2]
## Error in eval(expr, envir, enclos): object 'mat' not found
Hint: What are the dimensions (dim()
) of mat
?
pet_a_cat()
## Error in eval(expr, envir, enclos): could not find function "pet_a_cat"
Hint: This one is pretty self-explanatory
nothing = NA
if (nothing == NA) {
print("empty")
}
## Error in if (nothing == NA) {: missing value where TRUE/FALSE needed
Hint: What does nothing == NA
give you? How about is.na(nothing)
?
my_data = read.table("mydata.txt")
## Warning in file(file, "rt"): cannot open file 'mydata.txt': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection
Hint: Read the error message carefully.
x = data.frame(y = NULL)
x$y = 1:4
## Error in `$<-.data.frame`(`*tmp*`, "y", value = 1:4): replacement has 4 rows, data has 0
Hint: How many rows does x
have?