This module will be relying on the following material, which you are encouraged to read more extensively beyond the requirement for this course. Becoming comfortable (not only familiar) with a programming language is a like a superpower and you’ll find it enables you to do many more things beyond the initial context (and enhances your employment outlook).
Advanced R by Hadley Wickham. Don’t let the name fool you, it’s simply a more systematic and technical overview of the language. Even if you don’t understand much of it now, it’s a good book to come back to later.
R for Data Science by Hadley Wickham As the title implies, it focuses more on workflows and good practices for data science.
Miscellaneous :) There are many good resource on the web. We’ll point out some when relevant.
For this module, please read the sections we point out.
If you’re curious about some pros and cons of the R programming language, read sections 1.1 and 1.2 from Advanced R: https://adv-r.hadley.nz/introduction.html
Most R functions and libraries have good documentation. In RStduio, select a function and press F1
to have the documentation appear in the Help tab (to the lower-right). In an R console, run ?function
to display the same documentation on screen. Googling is of course an alternative
You can download a copy of this notebook here.
Install the tidyverse packages:
if (!require("tidyverse"))
install.packages("tidyverse")
One usually needs a collection of different R libraries within a workflow. Many functions in these packages have the same name and only the functions from the last imported library are “remembered” by R.
But we can explicitly refer to a function in a package by using the syntax
package::function()
This can be cumbersome but advised when working with multiple packages.
You can get a list of the loaded environment with the search()
function
base::search()
## [1] ".GlobalEnv" "package:forcats" "package:stringr"
## [4] "package:dplyr" "package:purrr" "package:readr"
## [7] "package:tidyr" "package:tibble" "package:ggplot2"
## [10] "package:tidyverse" "package:stats" "package:graphics"
## [13] "package:grDevices" "package:utils" "package:datasets"
## [16] "package:methods" "Autoloads" "package:base"
R uses a number of different data structures (vector, matrices, lists, objects, data frames, etc.) and allows users to create their own. Which is why one should be aware of some general principles, as some package creators get overenthusiastic about they way they package output from their functions.
R data structures can be “decorated” by setting attributes. To set these, use the attr()
function:
my_vector <- 1:10
attr(my_vector, "about") <- "this is just a simple sequence"
my_vector
## [1] 1 2 3 4 5 6 7 8 9 10
## attr(,"about")
## [1] "this is just a simple sequence"
These attributes are printed along with the variable, but can also be fetched as a list with attributes
:
attributes(my_vector)
## $about
## [1] "this is just a simple sequence"
Some attributes are generally recognized by various functions so they have specialized ways to access / set them, like names()
names(my_vector) <- c("one", "two", "three", "four", "five")
my_vector
## one two three four five <NA> <NA> <NA> <NA> <NA>
## 1 2 3 4 5 6 7 8 9 10
## attr(,"about")
## [1] "this is just a simple sequence"
As you can see, the print()
function expects to match each element of the vector with a name.
Besides meta-information, attributes can have a functional role, as they inform various functions how to interpret and process these data structures.
A typical example are matrices, which are simply vectors, with the number of rows and columns set as attributes:
dim(my_vector) <- c(2, 5)
my_vector
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
## attr(,"about")
## [1] "this is just a simple sequence"
This is equivalent to using the built-in matrix()
creation function:
my_matrix <- matrix(1:10, nrow = 2, ncol = 5)
my_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
Note Attributes are meant to be transient. Many functions remove them.
These are basically data structures with permanent attributes (meaning functions cannot remove them).
The way to create these is simply to set a class
attribute. (An object is a single instance of a generic class of data structures)
attr(my_matrix, "class") <- "named_matrix"
attr(my_matrix, "row_names") <- c("one", "two")
attr(my_matrix, "column_names") <- c("one", "two", "three", "four", "five")
my_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
## attr(,"class")
## [1] "named_matrix"
## attr(,"row_names")
## [1] "one" "two"
## attr(,"column_names")
## [1] "one" "two" "three" "four" "five"
Or, in one step:
structure(matrix(1:10, nrow = 2, ncol = 5),
class = "named_matrix",
row_names = c("one", "two"),
column_names = c("one", "two", "three", "four", "five")
)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
## attr(,"class")
## [1] "named_matrix"
## attr(,"row_names")
## [1] "one" "two"
## attr(,"column_names")
## [1] "one" "two" "three" "four" "five"
All of this is mostly relevant for certain functions and it’s the “magic” behind some fancy data types like data frames.
S3 objects are extremely common in the R landscape and many functions return output like this, so it’s good to be aware of how they’re structured.
Besides S3, there are S4 and R6 object types (helpful names, huh?). If you’re curious or need to understand some output in these formats, refer to Objected-oriented programming in the Advanced R book.
These are a more flexible structure, as they can contain multiple entries, of any data type:
shopping_list <- list(
items = c("chocolate", "bread", "broccoli"),
prices = c(15, 45, 32),
date = "2021-02-01"
)
Text aka “character” strings can be manipulated in different ways. R has built-in functions for this but take a look at the stringr package that offers convenient functions
message <- "hello world"
nchar(message)
## [1] 11
You can extract parts of a string using numerical positions
substr(message, start = 1, stop = 5)
## [1] "hello"
You can replace certain patterns:
library(stringr)
stringr::str_replace(message, pattern = "o", replacement = "-")
## [1] "hell- world"
stringr::str_replace_all(message, pattern = "o", replacement = "-")
## [1] "hell- w-rld"
str_replace()
and str_replace_all()
can be applied on vectors and accepts more complicated (regex) patterns:
fruits <- c("one apple", "two pears", "three bananas")
stringr::str_replace_all(fruits, pattern = "[aeiou]", replacement = "-")
## [1] "-n- -ppl-" "tw- p--rs" "thr-- b-n-n-s"
To search for a string inside another larger string or for a certain pattern of possible strings inside a larger string, you can use the stringr::str_locate()
and stringr::str_locate_all()
functions.
fruit <- c("apple", "banana", "pear", "pineapple")
stringr::str_locate(fruit, "a")
## start end
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3
## [4,] 5 5
stringr::str_locate_all(fruit, "a")
## [[1]]
## start end
## [1,] 1 1
##
## [[2]]
## start end
## [1,] 2 2
## [2,] 4 4
## [3,] 6 6
##
## [[3]]
## start end
## [1,] 3 3
##
## [[4]]
## start end
## [1,] 5 5
You can see what type of data structure a variable is by using the class()
function.
class(shopping_list)
## [1] "list"
To see what the underlying data type (i.e. nunber, character) a variable has, you can use the typeof()
function.
typeof(my_matrix)
## [1] "integer"
For our toy matrix, this is number (integer, to be exact).
To get elements out of various data structures there are a few access operators.
my_vector[3]
## [1] 3
shopping_list$items
## [1] "chocolate" "bread" "broccoli"
shopping_list[["items"]]
## [1] "chocolate" "bread" "broccoli"
shopping_list$items[1]
## [1] "chocolate"
There many more useful ways you can fetch elements and you are strongly encouraged to at least browse through the Subsetting chapter of the Advanced R book, at least to be aware.
Many datasets can naturally be organized in tables. Base R already has the data.frame
data structure but we’ll be using its “upgraded” version (a so-called “tibble”), along with associated functions from the dplyr
library, part of the tidyverse
collection of packages.
These functions allow for a programming style that combines functional elements with query-like steps.
Here’s a quick reference for the operations we’ll be using: Data Wrangling Cheat Sheet
Let’s use the (R built-in) Iris dataset to illustrate some ways we can work with tables.
library(tidyverse)
iris
iris %>%
dplyr::filter(Species == "setosa")
Note the pipe operator %>%
above, introduced by the tidyverse library. It’s very similar to the Unix pipe |
in that it feeds the value / variable on its left as the first argument to the function on its right. For example, "hello" %>% print
is equivalent to print("hello")
In RStudio, you can type it faster with [Ctrl] + [Shift] + [M]
(Windows) or [Command] + [Shift] + [M] (macOS)
iris %>%
dplyr::filter(Species == "setosa",
Petal.Length < 1.5)
iris %>% dplyr::select(Sepal.Length)
iris %>%
dplyr::mutate(length_ratio = Sepal.Length / Petal.Length)
You can also apply any vectorized function directly to a column
iris %>%
dplyr::mutate(log_petal_length = log(Petal.Length))
Or you can apply ad-hoc functions that work on each row of the column
iris %>%
dplyr::mutate(species_number = sapply(Species,
function (species_name) {
if (species_name == "setosa")
return(1)
else if (species_name == "versicolor")
return(2)
else
return(3)
}
))
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarize(mean(Petal.Length))
A central concept of this way of working with data (found across multiple programming languages and packages, not only R) is to structure the data in the standard statistical way, with variables as columns and rows as observations. This organizational discipline (aka “normal form”) has been popularized under the name tidy data.
It’s highly recommended to read this section from the R for Data Science book. Having tables organized this way makes very many operations straightforward.
The feature-rich Biostrings package has convenient functions for processing sequence files and biological data files of different types. For a quick intro, read: https://web.stanford.edu/class/bios221/labs/biostrings/lab_1_biostrings.html
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings")
## Bioconductor version 3.11 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'Biostrings'
##
## The downloaded binary packages are in
## /var/folders/91/c5sshmws4xvfcfpnw5l7gb2545hq76/T//RtmpXnh18t/downloaded_packages
## Old packages: 'fastmap', 'leiden', 'rappdirs', 'reprex', 'shiny', 'withr'
Let’s load ORFs and protein sequences from a reference yeast genome.
ORFS_YEAST_FILE <- "/tmp/orf_trans.fasta.gz"
download.file("http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_genomic_all.fasta.gz",
destfile = ORFS_YEAST_FILE)
orfs_yeast <- Biostrings::readDNAStringSet(ORFS_YEAST_FILE)
orfs_yeast
## DNAStringSet object of length 6713:
## width seq names
## [1] 3573 ATGGTACTGACGATTTATCCTG...TCTATATATGAATCTACATAA YAL001C TFC3 SGDI...
## [2] 3825 ATGGAGCAAAATGGCCTTGACC...TCTAACCCAAAAATAGTATAA YAL002W VPS8 SGDI...
## [3] 987 ATGGCATCCACCGATTTCTCCA...GCTGCTATGCAAAAATTATAA YAL003W EFB1 SGDI...
## [4] 648 ATGGGTGTCACCAGCGGTGGCC...ACACGAGTATGTTGTACCTAA YAL004W YAL004W S...
## [5] 1929 ATGTCAAAAGCTGTCGGTATTG...ACCGTTGAAGAAGTTGATTAA YAL005C SSA1 SGDI...
## ... ... ...
## [6709] 156 ATGAAAATAAAACTTAATAAAT...TATATTAATAAAAAAAAGTAA Q0297 Q0297 SGDID...
## [6710] 1272 ATGCCACAATTTGGTATATTAT...TACATAAATAGACGCATATAA R0010W FLP1 SGDID...
## [6711] 1122 ATGAATGGCGAGAGACTGCTTG...GAGCGAAAGGTGGATGGGTAG R0020C REP1 SGDID...
## [6712] 546 ATGCCTTATAAAACAGCTATAG...AGTATATTTGAACCTGTATAA R0030W RAF1 SGDID...
## [6713] 891 ATGGACGACATTGAAACAGCCA...CGTAATACTTCTAGGGTATGA R0040C REP2 SGDID...
Some operations can be applied to the entire set of sequences:
Biostrings::reverseComplement(orfs_yeast)
## DNAStringSet object of length 6713:
## width seq names
## [1] 3573 TTATGTAGATTCATATATAGAA...AGGATAAATCGTCAGTACCAT YAL001C TFC3 SGDI...
## [2] 3825 TTATACTATTTTTGGGTTAGAT...GTCAAGGCCATTTTGCTCCAT YAL002W VPS8 SGDI...
## [3] 987 TTATAATTTTTGCATAGCAGCA...GGAGAAATCGGTGGATGCCAT YAL003W EFB1 SGDI...
## [4] 648 TTAGGTACAACATACTCGTGTG...GCCACCGCTGGTGACACCCAT YAL004W YAL004W S...
## [5] 1929 TTAATCAACTTCTTCAACGGTT...AATACCGACAGCTTTTGACAT YAL005C SSA1 SGDI...
## ... ... ...
## [6709] 156 TTACTTTTTTTTATTAATATAT...TTTATTAAGTTTTATTTTCAT Q0297 Q0297 SGDID...
## [6710] 1272 TTATATGCGTCTATTTATGTAG...TAATATACCAAATTGTGGCAT R0010W FLP1 SGDID...
## [6711] 1122 CTACCCATCCACCTTTCGCTCC...AAGCAGTCTCTCGCCATTCAT R0020C REP1 SGDID...
## [6712] 546 TTATACAGGTTCAAATATACTA...TATAGCTGTTTTATAAGGCAT R0030W RAF1 SGDID...
## [6713] 891 TCATACCCTAGAAGTATTACGT...GGCTGTTTCAATGTCGTCCAT R0040C REP2 SGDID...
PROT_SEQ_YEAST_FILE <- "/tmp/orf_trans.fasta.gz"
download.file("http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans.fasta.gz",
destfile = PROT_SEQ_YEAST_FILE)
prot_sequence_yeast <- Biostrings::readAAStringSet(PROT_SEQ_YEAST_FILE)
prot_sequence_yeast
## AAStringSet object of length 5917:
## width seq names
## [1] 1161 MVLTIYPDELVQIVSDKIASNK...TTDFDGYWVNHNWYSIYEST* YAL001C TFC3 SGDI...
## [2] 1275 MEQNGLDHDSRSSIDTTINDTQ...QKPDEYSCLICQTESNPKIV* YAL002W VPS8 SGDI...
## [3] 207 MASTDFSKIETLKQLNASLADK...SIEEDEDHVQSTDIAAMQKL* YAL003W EFB1 SGDI...
## [4] 643 MSKAVGIDLGTTYSCVAHFAND...FPGGAPPAPEAEGPTVEEVD* YAL005C SSA1 SGDI...
## [5] 216 MIKSTIALPSFFIVLILALVNS...SIAQVLLIQFLFTGRQKNYV* YAL007C ERP2 SGDI...
## ... ... ...
## [5913] 270 MTHLERSRHQQHPFHMVMPSPW...THVLDVIWLFLYVVFYWWGV* Q0275 COX3 SGDID:...
## [5914] 424 MPQFGILCKTPPKVLVRQFVER...NGIISQEVLDYLSSYINRRI* R0010W FLP1 SGDID...
## [5915] 374 MNGERLLACIKQCIMQHFQPMV...KPVDVEVEFRCKFKERKVDG* R0020C REP1 SGDID...
## [5916] 182 MPYKTAIDCIEELATQCFLSKL...NFEHPNLGVFPETDSIFEPV* R0030W RAF1 SGDID...
## [5917] 297 MDDIETAKNLTVKARTAYSVWD...KKRRVATRVRGRKSRNTSRV* R0040C REP2 SGDID...
The General feature format contains information about genes and DNA, RNA, and protein sequences. It’s structured as a ([Tab] delimited) table, meaning we can read it directly into a data frame (or tibble).
Read the genome info as GFF from the source website (the file is not copied, just read directly into the data.frame). We could parse the file to extract the column names, but it’s just simpler if we write them.
library(RCurl)
library(tidyverse)
GENOME_INFO_SOURCE <-
"ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.gff"
genome_info <- readr::read_tsv(RCurl::getURL(GENOME_INFO_SOURCE), # stream in the file directly from the web
# skip the (first) lines in the GFF that start with "#"
# (these are comments)
comment = "#",
col_names = c("sequence_id", "source", "feature",
"start", "end", "score", "strand",
"phase", "attributes"))
genome_info
Plot the histogram of gene lengths:
genome_info %>%
dplyr::filter(feature == "gene") %>%
dplyr::mutate(gene_length = end - start) %>%
ggplot() +
geom_histogram(aes(x = gene_length))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s see if the GFF file contains protein sequences. The command below chains several operations:
attributes
columnFor reproducibility, every time one works with random numbers, the initial value (“seed”) of the random number generator has to be set to a known value. (Computers cannot really generate random numbers, just number sequences that look random.)
library(stringr)
set.seed(42)
genome_info %>%
dplyr::filter(feature == "gene") %>%
dplyr::select(attributes) %>%
dplyr::mutate(attributes = lapply(attributes,
function(attr_string) {
stringr::str_split(attr_string, pattern = ";")
})
) %>%
sample_n(4) %>%
pull # extract ("pull") the underlying list from the table so we can print contents
## [[1]]
## [[1]][[1]]
## [1] "ID=gene1176" "Name=snp" "gbkey=Gene"
## [4] "gene=snp" "locus_tag=VNG_1496G" "old_locus_tag=VNG1496G"
##
##
## [[2]]
## [[2]][[1]]
## [1] "ID=gene1097" "Name=htr9" "gbkey=Gene"
## [4] "gene=htr9" "locus_tag=VNG_1395G" "old_locus_tag=VNG1395G"
##
##
## [[3]]
## [[3]][[1]]
## [1] "ID=gene1251" "Name=gdcH" "gbkey=Gene"
## [4] "gene=gdcH" "locus_tag=VNG_1605G" "old_locus_tag=VNG1605G"
##
##
## [[4]]
## [[4]][[1]]
## [1] "ID=gene633" "Name=VNG_0804C" "gbkey=Gene"
## [4] "locus_tag=VNG_0804C" "old_locus_tag=VNG0804C"