1 Metagenomics

This tutorial is taken from the edgeR documentation: 4.2 RNA-Seq of pathogen inoculated arabidopsis with batch effects.

1.1 Experimental setup

Pseudomonas syringae is a bacterium often used to study plant reactions to pathogens. In this experiment, six-week old Arabidopsis plants were inoculated with the ∆hrcC mutant of P. syringae, after which total RNA was extracted from leaves. Control plants were inoculated with a mock pathogen. Three biological replicates of the experiment were conducted at separate times and using independently grown plants and bacteria. The six RNA samples were sequenced one per lane on an Illumina Genome Analyzer. Reads were aligned and summarized per gene using GENE-counter. The reference genome was derived from the TAIR9 genome release (www.arabidopsis.org).

1.1.1 Where is the data?

The file containing the read counts (arabidopsis.csv) will be in your /storage/student{nr}/folder at the start of the tutorial. Move it in to your working directory before starting the analysis.

1.1.2 Learning aims

During this tutorial you will learn the basics of an RNA-Seq analysis. What concepts are important to take into consideration?

  • Compositional data
  • Batch effects
  • Correcting for multiple testing
  • Interpreting dimensional reduction plots plots

Since the statistical assumptions for RNA-Seq and metagenomics are very similar, the concepts and tools you learn about today can also be used for that purpose.

1.1.3 Install dependencies

BiocManager::install("edgeR")
BiocManager::install("statmod")

1.1.4 Importing libraries

1.1.5 Read and pre preocess data

df <- read_csv('arabidopsis.csv')

# This will set the gene column as index
df <- df %>% column_to_rownames(., var = 'gene')

Lets take a look at what the data frame looks like!

 head(df)

1.2 Create group varibles

We need to tell edgeR what experimental groups each column belongs to. First we want to divide the samples by experimental condition Treat (treated vs control). Secondly, we would like to assign sampling time to each sample Time (1, 2 or 3).

# Identifies the different treatments
Treat <- factor(substring(colnames(df),1,4))

# Print the Treat variable before and after this command. What did it do?
Treat <- relevel(Treat, ref="mock")

# Identifies the sampling time point
Time <- factor(substring(colnames(df),5,5))

1.2.1 Create DGEList

edgeR stores data in a simple list-based data object called a DGEList. This type of object is easy to use because it can be manipulated like any list in R. Lets create one and see what type of information it contains!

y <- DGEList(counts=df, group=Treat)

1.2.2 Filter lowly expressed genes

Genes with very low counts across all libraries provide little evidence for differential expression. In the biological point of view, a gene must be expressed at some minimal level before it is likely to be translated into a protein or to be biologically important. In addition, the pronounced discreteness of these counts interferes with some of the statistical approximations that are used later in the pipeline. These genes should be filtered out prior to further analysis.

# Identify genes to keep
keep <- filterByExpr(y)
table(keep)

# Filter lowly expressed genes
y <- y[keep, , keep.lib.sizes=FALSE]

1.3 Calculate normalisation factors

If a small proportion of highly expressed genes consume a substantial proportion of the total library size for a particular sample, this will cause the remaining genes to be under-sampled for that sample. Unless this effect is adjusted for, the remaining genes may falsely appear to be down-regulated in that sample.

calcNormFactorscreates a “scaling factor” whihc compensates for this.

y <- calcNormFactors(y)
y$samples

1.4 Data exploration

Distances on an MDS plot of a DGEList object correspond to leading log-fold-change between each pair of samples. Leading log-fold-change is the root-mean-square average of the largest log2-fold-changes between each pair of samples.

plotMDS(y, col=rep(1:2, each=3))

To examine further consistency of the three replicates, we compute predictive log2-fold- changes (logFC) for the treatment separately for the three times.

# Create model matrix 
design <- model.matrix(~Time+Time:Treat)

# Predict logFC between samples
logFC <- predFC(y,design,prior.count=1,dispersion=0.05)

# Calculate how well samples taken at different times correlate
cor(logFC[,4:6])

1.5 Model matrix

This model matrix till tell edgeR what samples belong to which groups.

design <- model.matrix(~Time+Treat)
rownames(design) <- colnames(y)

1.6 Estimate dispersion

Since the dispersions control the variances of the gene counts, underestimation may lead to false discovery, while overestimation may lower the rate of true detection.

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081415

Larger dispersion = higher variance between replicates, which reduce power to detect differently expressed genes.

y <- estimateDisp(y, design, robust=TRUE)
y$common.dispersion
plotBCV(y)

1.7 Differential expression

To see which genes have both statisticaly significant and biologicaly meaningful change in expression level we can perform differential expression analysis.

1.7.1 Test for differences between replicates

First we check whether there is a genuine need to adjust for the experimental times. We do this by testing for differential expression between the three sampleing times.

# Fit the model
fit <- glmQLFit(y, design, robust=TRUE)
# Test for differences between sampling times
qlf <- glmQLFTest(fit, coef=2:3)

# Find top genes
topTags(qlf)

How many genes were significantly differentially expressed between the sampling times alone.

# Get nr significant genes FDR corrected
FDR <- p.adjust(qlf$table$PValue, method="fdr")

# Total nr of genes 
length(FDR)

# nr significant genes
sum(FDR < 0.05)

1.7.2 Identify differences between treatments

# Perform test: mock vs hrcc
qlf <- glmQLFTest(fit)
topTags(qlf)

Here’s a closer look at the individual counts-per-million for the top genes.

top <- rownames(topTags(qlf))
cpm(y)[top,]

The total number of genes significantly up-regulated or down-regulated at 5% FDR is sum- marized as follows:

summary(decideTests(qlf))

We can plot all the logFCs against average count size, highlighting the DE genes:

plotMD(qlf)
abline(h=c(-1,1), col="blue")

2 Single-Cell

In this exercise, we will analyze single cell data from human cancer cells. Download this notebook and the data file:

mkdir single_cell
cd single_cell
wget https://zelezniak-lab.github.io/MPBIO-BBT045/single_cell/sc_analysis.Rmd
wget https://zelezniak-lab.github.io/MPBIO-BBT045/single_cell/sc_human_brain.RData

The overall workflow is:

  1. Read and preprocess data sets: healthy + cancer
  2. Normalize data
  3. Visualize data

Go through the notebook yourselves and answer the questions (mostly reflections). If time permits, we will address these in class.

A good part of the work involves changing the way data is represented to best match the analysis tasks we’re interested in.

2.1 Single cell data sources

We have data from multiple sources.

For cancer data, we’ll use the GSE84465 dataset, which consists of 3589 GBM (glioblastoma multiform) cells.

For more control data, we will use the the GSE67835 dataset, which consists of RNA sequencing on 466 healthy brain cells.

We have also tissue-specific data from GTE (gene expression counts in a tissue-specific manner), as well as cancer data from TCGA (The Cancer Genome Atlas).

These data were already downloaded for you and stored in the file sc_human_brain.RData

For reference, the datasets can be viewed here:

2.2 Install the required packages (10-15 minutes)

Run the following installation commands in the R console (huge output). Make sure to check for any answers R might ask from you. Say “Yes” when it asks for confirmation and “n” when asking whether to update packages.

You only need to do this once for your RStudio.

BiocManager::install("SingleCellExperiment")

# We need the latest RcppAnnoy otherwise BiocNeighors fails to compile
install.packages("https://cran.r-project.org/src/contrib/Archive/RcppAnnoy/RcppAnnoy_0.0.16.tar.gz", repos=NULL, type="source")

BiocManager::install("BiocNeighbors")
BiocManager::install("scater")
BiocManager::install("scran")
install.packages("uwot")

2.3 Load and Preprocess Data

Run the chunk below to load from the RData file This will create new variables in the environment, which you can view to the right ->

load("sc_human_brain.RData")

sc.counts holds GBM single cell counts as transcripts per million (TPM), from data set GSE84465. It’s a matrix with:

library("tidyverse")

rownames(sc.counts) %>% head
colnames(sc.counts) %>% head

Form the various databases, we have information about what cell types these samples have. We’ll extract these as a data frame, for downstream use.

cell_types <-
    tibble::enframe(
        x = sc.sample.info["cell type:ch1",],
        name = "GEO_sc_sample_id",
        value = "cell_type"
    )

cell_types

To process single cell data, we will use the library SingleCellExperiment. It expects a matrix structured as (rows = genes, columns = samples), the same as our sc.counts matrix. Here we also feed in the previously constructed cell_types data frame.

library("SingleCellExperiment")
gbm_sc_experiment <- SingleCellExperiment(assays = list(counts = sc.counts),
                                          colData = cell_types)

2.4 Normalization

We have two things to take care of:

  • gene counts from multiple single cells must be adjusted to a single scale (often a percent)
  • gene counts spanning multiple orders of magnitude

2.4.1 Question

Why are these two things problematic?

To make our lives easier, we’ll use two helper libraries for SingleCellExperiment to:

  • compute the size factor (weight) of each sample
  • log-transform the counts

For simplicity and to save memory, we’ll be overwriting the original with the normalized version, since it’s all we care about.

library("scater")  # Single-Cell Analysis Toolkit for Gene Expression Data
library("scran")   # Methods for Single-Cell RNA-Seq Data Analysis

# Compute the size factor of each single cell sample using
# the deconvolution strategy by (Lun et al., 2016) for scaling normalization of sparse count data
gbm_sc_experiment <- scran::computeSumFactors(gbm_sc_experiment)

# Compute log-transformed normalized expression values
gbm_sc_experiment <- scater::logNormCounts(gbm_sc_experiment)

3 Visualization

We’re going to inspect gene distributions by reducing the number of dimensions (cell samples) using a linear and a non-linear method, then try to distinguish clusters of genes as basis for potential biological interpretation.

3.1 PCA

While R has a built-in PCA function, we can skip more data wrestling and use specialized functions in the SingleCellExperiment helper libraries. It’s also kind and colors the points according to the cell_types we prepared earlier.

Run the chunk below and answer these questions.

3.1.1 Questions

  • How many clusters do you see?
  • Do you see a clear separation between the cell types?
# By default, runPCA() uses the top 500 genes with the highest variances to compute the first PCs. 
gbm_sc_experiment <- scater::runPCA(gbm_sc_experiment)

scater::plotReducedDim(gbm_sc_experiment, 
                       dimred = "PCA", 
                       colour_by = "cell_type")   # Obs: Notice the UK spelling "coloUr"

3.2 UMAP

Let’s see how the data is distributed under a UMAP transformation.

UMAP’s major parameter is the “number of neighbors” to consider around each point in the multidimensional space. It’s rather robust to this value, but one ought to try different values.

Please keep in mind that UMAP initialization might be randomized, so to reproduce the same results, R’s random number generator has to be fixed, making it give out the same sequence of random numbers every time.

set.seed(123)           # Fix random number generation
UMAP_N_NEIGHBORS = 30

gbm_sc_experiment <- scater::runUMAP(gbm_sc_experiment,
                                     n_neighbors = UMAP_N_NEIGHBORS)
scater::plotUMAP(gbm_sc_experiment, 
                 colour_by = "cell_type")

3.3 t-SNE

The major parameter to adjust for t-SNE is the “perplexity” (statistical concept), basically the area around each point to consider. It can be quite sensitive to it (depending on the data), so you should try out different values.

set.seed(123) 
TSNE_PERPLEXITY = 10

gbm_sc_experiment <- scater::runTSNE(gbm_sc_experiment,
                                     perplexity = TSNE_PERPLEXITY)
scater::plotTSNE(gbm_sc_experiment, 
                 colour_by = "cell_type")

3.3.1 Questions

  • Was this helpful?
  • Which method was most informative?
  • Do the various methods agree on clusters?

4 Find cancer cell clusters

Now let’s focus on only the cancerous cells and look for gene clusters.

Recall: the columns are the single cell sample IDs. We have the cell_type data frame we can use to pick those columns (samples) that are from cancer cell.

cancer_samples <- sapply(cell_types$cell_type, 
                         function(ctype) {ctype == "Neoplastic"}
                         )

gbm_sc_cancer_experiment <- gbm_sc_experiment[, cancer_samples]

4.1 Visualize with different methods

Since we’re looking at a single type of cells, it’s not obvious how to color the genes. The point is rather to find gene clusters within this category of cancer cells.

gbm_sc_cancer_experiment <- scater::runPCA(gbm_sc_cancer_experiment)
scater::plotReducedDim(gbm_sc_cancer_experiment, dimred = "PCA")

set.seed(123)
TSNE_PERPLEXITY <- 10
gbm_sc_cancer_experiment <- scater::runTSNE(gbm_sc_cancer_experiment, 
                                            perplexity = TSNE_PERPLEXITY)
scater::plotTSNE(gbm_sc_cancer_experiment)

set.seed(123)       
UMAP_N_NEIGHBORS <- 30

gbm_sc_cancer_experiment <- scater::runUMAP(gbm_sc_cancer_experiment,
                                            n_neighbors = UMAP_N_NEIGHBORS)
scater::plotUMAP(gbm_sc_cancer_experiment)

4.2 K-means clustering

Since we don’t know how to “color” (group) the points, let’s use K-means clustering

umap_variables <- reducedDims(gbm_sc_cancer_experiment)[["UMAP"]]
tsne_variables <- reducedDims(gbm_sc_cancer_experiment)[["TSNE"]]
pca_results <- reducedDims(gbm_sc_cancer_experiment)[["PCA"]]

Let’s look at the scree plot:

component_variance_explained <- attr(pca_results, "percentVar")

tibble::as_tibble_col(component_variance_explained, column_name = "variance") %>% 
dplyr::mutate(pc = dplyr::row_number()) %>% 

ggplot() +
    geom_point(aes(x = pc, y = variance)) +
    geom_line(aes(x = pc, y = variance))

Though there are more rigorous ways to choose, let’s say the first 3 principal components are most informative, accounting for the following percentages of variation in the data:

head(component_variance_explained, n = 3)

Now for the k-means clustering using the built-in R function. We need to transpose the matrix since kmeans() expects variables as rows and observations as columns.

k = 3 

cancer_clusters <- kmeans(t(sc.counts[, cancer_samples]), centers = k)

gene_clusters <- 
    tibble::as_tibble_col(cancer_clusters$cluster, column_name = "cluster") %>% 
    dplyr::mutate(gene = names(cancer_clusters$cluster))

Now that the cluster labels to our cancer SingleCellExperiment object and plot the dimensionality reductions, now coloring by the k-cluster number.

# Add the k-cluster numbering to the gene information
colData(gbm_sc_cancer_experiment)[, "k_cluster"] <- as.factor(cancer_clusters$cluster)

scater::plotUMAP(gbm_sc_cancer_experiment, colour_by = "k_cluster")
scater::plotTSNE(gbm_sc_cancer_experiment, colour_by = "k_cluster")
scater::plotReducedDim(gbm_sc_cancer_experiment, dimred = "PCA", colour_by = "k_cluster") 

4.2.1 Questions

  • Does the number of clusters make sense? Does this vary by dimensionality reduction method?
  • How is the clustering different if we change the number of clusters k?