4 Biological Analysis

4.1 Clustering Introduction

Once we have normalized the data and removed confounders we can carry out analyses that are relevant to the biological questions at hand. The exact nature of the analysis depends on the dataset. Nevertheless, there are a few aspects that are useful in a wide range of contexts and we will be discussing some of them in the next few chapters. We will start with the clustering of scRNA-seq data.

4.1.1 Introduction

One of the most promising applications of scRNA-seq is de novo discovery and annotation of cell-types based on transcription profiles. Computationally, this is a hard problem as it amounts to unsupervised clustering. That is, we need to identify groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels. Moreover, in most situations we do not even know the number of clusters a priori. The problem is made even more challenging due to the high level of noise (both technical and biological) and the large number of dimensions (i.e. genes).

4.1.2 Dimensionality reductions

When working with large datasets, it can often be beneficial to apply some sort of dimensionality reduction method. By projecting the data onto a lower-dimensional sub-space, one is often able to significantly reduce the amount of noise. An additional benefit is that it is typically much easier to visualize the data in a 2 or 3-dimensional subspace. We have already discussed PCA (chapter 3.10.2) and t-SNE (chapter 3.10.2).

4.1.3 Clustering methods

Unsupervised clustering is useful in many different applications and it has been widely studied in machine learning. Some of the most popular approaches are hierarchical clustering, k-means clustering and graph-based clustering.

4.1.3.1 Hierarchical clustering

In hierarchical clustering, one can use either a bottom-up or a top-down approach. In the former case, each cell is initially assigned to its own cluster and pairs of clusters are subsequently merged to create a hieararchy:

Raw data

Figure 4.1: Raw data

The hierarchical clustering dendrogram

Figure 4.2: The hierarchical clustering dendrogram

With a top-down strategy, one instead starts with all observations in one cluster and then recursively split each cluster to form a hierarchy. One of the advantages of this strategy is that the method is deterministic.

4.1.3.2 k-means

In k-means clustering, the goal is to partition N cells into k different clusters. In an iterative manner, cluster centers are assigned and each cell is assigned to its nearest cluster:

Schematic representation of the k-means clustering

Figure 4.3: Schematic representation of the k-means clustering

Most methods for scRNA-seq analysis includes a k-means step at some point.

4.1.3.3 Graph-based methods

Over the last two decades there has been a lot of interest in analyzing networks in various domains. One goal is to identify groups or modules of nodes in a network.

Schematic representation of the graph network

Figure 4.4: Schematic representation of the graph network

Some of these methods can be applied to scRNA-seq data by building a graph where each node represents a cell. Note that constructing the graph and assigning weights to the edges is not trivial. One advantage of graph-based methods is that some of them are very efficient and can be applied to networks containing millions of nodes.

4.1.4 Challenges in clustering

  • What is the number of clusters k?
  • What is a cell type?
  • Scalability: in the last few years the number of cells in scRNA-seq experiments has grown by several orders of magnitude from ~\(10^2\) to ~\(10^6\)
  • Tools are not user-friendly

4.1.5 Tools for scRNA-seq data

4.1.5.1 SINCERA

  • SINCERA (Guo et al. 2015) is based on hierarchical clustering
  • Data is converted to z-scores before clustering
  • Identify k by finding the first singleton cluster in the hierarchy

4.1.5.2 pcaReduce

pcaReduce (žurauskienė and Yau 2016) combines PCA, k-means and “iterative” hierarchical clustering. Starting from a large number of clusters pcaReduce iteratively merges similar clusters; after each merging event it removes the principle component explaning the least variance in the data.

4.1.5.3 SC3

SC3 pipeline

Figure 4.5: SC3 pipeline

  • SC3 (Kiselev et al. 2017) is based on PCA and spectral dimensionality reductions
  • Utilises k-means
  • Additionally performs the consensus clustering

4.1.5.4 tSNE + k-means

  • Based on tSNE maps
  • Utilises k-means

4.1.5.5 SNN-Cliq

SNN-Cliq (C. Xu and Su 2015) is a graph-based method. First the method identifies the k-nearest-neighbours of each cell according to the distance measure. This is used to calculate the number of Shared Nearest Neighbours (SNN) between each pair of cells. A graph is built by placing an edge between two cells If they have at least one SNN. Clusters are defined as groups of cells with many edges between them using a “clique” method. SNN-Cliq requires several parameters to be defined manually.

4.1.5.6 Seurat clustering

Seurat clustering is based on a community detection approach similar to SNN-Cliq and to one previously proposed for analyzing CyTOF data (Levine et al. 2015). Since Seurat has become more like an all-in-one tool for scRNA-seq data analysis we dedicate a separate chapter to discuss it in more details (chapter 5).

4.1.6 Comparing clustering

To compare two sets of clustering labels we can use adjusted Rand index. The index is a measure of the similarity between two data clusterings. Values of the adjusted Rand index lie in \([0;1]\) interval, where \(1\) means that two clusterings are identical and \(0\) means the level of similarity expected by chance.

4.2 Clustering example

library(pcaMethods)
library(pcaReduce)
library(SC3)
library(scater)
library(SingleCellExperiment)
library(pheatmap)
library(mclust)
set.seed(1234567)

To illustrate clustering of scRNA-seq data, we consider the Deng dataset of cells from developing mouse embryo (Deng et al. 2014). We have preprocessed the dataset and created a SingleCellExperiment object in advance. We have also annotated the cells with the cell types identified in the original publication (it is the cell_type2 column in the colData slot).

4.2.1 Deng dataset

Let’s load the data and look at it:

deng <- readRDS("deng/deng-reads.rds")
deng
## class: SingleCellExperiment 
## dim: 22431 268 
## metadata(0):
## assays(2): counts logcounts
## rownames(22431): Hvcn1 Gbp7 ... Sox5 Alg11
## rowData names(10): feature_symbol is_feature_control ...
##   total_counts log10_total_counts
## colnames(268): 16cell 16cell.1 ... zy.2 zy.3
## colData names(30): cell_type2 cell_type1 ... pct_counts_ERCC
##   is_cell_control
## reducedDimNames(0):
## spikeNames(1): ERCC

Let’s look at the cell type annotation:

table(colData(deng)$cell_type2)
## 
##     16cell      4cell      8cell early2cell earlyblast  late2cell 
##         50         14         37          8         43         10 
##  lateblast   mid2cell   midblast         zy 
##         30         12         60          4

A simple PCA analysis already separates some strong cell types and provides some insights in the data structure:

plotPCA(deng, colour_by = "cell_type2")

As you can see, the early cell types separate quite well, but the three blastocyst timepoints are more difficult to distinguish.

4.2.2 SC3

Let’s run SC3 clustering on the Deng data. The advantage of the SC3 is that it can directly ingest a SingleCellExperiment object.

Now let’s image we do not know the number of clusters k (cell types). SC3 can estimate a number of clusters for you:

deng <- sc3_estimate_k(deng)
## Estimating k...
metadata(deng)$sc3$k_estimation
## [1] 6

Interestingly, the number of cell types predicted by SC3 is smaller than in the original data annotation. However, early, mid and late stages of different cell types together, we will have exactly 6 cell types. We store the merged cell types in cell_type1 column of the colData slot:

plotPCA(deng, colour_by = "cell_type1")

Now we are ready to run SC3 (we also ask it to calculate biological properties of the clusters):

deng <- sc3(deng, ks = 10, biology = TRUE)
## Setting SC3 parameters...
## Calculating distances between the cells...
## Performing transformations and calculating eigenvectors...
## Performing k-means clustering...
## Calculating consensus matrix...
## Calculating biology...

SC3 result consists of several different outputs (please look in (Kiselev et al. 2017) and SC3 vignette for more details). Here we show some of them:

Consensus matrix:

sc3_plot_consensus(deng, k = 10, show_pdata = "cell_type2")

Silhouette plot:

sc3_plot_silhouette(deng, k = 10)

Heatmap of the expression matrix:

sc3_plot_expression(deng, k = 10, show_pdata = "cell_type2")

Identified marker genes:

sc3_plot_markers(deng, k = 10, show_pdata = "cell_type2")

PCA plot with highlighted SC3 clusters:

plotPCA(deng, colour_by = "sc3_10_clusters")

Compare the results of SC3 clustering with the original publication cell type labels:

adjustedRandIndex(colData(deng)$cell_type2, colData(deng)$sc3_10_clusters)
## [1] 0.7705208

Note SC3 can also be run in an interactive Shiny session:

sc3_interactive(deng)

This command will open SC3 in a web browser.

Note Due to direct calculation of distances SC3 becomes very slow when the number of cells is \(>5000\). For large datasets containing up to \(10^5\) cells we recomment using Seurat (see chapter 5).

  • Exercise 1: Run SC3 for \(k\) from 8 to 12 and explore different clustering solutions in your web browser.

  • Exercise 2: Which clusters are the most stable when \(k\) is changed from 8 to 12? (Look at the “Stability” tab)

  • Exercise 3: Check out differentially expressed genes and marker genes for the obtained clusterings. Please use \(k=10\).

  • Exercise 4: Change the marker genes threshold (the default is 0.85). Does SC3 find more marker genes?

4.2.3 pcaReduce

pcaReduce operates directly on the expression matrix. It is recommended to use a gene filter and log transformation before running pcaReduce. We will use the default SC3 gene filter (note that the exprs slot of a scater object is log-transformed by default).

# use the same gene filter as in SC3
input <- logcounts(deng[rowData(deng)$sc3_gene_filter, ])

There are several parameters used by pcaReduce: * nbt defines a number of pcaReduce runs (it is stochastic and may have different solutions after different runs) * q defines number of dimensions to start clustering with. The output will contain partitions for all \(k\) from 2 to q+1. * method defines a method used for clustering. S - to perform sampling based merging, M - to perform merging based on largest probability.

We will run pcaReduce 1 time:

# run pcaReduce 1 time creating hierarchies from 1 to 30 clusters
pca.red <- PCAreduce(t(input), nbt = 1, q = 30, method = 'S')[[1]]
colData(deng)$pcaReduce <- as.character(pca.red[,32 - 10])
plotPCA(deng, colour_by = "pcaReduce")

Exercise 5: Run pcaReduce for \(k=2\) and plot a similar PCA plot. Does it look good?

Hint: When running pcaReduce for different \(k\)s you do not need to rerun PCAreduce function, just use already calculated pca.red object.

Our solution:
Clustering solutions of pcaReduce method for $k=2$.

Figure 4.6: Clustering solutions of pcaReduce method for \(k=2\).

Exercise 6: Compare the results between pcaReduce and the original publication cell types for \(k=10\).

Our solution:

## [1] 0.4216031

4.2.4 tSNE + kmeans

tSNE plots that we saw before (3.10.3) when used the scater package are made by using the Rtsne and ggplot2 packages. Here we will do the same:

deng <- plotTSNE(deng, rand_seed = 1, return_SCE = TRUE)
tSNE map of the patient data

Figure 4.7: tSNE map of the patient data

Note that all points on the plot above are black. This is different from what we saw before, when the cells were coloured based on the annotation. Here we do not have any annotation and all cells come from the same batch, therefore all dots are black.

Now we are going to apply k-means clustering algorithm to the cloud of points on the tSNE map. How many groups do you see in the cloud?

We will start with \(k=8\):

colData(deng)$tSNE_kmeans <- as.character(kmeans(deng@reducedDims$TSNE, centers = 8)$clust)
plotTSNE(deng, rand_seed = 1, colour_by = "tSNE_kmeans")
tSNE map of the patient data with 8 colored clusters, identified by the k-means clustering algorithm

Figure 4.8: tSNE map of the patient data with 8 colored clusters, identified by the k-means clustering algorithm

Exercise 7: Make the same plot for \(k=10\).

Exercise 8: Compare the results between tSNE+kmeans and the original publication cell types. Can the results be improved by changing the perplexity parameter?

Our solution:

## [1] 0.3976772

As you may have noticed, both pcaReduce and tSNE+kmeans are stochastic and give different results every time they are run. To get a better overview of the solutions, we need to run the methods multiple times. SC3 is also stochastic, but thanks to the consensus step, it is more robust and less likely to produce different outcomes.

4.2.5 SNN-Cliq

Here we run SNN-cliq with te default parameters provided in the author’s example:

distan <- "euclidean"
par.k <- 3
par.r <- 0.7
par.m <- 0.5
# construct a graph
scRNA.seq.funcs::SNN(
    data = t(input),
    outfile = "snn-cliq.txt",
    k = par.k,
    distance = distan
)
# find clusters in the graph
snn.res <- 
    system(
        paste0(
            "python utils/Cliq.py ", 
            "-i snn-cliq.txt ",
            "-o res-snn-cliq.txt ",
            "-r ", par.r,
            " -m ", par.m
        ),
        intern = TRUE
    )
cat(paste(snn.res, collapse = "\n"))
## input file snn-cliq.txt
## find 66 quasi-cliques
## merged into 29 clusters
## unique assign done
snn.res <- read.table("res-snn-cliq.txt")
# remove files that were created during the analysis
system("rm snn-cliq.txt res-snn-cliq.txt")

colData(deng)$SNNCliq <- as.character(snn.res[,1])
plotPCA(deng, colour_by = "SNNCliq")

Exercise 9: Compare the results between SNN-Cliq and the original publication cell types.

Our solution:

## [1] 0.2629731

4.2.6 SINCERA

As mentioned in the previous chapter SINCERA is based on hierarchical clustering. One important thing to keep in mind is that it performs a gene-level z-score transformation before doing clustering:

# perform gene-by-gene per-sample z-score transformation
dat <- apply(input, 1, function(y) scRNA.seq.funcs::z.transform.helper(y))
# hierarchical clustering
dd <- as.dist((1 - cor(t(dat), method = "pearson"))/2)
hc <- hclust(dd, method = "average")

If the number of cluster is not known SINCERA can identify k as the minimum height of the hierarchical tree that generates no more than a specified number of singleton clusters (clusters containing only 1 cell)

num.singleton <- 0
kk <- 1
for (i in 2:dim(dat)[2]) {
    clusters <- cutree(hc, k = i)
    clustersizes <- as.data.frame(table(clusters))
    singleton.clusters <- which(clustersizes$Freq < 2)
    if (length(singleton.clusters) <= num.singleton) {
        kk <- i
    } else {
        break;
    }
}
cat(kk)
## 6

Let’s now visualize the SINCERA results as a heatmap:

pheatmap(
    t(dat),
    cluster_cols = hc,
    cutree_cols = kk,
    kmeans_k = 100,
    show_rownames = FALSE
)
Clustering solutions of SINCERA method using found $k$

Figure 4.9: Clustering solutions of SINCERA method using found \(k\)

Exercise 10: Compare the results between SINCERA and the original publication cell types.

Our solution:

## [1] 0.3823537

Exercise 11: Is using the singleton cluster criteria for finding k a good idea?

4.2.7 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] pheatmap_1.0.8             scater_1.6.2              
##  [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [5] DelayedArray_0.4.1         matrixStats_0.52.2        
##  [7] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
##  [9] IRanges_2.12.0             S4Vectors_0.16.0          
## [11] ggplot2_2.2.1              SC3_1.7.6                 
## [13] pcaReduce_1.0              mclust_5.4                
## [15] mnormt_1.5-5               pcaMethods_1.70.0         
## [17] Biobase_2.38.0             BiocGenerics_0.24.0       
## [19] knitr_1.18                
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.13             ggbeeswarm_0.6.0       colorspace_1.3-2      
##   [4] rjson_0.2.15           class_7.3-14           rprojroot_1.3-2       
##   [7] XVector_0.18.0         bit64_0.9-7            AnnotationDbi_1.40.0  
##  [10] mvtnorm_1.0-6          scRNA.seq.funcs_0.1.0  codetools_0.2-15      
##  [13] tximport_1.6.0         doParallel_1.0.11      robustbase_0.92-8     
##  [16] cluster_2.0.6          shinydashboard_0.6.1   shiny_1.0.5           
##  [19] rrcov_1.4-3            compiler_3.4.3         httr_1.3.1            
##  [22] backports_1.1.2        assertthat_0.2.0       Matrix_1.2-7.1        
##  [25] lazyeval_0.2.1         limma_3.34.5           htmltools_0.3.6       
##  [28] prettyunits_1.0.2      tools_3.4.3            bindrcpp_0.2          
##  [31] gtable_0.2.0           glue_1.2.0             GenomeInfoDbData_1.0.0
##  [34] reshape2_1.4.3         dplyr_0.7.4            doRNG_1.6.6           
##  [37] Rcpp_0.12.15           gdata_2.18.0           iterators_1.0.9       
##  [40] stringr_1.2.0          mime_0.5               hypergeo_1.2-13       
##  [43] rngtools_1.2.4         gtools_3.5.0           WriteXLS_4.0.0        
##  [46] statmod_1.4.30         XML_3.98-1.9           edgeR_3.20.7          
##  [49] DEoptimR_1.0-8         MASS_7.3-45            zlibbioc_1.24.0       
##  [52] scales_0.5.0           rhdf5_2.22.0           RColorBrewer_1.1-2    
##  [55] yaml_2.1.16            memoise_1.1.0          gridExtra_2.3         
##  [58] pkgmaker_0.22          biomaRt_2.34.2         stringi_1.1.6         
##  [61] RSQLite_2.0            highr_0.6              pcaPP_1.9-73          
##  [64] foreach_1.4.4          orthopolynom_1.0-5     e1071_1.6-8           
##  [67] contfrac_1.1-11        caTools_1.17.1         moments_0.14          
##  [70] rlang_0.1.6            pkgconfig_2.0.1        bitops_1.0-6          
##  [73] evaluate_0.10.1        lattice_0.20-34        ROCR_1.0-7            
##  [76] bindr_0.1              labeling_0.3           cowplot_0.9.2         
##  [79] bit_1.1-12             deSolve_1.20           plyr_1.8.4            
##  [82] magrittr_1.5           bookdown_0.5           R6_2.2.2              
##  [85] gplots_3.0.1           DBI_0.7                pillar_1.1.0          
##  [88] RCurl_1.95-4.10        tibble_1.4.1           KernSmooth_2.23-15    
##  [91] rmarkdown_1.8          viridis_0.4.1          progress_1.1.2        
##  [94] locfit_1.5-9.1         grid_3.4.3             data.table_1.10.4-3   
##  [97] blob_1.1.0             digest_0.6.14          xtable_1.8-2          
## [100] httpuv_1.3.5           elliptic_1.3-7         munsell_0.4.3         
## [103] registry_0.5           beeswarm_0.2.3         viridisLite_0.2.0     
## [106] vipor_0.4.5

4.3 Feature Selection

library(scRNA.seq.funcs)
library(matrixStats)
library(M3Drop)
library(RColorBrewer)
library(SingleCellExperiment)
set.seed(1)

Single-cell RNASeq is capable of measuring the expression of many thousands of genes in every cell. However, in most situations only a portion of those will show a response to the biological condition of interest, e.g. differences in cell-type, drivers of differentiation, respond to an environmental stimulus. Most genes detected in a scRNASeq experiment will only be detected at different levels due to technical noise. One consequence of this is that technical noise and batch effects can obscure the biological signal of interest.

Thus, it is often advantageous to perform feature selection to remove those genes which only exhibit technical noise from downstream analysis. Not only does this generally increase the signal:noise ratio in the data; it also reduces the computational complexity of analyses, by reducing the total amount of data to be processed.

For scRNASeq data, we will be focusing on unsupervised methods of feature selection which don’t require any a priori information, such as cell-type labels or biological group, since they are not available, or may be unreliable, for many experiments. In contrast, differential expression (chapter 4.6) can be considered a form of supervised feature selection since it uses the known biological label of each sample to identify features (i.e. genes) which are expressed at different levels across groups.

For this section we will continue working with the Deng data.

deng <- readRDS("deng/deng-reads.rds")
cellLabels <- colData(deng)$cell_type2

This data can be QCed and normalized for library size using M3Drop, which removes cells with few detected genes, removes undetected genes, and converts raw counts to CPM.

deng_list <- M3DropCleanData(
    counts(deng),
    labels = cellLabels,
    min_detected_genes = 100,
    is.counts = TRUE
)
expr_matrix <- deng_list$data # Normalized & filtered expression matrix
celltype_labs <- factor(deng_list$labels) # filtered cell-type labels
cell_colors <- brewer.pal(max(3,length(unique(celltype_labs))), "Set3")

Exercise 1: How many cells & genes have been removed by this filtering?

4.3.1 Identifying Genes vs a Null Model

There are two main approaches to unsupervised feature selection. The first is to identify genes which behave differently from a null model describing just the technical noise expected in the dataset.

If the dataset contains spike-in RNAs they can be used to directly model technical noise. However, measurements of spike-ins may not experience the same technical noise as endogenous transcripts (Svensson et al., 2017). In addition, scRNASeq experiments often contain only a small number of spike-ins which reduces our confidence in fitted model parameters.

4.3.1.1 Highly Variable Genes

The first method proposed to identify features in scRNASeq datasets was to identify highly variable genes (HVG). HVG assumes that if genes have large differences in expression across cells some of those differences are due to biological difference between the cells rather than technical noise. However, because of the nature of count data, there is a positive relationship between the mean expression of a gene and the variance in the read counts across cells. This relationship must be corrected for to properly identify HVGs.

Exercise 2 Using the functions rowMeans and rowVars to plot the relationship between mean expression and variance for all genes in this dataset. (Hint: use log=“xy” to plot on a log-scale).

A popular method to correct for the relationship between variance and mean expression was proposed by Brennecke et al.. To use the Brennecke method, we first normalize for library size then calculate the mean and the square coefficient of variation (variation divided by the squared mean expression). A quadratic curve is fit to the relationship between these two variables for the ERCC spike-in, and then a chi-square test is used to find genes significantly above the curve. This method is included in the M3Drop package as the Brennecke_getVariableGenes(counts, spikes) function. However, this dataset does not contain spike-ins so we will use the entire dataset to estimate the technical noise.

In the figure below the red curve is the fitted technical noise model and the dashed line is the 95% CI. Pink dots are the genes with significant biological variability after multiple-testing correction.

Brennecke_HVG <- BrenneckeGetVariableGenes(
    expr_matrix,
    fdr = 0.01,
    minBiolDisp = 0.5
)

HVG_genes <- Brennecke_HVG$Gene

4.3.1.2 High Dropout Genes

An alternative to finding HVGs is to identify genes with unexpectedly high numbers of zeros. The frequency of zeros, know as the “dropout rate”, is very closely related to expression level in scRNASeq data. Zeros are the dominant feature of single-cell RNASeq data, typically accounting for over half of the entries in the final expression matrix. These zeros predominantly result from the failure of mRNAs failing to be reversed transcribed (Andrews and Hemberg, 2016). Reverse transcription is an enzyme reaction thus can be modelled using the Michaelis-Menten equation:

\[P_{dropout} = 1 - S/(K + S)\]

where \(S\) is the mRNA concentration in the cell (we will estimate this as average expression) and \(K\) is the Michaelis-Menten constant.

Because the Michaelis-Menten equation is a convex non-linear function, genes which are differentially expression across two or more populations of cells in our dataset will be shifted up/right of the Michaelis-Menten model (see Figure below).

K <- 49
S_sim <- 10^seq(from = -3, to = 4, by = 0.05) # range of expression values
MM <- 1 - S_sim / (K + S_sim)
plot(
    S_sim, 
    MM, 
    type = "l", 
    lwd = 3, 
    xlab = "Expression", 
    ylab = "Dropout Rate", 
    xlim = c(1,1000)
)
S1 <- 10
P1 <- 1 - S1 / (K + S1) # Expression & dropouts for cells in condition 1
S2 <- 750
P2 <- 1 - S2 / (K + S2) # Expression & dropouts for cells in condition 2
points(
    c(S1, S2),
    c(P1, P2), 
    pch = 16, 
    col = "grey85", 
    cex = 3
)
mix <- 0.5 # proportion of cells in condition 1
points(
    S1 * mix + S2 * (1 - mix), 
    P1 * mix + P2 * (1 - mix), 
    pch = 16, 
    col = "grey35", 
    cex = 3
)

Note: add log="x" to the plot call above to see how this looks on the log scale, which is used in M3Drop figures.

Exercise 3: Produce the same plot as above with different expression levels (S1 & S2) and/or mixtures (mix).

We use M3Drop to identify significant outliers to the right of the MM curve. We also apply 1% FDR multiple testing correction:

M3Drop_genes <- M3DropFeatureSelection(
    expr_matrix,
    mt_method = "fdr",
    mt_threshold = 0.01
)

M3Drop_genes <- M3Drop_genes$Gene

An alternative method is contained in the M3Drop package that is tailored specifically for UMI-tagged data which generally contains many zeros resulting from low sequencing coverage in addition to those resulting from insufficient reverse-transcription. This model is the Depth-Adjusted Negative Binomial (DANB). This method describes each expression observation as a negative binomial model with a mean related to both the mean expression of the respective gene and the sequencing depth of the respective cell, and a variance related to the mean-expression of the gene.

Unlike the Michaelis-Menten and HVG methods there isn’t a reliable statistical test for features selected by this model, so we will consider the top 1500 genes instead.

deng_int <- NBumiConvertToInteger(counts(deng))
DANB_fit <- NBumiFitModel(deng_int) # DANB is fit to the raw count matrix
# Perform DANB feature selection
DropFS <- NBumiFeatureSelectionCombinedDrop(DANB_fit)
DANB_genes <- names(DropFS[1:1500])

4.3.2 Correlated Expression

A completely different approach to feature selection is to use gene-gene correlations. This method is based on the idea that multiple genes will be differentially expressed between different cell-types or cell-states. Genes which are expressed in the same cell-population will be positively correlated with each other where as genes expressed in different cell-populations will be negatively correated with each other. Thus important genes can be identified by the magnitude of their correlation with other genes.

The limitation of this method is that it assumes technical noise is random and independent for each cell, thus shouldn’t produce gene-gene correlations, but this assumption is violated by batch effects which are generally systematic between different experimental batches and will produce gene-gene correlations. As a result it is more appropriate to take the top few thousand genes as ranked by gene-gene correlation than consider the significance of the correlations.

cor_mat <- cor(t(expr_matrix), method = "spearman") # Gene-gene correlations
diag(cor_mat) <- rep(0, times = nrow(expr_matrix))
score <- apply(cor_mat, 1, function(x) {max(abs(x))}) #Correlation of highest magnitude
names(score) <- rownames(expr_matrix);
score <- score[order(-score)]
Cor_genes <- names(score[1:1500])

Lastly, another common method for feature selection in scRNASeq data is to use PCA loadings. Genes with high PCA loadings are likely to be highly variable and correlated with many other variable genes, thus may be relevant to the underlying biology. However, as with gene-gene correlations PCA loadings tend to be susceptible to detecting systematic variation due to batch effects; thus it is recommended to plot the PCA results to determine those components corresponding to the biological variation rather than batch effects.

# PCA is typically performed on log-transformed expression data
pca <- prcomp(log(expr_matrix + 1) / log(2))

# plot projection
plot(
    pca$rotation[,1], 
    pca$rotation[,2], 
    pch = 16, 
    col = cell_colors[as.factor(celltype_labs)]
) 

# calculate loadings for components 1 and 2
score <- rowSums(abs(pca$x[,c(1,2)])) 
names(score) <- rownames(expr_matrix)
score <- score[order(-score)]
PCA_genes <- names(score[1:1500])

Exercise 4 Consider the top 5 principal components. Which appear to be most biologically relevant? How does the top 1,500 features change if you consider the loadings for those components?

4.3.3 Comparing Methods

We can check whether the identified features really do represent genes differentially expressed between cell-types in this dataset.

M3DropExpressionHeatmap(
    M3Drop_genes,
    expr_matrix,
    cell_labels = celltype_labs
)

We can also consider how consistent each feature selection method is with the others using the Jaccard Index:

J <- sum(M3Drop_genes %in% HVG_genes)/length(unique(c(M3Drop_genes, HVG_genes)))

Exercise 5

Plot the expression of the features for each of the other methods. Which appear to be differentially expressed? How consistent are the different methods for this dataset?

4.3.4 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [3] DelayedArray_0.4.1         Biobase_2.38.0            
##  [5] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
##  [7] IRanges_2.12.0             S4Vectors_0.16.0          
##  [9] BiocGenerics_0.24.0        RColorBrewer_1.1-2        
## [11] M3Drop_3.05.00             numDeriv_2016.8-1         
## [13] matrixStats_0.52.2         scRNA.seq.funcs_0.1.0     
## [15] knitr_1.18                
## 
## loaded via a namespace (and not attached):
##  [1] splines_3.4.3          elliptic_1.3-7         gtools_3.5.0          
##  [4] Formula_1.2-2          moments_0.14           statmod_1.4.30        
##  [7] latticeExtra_0.6-28    GenomeInfoDbData_1.0.0 yaml_2.1.16           
## [10] pillar_1.1.0           backports_1.1.2        lattice_0.20-34       
## [13] bbmle_1.0.20           digest_0.6.14          XVector_0.18.0        
## [16] checkmate_1.8.5        colorspace_1.3-2       htmltools_0.3.6       
## [19] Matrix_1.2-7.1         plyr_1.8.4             bookdown_0.5          
## [22] zlibbioc_1.24.0        scales_0.5.0           gdata_2.18.0          
## [25] Rtsne_0.13             htmlTable_1.11.2       tibble_1.4.1          
## [28] mgcv_1.8-23            ggplot2_2.2.1          nnet_7.3-12           
## [31] lazyeval_0.2.1         survival_2.40-1        magrittr_1.5          
## [34] evaluate_0.10.1        nlme_3.1-129           MASS_7.3-45           
## [37] gplots_3.0.1           foreign_0.8-67         reldist_1.6-6         
## [40] tools_3.4.3            data.table_1.10.4-3    stringr_1.2.0         
## [43] munsell_0.4.3          cluster_2.0.6          irlba_2.3.2           
## [46] orthopolynom_1.0-5     compiler_3.4.3         caTools_1.17.1        
## [49] contfrac_1.1-11        rlang_0.1.6            grid_3.4.3            
## [52] RCurl_1.95-4.10        rstudioapi_0.7         htmlwidgets_1.0       
## [55] bitops_1.0-6           base64enc_0.1-3        rmarkdown_1.8         
## [58] hypergeo_1.2-13        gtable_0.2.0           deSolve_1.20          
## [61] gridExtra_2.3          Hmisc_4.1-1            rprojroot_1.3-2       
## [64] KernSmooth_2.23-15     stringi_1.1.6          Rcpp_0.12.15          
## [67] rpart_4.1-10           acepack_1.4.1

4.4 Pseudotime analysis

library(SingleCellExperiment)
library(TSCAN)
library(M3Drop)
library(monocle)
library(destiny)
library(SLICER)
set.seed(1)

In many situations, one is studying a process where cells change continuously. This includes, for example, many differentiation processes taking place during development: following a stimulus, cells will change from one cell-type to another. Ideally, we would like to monitor the expression levels of an individual cell over time. Unfortunately, such monitoring is not possible with scRNA-seq since the cell is lysed (destroyed) when the RNA is extracted.

Instead, we must sample at multiple time-points and obtain snapshots of the gene expression profiles. Since some of the cells will proceed faster along the differentiation than others, each snapshot may contain cells at varying points along the developmental progression. We use statistical methods to order the cells along one or more trajectories which represent the underlying developmental trajectories, this ordering is referred to as “pseudotime”.

In this chapter we will consider four different tools: Monocle, TSCAN, destiny and SLICER for ordering cells according to their pseudotime development. To illustrate the methods we will be using a dataset on mouse embryonic development (Deng et al. 2014). The dataset consists of 268 cells from 10 different time-points of early mouse development. In this case, there is no need for pseudotime alignment since the cell labels provide information about the development trajectory. Thus, the labels allow us to establish a ground truth so that we can evaluate and compare the different methods.

A recent review by Cannoodt et al provides a detailed summary of the various computational methods for trajectory inference from single-cell transcriptomics (Cannoodt, Saelens, and Saeys 2016). They discuss several tools, but unfortunately for our purposes many of these tools do not have complete or well-maintained implementations, and/or are not implemented in R.

Cannoodt et al cover:

  • SCUBA - Matlab implementation
  • Wanderlust - Matlab (and requires registration to even download)
  • Wishbone - Python
  • SLICER - R, but package only available on Github
  • SCOUP - C++ command line tool
  • Waterfall - R, but one R script in supplement
  • Mpath - R pkg, but available as tar.gz on Github; function documentation but no vignette/workflow
  • Monocle - Bioconductor package
  • TSCAN - Bioconductor package

Unfortunately only two tools discussed (Monocle and TSCAN) meet the gold standard of open-source software hosted in a reputable repository.

The following figures from the paper summarise some of the features of the various tools.

Descriptions of trajectory inference methods for single-cell transcriptomics data (Fig. 2 from Cannoodt et al, 2016).

Figure 4.10: Descriptions of trajectory inference methods for single-cell transcriptomics data (Fig. 2 from Cannoodt et al, 2016).

Characterization of trajectory inference methods for single-cell transcriptomics data (Fig. 3 from Cannoodt et al, 2016).

Figure 4.11: Characterization of trajectory inference methods for single-cell transcriptomics data (Fig. 3 from Cannoodt et al, 2016).

4.4.1 TSCAN

TSCAN combines clustering with pseudotime analysis. First it clusters the cells using mclust, which is based on a mixture of normal distributions. Then it builds a minimum spanning tree to connect the clusters. The branch of this tree that connects the largest number of clusters is the main branch which is used to determine pseudotime.

First we will try to use all genes to order the cells.

deng_SCE <- readRDS("deng/deng-reads.rds")
cellLabels <- colData(deng_SCE)$cell_type2
deng <- counts(deng_SCE)
colnames(deng) <- cellLabels
procdeng <- TSCAN::preprocess(deng)
colnames(procdeng) <- 1:ncol(deng)
dengclust <- TSCAN::exprmclust(procdeng, clusternum = 10)
TSCAN::plotmclust(dengclust)

dengorderTSCAN <- TSCAN::TSCANorder(dengclust, orderonly = F)
pseudotime_order_tscan <- as.character(dengorderTSCAN$sample_name)

We can also examine which timepoints have been assigned to each state:

cellLabels[dengclust$clusterid == 10]
##  [1] late2cell late2cell late2cell late2cell late2cell late2cell late2cell
##  [8] late2cell late2cell late2cell
## 10 Levels: 16cell 4cell 8cell early2cell earlyblast ... zy
colours <- rainbow(n = 10) # red = early, violet = late
tmp <-
    factor(
        cellLabels[as.numeric(pseudotime_order_tscan)],
        levels = c("early2cell", "mid2cell", "late2cell", "4cell", "8cell",
                   "16cell", "earlyblast", "midblast", "lateblast")
    )
plot(
    as.numeric(tmp),
    xlab = "Pseudotime Order",
    ylab = "Timepoint",
    col = colours[tmp],
    pch = 16
)

Exercise 1 Compare results for different numbers of clusters (clusternum).

4.4.2 monocle

Monocle skips the clustering stage of TSCAN and directly builds a minimum spanning tree on a reduced dimension representation of the cells to connect all cells. Monocle then identifies the longest path in this tree to determine pseudotime. If the data contains diverging trajectories (i.e. one cell type differentiates into two different cell-types), monocle can identify these. Each of the resulting forked paths is defined as a separate cell state.

Unfortunately, Monocle does not work when all the genes are used, so we must carry out feature selection. First, we use M3Drop:

m3dGenes <- as.character(
    M3DropFeatureSelection(deng)$Gene
)
## Warning in bg__calc_variables(expr_mat): Warning: Removing 1134 undetected
## genes.

d <- deng[which(rownames(deng) %in% m3dGenes), ]
d <- d[!duplicated(rownames(d)), ]

Now run monocle:

colnames(d) <- 1:ncol(d)
geneNames <- rownames(d)
rownames(d) <- 1:nrow(d)
pd <- data.frame(timepoint = cellLabels)
pd <- new("AnnotatedDataFrame", data=pd)
fd <- data.frame(gene_short_name = geneNames)
fd <- new("AnnotatedDataFrame", data=fd)

dCellData <- newCellDataSet(d, phenoData = pd, featureData = fd, expressionFamily = tobit())
dCellData <- setOrderingFilter(dCellData, which(geneNames %in% m3dGenes))
dCellData <- estimateSizeFactors(dCellData)
dCellDataSet <- reduceDimension(dCellData, pseudo_expr = 1)
dCellDataSet <- orderCells(dCellDataSet, reverse = FALSE)
plot_cell_trajectory(dCellDataSet)

# Store the ordering
pseudotime_monocle <-
    data.frame(
        Timepoint = phenoData(dCellDataSet)$timepoint,
        pseudotime = phenoData(dCellDataSet)$Pseudotime,
        State=phenoData(dCellDataSet)$State
    )
rownames(pseudotime_monocle) <- 1:ncol(d)
pseudotime_order_monocle <-
    rownames(pseudotime_monocle[order(pseudotime_monocle$pseudotime), ])

We can again compare the inferred pseudotime to the known sampling timepoints.

monocle_time_point <- factor(
     pseudotime_monocle$Timepoint,
     levels = c("early2cell", "mid2cell", "late2cell", "4cell", "8cell",
                   "16cell", "earlyblast", "midblast", "lateblast")
)

plot(
    pseudotime_monocle$pseudotime,
    monocle_time_point,
    xlab = "Pseudotime",
    ylab = "Timepoint",
    col = colours[monocle_time_point],
    pch = 16
)

4.4.3 Diffusion maps

Diffusion maps were introduced by Ronald Coifman and Stephane Lafon, and the underlying idea is to assume that the data are samples from a diffusion process. The method infers the low-dimensional manifold by estimating the eigenvalues and eigenvectors for the diffusion operator related to the data.

Haghverdi et al have applied the diffusion maps concept to the analysis of single-cell RNA-seq data to create an R package called destiny.

deng <- logcounts(deng_SCE)
colnames(deng) <- cellLabels
dm <- DiffusionMap(t(deng))
tmp <- factor(
    colnames(deng),
    levels = c(
        "early2cell",
        "mid2cell",
        "late2cell",
        "4cell",
        "8cell",
        "16cell",
        "earlyblast",
        "midblast",
        "lateblast"
    )
)
plot(
    eigenvectors(dm)[,1],
    eigenvectors(dm)[,2],
    xlab = "Diffusion component 1",
    ylab = "Diffusion component 2",
    col = colours[tmp],
    pch = 16
)

Like the other methods, destiny does a good job at ordering the early time-points, but it is unable to distinguish the later ones.

Exercise 2 Do you get a better resolution between the later time points by considering additional eigenvectors?

Exercise 3 How does the ordering change if you only use the genes identified by M3Drop?

4.4.4 SLICER

The SLICER method is an algorithm for constructing trajectories that describe gene expression changes during a sequential biological process, just as Monocle and TSCAN are. SLICER is designed to capture highly nonlinear gene expression changes, automatically select genes related to the process, and detect multiple branch and loop features in the trajectory (Welch, Hartemink, and Prins 2016). The SLICER R package is available from its GitHub repository and can be installed from there using the devtools package.

We use the select_genes function in SLICER to automatically select the genes to use in builing the cell trajectory. The function uses “neighbourhood variance” to identify genes that vary smoothly, rather than fluctuating randomly, across the set of cells. Following this, we determine which value of “k” (number of nearest neighbours) yields an embedding that most resembles a trajectory. Then we estimate the locally linear embedding of the cells.

require("lle")
slicer_genes <- select_genes(t(deng))
k <- select_k(t(deng[slicer_genes,]), kmin = 30, kmax=60)
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
slicer_traj_lle <- lle(t(deng[slicer_genes,]), m = 2, k)$Y
## finding neighbours
## calculating weights
## computing coordinates
plot(slicer_traj_lle, xlab = "LLE Comp 1", ylab = "LLE Comp 2",
     main = "Locally linear embedding of cells from SLICER", 
     col=colours[tmp], pch=16)

With the locally linear embedding computed we can construct a k-nearest neighbour graph that is fully connected. This plot displays a (yellow) circle for each cell, with the cell ID number overlaid in blue. Here we show the graph computed using 10 nearest neighbours. Here, SLICER appears to detect one major trajectory with one branch.

slicer_traj_graph <- conn_knn_graph(slicer_traj_lle, 10)
plot(slicer_traj_graph, main = "Fully connected kNN graph from SLICER")

From this graph we can identify “extreme” cells that are candidates for start/end cells in the trajectory.

ends <- find_extreme_cells(slicer_traj_graph, slicer_traj_lle)

start <- ends[1]

Having defined a start cell we can order the cells in the estimated pseudotime.

pseudotime_order_slicer <- cell_order(slicer_traj_graph, start)
branches <- assign_branches(slicer_traj_graph, start)

pseudotime_slicer <-
    data.frame(
        Timepoint = cellLabels,
        pseudotime = NA,
        State = branches
    )
pseudotime_slicer$pseudotime[pseudotime_order_slicer] <-
    1:length(pseudotime_order_slicer)

We can again compare the inferred pseudotime to the known sampling timepoints. SLICER does not provide a pseudotime value per se, just an ordering of cells.

slicer_time_point <- factor(
     pseudotime_slicer$Timepoint,
     levels = c("early2cell", "mid2cell", "late2cell", "4cell", "8cell",
                   "16cell", "earlyblast", "midblast", "lateblast")
)

plot(
    pseudotime_slicer$pseudotime,
    slicer_time_point,
    xlab = "Pseudotime",
    ylab = "Timepoint",
    col = colours[slicer_time_point],
    pch = 16
)

Like the previous method, SLICER here provides a good ordering for the early time points and struggles for later time points.

Exercise 4 How do the results change for different k? (e.g. k = 5) What about changing the number of nearest neighbours in the call to conn_knn_graph?

Exercise 5 How does the ordering change if you use a different set of genes from those chosen by SLICER (e.g. the genes identified by M3Drop)?

4.4.5 Comparison of the methods

How do the trajectories inferred by TSCAN and Monocle compare?

matched_ordering <-
    match(
        pseudotime_order_tscan,
        pseudotime_order_monocle
    )
timepoint_ordered <-
    monocle_time_point[order(pseudotime_monocle$pseudotime)]
plot(
    matched_ordering,
    xlab = "Monocle Order",
    ylab = "TSCAN Order",
    col = colours[timepoint_ordered],
    pch = 16
)

Exercise 6: Compare destiny and SLICER to TSCAN and Monocle.

4.4.6 Expression of genes through time

Each package also enables the visualization of expression through pseudotime. Following individual genes is very helpful for identifying genes that play an important role in the differentiation process. We illustrate the procedure using the Rhoa gene.

TSCAN

colnames(deng) <- 1:ncol(deng)
TSCAN::singlegeneplot(
    deng[rownames(deng) == "Rhoa", ],
    dengorderTSCAN
)

Monocle

monocle::plot_genes_in_pseudotime(
    dCellDataSet[fData(dCellDataSet)$gene == "Rhoa",],
    color_by = "timepoint"
)

Of course, pseudotime values computed with any method can be added to the colData slot of an SCE object. Having done that, the full plotting capabilities of the scater package can be used to investigate relationships between gene expression, cell populations and pseudotime. This would be particularly useful for the SLICER results, as SLICER does not provide plotting functions.

Exercise 7: Repeat the exercise using a subset of the genes, e.g. the set of highly variable genes that can be obtained using Brennecke_getVariableGenes()

4.4.7 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   parallel  stats4    methods   stats     graphics  grDevices
##  [8] utils     datasets  base     
## 
## other attached packages:
##  [1] lle_1.1                    snowfall_1.84-6.1         
##  [3] snow_0.4-2                 MASS_7.3-45               
##  [5] scatterplot3d_0.3-40       SLICER_0.2.0              
##  [7] destiny_2.6.1              monocle_2.6.1             
##  [9] DDRTree_0.1.5              irlba_2.3.2               
## [11] VGAM_1.0-4                 ggplot2_2.2.1             
## [13] Matrix_1.2-7.1             M3Drop_3.05.00            
## [15] numDeriv_2016.8-1          TSCAN_1.16.0              
## [17] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
## [19] DelayedArray_0.4.1         matrixStats_0.52.2        
## [21] Biobase_2.38.0             GenomicRanges_1.30.1      
## [23] GenomeInfoDb_1.14.0        IRanges_2.12.0            
## [25] S4Vectors_0.16.0           BiocGenerics_0.24.0       
## [27] knitr_1.18                
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2        Hmisc_4.1-1            RcppEigen_0.3.3.3.1   
##   [4] plyr_1.8.4             igraph_1.1.2           lazyeval_0.2.1        
##   [7] sp_1.2-7               densityClust_0.3       fastICA_1.2-1         
##  [10] digest_0.6.14          htmltools_0.3.6        viridis_0.4.1         
##  [13] gdata_2.18.0           magrittr_1.5           checkmate_1.8.5       
##  [16] tensor_1.5             cluster_2.0.6          limma_3.34.5          
##  [19] tripack_1.3-8          R.utils_2.6.0          xts_0.10-1            
##  [22] colorspace_1.3-2       ggrepel_0.7.0          dplyr_0.7.4           
##  [25] RCurl_1.95-4.10        spatstat.data_1.2-0    spatstat_1.54-0       
##  [28] lme4_1.1-15            bindr_0.1              survival_2.40-1       
##  [31] zoo_1.8-1              glue_1.2.0             polyclip_1.6-1        
##  [34] gtable_0.2.0           zlibbioc_1.24.0        XVector_0.18.0        
##  [37] MatrixModels_0.4-1     car_2.1-6              DEoptimR_1.0-8        
##  [40] abind_1.4-5            sgeostat_1.0-27        reldist_1.6-6         
##  [43] SparseM_1.77           VIM_4.7.0              scales_0.5.0          
##  [46] pheatmap_1.0.8         Rcpp_0.12.15           viridisLite_0.2.0     
##  [49] xtable_1.8-2           laeken_0.4.6           htmlTable_1.11.2      
##  [52] foreign_0.8-67         proxy_0.4-21           mclust_5.4            
##  [55] Formula_1.2-2          vcd_1.4-4              htmlwidgets_1.0       
##  [58] FNN_1.1                gplots_3.0.1           RColorBrewer_1.1-2    
##  [61] acepack_1.4.1          R.methodsS3_1.7.1      pkgconfig_2.0.1       
##  [64] deldir_0.1-14          nnet_7.3-12            alphahull_2.1         
##  [67] labeling_0.3           rlang_0.1.6            reshape2_1.4.3        
##  [70] munsell_0.4.3          tools_3.4.3            splancs_2.01-40       
##  [73] evaluate_0.10.1        stringr_1.2.0          goftest_1.1-1         
##  [76] yaml_2.1.16            robustbase_0.92-8      caTools_1.17.1        
##  [79] RANN_2.5.1             bindrcpp_0.2           nlme_3.1-129          
##  [82] mime_0.5               quantreg_5.34          slam_0.1-42           
##  [85] R.oo_1.21.0            compiler_3.4.3         pbkrtest_0.4-7        
##  [88] rstudioapi_0.7         curl_3.1               e1071_1.6-8           
##  [91] spatstat.utils_1.8-0   smoother_1.1           tibble_1.4.1          
##  [94] statmod_1.4.30         stringi_1.1.6          highr_0.6             
##  [97] lattice_0.20-34        nloptr_1.0.4           HSMMSingleCell_0.112.0
## [100] pillar_1.1.0           combinat_0.0-8         lmtest_0.9-35         
## [103] data.table_1.10.4-3    bitops_1.0-6           httpuv_1.3.5          
## [106] R6_2.2.2               latticeExtra_0.6-28    bookdown_0.5          
## [109] KernSmooth_2.23-15     gridExtra_2.3          boot_1.3-18           
## [112] gtools_3.5.0           assertthat_0.2.0       rprojroot_1.3-2       
## [115] qlcMatrix_0.9.5        GenomeInfoDbData_1.0.0 mgcv_1.8-23           
## [118] grid_3.4.3             rpart_4.1-10           class_7.3-14          
## [121] minqa_1.2.4            rmarkdown_1.8          Rtsne_0.13            
## [124] TTR_0.23-2             bbmle_1.0.20           shiny_1.0.5           
## [127] base64enc_0.1-3

4.5 Imputation

library(scImpute)
library(SC3)
library(scater)
library(SingleCellExperiment)
library(mclust)
set.seed(1234567)

As discussed previously, one of the main challenges when analyzing scRNA-seq data is the presence of zeros, or dropouts. The dropouts are assumed to have arisen for three possible reasons:

  • The gene was not expressed in the cell and hence there are no transcripts to sequence
  • The gene was expressed, but for some reason the transcripts were lost somewhere prior to sequencing
  • The gene was expressed and transcripts were captured and turned into cDNA, but the sequencing depth was not sufficient to produce any reads.

Thus, dropouts could be result of experimental shortcomings, and if this is the case then we would like to provide computational corrections. One possible solution is to impute the dropouts in the expression matrix. To be able to impute gene expression values, one must have an underlying model. However, since we do not know which dropout events are technical artefacts and which correspond to the transcript being truly absent, imputation is a difficult challenge.

To the best of our knowledge, there are currently two different imputation methods available: MAGIC (Dijk et al. 2017) and scImpute (W. V. Li and Li 2017). MAGIC is only available for Python or Matlab, but we will run it from within R.

4.5.1 scImpute

To test scImpute, we use the default parameters and we apply it to the Deng dataset that we have worked with before. scImpute takes a .csv or .txt file as an input:

deng <- readRDS("deng/deng-reads.rds")
write.csv(counts(deng), "deng.csv")
scimpute(
    count_path = "deng.csv",
    infile = "csv",
    outfile = "txt", 
    out_dir = "./",
    Kcluster = 10,
    ncores = 2
)
## [1] "reading in raw count matrix ..."
## [1] "number of genes in raw count matrix 22431"
## [1] "number of cells in raw count matrix 268"
## [1] "inferring cell similarities ..."
## [1] "cluster sizes:"
##  [1] 12  9 26  5  9 57 58 43 17 22
## [1] "estimating dropout probability for type 1 ..."
## [1] "imputing dropout values for type 1 ..."
## [1] "estimating dropout probability for type 2 ..."
## [1] "imputing dropout values for type 2 ..."
## [1] "estimating dropout probability for type 3 ..."
## [1] "imputing dropout values for type 3 ..."
## [1] "estimating dropout probability for type 4 ..."
## [1] "imputing dropout values for type 4 ..."
## [1] "estimating dropout probability for type 5 ..."
## [1] "imputing dropout values for type 5 ..."
## [1] "estimating dropout probability for type 6 ..."
## [1] "imputing dropout values for type 6 ..."
## [1] "estimating dropout probability for type 7 ..."
## [1] "imputing dropout values for type 7 ..."
## [1] "estimating dropout probability for type 8 ..."
## [1] "imputing dropout values for type 8 ..."
## [1] "estimating dropout probability for type 9 ..."
## [1] "imputing dropout values for type 9 ..."
## [1] "estimating dropout probability for type 10 ..."
## [1] "imputing dropout values for type 10 ..."
## [1] "writing imputed count matrix ..."
##  [1]  17  18  88 111 126 177 186 229 244 247

Now we can compare the results with original data by considering a PCA plot

res <- read.table("scimpute_count.txt")
colnames(res) <- NULL
res <- SingleCellExperiment(
    assays = list(logcounts = log2(as.matrix(res) + 1)), 
    colData = colData(deng)
)
rowData(res)$feature_symbol <- rowData(deng)$feature_symbol
plotPCA(
    res, 
    colour_by = "cell_type2"
)

Compare this result to the original data in Chapter 4.2. What are the most significant differences?

To evaluate the impact of the imputation, we use SC3 to cluster the imputed matrix

res <- sc3_estimate_k(res)
metadata(res)$sc3$k_estimation
## [1] 6
res <- sc3(res, ks = 10, gene_filter = FALSE)
adjustedRandIndex(colData(deng)$cell_type2, colData(res)$sc3_10_clusters)
## [1] 0.4687332
plotPCA(
    res, 
    colour_by = "sc3_10_clusters"
)

Exercise: Based on the PCA and the clustering results, do you think that imputation using scImpute is a good idea for the Deng dataset?

4.5.2 MAGIC

system("python3 utils/MAGIC.py -d deng.csv -o MAGIC_count.csv --cell-axis columns -l 1 --pca-non-random csv")
res <- read.csv("MAGIC_count.csv", header = TRUE)
rownames(res) <- res[,1]
res <- res[,-1]
res <- t(res)
res <- SingleCellExperiment(
    assays = list(logcounts = res), 
    colData = colData(deng)
)
rowData(res)$feature_symbol <- rownames(res)
plotPCA(
    res, 
    colour_by = "cell_type2"
)

Compare this result to the original data in Chapter 4.2. What are the most significant differences?

To evaluate the impact of the imputation, we use SC3 to cluster the imputed matrix

res <- sc3_estimate_k(res)
metadata(res)$sc3$k_estimation
## [1] 4
res <- sc3(res, ks = 10, gene_filter = FALSE)
adjustedRandIndex(colData(deng)$cell_type2, colData(res)$sc3_10_clusters)
## [1] 0.3752866
plotPCA(
    res, 
    colour_by = "sc3_10_clusters"
)

Exercise: What is the difference between scImpute and MAGIC based on the PCA and clustering analysis? Which one do you think is best to use?

4.5.3 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    methods   parallel  stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] mclust_5.4                 scater_1.6.2              
##  [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [5] DelayedArray_0.4.1         matrixStats_0.52.2        
##  [7] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
##  [9] IRanges_2.12.0             S4Vectors_0.16.0          
## [11] ggplot2_2.2.1              Biobase_2.38.0            
## [13] BiocGenerics_0.24.0        SC3_1.7.6                 
## [15] scImpute_0.0.5             doParallel_1.0.11         
## [17] iterators_1.0.9            foreach_1.4.4             
## [19] penalized_0.9-50           survival_2.40-1           
## [21] kernlab_0.9-25             knitr_1.18                
## 
## loaded via a namespace (and not attached):
##  [1] ggbeeswarm_0.6.0       colorspace_1.3-2       rjson_0.2.15          
##  [4] class_7.3-14           rprojroot_1.3-2        XVector_0.18.0        
##  [7] bit64_0.9-7            AnnotationDbi_1.40.0   mvtnorm_1.0-6         
## [10] codetools_0.2-15       splines_3.4.3          tximport_1.6.0        
## [13] robustbase_0.92-8      cluster_2.0.6          pheatmap_1.0.8        
## [16] shinydashboard_0.6.1   shiny_1.0.5            rrcov_1.4-3           
## [19] compiler_3.4.3         httr_1.3.1             backports_1.1.2       
## [22] assertthat_0.2.0       Matrix_1.2-7.1         lazyeval_0.2.1        
## [25] limma_3.34.5           htmltools_0.3.6        prettyunits_1.0.2     
## [28] tools_3.4.3            bindrcpp_0.2           gtable_0.2.0          
## [31] glue_1.2.0             GenomeInfoDbData_1.0.0 reshape2_1.4.3        
## [34] dplyr_0.7.4            doRNG_1.6.6            Rcpp_0.12.15          
## [37] gdata_2.18.0           stringr_1.2.0          mime_0.5              
## [40] rngtools_1.2.4         gtools_3.5.0           WriteXLS_4.0.0        
## [43] XML_3.98-1.9           edgeR_3.20.7           DEoptimR_1.0-8        
## [46] zlibbioc_1.24.0        scales_0.5.0           rhdf5_2.22.0          
## [49] RColorBrewer_1.1-2     yaml_2.1.16            memoise_1.1.0         
## [52] gridExtra_2.3          pkgmaker_0.22          biomaRt_2.34.2        
## [55] stringi_1.1.6          RSQLite_2.0            pcaPP_1.9-73          
## [58] e1071_1.6-8            caTools_1.17.1         rlang_0.1.6           
## [61] pkgconfig_2.0.1        bitops_1.0-6           evaluate_0.10.1       
## [64] lattice_0.20-34        ROCR_1.0-7             bindr_0.1             
## [67] labeling_0.3           cowplot_0.9.2          bit_1.1-12            
## [70] plyr_1.8.4             magrittr_1.5           bookdown_0.5          
## [73] R6_2.2.2               gplots_3.0.1           DBI_0.7               
## [76] pillar_1.1.0           RCurl_1.95-4.10        tibble_1.4.1          
## [79] KernSmooth_2.23-15     rmarkdown_1.8          viridis_0.4.1         
## [82] progress_1.1.2         locfit_1.5-9.1         grid_3.4.3            
## [85] data.table_1.10.4-3    blob_1.1.0             digest_0.6.14         
## [88] xtable_1.8-2           httpuv_1.3.5           munsell_0.4.3         
## [91] registry_0.5           beeswarm_0.2.3         viridisLite_0.2.0     
## [94] vipor_0.4.5

4.6 Differential Expression (DE) analysis

4.6.1 Bulk RNA-seq

One of the most common types of analyses when working with bulk RNA-seq data is to identify differentially expressed genes. By comparing the genes that change between two conditions, e.g. mutant and wild-type or stimulated and unstimulated, it is possible to characterize the molecular mechanisms underlying the change.

Several different methods, e.g. DESeq2 and edgeR, have been developed for bulk RNA-seq. Moreover, there are also extensive datasets available where the RNA-seq data has been validated using RT-qPCR. These data can be used to benchmark DE finding algorithms and the available evidence suggests that the algorithms are performing quite well.

4.6.2 Single cell RNA-seq

In contrast to bulk RNA-seq, in scRNA-seq we usually do not have a defined set of experimental conditions. Instead, as was shown in a previous chapter (4.2) we can identify the cell groups by using an unsupervised clustering approach. Once the groups have been identified one can find differentially expressed genes either by comparing the differences in variance between the groups (like the Kruskal-Wallis test implemented in SC3), or by comparing gene expression between clusters in a pairwise manner. In the following chapter we will mainly consider tools developed for pairwise comparisons.

4.6.3 Differences in Distribution

Unlike bulk RNA-seq, we generally have a large number of samples (i.e. cells) for each group we are comparing in single-cell experiments. Thus we can take advantage of the whole distribution of expression values in each group to identify differences between groups rather than only comparing estimates of mean-expression as is standard for bulk RNASeq.

There are two main approaches to comparing distributions. Firstly, we can use existing statistical models/distributions and fit the same type of model to the expression in each group then test for differences in the parameters for each model, or test whether the model fits better if a particular paramter is allowed to be different according to group. For instance in Chapter 3.17 we used edgeR to test whether allowing mean expression to be different in different batches significantly improved the fit of a negative binomial model of the data.

Alternatively, we can use a non-parametric test which does not assume that expression values follow any particular distribution, e.g. the Kolmogorov-Smirnov test (KS-test). Non-parametric tests generally convert observed expression values to ranks and test whether the distribution of ranks for one group are signficantly different from the distribution of ranks for the other group. However, some non-parametric methods fail in the presence of a large number of tied values, such as the case for dropouts (zeros) in single-cell RNA-seq expression data. Moreover, if the conditions for a parametric test hold, then it will typically be more powerful than a non-parametric test.

4.6.4 Models of single-cell RNASeq data

The most common model of RNASeq data is the negative binomial model:

set.seed(1)
hist(
    rnbinom(
        1000, 
        mu = 10, 
        size = 100), 
    col = "grey50", 
    xlab = "Read Counts", 
    main = "Negative Binomial"
)
Negative Binomial distribution of read counts for a single gene across 1000 cells

Figure 4.12: Negative Binomial distribution of read counts for a single gene across 1000 cells

Mean: \(\mu = mu\)

Variance: \(\sigma^2 = mu + mu^2/size\)

It is parameterized by the mean expression (mu) and the dispersion (size), which is inversely related to the variance. The negative binomial model fits bulk RNA-seq data very well and it is used for most statistical methods designed for such data. In addition, it has been show to fit the distribution of molecule counts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al. 2014, Islam et al. 2011).

However, a raw negative binomial model does not fit full-length transcript data as well due to the high dropout rates relative to the non-zero read counts. For this type of data a variety of zero-inflated negative binomial models have been proposed (e.g. MAST, SCDE).

d <- 0.5;
counts <- rnbinom(
    1000, 
    mu = 10, 
    size = 100
)
counts[runif(1000) < d] <- 0
hist(
    counts, 
    col = "grey50", 
    xlab = "Read Counts", 
    main = "Zero-inflated NB"
)
Zero-inflated Negative Binomial distribution

Figure 4.13: Zero-inflated Negative Binomial distribution

Mean: \(\mu = mu \cdot (1 - d)\)

Variance: \(\sigma^2 = \mu \cdot (1-d) \cdot (1 + d \cdot \mu + \mu / size)\)

These models introduce a new parameter \(d\), for the dropout rate, to the negative binomial model. As we saw in Chapter 19, the dropout rate of a gene is strongly correlated with the mean expression of the gene. Different zero-inflated negative binomial models use different relationships between mu and d and some may fit \(\mu\) and \(d\) to the expression of each gene independently.

Finally, several methods use a Poisson-Beta distribution which is based on a mechanistic model of transcriptional bursting. There is strong experimental support for this model (Kim and Marioni, 2013) and it provides a good fit to scRNA-seq data but it is less easy to use than the negative-binomial models and much less existing methods upon which to build than the negative binomial model.

a <- 0.1
b <- 0.1
g <- 100
lambdas <- rbeta(1000, a, b)
counts <- sapply(g*lambdas, function(l) {rpois(1, lambda = l)})
hist(
    counts, 
    col = "grey50", 
    xlab = "Read Counts", 
    main = "Poisson-Beta"
)

Mean: \(\mu = g \cdot a / (a + b)\)

Variance: \(\sigma^2 = g^2 \cdot a \cdot b/((a + b + 1) \cdot (a + b)^2)\)

This model uses three parameters: \(a\) the rate of activation of transcription; \(b\) the rate of inhibition of transcription; and \(g\) the rate of transcript production while transcription is active at the locus. Differential expression methods may test each of the parameters for differences across groups or only one (often \(g\)).

All of these models may be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or library depth depending on the particular DE algorithm.

Exercise: Vary the parameters of each distribution to explore how they affect the distribution of gene expression. How similar are the Poisson-Beta and Negative Binomial models?

4.7 DE in a real dataset

library(scRNA.seq.funcs)
library(edgeR)
library(monocle)
library(MAST)
library(ROCR)
set.seed(1)

4.7.1 Introduction

To test different single-cell differential expression methods we will be using the Blischak dataset from Chapters 7-17. For this experiment bulk RNA-seq data for each cell-line was generated in addition to single-cell data. We will use the differentially expressed genes identified using standard methods on the respective bulk data as the ground truth for evaluating the accuracy of each single-cell method. To save time we have pre-computed these for you. You can run the commands below to load these data.

DE <- read.table("tung/TPs.txt")
notDE <- read.table("tung/TNs.txt")
GroundTruth <- list(
    DE = as.character(unlist(DE)), 
    notDE = as.character(unlist(notDE))
)

This ground truth has been produce for the comparison of individual NA19101 to NA19239. Now load the respective single-cell data:

molecules <- read.table("tung/molecules.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
keep <- anno[,1] == "NA19101" | anno[,1] == "NA19239"
data <- molecules[,keep]
group <- anno[keep,1]
batch <- anno[keep,4]
# remove genes that aren't expressed in at least 6 cells
gkeep <- rowSums(data > 0) > 5;
counts <- data[gkeep,]
# Library size normalization
lib_size = colSums(counts)
norm <- t(t(counts)/lib_size * median(lib_size)) 
# Variant of CPM for datasets with library sizes of fewer than 1 mil molecules

Now we will compare various single-cell DE methods. Note that we will only be running methods which are available as R-packages and run relatively quickly.

4.7.2 Kolmogorov-Smirnov test

The types of test that are easiest to work with are non-parametric ones. The most commonly used non-parametric test is the Kolmogorov-Smirnov test (KS-test) and we can use it to compare the distributions for each gene in the two individuals.

The KS-test quantifies the distance between the empirical cummulative distributions of the expression of each gene in each of the two populations. It is sensitive to changes in mean experession and changes in variability. However it assumes data is continuous and may perform poorly when data contains a large number of identical values (eg. zeros). Another issue with the KS-test is that it can be very sensitive for large sample sizes and thus it may end up as significant even though the magnitude of the difference is very small.

Illustration of the two-sample Kolmogorov–Smirnov statistic. Red and blue lines each correspond to an empirical distribution function, and the black arrow is the two-sample KS statistic. (taken from [here](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test))

Figure 4.14: Illustration of the two-sample Kolmogorov–Smirnov statistic. Red and blue lines each correspond to an empirical distribution function, and the black arrow is the two-sample KS statistic. (taken from here)

Now run the test:

pVals <- apply(
    norm, 1, function(x) {
        ks.test(
            x[group == "NA19101"], 
            x[group == "NA19239"]
        )$p.value
    }
)
# multiple testing correction
pVals <- p.adjust(pVals, method = "fdr")

This code “applies” the function to each row (specified by 1) of the expression matrix, data. In the function we are returning just the p.value from the ks.test output. We can now consider how many of the ground truth positive and negative DE genes are detected by the KS-test:

4.7.2.1 Evaluating Accuracy

sigDE <- names(pVals)[pVals < 0.05]
length(sigDE) 
## [1] 5095
# Number of KS-DE genes
sum(GroundTruth$DE %in% sigDE) 
## [1] 792
# Number of KS-DE genes that are true DE genes
sum(GroundTruth$notDE %in% sigDE)
## [1] 3190
# Number of KS-DE genes that are truly not-DE

As you can see many more of our ground truth negative genes were identified as DE by the KS-test (false positives) than ground truth positive genes (true positives), however this may be due to the larger number of notDE genes thus we typically normalize these counts as the True positive rate (TPR), TP/(TP + FN), and False positive rate (FPR), FP/(FP+TP).

tp <- sum(GroundTruth$DE %in% sigDE)
fp <- sum(GroundTruth$notDE %in% sigDE)
tn <- sum(GroundTruth$notDE %in% names(pVals)[pVals >= 0.05])
fn <- sum(GroundTruth$DE %in% names(pVals)[pVals >= 0.05])
tpr <- tp/(tp + fn)
fpr <- fp/(fp + tn)
cat(c(tpr, fpr))
## 0.7346939 0.2944706

Now we can see the TPR is much higher than the FPR indicating the KS test is identifying DE genes.

So far we’ve only evaluated the performance at a single significance threshold. Often it is informative to vary the threshold and evaluate performance across a range of values. This is then plotted as a receiver-operating-characteristic curve (ROC) and a general accuracy statistic can be calculated as the area under this curve (AUC). We will use the ROCR package to facilitate this plotting.

# Only consider genes for which we know the ground truth
pVals <- pVals[names(pVals) %in% GroundTruth$DE | 
               names(pVals) %in% GroundTruth$notDE] 
truth <- rep(1, times = length(pVals));
truth[names(pVals) %in% GroundTruth$DE] = 0;
pred <- ROCR::prediction(pVals, truth)
perf <- ROCR::performance(pred, "tpr", "fpr")
ROCR::plot(perf)
ROC curve for KS-test.

Figure 4.15: ROC curve for KS-test.

aucObj <- ROCR::performance(pred, "auc")
aucObj@y.values[[1]] # AUC
## [1] 0.7954796

Finally to facilitate the comparisons of other DE methods let’s put this code into a function so we don’t need to repeat it:

DE_Quality_AUC <- function(pVals) {
    pVals <- pVals[names(pVals) %in% GroundTruth$DE | 
                   names(pVals) %in% GroundTruth$notDE]
    truth <- rep(1, times = length(pVals));
    truth[names(pVals) %in% GroundTruth$DE] = 0;
    pred <- ROCR::prediction(pVals, truth)
    perf <- ROCR::performance(pred, "tpr", "fpr")
    ROCR::plot(perf)
    aucObj <- ROCR::performance(pred, "auc")
    return(aucObj@y.values[[1]])
}

4.7.3 Wilcox/Mann-Whitney-U Test

The Wilcox-rank-sum test is another non-parametric test, but tests specifically if values in one group are greater/less than the values in the other group. Thus it is often considered a test for difference in median expression between two groups; whereas the KS-test is sensitive to any change in distribution of expression values.

pVals <- apply(
    norm, 1, function(x) {
        wilcox.test(
            x[group == "NA19101"], 
            x[group == "NA19239"]
        )$p.value
    }
)
# multiple testing correction
pVals <- p.adjust(pVals, method = "fdr")
DE_Quality_AUC(pVals)
ROC curve for Wilcox test.

Figure 4.16: ROC curve for Wilcox test.

## [1] 0.8320326

4.7.4 edgeR

We’ve already used edgeR for differential expression in Chapter 3.17. edgeR is based on a negative binomial model of gene expression and uses a generalized linear model (GLM) framework, the enables us to include other factors such as batch to the model.

dge <- DGEList(
    counts = counts, 
    norm.factors = rep(1, length(counts[1,])), 
    group = group
)
group_edgeR <- factor(group)
design <- model.matrix(~ group_edgeR)
dge <- estimateDisp(dge, design = design, trend.method = "none")
fit <- glmFit(dge, design)
res <- glmLRT(fit)
pVals <- res$table[,4]
names(pVals) <- rownames(res$table)

pVals <- p.adjust(pVals, method = "fdr")
DE_Quality_AUC(pVals)
ROC curve for edgeR.

Figure 4.17: ROC curve for edgeR.

## [1] 0.8466764

4.7.5 Monocle

Monocle can use several different models for DE. For count data it recommends the Negative Binomial model (negbinomial.size). For normalized data it recommends log-transforming it then using a normal distribution (gaussianff). Similar to edgeR this method uses a GLM framework so in theory can account for batches, however in practice the model fails for this dataset if batches are included.

pd <- data.frame(group = group, batch = batch)
rownames(pd) <- colnames(counts)
pd <- new("AnnotatedDataFrame", data = pd)

Obj <- newCellDataSet(
    as.matrix(counts), 
    phenoData = pd, 
    expressionFamily = negbinomial.size()
)
Obj <- estimateSizeFactors(Obj)
Obj <- estimateDispersions(Obj)
res <- differentialGeneTest(Obj, fullModelFormulaStr = "~group")

pVals <- res[,3]
names(pVals) <- rownames(res)
pVals <- p.adjust(pVals, method = "fdr")
DE_Quality_AUC(pVals)
ROC curve for Monocle.

Figure 4.18: ROC curve for Monocle.

## [1] 0.8252662

Exercise: Compare the results using the negative binomial model on counts and those from using the normal/gaussian model (gaussianff()) on log-transformed normalized counts.

Answer:
ROC curve for Monocle-gaussian.

Figure 4.19: ROC curve for Monocle-gaussian.

## [1] 0.7357829

4.7.6 MAST

MAST is based on a zero-inflated negative binomial model. It tests for differential expression using a hurdle model to combine tests of discrete (0 vs not zero) and continuous (non-zero values) aspects of gene expression. Again this uses a linear modelling framework to enable complex models to be considered.

log_counts <- log(counts + 1) / log(2)
fData <- data.frame(names = rownames(log_counts))
rownames(fData) <- rownames(log_counts);
cData <- data.frame(cond = group)
rownames(cData) <- colnames(log_counts)

obj <- FromMatrix(as.matrix(log_counts), cData, fData)
colData(obj)$cngeneson <- scale(colSums(assay(obj) > 0))
cond <- factor(colData(obj)$cond)

# Model expression as function of condition & number of detected genes
zlmCond <- zlm.SingleCellAssay(~ cond + cngeneson, obj) 
## Warning: 'zlm.SingleCellAssay' is deprecated.
## Use 'zlm' instead.
## See help("Deprecated")
## Warning in .nextMethod(object = object, value = value): Coefficients
## condNA19239 are never estimible and will be dropped.
summaryCond <- summary(zlmCond, doLRT = "condNA19101")
summaryDt <- summaryCond$datatable

summaryDt <- as.data.frame(summaryDt)
pVals <- unlist(summaryDt[summaryDt$component == "H",4]) # H = hurdle model
names(pVals) <- unlist(summaryDt[summaryDt$component == "H",1])
pVals <- p.adjust(pVals, method = "fdr")
DE_Quality_AUC(pVals)
ROC curve for MAST.

Figure 4.20: ROC curve for MAST.

## [1] 0.8284046

4.7.7 Slow Methods (>1h to run)

These methods are too slow to run today but we encourage you to try them out on your own:

4.7.8 BPSC

BPSC uses the Poisson-Beta model of single-cell gene expression, which we discussed in the previous chapter, and combines it with generalized linear models which we’ve already encountered when using edgeR. BPSC performs comparisons of one or more groups to a reference group (“control”) and can include other factors such as batches in the model.

library(BPSC)
bpsc_data <- norm[,batch=="NA19101.r1" | batch=="NA19239.r1"]
bpsc_group = group[batch=="NA19101.r1" | batch=="NA19239.r1"]

control_cells <- which(bpsc_group == "NA19101")
design <- model.matrix(~bpsc_group)
coef=2 # group label
res=BPglm(data=bpsc_data, controlIds=control_cells, design=design, coef=coef, 
                estIntPar=FALSE, useParallel = FALSE)
pVals = res$PVAL
pVals <- p.adjust(pVals, method = "fdr")
DE_Quality_AUC(pVals)

4.7.9 SCDE

SCDE is the first single-cell specific DE method. It fits a zero-inflated negative binomial model to expression data using Bayesian statistics. The usage below tests for differences in mean expression of individual genes across groups but recent versions include methods to test for differences in mean expression or dispersion of groups of genes, usually representing a pathway.

library(scde)
cnts <- apply(
    counts,
    2,
    function(x) {
        storage.mode(x) <- 'integer'
        return(x)
    }
)
names(group) <- 1:length(group)
colnames(cnts) <- 1:length(group)
o.ifm <- scde::scde.error.models(
    counts = cnts,
    groups = group,
    n.cores = 1,
    threshold.segmentation = TRUE,
    save.crossfit.plots = FALSE,
    save.model.plots = FALSE,
    verbose = 0,
    min.size.entries = 2
)
priors <- scde::scde.expression.prior(
    models = o.ifm,
    counts = cnts,
    length.out = 400,
    show.plot = FALSE
)
resSCDE <- scde::scde.expression.difference(
    o.ifm,
    cnts,
    priors,
    groups = group,
    n.randomizations = 100,
    n.cores = 1,
    verbose = 0
)
# Convert Z-scores into 2-tailed p-values
pVals <- pnorm(abs(resSCDE$cZ), lower.tail = FALSE) * 2
pVals <- p.adjust(pVals, method = "fdr")
DE_Quality_AUC(pVals)

4.7.10 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   stats4    parallel  methods   stats     graphics  grDevices
##  [8] utils     datasets  base     
## 
## other attached packages:
##  [1] ROCR_1.0-7                 gplots_3.0.1              
##  [3] MAST_1.4.1                 SummarizedExperiment_1.8.1
##  [5] DelayedArray_0.4.1         matrixStats_0.52.2        
##  [7] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
##  [9] IRanges_2.12.0             S4Vectors_0.16.0          
## [11] monocle_2.6.1              DDRTree_0.1.5             
## [13] irlba_2.3.2                VGAM_1.0-4                
## [15] ggplot2_2.2.1              Biobase_2.38.0            
## [17] BiocGenerics_0.24.0        Matrix_1.2-7.1            
## [19] edgeR_3.20.7               limma_3.34.5              
## [21] scRNA.seq.funcs_0.1.0      knitr_1.18                
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.4.1          viridisLite_0.2.0      gtools_3.5.0          
##  [4] elliptic_1.3-7         moments_0.14           assertthat_0.2.0      
##  [7] statmod_1.4.30         highr_0.6              GenomeInfoDbData_1.0.0
## [10] yaml_2.1.16            slam_0.1-42            ggrepel_0.7.0         
## [13] pillar_1.1.0           backports_1.1.2        lattice_0.20-34       
## [16] glue_1.2.0             densityClust_0.3       digest_0.6.14         
## [19] XVector_0.18.0         RColorBrewer_1.1-2     colorspace_1.3-2      
## [22] fastICA_1.2-1          htmltools_0.3.6        plyr_1.8.4            
## [25] pkgconfig_2.0.1        pheatmap_1.0.8         HSMMSingleCell_0.112.0
## [28] qlcMatrix_0.9.5        zlibbioc_1.24.0        bookdown_0.5          
## [31] scales_0.5.0           gdata_2.18.0           RANN_2.5.1            
## [34] Rtsne_0.13             tibble_1.4.1           combinat_0.0-8        
## [37] lazyeval_0.2.1         magrittr_1.5           evaluate_0.10.1       
## [40] MASS_7.3-45            FNN_1.1                data.table_1.10.4-3   
## [43] tools_3.4.3            stringr_1.2.0          munsell_0.4.3         
## [46] locfit_1.5-9.1         cluster_2.0.6          bindrcpp_0.2          
## [49] orthopolynom_1.0-5     compiler_3.4.3         caTools_1.17.1        
## [52] contfrac_1.1-11        rlang_0.1.6            grid_3.4.3            
## [55] RCurl_1.95-4.10        igraph_1.1.2           bitops_1.0-6          
## [58] rmarkdown_1.8          hypergeo_1.2-13        gtable_0.2.0          
## [61] abind_1.4-5            deSolve_1.20           reshape2_1.4.3        
## [64] R6_2.2.2               gridExtra_2.3          dplyr_0.7.4           
## [67] bindr_0.1              rprojroot_1.3-2        KernSmooth_2.23-15    
## [70] stringi_1.1.6          Rcpp_0.12.15

4.8 Projecting scRNA-seq data

library(scmap)
library(scater)
library(SingleCellExperiment)
set.seed(1234567)

As more and more scRNA-seq datasets become available, carrying out comparisons between them is key. A central application is to compare datasets of similar biological origin collected by different labs to ensure that the annotation and the analysis is consistent. Moreover, as very large references, e.g. the Human Cell Atlas (HCA) (Regev et al. 2017), become available, an important application will be to project cells from a new sample (e.g. from a disease tissue) onto the reference to characterize differences in composition, or to detect new cell-types. Conceptually, such projections are similar to the popular BLAST method (Altschul et al. 1990), which makes it possible to quickly find the closest match in a database for a newly identified nucleotide or amino acid sequence.

We recently developed scmap (Kiselev and Hemberg 2017) - a method for projecting cells from a scRNA-seq experiment onto the cell-types identified in other experiments. Additionally, a cloud version of scmap can be run for free, without restrictions, from http://www.hemberg-lab.cloud/scmap.

4.8.1 Datasets

We will run scmap on two human pancreas datasets: (Muraro et al. 2016) and (Segerstolpe et al. 2016). Since the pancreas has been widely studied, these datasets are well annotated. As usual cell type labels are stored in the cell_type1 column of the colData slots.

muraro <- readRDS("pancreas/muraro.rds")
segerstolpe <- readRDS("pancreas/segerstolpe.rds")

scmap can be used to perform both types of projections to either a single dataset or to a reference collection of datasets (Atlas):

scmap can be used to compare the annotations of two different samples by providing a one to one mapping between the cells.

Figure 2.5: scmap can be used to compare the annotations of two different samples by providing a one to one mapping between the cells.

 scmap can also be used to project cells from a new experiment onto an annotated reference.

Figure 2.6: scmap can also be used to project cells from a new experiment onto an annotated reference.

4.8.2 Feature Selection

Once we have a SingleCellExperiment object we can run scmap. Firstly, we need to select the most informative features from our input dataset:

muraro <- selectFeatures(muraro, suppress_plot = FALSE)
## Warning in linearModel(object, n_features): Your object does not contain
## counts() slot. Dropouts were calculated using logcounts() slot...

Features highlighted with the red colour will be used in the futher analysis (projection).

Features are stored in the scmap_features column of the rowData slot of the input object. By default scmap selects 500 features (it can also be controlled by setting n_features parameter):

table(rowData(muraro)$scmap_features)
## 
## FALSE  TRUE 
## 18627   500

4.8.3 scmap-cluster

The scmap-cluster index of a reference dataset is created by finding the median gene expression for each cluster. By default scmap uses the cell_type1 column of the colData slot in the reference to identify clusters. Other columns can be manually selected by adjusting cluster_col parameter:

muraro <- indexCluster(muraro)

The function indexCluster automatically writes the scmap_cluster_index item of the metadata slot of the reference dataset.

head(metadata(muraro)$scmap_cluster_index)
##           alpha   ductal endothelial    delta   acinar     beta   unclear
## A1CF   2.006376 0.000000    0.000000 1.001412 1.001412 1.001412 0.0000000
## ABCC3  0.000000 2.596810    0.000000 0.000000 2.331011 0.000000 3.2707107
## ABCC8  2.596810 0.000000    0.000000 3.017474 0.000000 4.649591 0.0000000
## ABLIM1 2.331011 2.006376    0.000000 1.588734 1.001412 2.006376 3.3451141
## ACTN1  1.588734 2.331011    2.006376 1.588734 3.017474 1.588734 0.7943671
## ADAM9  1.001412 3.190246    1.001412 0.000000 3.345114 1.001412 1.9117070
##           gamma mesenchymal  epsilon
## A1CF   1.001412    0.000000 2.331011
## ABCC3  0.000000    0.000000 0.000000
## ABCC8  3.485498    0.000000 2.006376
## ABLIM1 0.000000    0.000000 1.588734
## ACTN1  1.588734    4.711531 1.588734
## ADAM9  0.000000    2.331011 0.000000

One can also visualise the index:

heatmap(as.matrix(metadata(muraro)$scmap_cluster_index))

4.8.3.1 Projecting

Once the scmap-cluster index has been generated we can use it to project segerstolpe dataset to muraro dataset. This can be done with one index at a time, but scmap also allows for simultaneous projection to multiple indexes if they are provided as a list:

scmapCluster_results <- scmapCluster(
  projection = segerstolpe, 
  index_list = list(
    muraro = metadata(muraro)$scmap_cluster_index
  )
)
## Warning in setFeatures(projection, rownames(index)): Features C19orf77,
## CSDA, LOC100216479 are not present in the 'SCESet' object and therefore
## were not set.

4.8.4 Results

scmap-cluster projects the query dataset to all projections defined in the index_list. The results of cell label assignements are merged into one matrix:

head(scmapCluster_results$scmap_cluster_labs)
##      muraro      
## [1,] "unassigned"
## [2,] "unassigned"
## [3,] "alpha"     
## [4,] "delta"     
## [5,] "gamma"     
## [6,] "unassigned"

Corresponding similarities are stored in the scmap_cluster_siml item:

head(scmapCluster_results$scmap_cluster_siml)
##         muraro
## [1,] 0.4452916
## [2,] 0.6571171
## [3,] 0.8016826
## [4,] 0.7209766
## [5,] 0.7448156
## [6,] 0.4971898

scmap also provides combined results of all reference dataset (choose labels corresponding to the largest similarity across reference datasets):

head(scmapCluster_results$combined_labs)
## [1] "unassigned" "unassigned" "alpha"      "delta"      "gamma"     
## [6] "unassigned"

Clearly the projection is almost perfect. With scmap one can also plot a Sankey diagram (however, cell_type1 columns have to be provided in the colData slots of both the reference and the projection datasets):

plot(
  getSankey(
    colData(segerstolpe)$cell_type1, 
    scmapCluster_results$scmap_cluster_labs[,'muraro'],
    plot_height = 400
  )
)

4.8.5 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] bindrcpp_0.2               scater_1.6.2              
##  [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [5] DelayedArray_0.4.1         matrixStats_0.52.2        
##  [7] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
##  [9] IRanges_2.12.0             S4Vectors_0.16.0          
## [11] ggplot2_2.2.1              Biobase_2.38.0            
## [13] BiocGenerics_0.24.0        scmap_1.1.5               
## [15] googleVis_0.6.2            knitr_1.18                
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.4.1          httr_1.3.1             edgeR_3.20.7          
##  [4] bit64_0.9-7            jsonlite_1.5           viridisLite_0.2.0     
##  [7] shiny_1.0.5            assertthat_0.2.0       highr_0.6             
## [10] blob_1.1.0             vipor_0.4.5            GenomeInfoDbData_1.0.0
## [13] yaml_2.1.16            progress_1.1.2         pillar_1.1.0          
## [16] RSQLite_2.0            backports_1.1.2        lattice_0.20-34       
## [19] limma_3.34.5           glue_1.2.0             digest_0.6.14         
## [22] XVector_0.18.0         randomForest_4.6-12    colorspace_1.3-2      
## [25] htmltools_0.3.6        httpuv_1.3.5           Matrix_1.2-7.1        
## [28] plyr_1.8.4             XML_3.98-1.9           pkgconfig_2.0.1       
## [31] biomaRt_2.34.2         bookdown_0.5           zlibbioc_1.24.0       
## [34] xtable_1.8-2           scales_0.5.0           tibble_1.4.1          
## [37] proxy_0.4-21           lazyeval_0.2.1         magrittr_1.5          
## [40] mime_0.5               memoise_1.1.0          evaluate_0.10.1       
## [43] class_7.3-14           beeswarm_0.2.3         shinydashboard_0.6.1  
## [46] data.table_1.10.4-3    tools_3.4.3            prettyunits_1.0.2     
## [49] stringr_1.2.0          locfit_1.5-9.1         munsell_0.4.3         
## [52] AnnotationDbi_1.40.0   compiler_3.4.3         e1071_1.6-8           
## [55] rlang_0.1.6            rhdf5_2.22.0           grid_3.4.3            
## [58] RCurl_1.95-4.10        tximport_1.6.0         rjson_0.2.15          
## [61] labeling_0.3           bitops_1.0-6           rmarkdown_1.8         
## [64] gtable_0.2.0           codetools_0.2-15       DBI_0.7               
## [67] reshape2_1.4.3         R6_2.2.2               gridExtra_2.3         
## [70] dplyr_0.7.4            bit_1.1-12             bindr_0.1             
## [73] rprojroot_1.3-2        ggbeeswarm_0.6.0       stringi_1.1.6         
## [76] Rcpp_0.12.15

4.9 Search scRNA-Seq data

library(scfind)
library(SingleCellExperiment)
set.seed(1234567)

4.9.1 About

scfind is a tool that allows one to search single cell RNA-Seq collections (Atlas) using lists of genes, e.g. searching for cells and cell-types where a specific set of genes are expressed. scfind is a Bioconductor package. Cloud implementation of scfind with a large collection of datasets is available on our website.

scfind can be used to search large collection of scRNA-seq data by a list of gene IDs.

Figure 2.4: scfind can be used to search large collection of scRNA-seq data by a list of gene IDs.

4.9.2 Dataset

We will run scfind on the same human pancreas dataset as in the previous chapter. scfind also operates on SingleCellExperiment class:

muraro <- readRDS("pancreas/muraro.rds")

4.9.3 Gene Index

Now we need to create a gene index using our dataset:

cellIndex <- buildCellIndex(muraro)

The gene index contains for each gene indexes of the cells where it is expressed. This is similar to sparsification of the expression matrix. In addition to this the index is also compressed in a way that it can accessed very quickly. We estimated that one can achieve 5-10 compression factor with this method.

By default the cell_type1 column of the colData slot of the SingleCellExperiment object is used to define cell types, however it can also be defined manually using the cell_type_column argument of the buildCellTypeIndex function (check ?buildCellTypeIndex).

4.9.4 Marker genes

Now let’s define lists of cell type specific marker genes. We will use the marker genes identified in the original publication, namely in Figure 1:

# these genes are taken from fig. 1
muraro_alpha <- c("GCG", "LOXL4", "PLCE1", "IRX2", "GC", "KLHL41", 
                  "CRYBA2", "TTR", "TM4SF4", "RGS4")
muraro_beta <- c("INS", "IAPP", "MAFA", "NPTX2", "DLK1", "ADCYAP1", 
                 "PFKFB2", "PDX1", "TGFBR3", "SYT13")
muraro_gamma <- c("PPY", "SERTM1", "CARTPT", "SLITRK6", "ETV1", 
                  "THSD7A", "AQP3", "ENTPD2", "PTGFR", "CHN2")
muraro_delta <- c("SST", "PRG4", "LEPR", "RBP4", "BCHE", "HHEX", 
                  "FRZB", "PCSK1", "RGS2", "GABRG2")

4.9.5 Search cells by a gene list

findCell function returns a list of p-values corresponding to all cell types in a given dataset. It also outputs a list of cells in which genes from the given gene list are co-expressed. We will run it on all lists of marker genes defined above:

res <- findCell(cellIndex, muraro_alpha)
barplot(-log10(res$p_values), ylab = "-log10(pval)", las = 2)

head(res$common_exprs_cells)
##   cell_id cell_type
## 1       1     alpha
## 2       3     alpha
## 3       7     alpha
## 4       9     alpha
## 5      15     alpha
## 6      20     alpha

Exercise 1

Perform a search by beta, delta and gamma gene lists and explore the results.

##   cell_id cell_type
## 1      71     alpha
## 2      72      beta
## 3      92      beta
## 4      96      beta
## 5      98      beta
## 6     102      beta

##   cell_id cell_type
## 1      40     delta
## 2     212     delta
## 3     225     delta
## 4     253     delta
## 5     330     delta
## 6     400     delta

##   cell_id cell_type
## 1      53     alpha
## 2     102      beta
## 3     255     gamma
## 4     305     gamma
## 5     525     gamma
## 6     662     gamma

Exercise 2

Load the segerstolpe and search it using alpha, beta, delta and gamma gene lists identified in muraro dataset.

##   cell_id cell_type
## 1      18     alpha
## 2      20     alpha
## 3      24     alpha
## 4      32     alpha
## 5      43     alpha
## 6      48     alpha

##   cell_id     cell_type
## 1      15 co-expression
## 2      58          beta
## 3     300          beta
## 4     390 co-expression
## 5     504 co-expression
## 6     506          beta

##   cell_id     cell_type
## 1     170         delta
## 2     715         delta
## 3    1039 co-expression
## 4    1133         delta
## 5    1719         delta
## 6    1721         delta

##   cell_id cell_type
## 1      47     gamma
## 2     458     gamma
## 3     476     gamma
## 4     600     gamma
## 5     606     gamma
## 6     622     gamma

4.9.6 sessionInfo()

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [3] DelayedArray_0.4.1         matrixStats_0.52.2        
##  [5] Biobase_2.38.0             GenomicRanges_1.30.1      
##  [7] GenomeInfoDb_1.14.0        IRanges_2.12.0            
##  [9] S4Vectors_0.16.0           BiocGenerics_0.24.0       
## [11] scfind_1.0.0               knitr_1.18                
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15           highr_0.6              plyr_1.8.4            
##  [4] pillar_1.1.0           compiler_3.4.3         XVector_0.18.0        
##  [7] bindr_0.1              bitops_1.0-6           tools_3.4.3           
## [10] zlibbioc_1.24.0        digest_0.6.14          bit_1.1-12            
## [13] tibble_1.4.1           evaluate_0.10.1        lattice_0.20-34       
## [16] pkgconfig_2.0.1        rlang_0.1.6            Matrix_1.2-7.1        
## [19] yaml_2.1.16            bindrcpp_0.2           GenomeInfoDbData_1.0.0
## [22] stringr_1.2.0          dplyr_0.7.4            rprojroot_1.3-2       
## [25] grid_3.4.3             glue_1.2.0             R6_2.2.2              
## [28] hash_2.2.6             rmarkdown_1.8          bookdown_0.5          
## [31] reshape2_1.4.3         magrittr_1.5           backports_1.1.2       
## [34] htmltools_0.3.6        assertthat_0.2.0       stringi_1.1.6         
## [37] RCurl_1.95-4.10
set.seed(1234567)

References

Guo, Minzhe, Hui Wang, S. Steven Potter, Jeffrey A. Whitsett, and Yan Xu. 2015. “SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis.” PLoS Comput Biol 11 (11). Public Library of Science (PLoS): e1004575. doi:10.1371/journal.pcbi.1004575.

žurauskienė, Justina, and Christopher Yau. 2016. “pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles.” BMC Bioinformatics 17 (1). Springer Nature. doi:10.1186/s12859-016-0984-y.

Kiselev, Vladimir Yu, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew Yiu, Tamir Chandra, Kedar N Natarajan, et al. 2017. “SC3: Consensus Clustering of Single-Cell RNA-Seq Data.” Nat Meth 14 (5). Springer Nature: 483–86. doi:10.1038/nmeth.4236.

Xu, Chen, and Zhengchang Su. 2015. “Identification of Cell Types from Single-Cell Transcriptomes Using a Novel Clustering Method.” Bioinformatics 31 (12). Oxford University Press (OUP): 1974–80. doi:10.1093/bioinformatics/btv088.

Levine, Jacob H., Erin F. Simonds, Sean C. Bendall, Kara L. Davis, El-ad D. Amir, Michelle D. Tadmor, Oren Litvin, et al. 2015. “Data-Driven Phenotypic Dissection of AML Reveals Progenitor-Like Cells That Correlate with Prognosis.” Cell 162 (1). Elsevier BV: 184–97. doi:10.1016/j.cell.2015.05.047.

Deng, Q., D. Ramskold, B. Reinius, and R. Sandberg. 2014. “Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells.” Science 343 (6167). American Association for the Advancement of Science (AAAS): 193–96. doi:10.1126/science.1245316.

Cannoodt, Robrecht, Wouter Saelens, and Yvan Saeys. 2016. “Computational Methods for Trajectory Inference from Single-Cell Transcriptomics.” Eur. J. Immunol. 46 (11). Wiley-Blackwell: 2496–2506. doi:10.1002/eji.201646347.

Welch, Joshua D., Alexander J. Hartemink, and Jan F. Prins. 2016. “SLICER: Inferring Branched, Nonlinear Cellular Trajectories from Single Cell RNA-Seq Data.” Genome Biol 17 (1). Springer Nature. doi:10.1186/s13059-016-0975-3.

Dijk, David van, Juozas Nainys, Roshan Sharma, Pooja Kathail, Ambrose J Carr, Kevin R Moon, Linas Mazutis, Guy Wolf, Smita Krishnaswamy, and Dana Pe’er. 2017. “MAGIC: A Diffusion-Based Imputation Method Reveals Gene-Gene Interactions in Single-Cell RNA-sequencing Data.” bioRxiv, February, 111591.

Li, Wei Vivian, and Jingyi Jessica Li. 2017. “scImpute: Accurate and Robust Imputation for Single Cell RNA-Seq Data.” bioRxiv, June, 141598.

Regev, Aviv, Sarah Teichmann, Eric S Lander, Ido Amit, Christophe Benoist, Ewan Birney, Bernd Bodenmiller, et al. 2017. “The Human Cell Atlas.” bioRxiv, May, 121202.

Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3). Elsevier BV: 403–10. doi:10.1016/s0022-2836(05)80360-2.

Kiselev, Vladimir Yu, and Martin Hemberg. 2017. “Scmap - a Tool for Unsupervised Projection of Single Cell RNA-seq Data.” bioRxiv, July, 150292.

Muraro, Mauro J., Gitanjali Dharmadhikari, Dominic Grün, Nathalie Groen, Tim Dielen, Erik Jansen, Leon van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Systems 3 (4). Elsevier BV: 385–394.e3. doi:10.1016/j.cels.2016.09.002.

Segerstolpe, Åsa, Athanasia Palasantza, Pernilla Eliasson, Eva-Marie Andersson, Anne-Christine Andréasson, Xiaoyan Sun, Simone Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metabolism 24 (4). Elsevier BV: 593–607. doi:10.1016/j.cmet.2016.08.020.