7 Cleaning the Expression Matrix

7.1 Expression QC (UMI)

7.1.1 Introduction

Once gene expression has been quantified it is summarized as an expression matrix where each row corresponds to a gene (or transcript) and each column corresponds to a single cell. This matrix should be examined to remove poor quality cells which were not detected in either read QC or mapping QC steps. Failure to remove low quality cells at this stage may add technical noise which has the potential to obscure the biological signals of interest in the downstream analysis.

Since there is currently no standard method for performing scRNASeq the expected values for the various QC measures that will be presented here can vary substantially from experiment to experiment. Thus, to perform QC we will be looking for cells which are outliers with respect to the rest of the dataset rather than comparing to independent quality standards. Consequently, care should be taken when comparing quality metrics across datasets collected using different protocols.

7.1.2 Tung dataset

To illustrate cell QC, we consider a dataset of induced pluripotent stem cells generated from three different individuals (Tung et al. 2017) in Yoav Gilad’s lab at the University of Chicago. The experiments were carried out on the Fluidigm C1 platform and to facilitate the quantification both unique molecular identifiers (UMIs) and ERCC spike-ins were used. The data files are located in the tung folder in your working directory. These files are the copies of the original files made on the 15/03/16. We will use these copies for reproducibility purposes.

library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)

Load the data and annotations:

molecules <- read.table("tung/molecules.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)

Inspect a small portion of the expression matrix

head(molecules[ , 1:3])
##                 NA19098.r1.A01 NA19098.r1.A02 NA19098.r1.A03
## ENSG00000237683              0              0              0
## ENSG00000187634              0              0              0
## ENSG00000188976              3              6              1
## ENSG00000187961              0              0              0
## ENSG00000187583              0              0              0
## ENSG00000187642              0              0              0
head(anno)
##   individual replicate well      batch      sample_id
## 1    NA19098        r1  A01 NA19098.r1 NA19098.r1.A01
## 2    NA19098        r1  A02 NA19098.r1 NA19098.r1.A02
## 3    NA19098        r1  A03 NA19098.r1 NA19098.r1.A03
## 4    NA19098        r1  A04 NA19098.r1 NA19098.r1.A04
## 5    NA19098        r1  A05 NA19098.r1 NA19098.r1.A05
## 6    NA19098        r1  A06 NA19098.r1 NA19098.r1.A06

The data consists of 3 individuals and 3 replicates and therefore has 9 batches in total.

We standardize the analysis by using both SingleCellExperiment (SCE) and scater packages. First, create the SCE object:

umi <- SingleCellExperiment(
    assays = list(counts = as.matrix(molecules)), 
    colData = anno
)

Remove genes that are not expressed in any cell:

keep_feature <- rowSums(counts(umi) > 0) > 0
umi <- umi[keep_feature, ]

Define control features (genes) - ERCC spike-ins and mitochondrial genes (provided by the authors):

isSpike(umi, "ERCC") <- grepl("^ERCC-", rownames(umi))
isSpike(umi, "MT") <- rownames(umi) %in% 
    c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888",
    "ENSG00000198886", "ENSG00000212907", "ENSG00000198786",
    "ENSG00000198695", "ENSG00000198712", "ENSG00000198804",
    "ENSG00000198763", "ENSG00000228253", "ENSG00000198938",
    "ENSG00000198840")

Calculate the quality metrics:

umi <- calculateQCMetrics(
    umi,
    feature_controls = list(
        ERCC = isSpike(umi, "ERCC"), 
        MT = isSpike(umi, "MT")
    )
)

7.1.3 Cell QC

7.1.3.1 Library size

Next we consider the total number of RNA molecules detected per sample (if we were using read counts rather than UMI counts this would be the total number of reads). Wells with few reads/molecules are likely to have been broken or failed to capture a cell, and should thus be removed.

hist(
    umi$total_counts,
    breaks = 100
)
abline(v = 25000, col = "red")
Histogram of library sizes for all cells

Figure 7.1: Histogram of library sizes for all cells

Exercise 1

  1. How many cells does our filter remove?

  2. What distribution do you expect that the total number of molecules for each cell should follow?

Our answer

## filter_by_total_counts
## FALSE  TRUE 
##    46   818

7.1.3.2 Detected genes

In addition to ensuring sufficient sequencing depth for each sample, we also want to make sure that the reads are distributed across the transcriptome. Thus, we count the total number of unique genes detected in each sample.

hist(
    umi$total_features,
    breaks = 100
)
abline(v = 7000, col = "red")
Histogram of the number of detected genes in all cells

Figure 7.2: Histogram of the number of detected genes in all cells

From the plot we conclude that most cells have between 7,000-10,000 detected genes, which is normal for high-depth scRNA-seq. However, this varies by experimental protocol and sequencing depth. For example, droplet-based methods or samples with lower sequencing-depth typically detect fewer genes per cell. The most notable feature in the above plot is the “heavy tail” on the left hand side of the distribution. If detection rates were equal across the cells then the distribution should be approximately normal. Thus we remove those cells in the tail of the distribution (fewer than 7,000 detected genes).

Exercise 2

How many cells does our filter remove?

Our answer

## filter_by_expr_features
## FALSE  TRUE 
##   116   748

7.1.3.3 ERCCs and MTs

Another measure of cell quality is the ratio between ERCC spike-in RNAs and endogenous RNAs. This ratio can be used to estimate the total amount of RNA in the captured cells. Cells with a high level of spike-in RNAs had low starting amounts of RNA, likely due to the cell being dead or stressed which may result in the RNA being degraded.

plotPhenoData(
    umi,
    aes_string(
        x = "total_features",
        y = "pct_counts_MT",
        colour = "batch"
    )
)
Percentage of counts in MT genes

Figure 7.3: Percentage of counts in MT genes

plotPhenoData(
    umi,
    aes_string(
        x = "total_features",
        y = "pct_counts_ERCC",
        colour = "batch"
    )
)
Percentage of counts in ERCCs

Figure 7.4: Percentage of counts in ERCCs

The above analysis shows that majority of the cells from NA19098.r2 batch have a very high ERCC/Endo ratio. Indeed, it has been shown by the authors that this batch contains cells of smaller size.

Exercise 3

Create filters for removing batch NA19098.r2 and cells with high expression of mitochondrial genes (>10% of total counts in a cell).

Our answer

## filter_by_ERCC
## FALSE  TRUE 
##    96   768
## filter_by_MT
## FALSE  TRUE 
##    31   833

Exercise 4

What would you expect to see in the ERCC vs counts plot if you were examining a dataset containing cells of different sizes (eg. normal & senescent cells)?

Answer

You would expect to see a group corresponding to the smaller cells (normal) with a higher fraction of ERCC reads than a separate group corresponding to the larger cells (senescent).

7.1.4 Cell filtering

7.1.4.1 Manual

Now we can define a cell filter based on our previous analysis:

umi$use <- (
    # sufficient features (genes)
    filter_by_expr_features &
    # sufficient molecules counted
    filter_by_total_counts &
    # sufficient endogenous RNA
    filter_by_ERCC &
    # remove cells with unusual number of reads in MT genes
    filter_by_MT
)
table(umi$use)
## 
## FALSE  TRUE 
##   207   657

7.1.4.2 Automatic

Another option available in scater is to conduct PCA on a set of QC metrics and then use automatic outlier detection to identify potentially problematic cells.

By default, the following metrics are used for PCA-based outlier detection:

  • pct_counts_top_100_features
  • total_features
  • pct_counts_feature_controls
  • n_detected_feature_controls
  • log10_counts_endogenous_features
  • log10_counts_feature_controls

scater first creates a matrix where the rows represent cells and the columns represent the different QC metrics. Here, the PCA plot provides a 2D representation of cells ordered by their quality metrics. The outliers are then detected using methods from the mvoutlier package.

umi <- plotPCA(
    umi,
    size_by = "total_features", 
    shape_by = "use",
    pca_data_input = "pdata",
    detect_outliers = TRUE,
    return_SCE = TRUE
)
PCA plot used for automatic detection of cell outliers

Figure 7.5: PCA plot used for automatic detection of cell outliers

table(umi$outlier)
## 
## FALSE  TRUE 
##   819    45

7.1.5 Compare filterings

Exercise 5

Compare the default, automatic and manual cell filters. Plot a Venn diagram of the outlier cells from these filterings.

Hint: Use vennCounts and vennDiagram functions from the limma package to make a Venn diagram.

Answer

## 
## Attaching package: 'limma'
## The following object is masked from 'package:scater':
## 
##     plotMDS
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
Comparison of the default, automatic and manual cell filters

Figure 7.6: Comparison of the default, automatic and manual cell filters

7.1.6 Gene analysis

7.1.6.1 Gene expression

In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles may provide insights about how the experimental procedures could be improved.

It is often instructive to consider the number of reads consumed by the top 50 expressed genes.

plotQC(umi, type = "highest-expression")
Number of total counts consumed by the top 50 expressed genes

Figure 7.7: Number of total counts consumed by the top 50 expressed genes

The distributions are relatively flat indicating (but not guaranteeing!) good coverage of the full transcriptome of these cells. However, there are several spike-ins in the top 15 genes which suggests a greater dilution of the spike-ins may be preferrable if the experiment is to be repeated.

7.1.6.2 Gene filtering

It is typically a good idea to remove genes whose expression level is considered “undetectable”. We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells (note colData(umi)$use filter applied to the umi dataset).

filter_genes <- apply(
    counts(umi[ , colData(umi)$use]), 
    1, 
    function(x) length(x[x > 1]) >= 2
)
rowData(umi)$use <- filter_genes
table(filter_genes)
## filter_genes
## FALSE  TRUE 
##  4660 14066

Depending on the cell-type, protocol and sequencing depth, other cut-offs may be appropriate.

7.1.7 Save the data

Dimensions of the QCed dataset (do not forget about the gene filter we defined above):

dim(umi[rowData(umi)$use, colData(umi)$use])
## [1] 14066   657

Let’s create an additional slot with log-transformed counts (we will need it in the next chapters) and remove saved PCA results from the reducedDim slot:

assay(umi, "logcounts_raw") <- log2(counts(umi) + 1)
reducedDim(umi) <- NULL

Save the data:

saveRDS(umi, file = "tung/umi.rds")

7.1.8 Big Exercise

Perform exactly the same QC analysis with read counts of the same Blischak data. Use tung/reads.txt file to load the reads. Once you have finished please compare your results to ours (next chapter).

7.1.9 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] limma_3.34.9               scater_1.6.3              
##  [3] ggplot2_2.2.1              SingleCellExperiment_1.0.0
##  [5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
##  [7] matrixStats_0.53.1         Biobase_2.38.0            
##  [9] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
## [11] IRanges_2.12.0             S4Vectors_0.16.0          
## [13] BiocGenerics_0.24.0        knitr_1.20                
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2        plyr_1.8.4             lazyeval_0.2.1        
##   [4] sp_1.2-7               shinydashboard_0.6.1   splines_3.4.3         
##   [7] digest_0.6.15          htmltools_0.3.6        viridis_0.5.0         
##  [10] magrittr_1.5           memoise_1.1.0          cluster_2.0.6         
##  [13] prettyunits_1.0.2      colorspace_1.3-2       blob_1.1.0            
##  [16] rrcov_1.4-3            xfun_0.1               dplyr_0.7.4           
##  [19] RCurl_1.95-4.10        tximport_1.6.0         lme4_1.1-15           
##  [22] bindr_0.1              zoo_1.8-1              glue_1.2.0            
##  [25] gtable_0.2.0           zlibbioc_1.24.0        XVector_0.18.0        
##  [28] MatrixModels_0.4-1     car_2.1-6              kernlab_0.9-25        
##  [31] prabclus_2.2-6         DEoptimR_1.0-8         SparseM_1.77          
##  [34] VIM_4.7.0              scales_0.5.0           sgeostat_1.0-27       
##  [37] mvtnorm_1.0-7          DBI_0.7                GGally_1.3.2          
##  [40] edgeR_3.20.9           Rcpp_0.12.15           sROC_0.1-2            
##  [43] viridisLite_0.3.0      xtable_1.8-2           progress_1.1.2        
##  [46] laeken_0.4.6           bit_1.1-12             mclust_5.4            
##  [49] vcd_1.4-4              httr_1.3.1             RColorBrewer_1.1-2    
##  [52] fpc_2.1-11             modeltools_0.2-21      pkgconfig_2.0.1       
##  [55] reshape_0.8.7          XML_3.98-1.10          flexmix_2.3-14        
##  [58] nnet_7.3-12            locfit_1.5-9.1         labeling_0.3          
##  [61] rlang_0.2.0            reshape2_1.4.3         AnnotationDbi_1.40.0  
##  [64] munsell_0.4.3          tools_3.4.3            RSQLite_2.0           
##  [67] pls_2.6-0              evaluate_0.10.1        stringr_1.3.0         
##  [70] cvTools_0.3.2          yaml_2.1.17            bit64_0.9-7           
##  [73] robustbase_0.92-8      bindrcpp_0.2           nlme_3.1-129          
##  [76] mime_0.5               quantreg_5.35          biomaRt_2.34.2        
##  [79] compiler_3.4.3         pbkrtest_0.4-7         beeswarm_0.2.3        
##  [82] e1071_1.6-8            tibble_1.4.2           robCompositions_2.0.6 
##  [85] pcaPP_1.9-73           stringi_1.1.6          highr_0.6             
##  [88] lattice_0.20-34        trimcluster_0.1-2      Matrix_1.2-7.1        
##  [91] nloptr_1.0.4           pillar_1.2.1           lmtest_0.9-35         
##  [94] data.table_1.10.4-3    cowplot_0.9.2          bitops_1.0-6          
##  [97] httpuv_1.3.6.1         R6_2.2.2               bookdown_0.7          
## [100] gridExtra_2.3          vipor_0.4.5            boot_1.3-18           
## [103] MASS_7.3-45            assertthat_0.2.0       rhdf5_2.22.0          
## [106] rprojroot_1.3-2        rjson_0.2.15           GenomeInfoDbData_1.0.0
## [109] diptest_0.75-7         mgcv_1.8-23            grid_3.4.3            
## [112] class_7.3-14           minqa_1.2.4            rmarkdown_1.8         
## [115] mvoutlier_2.0.9        shiny_1.0.5            ggbeeswarm_0.6.0

7.2 Expression QC (Reads)

library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
reads <- read.table("tung/reads.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
head(reads[ , 1:3])
##                 NA19098.r1.A01 NA19098.r1.A02 NA19098.r1.A03
## ENSG00000237683              0              0              0
## ENSG00000187634              0              0              0
## ENSG00000188976             57            140              1
## ENSG00000187961              0              0              0
## ENSG00000187583              0              0              0
## ENSG00000187642              0              0              0
head(anno)
##   individual replicate well      batch      sample_id
## 1    NA19098        r1  A01 NA19098.r1 NA19098.r1.A01
## 2    NA19098        r1  A02 NA19098.r1 NA19098.r1.A02
## 3    NA19098        r1  A03 NA19098.r1 NA19098.r1.A03
## 4    NA19098        r1  A04 NA19098.r1 NA19098.r1.A04
## 5    NA19098        r1  A05 NA19098.r1 NA19098.r1.A05
## 6    NA19098        r1  A06 NA19098.r1 NA19098.r1.A06
reads <- SingleCellExperiment(
    assays = list(counts = as.matrix(reads)), 
    colData = anno
)
keep_feature <- rowSums(counts(reads) > 0) > 0
reads <- reads[keep_feature, ]
isSpike(reads, "ERCC") <- grepl("^ERCC-", rownames(reads))
isSpike(reads, "MT") <- rownames(reads) %in% 
    c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888",
    "ENSG00000198886", "ENSG00000212907", "ENSG00000198786",
    "ENSG00000198695", "ENSG00000198712", "ENSG00000198804",
    "ENSG00000198763", "ENSG00000228253", "ENSG00000198938",
    "ENSG00000198840")
reads <- calculateQCMetrics(
    reads,
    feature_controls = list(
        ERCC = isSpike(reads, "ERCC"), 
        MT = isSpike(reads, "MT")
    )
)
hist(
    reads$total_counts,
    breaks = 100
)
abline(v = 1.3e6, col = "red")
Histogram of library sizes for all cells

Figure 7.8: Histogram of library sizes for all cells

filter_by_total_counts <- (reads$total_counts > 1.3e6)
table(filter_by_total_counts)
## filter_by_total_counts
## FALSE  TRUE 
##   180   684
hist(
    reads$total_features,
    breaks = 100
)
abline(v = 7000, col = "red")
Histogram of the number of detected genes in all cells

Figure 7.9: Histogram of the number of detected genes in all cells

filter_by_expr_features <- (reads$total_features > 7000)
table(filter_by_expr_features)
## filter_by_expr_features
## FALSE  TRUE 
##   116   748
plotPhenoData(
    reads,
    aes_string(
        x = "total_features",
        y = "pct_counts_MT",
        colour = "batch"
    )
)
Percentage of counts in MT genes

Figure 7.10: Percentage of counts in MT genes

plotPhenoData(
    reads,
    aes_string(
        x = "total_features",
        y = "pct_counts_ERCC",
        colour = "batch"
    )
)
Percentage of counts in ERCCs

Figure 7.11: Percentage of counts in ERCCs

filter_by_ERCC <- 
    reads$batch != "NA19098.r2" & reads$pct_counts_ERCC < 25
table(filter_by_ERCC)
## filter_by_ERCC
## FALSE  TRUE 
##   103   761
filter_by_MT <- reads$pct_counts_MT < 30
table(filter_by_MT)
## filter_by_MT
## FALSE  TRUE 
##    18   846
reads$use <- (
    # sufficient features (genes)
    filter_by_expr_features &
    # sufficient molecules counted
    filter_by_total_counts &
    # sufficient endogenous RNA
    filter_by_ERCC &
    # remove cells with unusual number of reads in MT genes
    filter_by_MT
)
table(reads$use)
## 
## FALSE  TRUE 
##   258   606
reads <- plotPCA(
    reads,
    size_by = "total_features", 
    shape_by = "use",
    pca_data_input = "pdata",
    detect_outliers = TRUE,
    return_SCE = TRUE
)
PCA plot used for automatic detection of cell outliers

Figure 7.12: PCA plot used for automatic detection of cell outliers

table(reads$outlier)
## 
## FALSE  TRUE 
##   756   108
library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:scater':
## 
##     plotMDS
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
auto <- colnames(reads)[reads$outlier]
man <- colnames(reads)[!reads$use]
venn.diag <- vennCounts(
    cbind(colnames(reads) %in% auto,
    colnames(reads) %in% man)
)
vennDiagram(
    venn.diag,
    names = c("Automatic", "Manual"),
    circle.col = c("blue", "green")
)
Comparison of the default, automatic and manual cell filters

Figure 7.13: Comparison of the default, automatic and manual cell filters

plotQC(reads, type = "highest-expression")
Number of total counts consumed by the top 50 expressed genes

Figure 7.14: Number of total counts consumed by the top 50 expressed genes

filter_genes <- apply(
    counts(reads[, colData(reads)$use]), 
    1, 
    function(x) length(x[x > 1]) >= 2
)
rowData(reads)$use <- filter_genes
table(filter_genes)
## filter_genes
## FALSE  TRUE 
##  2664 16062
dim(reads[rowData(reads)$use, colData(reads)$use])
## [1] 16062   606
assay(reads, "logcounts_raw") <- log2(counts(reads) + 1)
reducedDim(reads) <- NULL
saveRDS(reads, file = "tung/reads.rds")

By comparing Figure 7.6 and Figure 7.13, it is clear that the reads based filtering removed more cells than the UMI based analysis. If you go back and compare the results you should be able to conclude that the ERCC and MT filters are more strict for the reads-based analysis.

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] limma_3.34.9               scater_1.6.3              
##  [3] ggplot2_2.2.1              SingleCellExperiment_1.0.0
##  [5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
##  [7] matrixStats_0.53.1         Biobase_2.38.0            
##  [9] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
## [11] IRanges_2.12.0             S4Vectors_0.16.0          
## [13] BiocGenerics_0.24.0        knitr_1.20                
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2        plyr_1.8.4             lazyeval_0.2.1        
##   [4] sp_1.2-7               shinydashboard_0.6.1   splines_3.4.3         
##   [7] digest_0.6.15          htmltools_0.3.6        viridis_0.5.0         
##  [10] magrittr_1.5           memoise_1.1.0          cluster_2.0.6         
##  [13] prettyunits_1.0.2      colorspace_1.3-2       blob_1.1.0            
##  [16] rrcov_1.4-3            xfun_0.1               dplyr_0.7.4           
##  [19] RCurl_1.95-4.10        tximport_1.6.0         lme4_1.1-15           
##  [22] bindr_0.1              zoo_1.8-1              glue_1.2.0            
##  [25] gtable_0.2.0           zlibbioc_1.24.0        XVector_0.18.0        
##  [28] MatrixModels_0.4-1     car_2.1-6              kernlab_0.9-25        
##  [31] prabclus_2.2-6         DEoptimR_1.0-8         SparseM_1.77          
##  [34] VIM_4.7.0              scales_0.5.0           sgeostat_1.0-27       
##  [37] mvtnorm_1.0-7          DBI_0.7                GGally_1.3.2          
##  [40] edgeR_3.20.9           Rcpp_0.12.15           sROC_0.1-2            
##  [43] viridisLite_0.3.0      xtable_1.8-2           progress_1.1.2        
##  [46] laeken_0.4.6           bit_1.1-12             mclust_5.4            
##  [49] vcd_1.4-4              httr_1.3.1             RColorBrewer_1.1-2    
##  [52] fpc_2.1-11             modeltools_0.2-21      pkgconfig_2.0.1       
##  [55] reshape_0.8.7          XML_3.98-1.10          flexmix_2.3-14        
##  [58] nnet_7.3-12            locfit_1.5-9.1         labeling_0.3          
##  [61] rlang_0.2.0            reshape2_1.4.3         AnnotationDbi_1.40.0  
##  [64] munsell_0.4.3          tools_3.4.3            RSQLite_2.0           
##  [67] pls_2.6-0              evaluate_0.10.1        stringr_1.3.0         
##  [70] cvTools_0.3.2          yaml_2.1.17            bit64_0.9-7           
##  [73] robustbase_0.92-8      bindrcpp_0.2           nlme_3.1-129          
##  [76] mime_0.5               quantreg_5.35          biomaRt_2.34.2        
##  [79] compiler_3.4.3         pbkrtest_0.4-7         beeswarm_0.2.3        
##  [82] e1071_1.6-8            tibble_1.4.2           robCompositions_2.0.6 
##  [85] pcaPP_1.9-73           stringi_1.1.6          highr_0.6             
##  [88] lattice_0.20-34        trimcluster_0.1-2      Matrix_1.2-7.1        
##  [91] nloptr_1.0.4           pillar_1.2.1           lmtest_0.9-35         
##  [94] data.table_1.10.4-3    cowplot_0.9.2          bitops_1.0-6          
##  [97] httpuv_1.3.6.1         R6_2.2.2               bookdown_0.7          
## [100] gridExtra_2.3          vipor_0.4.5            boot_1.3-18           
## [103] MASS_7.3-45            assertthat_0.2.0       rhdf5_2.22.0          
## [106] rprojroot_1.3-2        rjson_0.2.15           GenomeInfoDbData_1.0.0
## [109] diptest_0.75-7         mgcv_1.8-23            grid_3.4.3            
## [112] class_7.3-14           minqa_1.2.4            rmarkdown_1.8         
## [115] mvoutlier_2.0.9        shiny_1.0.5            ggbeeswarm_0.6.0

7.3 Data visualization

7.3.1 Introduction

In this chapter we will continue to work with the filtered Tung dataset produced in the previous chapter. We will explore different ways of visualizing the data to allow you to asses what happened to the expression matrix after the quality control step. scater package provides several very useful functions to simplify visualisation.

One important aspect of single-cell RNA-seq is to control for batch effects. Batch effects are technical artefacts that are added to the samples during handling. For example, if two sets of samples were prepared in different labs or even on different days in the same lab, then we may observe greater similarities between the samples that were handled together. In the worst case scenario, batch effects may be mistaken for true biological variation. The Tung data allows us to explore these issues in a controlled manner since some of the salient aspects of how the samples were handled have been recorded. Ideally, we expect to see batches from the same individual grouping together and distinct groups corresponding to each individual.

library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[rowData(umi)$use, colData(umi)$use]
endog_genes <- !rowData(umi.qc)$is_feature_control

7.3.2 PCA plot

The easiest way to overview the data is by transforming it using the principal component analysis and then visualize the first two principal components.

Principal component analysis (PCA) is a statistical procedure that uses a transformation to convert a set of observations into a set of values of linearly uncorrelated variables called principal components (PCs). The number of principal components is less than or equal to the number of original variables.

Mathematically, the PCs correspond to the eigenvectors of the covariance matrix. The eigenvectors are sorted by eigenvalue so that the first principal component accounts for as much of the variability in the data as possible, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components (the figure below is taken from here).

Schematic representation of PCA dimensionality reduction

Figure 7.15: Schematic representation of PCA dimensionality reduction

7.3.2.1 Before QC

Without log-transformation:

plotPCA(
    umi[endog_genes, ],
    exprs_values = "counts",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data

Figure 7.16: PCA plot of the tung data

With log-transformation:

plotPCA(
    umi[endog_genes, ],
    exprs_values = "logcounts_raw",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data

Figure 7.17: PCA plot of the tung data

Clearly log-transformation is benefitial for our data - it reduces the variance on the first principal component and already separates some biological effects. Moreover, it makes the distribution of the expression values more normal. In the following analysis and chapters we will be using log-transformed raw counts by default.

However, note that just a log-transformation is not enough to account for different technical factors between the cells (e.g. sequencing depth). Therefore, please do not use logcounts_raw for your downstream analysis, instead as a minimum suitable data use the logcounts slot of the SingleCellExperiment object, which not just log-transformed, but also normalised by library size (e.g. CPM normalisation). In the course we use logcounts_raw only for demonstration purposes!

7.3.2.2 After QC

plotPCA(
    umi.qc[endog_genes, ],
    exprs_values = "logcounts_raw",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data

Figure 7.18: PCA plot of the tung data

Comparing Figure 7.17 and Figure 7.18, it is clear that after quality control the NA19098.r2 cells no longer form a group of outliers.

By default only the top 500 most variable genes are used by scater to calculate the PCA. This can be adjusted by changing the ntop argument.

Exercise 1 How do the PCA plots change if when all 14,214 genes are used? Or when only top 50 genes are used? Why does the fraction of variance accounted for by the first PC change so dramatically?

Hint Use ntop argument of the plotPCA function.

Our answer

PCA plot of the tung data (14214 genes)

Figure 7.19: PCA plot of the tung data (14214 genes)

PCA plot of the tung data (50 genes)

Figure 7.20: PCA plot of the tung data (50 genes)

If your answers are different please compare your code with ours (you need to search for this exercise in the opened file).

7.3.3 tSNE map

An alternative to PCA for visualizing scRNASeq data is a tSNE plot. tSNE (t-Distributed Stochastic Neighbor Embedding) combines dimensionality reduction (e.g. PCA) with random walks on the nearest-neighbour network to map high dimensional data (i.e. our 14,214 dimensional expression matrix) to a 2-dimensional space while preserving local distances between cells. In contrast with PCA, tSNE is a stochastic algorithm which means running the method multiple times on the same dataset will result in different plots. Due to the non-linear and stochastic nature of the algorithm, tSNE is more difficult to intuitively interpret tSNE. To ensure reproducibility, we fix the “seed” of the random-number generator in the code below so that we always get the same plot.

7.3.3.1 Before QC

plotTSNE(
    umi[endog_genes, ],
    exprs_values = "logcounts_raw",
    perplexity = 130,
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual",
    rand_seed = 123456
)
tSNE map of the tung data

Figure 7.21: tSNE map of the tung data

7.3.3.2 After QC

plotTSNE(
    umi.qc[endog_genes, ],
    exprs_values = "logcounts_raw",
    perplexity = 130,
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual",
    rand_seed = 123456
)
tSNE map of the tung data

Figure 7.22: tSNE map of the tung data

Interpreting PCA and tSNE plots is often challenging and due to their stochastic and non-linear nature, they are less intuitive. However, in this case it is clear that they provide a similar picture of the data. Comparing Figure 7.21 and 7.22, it is again clear that the samples from NA19098.r2 are no longer outliers after the QC filtering.

Furthermore tSNE requires you to provide a value of perplexity which reflects the number of neighbours used to build the nearest-neighbour network; a high value creates a dense network which clumps cells together while a low value makes the network more sparse allowing groups of cells to separate from each other. scater uses a default perplexity of the total number of cells divided by five (rounded down).

You can read more about the pitfalls of using tSNE here.

Exercise 2 How do the tSNE plots change when a perplexity of 10 or 200 is used? How does the choice of perplexity affect the interpretation of the results?

Our answer

tSNE map of the tung data (perplexity = 10)

Figure 7.23: tSNE map of the tung data (perplexity = 10)

tSNE map of the tung data (perplexity = 200)

Figure 7.24: tSNE map of the tung data (perplexity = 200)

7.3.4 Big Exercise

Perform the same analysis with read counts of the Blischak data. Use tung/reads.rds file to load the reads SCE object. Once you have finished please compare your results to ours (next chapter).

7.3.5 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] scater_1.6.3               ggplot2_2.2.1             
##  [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [5] DelayedArray_0.4.1         matrixStats_0.53.1        
##  [7] Biobase_2.38.0             GenomicRanges_1.30.3      
##  [9] GenomeInfoDb_1.14.0        IRanges_2.12.0            
## [11] S4Vectors_0.16.0           BiocGenerics_0.24.0       
## [13] knitr_1.20                
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.0          httr_1.3.1             edgeR_3.20.9          
##  [4] bit64_0.9-7            viridisLite_0.3.0      shiny_1.0.5           
##  [7] assertthat_0.2.0       highr_0.6              blob_1.1.0            
## [10] vipor_0.4.5            GenomeInfoDbData_1.0.0 yaml_2.1.17           
## [13] progress_1.1.2         pillar_1.2.1           RSQLite_2.0           
## [16] backports_1.1.2        lattice_0.20-34        glue_1.2.0            
## [19] limma_3.34.9           digest_0.6.15          XVector_0.18.0        
## [22] colorspace_1.3-2       cowplot_0.9.2          htmltools_0.3.6       
## [25] httpuv_1.3.6.1         Matrix_1.2-7.1         plyr_1.8.4            
## [28] XML_3.98-1.10          pkgconfig_2.0.1        biomaRt_2.34.2        
## [31] bookdown_0.7           zlibbioc_1.24.0        xtable_1.8-2          
## [34] scales_0.5.0           Rtsne_0.13             tibble_1.4.2          
## [37] lazyeval_0.2.1         magrittr_1.5           mime_0.5              
## [40] memoise_1.1.0          evaluate_0.10.1        beeswarm_0.2.3        
## [43] shinydashboard_0.6.1   tools_3.4.3            data.table_1.10.4-3   
## [46] prettyunits_1.0.2      stringr_1.3.0          munsell_0.4.3         
## [49] locfit_1.5-9.1         AnnotationDbi_1.40.0   bindrcpp_0.2          
## [52] compiler_3.4.3         rlang_0.2.0            rhdf5_2.22.0          
## [55] grid_3.4.3             RCurl_1.95-4.10        tximport_1.6.0        
## [58] rjson_0.2.15           labeling_0.3           bitops_1.0-6          
## [61] rmarkdown_1.8          gtable_0.2.0           DBI_0.7               
## [64] reshape2_1.4.3         R6_2.2.2               gridExtra_2.3         
## [67] dplyr_0.7.4            bit_1.1-12             bindr_0.1             
## [70] rprojroot_1.3-2        ggbeeswarm_0.6.0       stringi_1.1.6         
## [73] Rcpp_0.12.15           xfun_0.1

7.4 Data visualization (Reads)

library(scater)
options(stringsAsFactors = FALSE)
reads <- readRDS("tung/reads.rds")
reads.qc <- reads[rowData(reads)$use, colData(reads)$use]
endog_genes <- !rowData(reads.qc)$is_feature_control
plotPCA(
    reads[endog_genes, ],
    exprs_values = "counts",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data

Figure 7.25: PCA plot of the tung data

plotPCA(
    reads[endog_genes, ],
    exprs_values = "logcounts_raw",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data

Figure 7.26: PCA plot of the tung data

plotPCA(
    reads.qc[endog_genes, ],
    exprs_values = "logcounts_raw",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data

Figure 7.27: PCA plot of the tung data

plotTSNE(
    reads[endog_genes, ],
    exprs_values = "logcounts_raw",
    perplexity = 130,
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual",
    rand_seed = 123456
)
tSNE map of the tung data

Figure 7.28: tSNE map of the tung data

plotTSNE(
    reads.qc[endog_genes, ],
    exprs_values = "logcounts_raw",
    perplexity = 130,
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual",
    rand_seed = 123456
)
tSNE map of the tung data

Figure 7.29: tSNE map of the tung data

tSNE map of the tung data (perplexity = 10)

Figure 7.23: tSNE map of the tung data (perplexity = 10)

tSNE map of the tung data (perplexity = 200)

Figure 7.24: tSNE map of the tung data (perplexity = 200)

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] knitr_1.20                 scater_1.6.3              
##  [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [5] DelayedArray_0.4.1         matrixStats_0.53.1        
##  [7] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
##  [9] IRanges_2.12.0             S4Vectors_0.16.0          
## [11] ggplot2_2.2.1              Biobase_2.38.0            
## [13] BiocGenerics_0.24.0       
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.0          httr_1.3.1             edgeR_3.20.9          
##  [4] bit64_0.9-7            viridisLite_0.3.0      shiny_1.0.5           
##  [7] assertthat_0.2.0       highr_0.6              blob_1.1.0            
## [10] vipor_0.4.5            GenomeInfoDbData_1.0.0 yaml_2.1.17           
## [13] progress_1.1.2         pillar_1.2.1           RSQLite_2.0           
## [16] backports_1.1.2        lattice_0.20-34        glue_1.2.0            
## [19] limma_3.34.9           digest_0.6.15          XVector_0.18.0        
## [22] colorspace_1.3-2       cowplot_0.9.2          htmltools_0.3.6       
## [25] httpuv_1.3.6.1         Matrix_1.2-7.1         plyr_1.8.4            
## [28] XML_3.98-1.10          pkgconfig_2.0.1        biomaRt_2.34.2        
## [31] bookdown_0.7           zlibbioc_1.24.0        xtable_1.8-2          
## [34] scales_0.5.0           Rtsne_0.13             tibble_1.4.2          
## [37] lazyeval_0.2.1         magrittr_1.5           mime_0.5              
## [40] memoise_1.1.0          evaluate_0.10.1        beeswarm_0.2.3        
## [43] shinydashboard_0.6.1   tools_3.4.3            data.table_1.10.4-3   
## [46] prettyunits_1.0.2      stringr_1.3.0          munsell_0.4.3         
## [49] locfit_1.5-9.1         AnnotationDbi_1.40.0   bindrcpp_0.2          
## [52] compiler_3.4.3         rlang_0.2.0            rhdf5_2.22.0          
## [55] grid_3.4.3             RCurl_1.95-4.10        tximport_1.6.0        
## [58] rjson_0.2.15           labeling_0.3           bitops_1.0-6          
## [61] rmarkdown_1.8          gtable_0.2.0           DBI_0.7               
## [64] reshape2_1.4.3         R6_2.2.2               gridExtra_2.3         
## [67] dplyr_0.7.4            bit_1.1-12             bindr_0.1             
## [70] rprojroot_1.3-2        ggbeeswarm_0.6.0       stringi_1.1.6         
## [73] Rcpp_0.12.15           xfun_0.1

7.5 Identifying confounding factors

7.5.1 Introduction

There is a large number of potential confounders, artifacts and biases in sc-RNA-seq data. One of the main challenges in analyzing scRNA-seq data stems from the fact that it is difficult to carry out a true technical replicate (why?) to distinguish biological and technical variability. In the previous chapters we considered batch effects and in this chapter we will continue to explore how experimental artifacts can be identified and removed. We will continue using the scater package since it provides a set of methods specifically for quality control of experimental and explanatory variables. Moreover, we will continue to work with the Blischak data that was used in the previous chapter.

library(scater, quietly = TRUE)
options(stringsAsFactors = FALSE)
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[rowData(umi)$use, colData(umi)$use]
endog_genes <- !rowData(umi.qc)$is_feature_control

The umi.qc dataset contains filtered cells and genes. Our next step is to explore technical drivers of variability in the data to inform data normalisation before downstream analysis.

7.5.2 Correlations with PCs

Let’s first look again at the PCA plot of the QCed dataset:

plotPCA(
    umi.qc[endog_genes, ],
    exprs_values = "logcounts_raw",
    colour_by = "batch",
    size_by = "total_features"
)
PCA plot of the tung data

Figure 7.30: PCA plot of the tung data

scater allows one to identify principal components that correlate with experimental and QC variables of interest (it ranks principle components by \(R^2\) from a linear model regressing PC value against the variable of interest).

Let’s test whether some of the variables correlate with any of the PCs.

7.5.2.1 Detected genes

plotQC(
    umi.qc[endog_genes, ],
    type = "find-pcs",
    exprs_values = "logcounts_raw",
    variable = "total_features"
)
PC correlation with the number of detected genes

Figure 7.31: PC correlation with the number of detected genes

Indeed, we can see that PC1 can be almost completely explained by the number of detected genes. In fact, it was also visible on the PCA plot above. This is a well-known issue in scRNA-seq and was described here.

7.5.3 Explanatory variables

scater can also compute the marginal \(R^2\) for each variable when fitting a linear model regressing expression values for each gene against just that variable, and display a density plot of the gene-wise marginal \(R^2\) values for the variables.

plotQC(
    umi.qc[endog_genes, ],
    type = "expl",
    exprs_values = "logcounts_raw",
    variables = c(
        "total_features",
        "total_counts",
        "batch",
        "individual",
        "pct_counts_ERCC",
        "pct_counts_MT"
    )
)
Explanatory variables

Figure 7.32: Explanatory variables

This analysis indicates that the number of detected genes (again) and also the sequencing depth (number of counts) have substantial explanatory power for many genes, so these variables are good candidates for conditioning out in a normalisation step, or including in downstream statistical models. Expression of ERCCs also appears to be an important explanatory variable and one notable feature of the above plot is that batch explains more than individual. What does that tell us about the technical and biological variability of the data?

7.5.4 Other confounders

In addition to correcting for batch, there are other factors that one may want to compensate for. As with batch correction, these adjustments require extrinsic information. One popular method is scLVM which allows you to identify and subtract the effect from processes such as cell-cycle or apoptosis.

In addition, protocols may differ in terms of their coverage of each transcript, their bias based on the average content of A/T nucleotides, or their ability to capture short transcripts. Ideally, we would like to compensate for all of these differences and biases.

7.5.5 Exercise

Perform the same analysis with read counts of the Blischak data. Use tung/reads.rds file to load the reads SCESet object. Once you have finished please compare your results to ours (next chapter).

7.5.6 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] scater_1.6.3               SingleCellExperiment_1.0.0
##  [3] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
##  [5] matrixStats_0.53.1         GenomicRanges_1.30.3      
##  [7] GenomeInfoDb_1.14.0        IRanges_2.12.0            
##  [9] S4Vectors_0.16.0           ggplot2_2.2.1             
## [11] Biobase_2.38.0             BiocGenerics_0.24.0       
## [13] knitr_1.20                
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.0          httr_1.3.1             edgeR_3.20.9          
##  [4] bit64_0.9-7            viridisLite_0.3.0      shiny_1.0.5           
##  [7] assertthat_0.2.0       highr_0.6              blob_1.1.0            
## [10] vipor_0.4.5            GenomeInfoDbData_1.0.0 yaml_2.1.17           
## [13] progress_1.1.2         pillar_1.2.1           RSQLite_2.0           
## [16] backports_1.1.2        lattice_0.20-34        glue_1.2.0            
## [19] limma_3.34.9           digest_0.6.15          XVector_0.18.0        
## [22] colorspace_1.3-2       cowplot_0.9.2          htmltools_0.3.6       
## [25] httpuv_1.3.6.1         Matrix_1.2-7.1         plyr_1.8.4            
## [28] XML_3.98-1.10          pkgconfig_2.0.1        biomaRt_2.34.2        
## [31] bookdown_0.7           zlibbioc_1.24.0        xtable_1.8-2          
## [34] scales_0.5.0           tibble_1.4.2           lazyeval_0.2.1        
## [37] magrittr_1.5           mime_0.5               memoise_1.1.0         
## [40] evaluate_0.10.1        beeswarm_0.2.3         shinydashboard_0.6.1  
## [43] tools_3.4.3            data.table_1.10.4-3    prettyunits_1.0.2     
## [46] stringr_1.3.0          munsell_0.4.3          locfit_1.5-9.1        
## [49] AnnotationDbi_1.40.0   bindrcpp_0.2           compiler_3.4.3        
## [52] rlang_0.2.0            rhdf5_2.22.0           grid_3.4.3            
## [55] RCurl_1.95-4.10        tximport_1.6.0         rjson_0.2.15          
## [58] labeling_0.3           bitops_1.0-6           rmarkdown_1.8         
## [61] gtable_0.2.0           DBI_0.7                reshape2_1.4.3        
## [64] R6_2.2.2               gridExtra_2.3          dplyr_0.7.4           
## [67] bit_1.1-12             bindr_0.1              rprojroot_1.3-2       
## [70] ggbeeswarm_0.6.0       stringi_1.1.6          Rcpp_0.12.15          
## [73] xfun_0.1

7.6 Identifying confounding factors (Reads)

PCA plot of the tung data

Figure 7.33: PCA plot of the tung data

PC correlation with the number of detected genes

Figure 7.34: PC correlation with the number of detected genes

Explanatory variables

Figure 7.35: Explanatory variables

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] knitr_1.20                 scater_1.6.3              
##  [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [5] DelayedArray_0.4.1         matrixStats_0.53.1        
##  [7] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
##  [9] IRanges_2.12.0             S4Vectors_0.16.0          
## [11] ggplot2_2.2.1              Biobase_2.38.0            
## [13] BiocGenerics_0.24.0       
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.0          httr_1.3.1             edgeR_3.20.9          
##  [4] bit64_0.9-7            viridisLite_0.3.0      shiny_1.0.5           
##  [7] assertthat_0.2.0       highr_0.6              blob_1.1.0            
## [10] vipor_0.4.5            GenomeInfoDbData_1.0.0 yaml_2.1.17           
## [13] progress_1.1.2         pillar_1.2.1           RSQLite_2.0           
## [16] backports_1.1.2        lattice_0.20-34        glue_1.2.0            
## [19] limma_3.34.9           digest_0.6.15          XVector_0.18.0        
## [22] colorspace_1.3-2       cowplot_0.9.2          htmltools_0.3.6       
## [25] httpuv_1.3.6.1         Matrix_1.2-7.1         plyr_1.8.4            
## [28] XML_3.98-1.10          pkgconfig_2.0.1        biomaRt_2.34.2        
## [31] bookdown_0.7           zlibbioc_1.24.0        xtable_1.8-2          
## [34] scales_0.5.0           tibble_1.4.2           lazyeval_0.2.1        
## [37] magrittr_1.5           mime_0.5               memoise_1.1.0         
## [40] evaluate_0.10.1        beeswarm_0.2.3         shinydashboard_0.6.1  
## [43] tools_3.4.3            data.table_1.10.4-3    prettyunits_1.0.2     
## [46] stringr_1.3.0          munsell_0.4.3          locfit_1.5-9.1        
## [49] AnnotationDbi_1.40.0   bindrcpp_0.2           compiler_3.4.3        
## [52] rlang_0.2.0            rhdf5_2.22.0           grid_3.4.3            
## [55] RCurl_1.95-4.10        tximport_1.6.0         rjson_0.2.15          
## [58] labeling_0.3           bitops_1.0-6           rmarkdown_1.8         
## [61] gtable_0.2.0           DBI_0.7                reshape2_1.4.3        
## [64] R6_2.2.2               gridExtra_2.3          dplyr_0.7.4           
## [67] bit_1.1-12             bindr_0.1              rprojroot_1.3-2       
## [70] ggbeeswarm_0.6.0       stringi_1.1.6          Rcpp_0.12.15          
## [73] xfun_0.1

7.7 Normalization theory

7.7.1 Introduction

In the previous chapter we identified important confounding factors and explanatory variables. scater allows one to account for these variables in subsequent statistical models or to condition them out using normaliseExprs(), if so desired. This can be done by providing a design matrix to normaliseExprs(). We are not covering this topic here, but you can try to do it yourself as an exercise.

Instead we will explore how simple size-factor normalisations correcting for library size can remove the effects of some of the confounders and explanatory variables.

7.7.2 Library size

Library sizes vary because scRNA-seq data is often sequenced on highly multiplexed platforms the total reads which are derived from each cell may differ substantially. Some quantification methods (eg. Cufflinks, RSEM) incorporated library size when determining gene expression estimates thus do not require this normalization.

However, if another quantification method was used then library size must be corrected for by multiplying or dividing each column of the expression matrix by a “normalization factor” which is an estimate of the library size relative to the other cells. Many methods to correct for library size have been developped for bulk RNA-seq and can be equally applied to scRNA-seq (eg. UQ, SF, CPM, RPKM, FPKM, TPM).

7.7.3 Normalisations

7.7.3.1 CPM

The simplest way to normalize this data is to convert it to counts per million (CPM) by dividing each column by its total then multiplying by 1,000,000. Note that spike-ins should be excluded from the calculation of total expression in order to correct for total cell RNA content, therefore we will only use endogenous genes. Example of a CPM function in R:

calc_cpm <-
function (expr_mat, spikes = NULL) 
{
    norm_factor <- colSums(expr_mat[-spikes, ])
    return(t(t(expr_mat)/norm_factor)) * 10^6
}

One potential drawback of CPM is if your sample contains genes that are both very highly expressed and differentially expressed across the cells. In this case, the total molecules in the cell may depend of whether such genes are on/off in the cell and normalizing by total molecules may hide the differential expression of those genes and/or falsely create differential expression for the remaining genes.

Note RPKM, FPKM and TPM are variants on CPM which further adjust counts by the length of the respective gene/transcript.

To deal with this potentiality several other measures were devised.

7.7.3.2 RLE (SF)

The size factor (SF) was proposed and popularized by DESeq (Anders and Huber 2010). First the geometric mean of each gene across all cells is calculated. The size factor for each cell is the median across genes of the ratio of the expression to the gene’s geometric mean. A drawback to this method is that since it uses the geometric mean only genes with non-zero expression values across all cells can be used in its calculation, making it unadvisable for large low-depth scRNASeq experiments. edgeR & scater call this method RLE for “relative log expression”. Example of a SF function in R:

calc_sf <-
function (expr_mat, spikes = NULL) 
{
    geomeans <- exp(rowMeans(log(expr_mat[-spikes, ])))
    SF <- function(cnts) {
        median((cnts/geomeans)[(is.finite(geomeans) & geomeans > 
            0)])
    }
    norm_factor <- apply(expr_mat[-spikes, ], 2, SF)
    return(t(t(expr_mat)/norm_factor))
}

7.7.3.3 UQ

The upperquartile (UQ) was proposed by (Bullard et al. 2010). Here each column is divided by the 75% quantile of the counts for each library. Often the calculated quantile is scaled by the median across cells to keep the absolute level of expression relatively consistent. A drawback to this method is that for low-depth scRNASeq experiments the large number of undetected genes may result in the 75% quantile being zero (or close to it). This limitation can be overcome by generalizing the idea and using a higher quantile (eg. the 99% quantile is the default in scater) or by excluding zeros prior to calculating the 75% quantile. Example of a UQ function in R:

calc_uq <-
function (expr_mat, spikes = NULL) 
{
    UQ <- function(x) {
        quantile(x[x > 0], 0.75)
    }
    uq <- unlist(apply(expr_mat[-spikes, ], 2, UQ))
    norm_factor <- uq/median(uq)
    return(t(t(expr_mat)/norm_factor))
}

7.7.3.4 TMM

Another method is called TMM is the weighted trimmed mean of M-values (to the reference) proposed by (Robinson and Oshlack 2010). The M-values in question are the gene-wise log2 fold changes between individual cells. One cell is used as the reference then the M-values for each other cell is calculated compared to this reference. These values are then trimmed by removing the top and bottom ~30%, and the average of the remaining values is calculated by weighting them to account for the effect of the log scale on variance. Each non-reference cell is multiplied by the calculated factor. Two potential issues with this method are insufficient non-zero genes left after trimming, and the assumption that most genes are not differentially expressed.

7.7.3.5 scran

scran package implements a variant on CPM specialized for single-cell data (L. Lun, Bach, and Marioni 2016). Briefly this method deals with the problem of vary large numbers of zero values per cell by pooling cells together calculating a normalization factor (similar to CPM) for the sum of each pool. Since each cell is found in many different pools, cell-specific factors can be deconvoluted from the collection of pool-specific factors using linear algebra.

7.7.3.6 Downsampling

A final way to correct for library size is to downsample the expression matrix so that each cell has approximately the same total number of molecules. The benefit of this method is that zero values will be introduced by the down sampling thus eliminating any biases due to differing numbers of detected genes. However, the major drawback is that the process is not deterministic so each time the downsampling is run the resulting expression matrix is slightly different. Thus, often analyses must be run on multiple downsamplings to ensure results are robust. Example of a downsampling function in R:

Down_Sample_Matrix <-
function (expr_mat) 
{
    min_lib_size <- min(colSums(expr_mat))
    down_sample <- function(x) {
        prob <- min_lib_size/sum(x)
        return(unlist(lapply(x, function(y) {
            rbinom(1, y, prob)
        })))
    }
    down_sampled_mat <- apply(expr_mat, 2, down_sample)
    return(down_sampled_mat)
}

7.7.4 Effectiveness

to compare the efficiency of different normalization methods we will use visual inspection of PCA plots and calculation of cell-wise relative log expression via scater’s plotRLE() function. Namely, cells with many (few) reads have higher (lower) than median expression for most genes resulting in a positive (negative) RLE across the cell, whereas normalized cells have an RLE close to zero. Example of a RLE function in R:

calc_cell_RLE <-
function (expr_mat, spikes = NULL) 
{
    RLE_gene <- function(x) {
        if (median(unlist(x)) > 0) {
            log((x + 1)/(median(unlist(x)) + 1))/log(2)
        }
        else {
            rep(NA, times = length(x))
        }
    }
    if (!is.null(spikes)) {
        RLE_matrix <- t(apply(expr_mat[-spikes, ], 1, RLE_gene))
    }
    else {
        RLE_matrix <- t(apply(expr_mat, 1, RLE_gene))
    }
    cell_RLE <- apply(RLE_matrix, 2, median, na.rm = T)
    return(cell_RLE)
}

Note The RLE, TMM, and UQ size-factor methods were developed for bulk RNA-seq data and, depending on the experimental context, may not be appropriate for single-cell RNA-seq data, as their underlying assumptions may be problematically violated.

Note scater acts as a wrapper for the calcNormFactors function from edgeR which implements several library size normalization methods making it easy to apply any of these methods to our data.

Note edgeR makes extra adjustments to some of the normalization methods which may result in somewhat different results than if the original methods are followed exactly, e.g. edgeR’s and scater’s “RLE” method which is based on the “size factor” used by DESeq may give different results to the estimateSizeFactorsForMatrix method in the DESeq/DESeq2 packages. In addition, some versions of edgeR will not calculate the normalization factors correctly unless lib.size is set at 1 for all cells.

Note For CPM normalisation we use scater’s calculateCPM() function. For RLE, UQ and TMM we use scater’s normaliseExprs() function. For scran we use scran package to calculate size factors (it also operates on SingleCellExperiment class) and scater’s normalize() to normalise the data. All these normalization functions save the results to the logcounts slot of the SCE object. For downsampling we use our own functions shown above.

7.8 Normalization practice (UMI)

We will continue to work with the tung data that was used in the previous chapter.

library(scRNA.seq.funcs)
library(scater)
library(scran)
options(stringsAsFactors = FALSE)
set.seed(1234567)
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[rowData(umi)$use, colData(umi)$use]
endog_genes <- !rowData(umi.qc)$is_feature_control

7.8.1 Raw

plotPCA(
    umi.qc[endog_genes, ],
    exprs_values = "logcounts_raw",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data

Figure 7.36: PCA plot of the tung data

7.8.2 CPM

logcounts(umi.qc) <- log2(calculateCPM(umi.qc, use.size.factors = FALSE) + 1)
plotPCA(
    umi.qc[endog_genes, ],
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data after CPM normalisation

Figure 7.37: PCA plot of the tung data after CPM normalisation

plotRLE(
    umi.qc[endog_genes, ], 
    exprs_mats = list(Raw = "logcounts_raw", CPM = "logcounts"),
    exprs_logged = c(TRUE, TRUE),
    colour_by = "batch"
)
Cell-wise RLE of the tung data

Figure 7.38: Cell-wise RLE of the tung data

7.8.3 Size-factor (RLE)

umi.qc <- normaliseExprs(
    umi.qc,
    method = "RLE", 
    feature_set = endog_genes,
    return_log = TRUE,
    return_norm_as_exprs = TRUE
)
## Warning in normalizeSCE(object, exprs_values = exprs_values, return_log
## = return_log, : spike-in transcripts in 'ERCC' should have their own size
## factors
plotPCA(
    umi.qc[endog_genes, ],
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data after RLE normalisation

Figure 7.39: PCA plot of the tung data after RLE normalisation

plotRLE(
    umi.qc[endog_genes, ], 
    exprs_mats = list(Raw = "logcounts_raw", RLE = "logcounts"),
    exprs_logged = c(TRUE, TRUE),
    colour_by = "batch"
)
Cell-wise RLE of the tung data

Figure 7.40: Cell-wise RLE of the tung data

7.8.4 Upperquantile

umi.qc <- normaliseExprs(
    umi.qc,
    method = "upperquartile", 
    feature_set = endog_genes,
    p = 0.99,
    return_log = TRUE,
    return_norm_as_exprs = TRUE
)
## Warning in normalizeSCE(object, exprs_values = exprs_values, return_log
## = return_log, : spike-in transcripts in 'ERCC' should have their own size
## factors
plotPCA(
    umi.qc[endog_genes, ],
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data after UQ normalisation

Figure 7.41: PCA plot of the tung data after UQ normalisation

plotRLE(
    umi.qc[endog_genes, ], 
    exprs_mats = list(Raw = "logcounts_raw", UQ = "logcounts"),
    exprs_logged = c(TRUE, TRUE),
    colour_by = "batch"
)
Cell-wise RLE of the tung data

Figure 7.42: Cell-wise RLE of the tung data

7.8.5 TMM

umi.qc <- normaliseExprs(
    umi.qc,
    method = "TMM",
    feature_set = endog_genes,
    return_log = TRUE,
    return_norm_as_exprs = TRUE
)
## Warning in normalizeSCE(object, exprs_values = exprs_values, return_log
## = return_log, : spike-in transcripts in 'ERCC' should have their own size
## factors
plotPCA(
    umi.qc[endog_genes, ],
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data after TMM normalisation

Figure 7.43: PCA plot of the tung data after TMM normalisation

plotRLE(
    umi.qc[endog_genes, ], 
    exprs_mats = list(Raw = "logcounts_raw", TMM = "logcounts"),
    exprs_logged = c(TRUE, TRUE),
    colour_by = "batch"
)
Cell-wise RLE of the tung data

Figure 7.44: Cell-wise RLE of the tung data

7.8.6 scran

qclust <- quickCluster(umi.qc, min.size = 30)
umi.qc <- computeSumFactors(umi.qc, sizes = 15, clusters = qclust)
umi.qc <- normalize(umi.qc)
## Warning in .local(object, ...): spike-in transcripts in 'ERCC' should have
## their own size factors
plotPCA(
    umi.qc[endog_genes, ],
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data after LSF normalisation

Figure 7.45: PCA plot of the tung data after LSF normalisation

plotRLE(
    umi.qc[endog_genes, ], 
    exprs_mats = list(Raw = "logcounts_raw", scran = "logcounts"),
    exprs_logged = c(TRUE, TRUE),
    colour_by = "batch"
)
Cell-wise RLE of the tung data

Figure 7.46: Cell-wise RLE of the tung data

scran sometimes calculates negative or zero size factors. These will completely distort the normalized expression matrix. We can check the size factors scran has computed like so:

summary(sizeFactors(umi.qc))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4646  0.7768  0.9562  1.0000  1.1444  3.4348

For this dataset all the size factors are reasonable so we are done. If you find scran has calculated negative size factors try increasing the cluster and pool sizes until they are all positive.

7.8.7 Downsampling

logcounts(umi.qc) <- log2(Down_Sample_Matrix(counts(umi.qc)) + 1)
plotPCA(
    umi.qc[endog_genes, ],
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
PCA plot of the tung data after downsampling

Figure 7.47: PCA plot of the tung data after downsampling

plotRLE(
    umi.qc[endog_genes, ], 
    exprs_mats = list(Raw = "logcounts_raw", DownSample = "logcounts"),
    exprs_logged = c(TRUE, TRUE),
    colour_by = "batch"
)
Cell-wise RLE of the tung data

Figure 7.48: Cell-wise RLE of the tung data

7.8.8 Normalisation for gene/transcript length

Some methods combine library size and fragment/gene length normalization such as:

  • RPKM - Reads Per Kilobase Million (for single-end sequencing)
  • FPKM - Fragments Per Kilobase Million (same as RPKM but for paired-end sequencing, makes sure that paired ends mapped to the same fragment are not counted twice)
  • TPM - Transcripts Per Kilobase Million (same as RPKM, but the order of normalizations is reversed - length first and sequencing depth second)

These methods are not applicable to our dataset since the end of the transcript which contains the UMI was preferentially sequenced. Furthermore in general these should only be calculated using appropriate quantification software from aligned BAM files not from read counts since often only a portion of the entire gene/transcript is sequenced, not the entire length. If in doubt check for a relationship between gene/transcript length and expression level.

However, here we show how these normalisations can be calculated using scater. First, we need to find the effective transcript length in Kilobases. However, our dataset containes only gene IDs, therefore we will be using the gene lengths instead of transcripts. scater uses the biomaRt package, which allows one to annotate genes by other attributes:

umi.qc <- getBMFeatureAnnos(
    umi.qc,
    filters = "ensembl_gene_id", 
    attributes = c(
        "ensembl_gene_id",
        "hgnc_symbol",
        "chromosome_name",
        "start_position",
        "end_position"
    ), 
    feature_symbol = "hgnc_symbol",
    feature_id = "ensembl_gene_id",
    biomart = "ENSEMBL_MART_ENSEMBL", 
    dataset = "hsapiens_gene_ensembl",
    host = "www.ensembl.org"
)

# If you have mouse data, change the arguments based on this example:
# getBMFeatureAnnos(
#     object,
#     filters = "ensembl_transcript_id",
#     attributes = c(
#         "ensembl_transcript_id",
#         "ensembl_gene_id", 
#         "mgi_symbol",
#         "chromosome_name",
#         "transcript_biotype",
#         "transcript_start",
#         "transcript_end",
#         "transcript_count"
#     ),
#     feature_symbol = "mgi_symbol",
#     feature_id = "ensembl_gene_id",
#     biomart = "ENSEMBL_MART_ENSEMBL",
#     dataset = "mmusculus_gene_ensembl",
#     host = "www.ensembl.org"
# )

Some of the genes were not annotated, therefore we filter them out:

umi.qc.ann <- umi.qc[!is.na(rowData(umi.qc)$ensembl_gene_id), ]

Now we compute the total gene length in Kilobases by using the end_position and start_position fields:

eff_length <- 
    abs(rowData(umi.qc.ann)$end_position - rowData(umi.qc.ann)$start_position) / 1000
plot(eff_length, rowMeans(counts(umi.qc.ann)))

There is no relationship between gene length and mean expression so __FPKM__s & __TPM__s are inappropriate for this dataset. But we will demonstrate them anyway.

Note Here calculate the total gene length instead of the total exon length. Many genes will contain lots of introns so their eff_length will be very different from what we have calculated. Please consider our calculation as approximation. If you want to use the total exon lengths, please refer to this page.

Now we are ready to perform the normalisations:

tpm(umi.qc.ann) <- log2(calculateTPM(umi.qc.ann, eff_length) + 1)

Plot the results as a PCA plot:

plotPCA(
    umi.qc.ann,
    exprs_values = "tpm",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)
tpm(umi.qc.ann) <- log2(calculateFPKM(umi.qc.ann, eff_length) + 1)
plotPCA(
    umi.qc.ann,
    exprs_values = "tpm",
    colour_by = "batch",
    size_by = "total_features",
    shape_by = "individual"
)

Note The PCA looks for differences between cells. Gene length is the same across cells for each gene thus FPKM is almost identical to the CPM plot (it is just rotated) since it performs CPM first then normalizes gene length. Whereas, TPM is different because it weights genes by their length before performing CPM.

7.8.9 Exercise

Perform the same analysis with read counts of the tung data. Use tung/reads.rds file to load the reads SCE object. Once you have finished please compare your results to ours (next chapter).

7.8.10 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] scran_1.6.8                BiocParallel_1.12.0       
##  [3] scater_1.6.3               SingleCellExperiment_1.0.0
##  [5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
##  [7] matrixStats_0.53.1         GenomicRanges_1.30.3      
##  [9] GenomeInfoDb_1.14.0        IRanges_2.12.0            
## [11] S4Vectors_0.16.0           ggplot2_2.2.1             
## [13] Biobase_2.38.0             BiocGenerics_0.24.0       
## [15] knitr_1.20                 scRNA.seq.funcs_0.1.0     
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6           bit64_0.9-7            progress_1.1.2        
##  [4] httr_1.3.1             rprojroot_1.3-2        dynamicTreeCut_1.63-1 
##  [7] tools_3.4.3            backports_1.1.2        DT_0.4                
## [10] R6_2.2.2               hypergeo_1.2-13        vipor_0.4.5           
## [13] DBI_0.7                lazyeval_0.2.1         colorspace_1.3-2      
## [16] gridExtra_2.3          prettyunits_1.0.2      moments_0.14          
## [19] bit_1.1-12             compiler_3.4.3         orthopolynom_1.0-5    
## [22] labeling_0.3           bookdown_0.7           scales_0.5.0          
## [25] stringr_1.3.0          digest_0.6.15          rmarkdown_1.8         
## [28] XVector_0.18.0         pkgconfig_2.0.1        htmltools_0.3.6       
## [31] highr_0.6              limma_3.34.9           htmlwidgets_1.0       
## [34] rlang_0.2.0            RSQLite_2.0            FNN_1.1               
## [37] shiny_1.0.5            bindr_0.1              zoo_1.8-1             
## [40] dplyr_0.7.4            RCurl_1.95-4.10        magrittr_1.5          
## [43] GenomeInfoDbData_1.0.0 Matrix_1.2-7.1         Rcpp_0.12.15          
## [46] ggbeeswarm_0.6.0       munsell_0.4.3          viridis_0.5.0         
## [49] stringi_1.1.6          yaml_2.1.17            edgeR_3.20.9          
## [52] MASS_7.3-45            zlibbioc_1.24.0        rhdf5_2.22.0          
## [55] Rtsne_0.13             plyr_1.8.4             grid_3.4.3            
## [58] blob_1.1.0             shinydashboard_0.6.1   contfrac_1.1-11       
## [61] lattice_0.20-34        cowplot_0.9.2          locfit_1.5-9.1        
## [64] pillar_1.2.1           igraph_1.1.2           rjson_0.2.15          
## [67] reshape2_1.4.3         biomaRt_2.34.2         XML_3.98-1.10         
## [70] glue_1.2.0             evaluate_0.10.1        data.table_1.10.4-3   
## [73] deSolve_1.20           httpuv_1.3.6.1         gtable_0.2.0          
## [76] assertthat_0.2.0       xfun_0.1               mime_0.5              
## [79] xtable_1.8-2           viridisLite_0.3.0      tibble_1.4.2          
## [82] elliptic_1.3-7         AnnotationDbi_1.40.0   beeswarm_0.2.3        
## [85] memoise_1.1.0          tximport_1.6.0         bindrcpp_0.2          
## [88] statmod_1.4.30

7.9 Normalization practice (Reads)

PCA plot of the tung data

Figure 7.49: PCA plot of the tung data

PCA plot of the tung data after CPM normalisation

Figure 7.50: PCA plot of the tung data after CPM normalisation

Cell-wise RLE of the tung data

Figure 7.51: Cell-wise RLE of the tung data

## Warning in normalizeSCE(object, exprs_values = exprs_values, return_log
## = return_log, : spike-in transcripts in 'ERCC' should have their own size
## factors
PCA plot of the tung data after RLE normalisation

Figure 7.52: PCA plot of the tung data after RLE normalisation

Cell-wise RLE of the tung data

Figure 7.53: Cell-wise RLE of the tung data

## Warning in normalizeSCE(object, exprs_values = exprs_values, return_log
## = return_log, : spike-in transcripts in 'ERCC' should have their own size
## factors
PCA plot of the tung data after UQ normalisation

Figure 7.54: PCA plot of the tung data after UQ normalisation

Cell-wise RLE of the tung data

Figure 7.55: Cell-wise RLE of the tung data

## Warning in normalizeSCE(object, exprs_values = exprs_values, return_log
## = return_log, : spike-in transcripts in 'ERCC' should have their own size
## factors
PCA plot of the tung data after TMM normalisation

Figure 7.56: PCA plot of the tung data after TMM normalisation

Cell-wise RLE of the tung data

Figure 7.57: Cell-wise RLE of the tung data

## Warning in .local(object, ...): spike-in transcripts in 'ERCC' should have
## their own size factors
PCA plot of the tung data after LSF normalisation

Figure 7.58: PCA plot of the tung data after LSF normalisation

Cell-wise RLE of the tung data

Figure 7.59: Cell-wise RLE of the tung data

PCA plot of the tung data after downsampling

Figure 7.60: PCA plot of the tung data after downsampling

Cell-wise RLE of the tung data

Figure 7.61: Cell-wise RLE of the tung data

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] knitr_1.20                 scran_1.6.8               
##  [3] BiocParallel_1.12.0        scater_1.6.3              
##  [5] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
##  [7] DelayedArray_0.4.1         matrixStats_0.53.1        
##  [9] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
## [11] IRanges_2.12.0             S4Vectors_0.16.0          
## [13] ggplot2_2.2.1              Biobase_2.38.0            
## [15] BiocGenerics_0.24.0        scRNA.seq.funcs_0.1.0     
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6           bit64_0.9-7            progress_1.1.2        
##  [4] httr_1.3.1             rprojroot_1.3-2        dynamicTreeCut_1.63-1 
##  [7] tools_3.4.3            backports_1.1.2        DT_0.4                
## [10] R6_2.2.2               hypergeo_1.2-13        vipor_0.4.5           
## [13] DBI_0.7                lazyeval_0.2.1         colorspace_1.3-2      
## [16] gridExtra_2.3          prettyunits_1.0.2      moments_0.14          
## [19] bit_1.1-12             compiler_3.4.3         orthopolynom_1.0-5    
## [22] labeling_0.3           bookdown_0.7           scales_0.5.0          
## [25] stringr_1.3.0          digest_0.6.15          rmarkdown_1.8         
## [28] XVector_0.18.0         pkgconfig_2.0.1        htmltools_0.3.6       
## [31] highr_0.6              limma_3.34.9           htmlwidgets_1.0       
## [34] rlang_0.2.0            RSQLite_2.0            FNN_1.1               
## [37] shiny_1.0.5            bindr_0.1              zoo_1.8-1             
## [40] dplyr_0.7.4            RCurl_1.95-4.10        magrittr_1.5          
## [43] GenomeInfoDbData_1.0.0 Matrix_1.2-7.1         Rcpp_0.12.15          
## [46] ggbeeswarm_0.6.0       munsell_0.4.3          viridis_0.5.0         
## [49] stringi_1.1.6          yaml_2.1.17            edgeR_3.20.9          
## [52] MASS_7.3-45            zlibbioc_1.24.0        rhdf5_2.22.0          
## [55] Rtsne_0.13             plyr_1.8.4             grid_3.4.3            
## [58] blob_1.1.0             shinydashboard_0.6.1   contfrac_1.1-11       
## [61] lattice_0.20-34        cowplot_0.9.2          locfit_1.5-9.1        
## [64] pillar_1.2.1           igraph_1.1.2           rjson_0.2.15          
## [67] reshape2_1.4.3         biomaRt_2.34.2         XML_3.98-1.10         
## [70] glue_1.2.0             evaluate_0.10.1        data.table_1.10.4-3   
## [73] deSolve_1.20           httpuv_1.3.6.1         gtable_0.2.0          
## [76] assertthat_0.2.0       xfun_0.1               mime_0.5              
## [79] xtable_1.8-2           viridisLite_0.3.0      tibble_1.4.2          
## [82] elliptic_1.3-7         AnnotationDbi_1.40.0   beeswarm_0.2.3        
## [85] memoise_1.1.0          tximport_1.6.0         bindrcpp_0.2          
## [88] statmod_1.4.30

7.10 Dealing with confounders

7.10.1 Introduction

In the previous chapter we normalized for library size, effectively removing it as a confounder. Now we will consider removing other less well defined confounders from our data. Technical confounders (aka batch effects) can arise from difference in reagents, isolation methods, the lab/experimenter who performed the experiment, even which day/time the experiment was performed. Accounting for technical confounders, and batch effects particularly, is a large topic that also involves principles of experimental design. Here we address approaches that can be taken to account for confounders when the experimental design is appropriate.

Fundamentally, accounting for technical confounders involves identifying and, ideally, removing sources of variation in the expression data that are not related to (i.e. are confounding) the biological signal of interest. Various approaches exist, some of which use spike-in or housekeeping genes, and some of which use endogenous genes.

7.10.1.1 Advantages and disadvantages of using spike-ins to remove confounders

The use of spike-ins as control genes is appealing, since the same amount of ERCC (or other) spike-in was added to each cell in our experiment. In principle, all the variablity we observe for these genes is due to technical noise; whereas endogenous genes are affected by both technical noise and biological variability. Technical noise can be removed by fitting a model to the spike-ins and “substracting” this from the endogenous genes. There are several methods available based on this premise (eg. BASiCS, scLVM, RUVg); each using different noise models and different fitting procedures. Alternatively, one can identify genes which exhibit significant variation beyond technical noise (eg. Distance to median, Highly variable genes). However, there are issues with the use of spike-ins for normalisation (particularly ERCCs, derived from bacterial sequences), including that their variability can, for various reasons, actually be higher than that of endogenous genes.

Given the issues with using spike-ins, better results can often be obtained by using endogenous genes instead. Where we have a large number of endogenous genes that, on average, do not vary systematically between cells and where we expect technical effects to affect a large number of genes (a very common and reasonable assumption), then such methods (for example, the RUVs method) can perform well.

We explore both general approaches below.

library(scRNA.seq.funcs)
library(RUVSeq)
library(scater)
library(SingleCellExperiment)
library(scran)
library(kBET)
library(sva) # Combat
library(edgeR)
set.seed(1234567)
options(stringsAsFactors = FALSE)
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[rowData(umi)$use, colData(umi)$use]
endog_genes <- !rowData(umi.qc)$is_feature_control
erccs <- rowData(umi.qc)$is_feature_control

qclust <- quickCluster(umi.qc, min.size = 30)
umi.qc <- computeSumFactors(umi.qc, sizes = 15, clusters = qclust)
umi.qc <- normalize(umi.qc)

7.10.2 Remove Unwanted Variation

Factors contributing to technical noise frequently appear as “batch effects” where cells processed on different days or by different technicians systematically vary from one another. Removing technical noise and correcting for batch effects can frequently be performed using the same tool or slight variants on it. We will be considering the Remove Unwanted Variation (RUVSeq). Briefly, RUVSeq works as follows. For \(n\) samples and \(J\) genes, consider the following generalized linear model (GLM), where the RNA-Seq read counts are regressed on both the known covariates of interest and unknown factors of unwanted variation: \[\log E[Y|W,X,O] = W\alpha + X\beta + O\] Here, \(Y\) is the \(n \times J\) matrix of observed gene-level read counts, \(W\) is an \(n \times k\) matrix corresponding to the factors of “unwanted variation” and \(O\) is an \(n \times J\) matrix of offsets that can either be set to zero or estimated with some other normalization procedure (such as upper-quartile normalization). The simultaneous estimation of \(W\), \(\alpha\), \(\beta\), and \(k\) is infeasible. For a given \(k\), instead the following three approaches to estimate the factors of unwanted variation \(W\) are used:

  • RUVg uses negative control genes (e.g. ERCCs), assumed to have constant expression across samples;
  • RUVs uses centered (technical) replicate/negative control samples for which the covariates of interest are constant;
  • RUVr uses residuals, e.g., from a first-pass GLM regression of the counts on the covariates of interest.

We will concentrate on the first two approaches.

7.10.2.1 RUVg

ruvg <- RUVg(counts(umi.qc), erccs, k = 1)
assay(umi.qc, "ruvg1") <- log2(
    t(t(ruvg$normalizedCounts) / colSums(ruvg$normalizedCounts) * 1e6) + 1
)
ruvg <- RUVg(counts(umi.qc), erccs, k = 10)
assay(umi.qc, "ruvg10") <- log2(
    t(t(ruvg$normalizedCounts) / colSums(ruvg$normalizedCounts) * 1e6) + 1
)

7.10.2.2 RUVs

scIdx <- matrix(-1, ncol = max(table(umi.qc$individual)), nrow = 3)
tmp <- which(umi.qc$individual == "NA19098")
scIdx[1, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19101")
scIdx[2, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19239")
scIdx[3, 1:length(tmp)] <- tmp
cIdx <- rownames(umi.qc)
ruvs <- RUVs(counts(umi.qc), cIdx, k = 1, scIdx = scIdx, isLog = FALSE)
assay(umi.qc, "ruvs1") <- log2(
    t(t(ruvs$normalizedCounts) / colSums(ruvs$normalizedCounts) * 1e6) + 1
)
ruvs <- RUVs(counts(umi.qc), cIdx, k = 10, scIdx = scIdx, isLog = FALSE)
assay(umi.qc, "ruvs10") <- log2(
    t(t(ruvs$normalizedCounts) / colSums(ruvs$normalizedCounts) * 1e6) + 1
)

7.10.3 Combat

If you have an experiment with a balanced design, Combat can be used to eliminate batch effects while preserving biological effects by specifying the biological effects using the mod parameter. However the Tung data contains multiple experimental replicates rather than a balanced design so using mod1 to preserve biological variability will result in an error.

combat_data <- logcounts(umi.qc)
mod_data <- as.data.frame(t(combat_data))
# Basic batch removal
mod0 = model.matrix(~ 1, data = mod_data) 
# Preserve biological variability
mod1 = model.matrix(~ umi.qc$individual, data = mod_data) 
# adjust for total genes detected
mod2 = model.matrix(~ umi.qc$total_features, data = mod_data)
assay(umi.qc, "combat") <- ComBat(
    dat = t(mod_data), 
    batch = factor(umi.qc$batch), 
    mod = mod0,
    par.prior = TRUE,
    prior.plots = FALSE
)
## Standardizing Data across genes

Exercise 1

Perform ComBat correction accounting for total features as a co-variate. Store the corrected matrix in the combat_tf slot.

## Standardizing Data across genes

7.10.4 mnnCorrect

mnnCorrect (Haghverdi et al. 2017) assumes that each batch shares at least one biological condition with each other batch. Thus it works well for a variety of balanced experimental designs. However, the Tung data contains multiple replicates for each invidividual rather than balanced batches, thus we will normalized each individual separately. Note that this will remove batch effects between batches within the same individual but not the batch effects between batches in different individuals, due to the confounded experimental design.

Thus we will merge a replicate from each individual to form three batches.

do_mnn <- function(data.qc) {
    batch1 <- logcounts(data.qc[, data.qc$replicate == "r1"])
    batch2 <- logcounts(data.qc[, data.qc$replicate == "r2"])
    batch3 <- logcounts(data.qc[, data.qc$replicate == "r3"])
    
    if (ncol(batch2) > 0) {
        x = mnnCorrect(
          batch1, batch2, batch3,  
          k = 20,
          sigma = 0.1,
          cos.norm.in = TRUE,
          svd.dim = 2
        )
        res1 <- x$corrected[[1]]
        res2 <- x$corrected[[2]]
        res3 <- x$corrected[[3]]
        dimnames(res1) <- dimnames(batch1)
        dimnames(res2) <- dimnames(batch2)
        dimnames(res3) <- dimnames(batch3)
        return(cbind(res1, res2, res3))
    } else {
        x = mnnCorrect(
          batch1, batch3,  
          k = 20,
          sigma = 0.1,
          cos.norm.in = TRUE,
          svd.dim = 2
        )
        res1 <- x$corrected[[1]]
        res3 <- x$corrected[[2]]
        dimnames(res1) <- dimnames(batch1)
        dimnames(res3) <- dimnames(batch3)
        return(cbind(res1, res3))
    }
}

indi1 <- do_mnn(umi.qc[, umi.qc$individual == "NA19098"])
indi2 <- do_mnn(umi.qc[, umi.qc$individual == "NA19101"])
indi3 <- do_mnn(umi.qc[, umi.qc$individual == "NA19239"])

assay(umi.qc, "mnn") <- cbind(indi1, indi2, indi3)

# For a balanced design: 
#assay(umi.qc, "mnn") <- mnnCorrect(
#    list(B1 = logcounts(batch1), B2 = logcounts(batch2), B3 = logcounts(batch3)),  
#    k = 20,
#    sigma = 0.1,
#    cos.norm = TRUE,
#    svd.dim = 2
#)

7.10.5 GLM

A general linear model is a simpler version of Combat. It can correct for batches while preserving biological effects if you have a balanced design. In a confounded/replicate design biological effects will not be fit/preserved. Similar to mnnCorrect we could remove batch effects from each individual separately in order to preserve biological (and technical) variance between individuals. For demonstation purposes we will naively correct all cofounded batch effects:

glm_fun <- function(g, batch, indi) {
  model <- glm(g ~ batch + indi)
  model$coef[1] <- 0 # replace intercept with 0 to preserve reference batch.
  return(model$coef)
}
effects <- apply(
    logcounts(umi.qc), 
    1, 
    glm_fun, 
    batch = umi.qc$batch, 
    indi = umi.qc$individual
)
corrected <- logcounts(umi.qc) - t(effects[as.numeric(factor(umi.qc$batch)), ])
assay(umi.qc, "glm") <- corrected

Exercise 2

Perform GLM correction for each individual separately. Store the final corrected matrix in the glm_indi slot.

7.10.6 How to evaluate and compare confounder removal strategies

A key question when considering the different methods for removing confounders is how to quantitatively determine which one is the most effective. The main reason why comparisons are challenging is because it is often difficult to know what corresponds to technical counfounders and what is interesting biological variability. Here, we consider three different metrics which are all reasonable based on our knowledge of the experimental design. Depending on the biological question that you wish to address, it is important to choose a metric that allows you to evaluate the confounders that are likely to be the biggest concern for the given situation.

7.10.6.1 Effectiveness 1

We evaluate the effectiveness of the normalization by inspecting the PCA plot where colour corresponds the technical replicates and shape corresponds to different biological samples (individuals). Separation of biological samples and interspersed batches indicates that technical variation has been removed. We always use log2-cpm normalized data to match the assumptions of PCA.

for(n in assayNames(umi.qc)) {
    print(
        plotPCA(
            umi.qc[endog_genes, ],
            colour_by = "batch",
            size_by = "total_features",
            shape_by = "individual",
            exprs_values = n
        ) +
        ggtitle(n)
    )
}

Exercise 3

Consider different ks for RUV normalizations. Which gives the best results?

7.10.6.2 Effectiveness 2

We can also examine the effectiveness of correction using the relative log expression (RLE) across cells to confirm technical noise has been removed from the dataset. Note RLE only evaluates whether the number of genes higher and lower than average are equal for each cell - i.e. systemic technical effects. Random technical noise between batches may not be detected by RLE.

res <- list()
for(n in assayNames(umi.qc)) {
    res[[n]] <- suppressWarnings(calc_cell_RLE(assay(umi.qc, n), erccs))
}
par(mar=c(6,4,1,1))
boxplot(res, las=2)

7.10.6.3 Effectiveness 3

We can repeat the analysis from Chapter 12 to check whether batch effects have been removed.

for(n in assayNames(umi.qc)) {
    print(
        plotQC(
            umi.qc[endog_genes, ],
            type = "expl",
            exprs_values = n,
            variables = c(
                "total_features",
                "total_counts",
                "batch",
                "individual",
                "pct_counts_ERCC",
                "pct_counts_MT"
            )
        ) +
        ggtitle(n)
    )
}
Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Explanatory variables (mnn)

Figure 7.62: Explanatory variables (mnn)

Exercise 4

Perform the above analysis for each normalization/batch correction method. Which method(s) are most/least effective? Why is the variance accounted for by batch never lower than the variance accounted for by individual?

7.10.6.4 Effectiveness 4

Another method to check the efficacy of batch-effect correction is to consider the intermingling of points from different batches in local subsamples of the data. If there are no batch-effects then proportion of cells from each batch in any local region should be equal to the global proportion of cells in each batch.

kBET (Buttner et al. 2017) takes kNN networks around random cells and tests the number of cells from each batch against a binomial distribution. The rejection rate of these tests indicates the severity of batch-effects still present in the data (high rejection rate = strong batch effects). kBET assumes each batch contains the same complement of biological groups, thus it can only be applied to the entire dataset if a perfectly balanced design has been used. However, kBET can also be applied to replicate-data if it is applied to each biological group separately. In the case of the Tung data, we will apply kBET to each individual independently to check for residual batch effects. However, this method will not identify residual batch-effects which are confounded with biological conditions. In addition, kBET does not determine if biological signal has been preserved.

compare_kBET_results <- function(sce){
    indiv <- unique(sce$individual)
    norms <- assayNames(sce) # Get all normalizations
    results <- list()
    for (i in indiv){ 
        for (j in norms){
            tmp <- kBET(
                df = t(assay(sce[,sce$individual== i], j)), 
                batch = sce$batch[sce$individual==i], 
                heuristic = TRUE, 
                verbose = FALSE, 
                addTest = FALSE, 
                plot = FALSE)
            results[[i]][[j]] <- tmp$summary$kBET.observed[1]
        }
    }
    return(as.data.frame(results))
}

eff_debatching <- compare_kBET_results(umi.qc)
require("reshape2")
require("RColorBrewer")
# Plot results
dod <- melt(as.matrix(eff_debatching),  value.name = "kBET")
colnames(dod)[1:2] <- c("Normalisation", "Individual")

colorset <- c('gray', brewer.pal(n = 9, "RdYlBu"))

ggplot(dod, aes(Normalisation, Individual, fill=kBET)) +  
    geom_tile() +
    scale_fill_gradient2(
        na.value = "gray",
        low = colorset[2],
        mid=colorset[6],
        high = colorset[10],
        midpoint = 0.5, limit = c(0,1)) +
    scale_x_discrete(expand = c(0, 0)) +
    scale_y_discrete(expand = c(0, 0)) + 
    theme(
        axis.text.x = element_text(
            angle = 45, 
            vjust = 1, 
            size = 12, 
            hjust = 1
        )
    ) + 
    ggtitle("Effect of batch regression methods per individual")

Exercise 5

Why do the raw counts appear to have little batch effects?

7.10.7 Big Exercise

Perform the same analysis with read counts of the tung data. Use tung/reads.rds file to load the reads SCE object. Once you have finished please compare your results to ours (next chapter). Additionally, experiment with other combinations of normalizations and compare the results.

7.10.8 sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2         reshape2_1.4.3            
##  [3] sva_3.26.0                 genefilter_1.60.0         
##  [5] mgcv_1.8-23                nlme_3.1-129              
##  [7] kBET_0.99.5                scran_1.6.8               
##  [9] scater_1.6.3               SingleCellExperiment_1.0.0
## [11] ggplot2_2.2.1              RUVSeq_1.12.0             
## [13] edgeR_3.20.9               limma_3.34.9              
## [15] EDASeq_2.12.0              ShortRead_1.36.1          
## [17] GenomicAlignments_1.14.1   SummarizedExperiment_1.8.1
## [19] DelayedArray_0.4.1         matrixStats_0.53.1        
## [21] Rsamtools_1.30.0           GenomicRanges_1.30.3      
## [23] GenomeInfoDb_1.14.0        Biostrings_2.46.0         
## [25] XVector_0.18.0             IRanges_2.12.0            
## [27] S4Vectors_0.16.0           BiocParallel_1.12.0       
## [29] Biobase_2.38.0             BiocGenerics_0.24.0       
## [31] scRNA.seq.funcs_0.1.0      knitr_1.20                
## 
## loaded via a namespace (and not attached):
##  [1] Rtsne_0.13             ggbeeswarm_0.6.0       colorspace_1.3-2      
##  [4] rjson_0.2.15           hwriter_1.3.2          dynamicTreeCut_1.63-1 
##  [7] rprojroot_1.3-2        DT_0.4                 bit64_0.9-7           
## [10] AnnotationDbi_1.40.0   splines_3.4.3          R.methodsS3_1.7.1     
## [13] tximport_1.6.0         DESeq_1.30.0           geneplotter_1.56.0    
## [16] annotate_1.56.1        cluster_2.0.6          R.oo_1.21.0           
## [19] shinydashboard_0.6.1   shiny_1.0.5            compiler_3.4.3        
## [22] httr_1.3.1             backports_1.1.2        assertthat_0.2.0      
## [25] Matrix_1.2-7.1         lazyeval_0.2.1         htmltools_0.3.6       
## [28] prettyunits_1.0.2      tools_3.4.3            igraph_1.1.2          
## [31] bindrcpp_0.2           gtable_0.2.0           glue_1.2.0            
## [34] GenomeInfoDbData_1.0.0 dplyr_0.7.4            Rcpp_0.12.15          
## [37] rtracklayer_1.38.3     xfun_0.1               stringr_1.3.0         
## [40] mime_0.5               hypergeo_1.2-13        statmod_1.4.30        
## [43] XML_3.98-1.10          zoo_1.8-1              zlibbioc_1.24.0       
## [46] MASS_7.3-45            scales_0.5.0           aroma.light_3.8.0     
## [49] rhdf5_2.22.0           yaml_2.1.17            memoise_1.1.0         
## [52] gridExtra_2.3          biomaRt_2.34.2         latticeExtra_0.6-28   
## [55] stringi_1.1.6          RSQLite_2.0            highr_0.6             
## [58] RMySQL_0.10.14         orthopolynom_1.0-5     GenomicFeatures_1.30.3
## [61] contfrac_1.1-11        rlang_0.2.0            pkgconfig_2.0.1       
## [64] moments_0.14           bitops_1.0-6           evaluate_0.10.1       
## [67] lattice_0.20-34        bindr_0.1              labeling_0.3          
## [70] htmlwidgets_1.0        cowplot_0.9.2          bit_1.1-12            
## [73] deSolve_1.20           plyr_1.8.4             magrittr_1.5          
## [76] bookdown_0.7           R6_2.2.2               DBI_0.7               
## [79] pillar_1.2.1           survival_2.40-1        RCurl_1.95-4.10       
## [82] tibble_1.4.2           rmarkdown_1.8          viridis_0.5.0         
## [85] progress_1.1.2         locfit_1.5-9.1         grid_3.4.3            
## [88] data.table_1.10.4-3    FNN_1.1                blob_1.1.0            
## [91] digest_0.6.15          xtable_1.8-2           httpuv_1.3.6.1        
## [94] elliptic_1.3-7         R.utils_2.6.0          munsell_0.4.3         
## [97] beeswarm_0.2.3         viridisLite_0.3.0      vipor_0.4.5

7.11 Dealing with confounders (Reads)

library(scRNA.seq.funcs)
library(RUVSeq)
library(scater)
library(SingleCellExperiment)
library(scran)
library(kBET)
library(sva) # Combat
library(edgeR)
set.seed(1234567)
options(stringsAsFactors = FALSE)
reads <- readRDS("tung/reads.rds")
reads.qc <- reads[rowData(reads)$use, colData(reads)$use]
endog_genes <- !rowData(reads.qc)$is_feature_control
erccs <- rowData(reads.qc)$is_feature_control

qclust <- quickCluster(reads.qc, min.size = 30)
reads.qc <- computeSumFactors(reads.qc, sizes = 15, clusters = qclust)
reads.qc <- normalize(reads.qc)
ruvg <- RUVg(counts(reads.qc), erccs, k = 1)
assay(reads.qc, "ruvg1") <- log2(
    t(t(ruvg$normalizedCounts) / colSums(ruvg$normalizedCounts) * 1e6) + 1
)
ruvg <- RUVg(counts(reads.qc), erccs, k = 10)
assay(reads.qc, "ruvg10") <- log2(
    t(t(ruvg$normalizedCounts) / colSums(ruvg$normalizedCounts) * 1e6) + 1
)
scIdx <- matrix(-1, ncol = max(table(reads.qc$individual)), nrow = 3)
tmp <- which(reads.qc$individual == "NA19098")
scIdx[1, 1:length(tmp)] <- tmp
tmp <- which(reads.qc$individual == "NA19101")
scIdx[2, 1:length(tmp)] <- tmp
tmp <- which(reads.qc$individual == "NA19239")
scIdx[3, 1:length(tmp)] <- tmp
cIdx <- rownames(reads.qc)
ruvs <- RUVs(counts(reads.qc), cIdx, k = 1, scIdx = scIdx, isLog = FALSE)
assay(reads.qc, "ruvs1") <- log2(
    t(t(ruvs$normalizedCounts) / colSums(ruvs$normalizedCounts) * 1e6) + 1
)
ruvs <- RUVs(counts(reads.qc), cIdx, k = 10, scIdx = scIdx, isLog = FALSE)
assay(reads.qc, "ruvs10") <- log2(
    t(t(ruvs$normalizedCounts) / colSums(ruvs$normalizedCounts) * 1e6) + 1
)
combat_data <- logcounts(reads.qc)
mod_data <- as.data.frame(t(combat_data))
# Basic batch removal
mod0 = model.matrix(~ 1, data = mod_data) 
# Preserve biological variability
mod1 = model.matrix(~ reads.qc$individual, data = mod_data) 
# adjust for total genes detected
mod2 = model.matrix(~ reads.qc$total_features, data = mod_data)
assay(reads.qc, "combat") <- ComBat(
    dat = t(mod_data), 
    batch = factor(reads.qc$batch), 
    mod = mod0,
    par.prior = TRUE,
    prior.plots = FALSE
)
## Standardizing Data across genes

Exercise 1

## Standardizing Data across genes
do_mnn <- function(data.qc) {
    batch1 <- logcounts(data.qc[, data.qc$replicate == "r1"])
    batch2 <- logcounts(data.qc[, data.qc$replicate == "r2"])
    batch3 <- logcounts(data.qc[, data.qc$replicate == "r3"])
    
    if (ncol(batch2) > 0) {
        x = mnnCorrect(
          batch1, batch2, batch3,  
          k = 20,
          sigma = 0.1,
          cos.norm.in = TRUE,
          svd.dim = 2
        )
        res1 <- x$corrected[[1]]
        res2 <- x$corrected[[2]]
        res3 <- x$corrected[[3]]
        dimnames(res1) <- dimnames(batch1)
        dimnames(res2) <- dimnames(batch2)
        dimnames(res3) <- dimnames(batch3)
        return(cbind(res1, res2, res3))
    } else {
        x = mnnCorrect(
          batch1, batch3,  
          k = 20,
          sigma = 0.1,
          cos.norm.in = TRUE,
          svd.dim = 2
        )
        res1 <- x$corrected[[1]]
        res3 <- x$corrected[[2]]
        dimnames(res1) <- dimnames(batch1)
        dimnames(res3) <- dimnames(batch3)
        return(cbind(res1, res3))
    }
}

indi1 <- do_mnn(reads.qc[, reads.qc$individual == "NA19098"])
indi2 <- do_mnn(reads.qc[, reads.qc$individual == "NA19101"])
indi3 <- do_mnn(reads.qc[, reads.qc$individual == "NA19239"])

assay(reads.qc, "mnn") <- cbind(indi1, indi2, indi3)

# For a balanced design: 
#assay(reads.qc, "mnn") <- mnnCorrect(
#    list(B1 = logcounts(batch1), B2 = logcounts(batch2), B3 = logcounts(batch3)),  
#    k = 20,
#    sigma = 0.1,
#    cos.norm = TRUE,
#    svd.dim = 2
#)
glm_fun <- function(g, batch, indi) {
  model <- glm(g ~ batch + indi)
  model$coef[1] <- 0 # replace intercept with 0 to preserve reference batch.
  return(model$coef)
}
effects <- apply(
    logcounts(reads.qc), 
    1, 
    glm_fun, 
    batch = reads.qc$batch, 
    indi = reads.qc$individual
)
corrected <- logcounts(reads.qc) - t(effects[as.numeric(factor(reads.qc$batch)), ])
assay(reads.qc, "glm") <- corrected

Exercise 2

for(n in assayNames(reads.qc)) {
    print(
        plotPCA(
            reads.qc[endog_genes, ],
            colour_by = "batch",
            size_by = "total_features",
            shape_by = "individual",
            exprs_values = n
        ) +
        ggtitle(n)
    )
}

res <- list()
for(n in assayNames(reads.qc)) {
    res[[n]] <- suppressWarnings(calc_cell_RLE(assay(reads.qc, n), erccs))
}
par(mar=c(6,4,1,1))
boxplot(res, las=2)

for(n in assayNames(reads.qc)) {
    print(
        plotQC(
            reads.qc[endog_genes, ],
            type = "expl",
            exprs_values = n,
            variables = c(
                "total_features",
                "total_counts",
                "batch",
                "individual",
                "pct_counts_ERCC",
                "pct_counts_MT"
            )
        ) +
        ggtitle(n)
    )
}

compare_kBET_results <- function(sce){
    indiv <- unique(sce$individual)
    norms <- assayNames(sce) # Get all normalizations
    results <- list()
    for (i in indiv){ 
        for (j in norms){
            tmp <- kBET(
                df = t(assay(sce[,sce$individual== i], j)), 
                batch = sce$batch[sce$individual==i], 
                heuristic = TRUE, 
                verbose = FALSE, 
                addTest = FALSE, 
                plot = FALSE)
            results[[i]][[j]] <- tmp$summary$kBET.observed[1]
        }
    }
    return(as.data.frame(results))
}

eff_debatching <- compare_kBET_results(reads.qc)
require("reshape2")
require("RColorBrewer")
# Plot results
dod <- melt(as.matrix(eff_debatching),  value.name = "kBET")
colnames(dod)[1:2] <- c("Normalisation", "Individual")

colorset <- c('gray', brewer.pal(n = 9, "RdYlBu"))

ggplot(dod, aes(Normalisation, Individual, fill=kBET)) +  
    geom_tile() +
    scale_fill_gradient2(
        na.value = "gray",
        low = colorset[2],
        mid=colorset[6],
        high = colorset[10],
        midpoint = 0.5, limit = c(0,1)) +
    scale_x_discrete(expand = c(0, 0)) +
    scale_y_discrete(expand = c(0, 0)) + 
    theme(
        axis.text.x = element_text(
            angle = 45, 
            vjust = 1, 
            size = 12, 
            hjust = 1
        )
    ) + 
    ggtitle("Effect of batch regression methods per individual")

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2         reshape2_1.4.3            
##  [3] sva_3.26.0                 genefilter_1.60.0         
##  [5] mgcv_1.8-23                nlme_3.1-129              
##  [7] kBET_0.99.5                scran_1.6.8               
##  [9] scater_1.6.3               SingleCellExperiment_1.0.0
## [11] ggplot2_2.2.1              RUVSeq_1.12.0             
## [13] edgeR_3.20.9               limma_3.34.9              
## [15] EDASeq_2.12.0              ShortRead_1.36.1          
## [17] GenomicAlignments_1.14.1   SummarizedExperiment_1.8.1
## [19] DelayedArray_0.4.1         matrixStats_0.53.1        
## [21] Rsamtools_1.30.0           GenomicRanges_1.30.3      
## [23] GenomeInfoDb_1.14.0        Biostrings_2.46.0         
## [25] XVector_0.18.0             IRanges_2.12.0            
## [27] S4Vectors_0.16.0           BiocParallel_1.12.0       
## [29] Biobase_2.38.0             BiocGenerics_0.24.0       
## [31] scRNA.seq.funcs_0.1.0      knitr_1.20                
## 
## loaded via a namespace (and not attached):
##  [1] Rtsne_0.13             ggbeeswarm_0.6.0       colorspace_1.3-2      
##  [4] rjson_0.2.15           hwriter_1.3.2          dynamicTreeCut_1.63-1 
##  [7] rprojroot_1.3-2        DT_0.4                 bit64_0.9-7           
## [10] AnnotationDbi_1.40.0   splines_3.4.3          R.methodsS3_1.7.1     
## [13] tximport_1.6.0         DESeq_1.30.0           geneplotter_1.56.0    
## [16] annotate_1.56.1        cluster_2.0.6          R.oo_1.21.0           
## [19] shinydashboard_0.6.1   shiny_1.0.5            compiler_3.4.3        
## [22] httr_1.3.1             backports_1.1.2        assertthat_0.2.0      
## [25] Matrix_1.2-7.1         lazyeval_0.2.1         htmltools_0.3.6       
## [28] prettyunits_1.0.2      tools_3.4.3            igraph_1.1.2          
## [31] bindrcpp_0.2           gtable_0.2.0           glue_1.2.0            
## [34] GenomeInfoDbData_1.0.0 dplyr_0.7.4            Rcpp_0.12.15          
## [37] rtracklayer_1.38.3     xfun_0.1               stringr_1.3.0         
## [40] mime_0.5               hypergeo_1.2-13        statmod_1.4.30        
## [43] XML_3.98-1.10          zoo_1.8-1              zlibbioc_1.24.0       
## [46] MASS_7.3-45            scales_0.5.0           aroma.light_3.8.0     
## [49] rhdf5_2.22.0           yaml_2.1.17            memoise_1.1.0         
## [52] gridExtra_2.3          biomaRt_2.34.2         latticeExtra_0.6-28   
## [55] stringi_1.1.6          RSQLite_2.0            RMySQL_0.10.14        
## [58] orthopolynom_1.0-5     GenomicFeatures_1.30.3 contfrac_1.1-11       
## [61] rlang_0.2.0            pkgconfig_2.0.1        moments_0.14          
## [64] bitops_1.0-6           evaluate_0.10.1        lattice_0.20-34       
## [67] bindr_0.1              labeling_0.3           htmlwidgets_1.0       
## [70] cowplot_0.9.2          bit_1.1-12             deSolve_1.20          
## [73] plyr_1.8.4             magrittr_1.5           bookdown_0.7          
## [76] R6_2.2.2               DBI_0.7                pillar_1.2.1          
## [79] survival_2.40-1        RCurl_1.95-4.10        tibble_1.4.2          
## [82] rmarkdown_1.8          viridis_0.5.0          progress_1.1.2        
## [85] locfit_1.5-9.1         grid_3.4.3             data.table_1.10.4-3   
## [88] FNN_1.1                blob_1.1.0             digest_0.6.15         
## [91] xtable_1.8-2           httpuv_1.3.6.1         elliptic_1.3-7        
## [94] R.utils_2.6.0          munsell_0.4.3          beeswarm_0.2.3        
## [97] viridisLite_0.3.0      vipor_0.4.5

References

Tung, Po-Yuan, John D. Blischak, Chiaowen Joyce Hsiao, David A. Knowles, Jonathan E. Burnett, Jonathan K. Pritchard, and Yoav Gilad. 2017. “Batch Effects and the Effective Design of Single-Cell Gene Expression Studies.” Sci. Rep. 7 (January). Springer Nature: 39921. doi:10.1038/srep39921.

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biol 11 (10). Springer Nature: R106. doi:10.1186/gb-2010-11-10-r106.

Bullard, James H, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. 2010. “Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments.” BMC Bioinformatics 11 (1). Springer Nature: 94. doi:10.1186/1471-2105-11-94.

Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data.” Genome Biol 11 (3). Springer Nature: R25. doi:10.1186/gb-2010-11-3-r25.

L. Lun, Aaron T., Karsten Bach, and John C. Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biol 17 (1). Springer Nature. doi:10.1186/s13059-016-0947-7.

Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2017. “Correcting Batch Effects in Single-Cell RNA Sequencing Data by Matching Mutual Nearest Neighbours.” bioRxiv, July, 165118.

Buttner, Maren, Zhichao Miao, Alexander Wolf, Sarah A Teichmann, and Fabian J Theis. 2017. “Assessment of Batch-Correction Methods for scRNA-seq Data with a New Test Metric.” bioRxiv, October, 200345.