Dropbead
provides two functions to readily extract useful statistics from the data. The first one, extractReadStatistics
, takes the BAM file as an argument and computes the total number of (uniquely mapped) reads, the percentages of intronic/intergenic etc. The output is a data.frame
as follows:
## total_reads uniq_mapped intronic intergenic coding UTR
## Drop-seq 77.6 69 4.5 0.3 4.7 3.9
## total_barcodes
## Drop-seq 1.7
The read numbers are in million. The intronic, intergenic, coding and UTR reads do not get a gene annotation tag (GE:
) and are discarded from the Drop-seq pipeline. The last column counts the total number of barcodes seen in the data, most of them being artifacts from sequencing errors.
Dropbead
provides a function to computationally estimate the number of cells present in the sample, by calculating the inflection point of the cumulative fraction of reads against the cell barcodes. Assuming that reads.by.cell
is the data.frame
with the reads per cell (the output of BAMTagHistogram
from the Drop-seq toolkit),
plotCumulativeFractionOfReads(reads.by.cell,
cutoff = 10000,
draw.infl.point = TRUE)
The following function creates a table with the basic downstream statistics, such as the cell number and the median number of reads, genes and UMIs per cell.
extractDownstreamStatistics('~/dropseq-data/', min.umi=250, read.stats=NULL)
## cells reads genes umis sum.umi PCR
## 1 3995 3965 351 734 4.3 5.4
The total number of UMIs in the DGE is given in millions. The PCR column provides the reads to UMIs ratio, so that the smaller, the better. Passing the data.frame
output of the extractReadStatistics
function into the read.stats
argument provides a more precise calculation of the PCR over-amplification bias.
Dropbead offers two main classes to store samples, the SingleSpeciesSample
class and the MixedSpeciesSample
class, the latter containing the former.
Mixed species experiments are important to estimate the number of doublets in the samples. Dropbead assumes that a digital gene expression matrix (DGE) has already been generated and the genes for the two species are separated by a prefix. For mixed human/mouse samples the prefixes are hg_
and mm_
respectively. Assuming that the DGE has been loaded in dge.matrix
,
# The object containing the sample
mms <- new("MixedSpeciesSample",
species1="human",
species2="mouse",
dge=dge.matrix)
The number of cells and genes are stored and automatically updated while using dropbead
’s functions
length(mms@cells) # number of cells in the sample
## [1] 1261
length(mms@genes) # number of genes detected
## [1] 42194
Some of the barcodes might correspond to the same cells (essentially after using the DetectBeadSynthesisErrors
Drop-seq tool) and have to be collapsed. Such cells are identified by
listCellsToCollapse(mms)
## [[1]]
## [1] "ACCAGGAGCAAA" "ACCAGGAGCAAN"
##
## [[2]]
## [1] "GCGGTTATCAGC" "GCGGTTATCAGN"
##
## [[3]]
## [1] "TCAAGCATAATN" "TCAAGCATAATT"
##
## [[4]]
## [1] "TCACTGACAAAA" "TCACTGACAAAN"
##
## [[5]]
## [1] "CTTCCCCGCAAA" "CTTCCCCGCAAN"
In the example above there are 5 pairs of cells which need to be collapsed. This is straightforward with using
mms <- collapseCellsByBarcode(mms)
We can verify that the number of cells is now the correct one
length(mms@cells)
## [1] 1256
Calling classifyCellsAndDoublets
separates the species and returns a data.frame
with number of transcripts per cell per species. The threshold
controls the purity of the resulting cells, while min.trans
is the minimum number of UMIs required to keep a cell.
head(classifyCellsAndDoublets(mms,
threshold = 0.9,
min.trans = 1000))
## cell human mouse species
## 1 TTGAGACGGATA 33587 486 human
## 2 TTGCAGCCACCT 32838 458 human
## 3 GCTTCAGACAGA 31838 420 human
## 4 CTCTTCATCGAG 28476 296 human
## 5 TAACACACCGTG 511 32516 mouse
## 6 AGCTGCCGTTTG 546 31010 mouse
The output of classifyCellsAndDoublets
can be directly send for plotting
plotCellTypes(classifyCellsAndDoublets(mms, threshold = 0.9, min.trans = 5000))
The number of genes per cell is computed via
head(computeGenesPerCell(mms))
## cells counts species
## 1 TTGAGACGGATA 7600 human
## 2 TTGCAGCCACCT 7582 human
## 3 GCTTCAGACAGA 7554 human
## 4 CTCTTCATCGAG 7263 human
## 5 GCTTACTATCCG 6830 human
## 6 AACTGGTTCGTC 6685 human
and similarly for transcripts (UMIs) with the computeTranscriptsPerCell
function. Their output can be send directly for plotting and visualization
plotViolin(computeTranscriptsPerCell(mms),
attribute = "UMIs")
plotHistogram(computeTranscriptsPerCell(mms),
attribute = 'UMIs')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The above functions are polymorphic and can be also used for SingleSpeciesSample
objects. Splitting the samples is always performed by internal functions automatically. If the user wants to restrict to only one species, this is done by the splitMixedSpeciesSampleToSingleSpecies
function. which returns a list of the two SingleSpeciesSample
objects.
# Extracting the human cells as a separate sample for further analysis
h <- splitMixedSpeciesSampleToSingleSpecies(mms,
threshold = 0.9)[[1]]
class(h)
## [1] "SingleSpeciesSample"
## attr(,"package")
## [1] "dropbead"
There are a couple of functions to remove low quality cells and genes, such as
h.f1 <- keepBestCells(h, num.cells = 100) # keep only the top 100 cells
h.f2 <- keepBestCells(h, min.num.trans = 1000) # keep cells with at least 1000 UMIs
h.f3 <- removeLowQualityCells(h, min.genes = 2000) # remove cells which don't express at least 2000 genes
h.f4 <- removeLowQualityGenes(h, min.cells = 3) # remove genes which are not expressed in at least 3 cells
with obvious usage.
Reproducibility and correlations of different samples are easily assessed via the compareGeneExpressionLevels
function, for instance,
compareGeneExpressionLevels(h.f2, h.f1,
name1 = 'Drop-seq with >= 1000 UMIs per cell',
name2 = 'Drop-seq only 100 cells',
method = 'pearson')
If bulk data is available then its correlation pwith the Drop-seq sample is directly assessed with the function compareSingleCellsAgainstBulk
. Note that bulk has to be in a 1-column data.frame
format with genes as the rownames
and RPKM values (as default, raw counts are also accepted)
compareSingleCellsAgainstBulk(h.f1,
log2(h.f2@dge[, 1, drop=F]+1))
It is instructive to assess the mitochondrial content of the single cell samples. This is done as below and the percentages can be sent for plotting
head(computeMitochondrialPercentage(h), 5)
## TTGAGACGGATA TTGCAGCCACCT GCTTCAGACAGA CTCTTCATCGAG GCTTACTATCCG
## 3.227439 2.512333 3.813054 3.827785 3.051626
plotMitochondrialContent(list(100-computeMitochondrialPercentage(h.f1),
100-computeMitochondrialPercentage(h.f3)),
log_scale = FALSE)
## No id variables; using all as measure variables
Dropbead
offers also an implementation of the algorithm described in Macosko et. al. 2015 for classification of the cell cycle phases. The following R-packages xlsx
, rJava
and xlsxjars
are required for this
phases <- assignCellCyclePhases(h)