Perform overlap queries between reads and genomic features
summarizeOverlaps
extends findOverlaps
by providing
options to resolve reads that overlap multiple features.
## S4 method for signature 'GRanges,GAlignments' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...) ## S4 method for signature 'GRangesList,GAlignments' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...) ## S4 method for signature 'GRanges,GRanges' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...) ## S4 method for signature 'GRangesList,GRanges' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...) ## S4 method for signature 'GRanges,GAlignmentPairs' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...) ## S4 method for signature 'GRangesList,GAlignmentPairs' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...) ## mode funtions Union(features, reads, ignore.strand=FALSE, inter.feature=TRUE) IntersectionStrict(features, reads, ignore.strand=FALSE, inter.feature=TRUE) IntersectionNotEmpty(features, reads, ignore.strand=FALSE, inter.feature=TRUE) ## S4 method for signature 'GRanges,BamFile' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE, fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...) ## S4 method for signature 'BamViews,missing' summarizeOverlaps( features, reads, mode=Union, ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE, fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...)
features |
A GRanges or a GRangesList object of genomic regions of interest. When a GRanges is supplied, each row is considered a feature. When a GRangesList is supplied, each higher list-level is considered a feature. This distinction is important when defining overlaps. When |
reads |
A GRanges, GRangesList
GAlignments, GAlignmentsList,
GAlignmentPairs, BamViews or
BamFileList object that represents the data to be
counted by
|
mode |
The pre-defined options are designed after the counting modes available in the HTSeq package by Simon Anders (see references).
|
ignore.strand |
A logical indicating if strand should be considered when matching. |
inter.feature |
(Default TRUE) A logical indicating if the counting There are 6 possible combinations of the |
preprocess.reads |
A function applied to the reads before counting. The first argument
should be The distinction between a user-defined 'mode' and user-defined 'preprocess.reads' function is that in the first case the user defines how to count; in the second case the reads are preprocessed before counting with a pre-defined mode. See examples. |
... |
Additional arguments passed to functions or methods called
from within A |
singleEnd |
(Default TRUE) A logical indicating if reads are single or
paired-end. In Bioconductor > 2.12 it is not necessary to sort
paired-end BAM files by |
fragments |
(Default FALSE) A logical; applied to paired-end data only.
When When The term ‘mated pairs’ refers to records paired with the algorithm
described on the |
param |
An optional See |
summarizeOverlaps
offers counting modes to resolve reads that
overlap multiple features. The mode
argument defines a set of rules
to resolve the read to a single feature such that each read is counted a
maximum of once. New to GenomicRanges >= 1.13.9 is the
inter.feature
argument which allows reads to be counted for each
feature they overlap. When inter.feature=TRUE
the counting modes
are aware of feature overlap; reads that overlap multiple features are
dropped and not counted. When inter.feature=FALSE
multiple feature
overlap is ignored and reads are counted once for each feature they map
to. This essentially reduces modes ‘Union’ and
‘IntersectionStrict’ to countOverlaps
with
type="any"
, and type="within"
, respectively.
‘IntersectionNotEmpty’ is not reduced to a derivative of
countOverlaps
because the shared regions are removed before
counting.
The BamViews
, BamFile
and BamFileList
methods
summarize overlaps across one or several files. The latter uses
bplapply
; control parallel evaluation using the
register
interface in the BiocParallel package.
A ‘feature’ can be any portion of a genomic region such as a gene,
transcript, exon etc. When the features
argument is a
GRanges the rows define the features. The result
will be the same length as the GRanges. When
features
is a GRangesList the highest
list-level defines the features and the result will be the same length
as the GRangesList.
When inter.feature=TRUE
, each count mode
attempts to
assign a read that overlaps multiple features to a single feature. If
there are ranges that should be considered together (e.g., exons by
transcript or cds regions by gene) the GRangesList
would be appropriate. If there is no grouping in the data then a
GRanges would be appropriate.
Paired-end reads are counted as a single hit if one or both parts of the pair are overlapped. Paired-end records can be counted in a GAlignmentPairs container or BAM file.
Counting pairs in BAM files:
The singleEnd
argument should be FALSE.
When reads
are supplied as a BamFile or BamFileList,
the asMates
argument to the BamFile should be TRUE.
When fragments
is FALSE, a GAlignmentPairs
object is used in counting (pairs only).
When fragments
is TRUE, a GAlignmentsList
object is used in counting (pairs, singletons, unmapped
mates, etc.)
A RangedSummarizedExperiment object. The
assays
slot holds the counts, rowRanges
holds the annotation
from features
.
When reads
is a BamFile
or BamFileList
colData
is an empty DataFrame with a single row named ‘counts’. If
count.mapped.reads=TRUE
, colData
holds the output of
countBam
in 3 columns named ‘records’ (total records),
‘nucleotides’ and ‘mapped’ (mapped records).
When features
is a BamViews
colData
includes
2 columns named bamSamples
and bamIndices
.
In all other cases, colData
has columns of ‘object’
(class of reads) and ‘records’ (length of reads
).
Valerie Obenchain
The DESeq2, DEXSeq and edgeR packages.
The RangedSummarizedExperiment class defined in the SummarizedExperiment package.
The GAlignments and GAlignmentPairs classes.
The BamFileList and BamViews classes in the Rsamtools package.
The readGAlignments and readGAlignmentPairs functions.
reads <- GAlignments( names = c("a","b","c","d","e","f","g"), seqnames = Rle(c(rep(c("chr1", "chr2"), 3), "chr1")), pos = as.integer(c(1400, 2700, 3400, 7100, 4000, 3100, 5200)), cigar = c("500M", "100M", "300M", "500M", "300M", "50M200N50M", "50M150N50M"), strand = strand(rep("+", 7))) gr <- GRanges( seqnames = c(rep("chr1", 7), rep("chr2", 4)), strand = "+", ranges = IRanges(c(1000, 3000, 3600, 4000, 4000, 5000, 5400, 2000, 3000, 7000, 7500), width = c(500, 500, 300, 500, 900, 500, 500, 900, 500, 600, 300), names=c("A", "B", "C1", "C2", "D1", "D2", "E", "F", "G", "H1", "H2"))) groups <- factor(c(1,2,3,3,4,4,5,6,7,8,8)) grl <- splitAsList(gr, groups) names(grl) <- LETTERS[seq_along(grl)] ## --------------------------------------------------------------------- ## Counting modes. ## --------------------------------------------------------------------- ## First count with a GRanges as the 'features'. 'Union' is the ## most conservative counting mode followed by 'IntersectionStrict' ## then 'IntersectionNotEmpty'. counts1 <- data.frame(union=assays(summarizeOverlaps(gr, reads))$counts, intStrict=assays(summarizeOverlaps(gr, reads, mode="IntersectionStrict"))$counts, intNotEmpty=assays(summarizeOverlaps(gr, reads, mode="IntersectionNotEmpty"))$counts) colSums(counts1) ## Split the 'features' into a GRangesList and count again. counts2 <- data.frame(union=assays(summarizeOverlaps(grl, reads))$counts, intStrict=assays(summarizeOverlaps(grl, reads, mode="IntersectionStrict"))$counts, intNotEmpty=assays(summarizeOverlaps(grl, reads, mode="IntersectionNotEmpty"))$counts) colSums(counts2) ## The GRangesList ('grl' object) has 8 features whereas the GRanges ## ('gr' object) has 11. The affect on counting can be seen by looking ## at feature 'H' with mode 'Union'. In the GRanges this feature is ## represented by ranges 'H1' and 'H2', gr[c("H1", "H2")] ## and by list element 'H' in the GRangesList, grl["H"] ## Read "d" hits both 'H1' and 'H2'. This is considered a multi-hit when ## using a GRanges (each range is a separate feature) so the read was ## dropped and not counted. counts1[c("H1", "H2"), ] ## When using a GRangesList, each list element is considered a feature. ## The read hits multiple ranges within list element 'H' but only one ## list element. This is not considered a multi-hit so the read is counted. counts2["H", ] ## --------------------------------------------------------------------- ## Counting multi-hit reads. ## --------------------------------------------------------------------- ## The goal of the counting modes is to provide a set of rules that ## resolve reads hitting multiple features so each read is counted ## a maximum of once. However, sometimes it may be desirable to count ## a read for each feature it overlaps. This can be accomplished by ## setting 'inter.feature' to FALSE. ## When 'inter.feature=FALSE', modes 'Union' and 'IntersectionStrict' ## essentially reduce to countOverlaps() with type="any" and ## type="within", respectively. ## When 'inter.feature=TRUE' only features "A", "F" and "G" have counts. se1 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=TRUE) assays(se1)$counts ## When 'inter.feature=FALSE' all 11 features have a count. There are ## 7 total reads so one or more reads were counted more than once. se2 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=FALSE) assays(se2)$counts ## --------------------------------------------------------------------- ## Counting BAM files. ## --------------------------------------------------------------------- library(pasillaBamSubset) library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene") ## (i) Single-end : ## Large files can be iterated over in chunks by setting a ## 'yieldSize' on the BamFile. bf_s <- BamFile(untreated1_chr4(), yieldSize=50000) se_s <- summarizeOverlaps(exbygene, bf_s, singleEnd=TRUE) table(assays(se_s)$counts > 0) ## When a character (file name) is provided as 'reads' instead ## of a BamFile object summarizeOverlaps() will create a BamFile ## and set a reasonable default 'yieldSize'. ## (ii) Paired-end : ## A paired-end file may contain singletons, reads with unmapped ## pairs or reads with more than two fragments. When 'fragments=FALSE' ## only reads paired by the algorithm are included in the counting. nofrag <- summarizeOverlaps(exbygene, untreated3_chr4(), singleEnd=FALSE, fragments=FALSE) table(assays(nofrag)$counts > 0) ## When 'fragments=TRUE' all singletons, reads with unmapped pairs ## and other fragments will be included in the counting. bf <- BamFile(untreated3_chr4(), asMates=TRUE) frag <- summarizeOverlaps(exbygene, bf, singleEnd=FALSE, fragments=TRUE) table(assays(frag)$counts > 0) ## As expected, using 'fragments=TRUE' results in a larger number ## of total counts because singletons, unmapped pairs etc. are ## included in the counting. ## Total reads in the file: countBam(untreated3_chr4()) ## Reads counted with 'fragments=FALSE': sum(assays(nofrag)$counts) ## Reads counted with 'fragments=TRUE': sum(assays(frag)$counts) ## --------------------------------------------------------------------- ## Use ouput of summarizeOverlaps() for differential expression analysis ## with DESeq2 or edgeR. ## --------------------------------------------------------------------- fls <- list.files(system.file("extdata", package="GenomicAlignments"), recursive=TRUE, pattern="*bam$", full=TRUE) names(fls) <- basename(fls) bf <- BamFileList(fls, index=character(), yieldSize=1000) genes <- GRanges( seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)), ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 300, 500, 500))) se <- summarizeOverlaps(genes, bf) ## When the reads are BAM files, the 'colData' contains summary ## information from a call to countBam(). colData(se) ## Start differential expression analysis with the DESeq2 or edgeR ## package: library(DESeq2) deseq <- DESeqDataSet(se, design= ~ 1) library(edgeR) edger <- DGEList(assays(se)$counts, group=rownames(colData(se))) ## --------------------------------------------------------------------- ## Filter records by map quality before counting. ## (user-supplied 'mode' function) ## --------------------------------------------------------------------- ## The 'mode' argument can take a custom count function whose ## arguments are the same as those in the current counting modes ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). ## In this example records are filtered by map quality before counting. mapq_filter <- function(features, reads, ignore.strand, inter.feature) { require(GenomicAlignments) # needed for parallel evaluation Union(features, reads[mcols(reads)$mapq >= 20], ignore.strand, inter.feature) } genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) param <- ScanBamParam(what="mapq") fl <- system.file("extdata", "ex1.bam", package="Rsamtools") se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) assays(se)$counts ## The count function can be completely custom (i.e., not use the ## pre-defined count functions at all). Requirements are that ## the input arguments match the pre-defined modes and the output ## is a vector of counts the same length as 'features'. my_count <- function(features, reads, ignore.strand, inter.feature) { ## perform filtering, or subsetting etc. require(GenomicAlignments) # needed for parallel evaluation countOverlaps(features, reads) } ## --------------------------------------------------------------------- ## Preprocessing reads before counting with a standard count mode. ## (user-supplied 'preprocess.reads' function) ## --------------------------------------------------------------------- ## The 'preprocess.reads' argument takes a function that is ## applied to the reads before counting with a pre-defined mode. ResizeReads <- function(reads, width=1, fix="start", ...) { reads <- as(reads, "GRanges") stopifnot(all(strand(reads) != "*")) resize(reads, width=width, fix=fix, ...) } ## By default ResizeReads() counts reads that overlap on the 5' end: summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads) ## Count reads that overlap on the 3' end by passing new values ## for 'width' and 'fix': summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads, width=1, fix="end") ## --------------------------------------------------------------------- ## summarizeOverlaps() with BamViews. ## --------------------------------------------------------------------- ## bamSamples and bamPaths metadata are included in the colData. ## bamExperiment metadata is put into the metadata slot. fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) rngs <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584))) samp <- DataFrame(info="test", row.names="ex1") view <- BamViews(fl, bamSamples=samp, bamRanges=rngs) se <- summarizeOverlaps(view, mode=Union, ignore.strand=TRUE) colData(se) metadata(se)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.