Reading genomic alignments from a file
Read genomic alignments from a file (typically a BAM file) into a GAlignments, GAlignmentPairs, GAlignmentsList, or GappedReads object.
readGAlignments(file, index=file, use.names=FALSE, param=NULL, with.which_label=FALSE) readGAlignmentPairs(file, index=file, use.names=FALSE, param=NULL, with.which_label=FALSE, strandMode=1) readGAlignmentsList(file, index=file, use.names=FALSE, param=ScanBamParam(), with.which_label=FALSE) readGappedReads(file, index=file, use.names=FALSE, param=NULL, with.which_label=FALSE)
file |
The path to the file to read or a BamFile object.
Can also be a BamViews object for
|
index |
The path to the index file of the BAM file to read.
Must be given without the '.bai' extension.
See |
use.names |
|
param |
By default (i.e. |
with.which_label |
The purpose of this metadata column is to unambiguously identify
the range in |
strandMode |
Strand mode to set on the returned GAlignmentPairs object.
See |
readGAlignments
reads a file containing aligned reads as a
GAlignments object. See ?GAlignments
for a
description of GAlignments objects.
When file
is a BamViews object,
readGAlignments
visits each path in bamPaths(file)
,
returning the result of readGAlignments
applied to the
specified path. When index
is missing, it is set equal to
bamIndicies(file)
. Only reads in bamRanges(file)
are
returned (if param
is supplied, bamRanges(file)
takes
precedence over bamWhich(param)
). The return value is a
SimpleList object, with elements of the list
corresponding to each path. bamSamples(file)
is available
as metadata columns (accessed with mcols
) of the returned
SimpleList object.
readGAlignmentPairs
reads a file containing aligned paired-end
reads as a GAlignmentPairs object.
See ?GAlignmentPairs
for a description of
GAlignmentPairs objects.
readGAlignmentsList
reads a file containing aligned reads as
a GAlignmentsList object. See ?GAlignmentsList
for a description of GAlignmentsList objects.
readGAlignmentsList
pairs records into mates according to the
pairing criteria described below. The 1st mate will always be 1st in
the GAlignmentsList list elements that have mate_status set to
"mated"
, and the 2nd mate will always be 2nd.
A GAlignmentsList
is returned with a ‘mate_status’
metadata column on the outer list elements. mate_status
is a
factor with 3 levels indicating mate status, ‘mated’,
‘ambiguous’ or ‘unmated’:
mated: primary or non-primary pairs
ambiguous: multiple segments matching to the same location (indistinguishable)
unmated: mate does not exist or is unmapped
When the ‘file’ argument is a BamFile, ‘asMates=TRUE’
must be set, otherwise the data are treated as single-end reads.
See the ‘asMates’ section of ?BamFile
in the Rsamtools package for details.
readGappedReads
reads a file containing aligned reads as a
GappedReads object. See ?GappedReads
for a
description of GappedReads objects.
For all these functions, flags, tags and ranges may be specified in the supplied ScanBamParam object for fine tuning of results.
A GAlignments object for readGAlignments
.
A GAlignmentPairs object for readGAlignmentPairs
.
Note that a BAM (or SAM) file can in theory contain a mix of single-end
and paired-end reads, but in practise it seems that single-end and
paired-end are not mixed. In other words, the value of flag bit 0x1
(isPaired
) is the same for all the records in a file.
So if readGAlignmentPairs
returns a GAlignmentPairs object
of length zero, this almost always means that the BAM (or SAM) file
contains alignments for single-end reads (although it could also mean that
the user-supplied ScanBamParam
is filtering out
everything, or that the file is empty, or that all the records in the file
correspond to unmapped reads).
A GAlignmentsList object for readGAlignmentsList
.
When the list contains paired-end reads a metadata data column of
mate_status
is added to the object. See details in the
‘Bam specific back-ends’ section on this man page.
A GappedReads object for readGappedReads
.
This section describes the pairing criteria used by
readGAlignmentsList
and readGAlignmentPairs
.
First, only records with flag bit 0x1 (multiple segments) set to 1, flag bit 0x4 (segment unmapped) set to 0, and flag bit 0x8 (next segment in the template unmapped) set to 0, are candidates for pairing (see the SAM Spec for a description of flag bits and fields). Records that correspond to single-end reads, or records that correspond to paired-end reads where one or both ends are unmapped, will remain unmated.
Then the following fields and flag bits are considered:
(A) QNAME
(B) RNAME, RNEXT
(C) POS, PNEXT
(D) Flag bits Ox10 (segment aligned to minus strand) and 0x20 (next segment aligned to minus strand)
(E) Flag bits 0x40 (first segment in template) and 0x80 (last segment in template)
(F) Flag bit 0x2 (proper pair)
(G) Flag bit 0x100 (secondary alignment)
2 records rec1 and rec2 are considered mates iff all the following conditions are satisfied:
(A) QNAME(rec1) == QNAME(rec2)
(B) RNEXT(rec1) == RNAME(rec2) and RNEXT(rec2) == RNAME(rec1)
(C) PNEXT(rec1) == POS(rec2) and PNEXT(rec2) == POS(rec1)
(D) Flag bit 0x20 of rec1 == Flag bit 0x10 of rec2 and Flag bit 0x20 of rec2 == Flag bit 0x10 of rec1
(E) rec1 corresponds to the first segment in the template and rec2 corresponds to the last segment in the template, OR, rec2 corresponds to the first segment in the template and rec1 corresponds to the last segment in the template
(F) rec1 and rec2 have same flag bit 0x2
(G) rec1 and rec2 have same flag bit 0x100
Note that this is actually the pairing criteria used by
scanBam
(when the BamFile
passed to it has the asMates
toggle set to TRUE
), which
readGAlignmentsList
and readGAlignmentPairs
call behind
the scene. It is also the pairing criteria used by
findMateAlignment
.
BAM records corresponding to unmapped reads are always ignored.
Starting with Rsamtools 1.7.1 (BioC 2.10), PCR or optical duplicates
are loaded by default (use scanBamFlag(isDuplicate=FALSE)
to
drop them).
Hervé Pagès and Valerie Obenchain
scanBam
and
ScanBamParam
in the Rsamtools
package.
GAlignments, GAlignmentPairs, GAlignmentsList, and GappedReads objects.
IRangesList objects (used in the examples
below to specify the which
regions) in the IRanges
package.
## --------------------------------------------------------------------- ## A. readGAlignments() ## --------------------------------------------------------------------- ## Simple use: bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) gal1 <- readGAlignments(bamfile) gal1 names(gal1) ## Using the 'use.names' arg: gal2 <- readGAlignments(bamfile, use.names=TRUE) gal2 head(names(gal2)) ## Using the 'param' arg to drop PCR or optical duplicates as well as ## secondary alignments, and to load additional BAM fields: param <- ScanBamParam(flag=scanBamFlag(isDuplicate=FALSE, isSecondaryAlignment=FALSE), what=c("qual", "flag")) gal3 <- readGAlignments(bamfile, param=param) gal3 mcols(gal3) ## Using the 'param' arg to load alignments from particular regions. which <- IRangesList(seq1=IRanges(1000, 1100), seq2=IRanges(c(1546, 1555, 1567), width=10)) param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, use.names=TRUE, param=param) gal4 ## IMPORTANT NOTE: A given record is loaded one time for each region ## it overlaps with. We call this "duplicated record selection" (this ## is a scanBam() feature, readGAlignments() is based on scanBam()): which <- IRangesList(seq2=IRanges(c(1555, 1567), width=10)) param <- ScanBamParam(which=which) gal5 <- readGAlignments(bamfile, use.names=TRUE, param=param) gal5 # record EAS114_26:7:37:79:581 was loaded twice ## This becomes clearer if we use 'with.which_label=TRUE' to identify ## the region in 'which' where each element in 'gal5' originates from. gal5 <- readGAlignments(bamfile, use.names=TRUE, param=param, with.which_label=TRUE) gal5 ## Not surprisingly, we also get "duplicated record selection" when ## 'which' contains repeated or overlapping regions. Using the same ## regions as we did for 'gal4' above, except that now we're ## repeating the region on seq1: which <- IRangesList(seq1=rep(IRanges(1000, 1100), 2), seq2=IRanges(c(1546, 1555, 1567), width=10)) param <- ScanBamParam(which=which) gal4b <- readGAlignments(bamfile, use.names=TRUE, param=param) length(gal4b) # > length(gal4), because all the records overlapping # with bases 1000 to 1100 on seq1 are now duplicated ## The "duplicated record selection" will artificially increase the ## coverage or affect other downstream results. It can be mitigated ## (but not completely eliminated) by first "reducing" the set of ## regions: which <- reduce(which) which param <- ScanBamParam(which=which) gal4c <- readGAlignments(bamfile, use.names=TRUE, param=param) length(gal4c) # < length(gal4), because the 2 first original regions # on seq2 were merged into a single one ## Note that reducing the set of regions didn't completely eliminate ## "duplicated record selection". Records that overlap the 2 reduced ## regions on seq2 (which$seq2) are loaded twice (like for 'gal5' ## above). See example D. below for how to completely eliminate ## "duplicated record selection". ## Using the 'param' arg to load tags. Except for MF and Aq, the tags ## specified below are predefined tags (see the SAM Spec for the list ## of predefined tags and their meaning). param <- ScanBamParam(tag=c("MF", "Aq", "NM", "UQ", "H0", "H1"), what="isize") gal6 <- readGAlignments(bamfile, param=param) mcols(gal6) # "tag" cols always after "what" cols ## With a BamViews object: fls <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) bv <- BamViews(fls, bamSamples=DataFrame(info="test", row.names="ex1"), auto.range=TRUE) ## Note that the "readGAlignments" method for BamViews objects ## requires the ShortRead package to be installed. aln <- readGAlignments(bv) aln aln[[1]] aln[colnames(bv)] mcols(aln) ## --------------------------------------------------------------------- ## B. readGAlignmentPairs() ## --------------------------------------------------------------------- galp1 <- readGAlignmentPairs(bamfile) head(galp1) names(galp1) ## Here we use the 'param' arg to filter by proper pair, drop PCR / ## optical duplicates, and drop secondary alignments. Filtering by ## proper pair and dropping secondary alignments can help make the ## pairing algorithm run significantly faster: param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE, isDuplicate=FALSE, isSecondaryAlignment=FALSE)) galp2 <- readGAlignmentPairs(bamfile, use.names=TRUE, param=param) galp2 head(galp2) head(names(galp2)) ## --------------------------------------------------------------------- ## C. readGAlignmentsList() ## --------------------------------------------------------------------- library(pasillaBamSubset) ## 'file' as character. bam <- untreated3_chr4() galist1 <- readGAlignmentsList(bam) galist1[1:3] length(galist1) table(elementNROWS(galist1)) ## When 'file' is a BamFile, 'asMates' must be TRUE. If FALSE, ## the data are treated as single-end and each list element of the ## GAlignmentsList will be of length 1. For single-end data ## use readGAlignments(). bamfile <- BamFile(bam, yieldSize=3, asMates=TRUE) readGAlignmentsList(bamfile) ## Use a 'param' to fine tune the results. param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE)) galist2 <- readGAlignmentsList(bam, param=param) length(galist2) ## --------------------------------------------------------------------- ## D. COMPARING 4 STRATEGIES FOR LOADING THE ALIGNMENTS THAT OVERLAP ## WITH THE EXONIC REGIONS ON FLY CHROMMOSOME 4 ## --------------------------------------------------------------------- library(pasillaBamSubset) bam <- untreated1_chr4() library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ex <- exons(txdb) seqlevels(ex, pruning.mode="coarse") <- "chr4" length(ex) ## Some of the exons overlap with each other: isDisjoint(ex) # FALSE exonic_regions <- reduce(ex) isDisjoint(exonic_regions) # no more overlaps length(exonic_regions) ## Strategy #1: slow and loads a lot of records more than once (see ## "duplicated record selection" in example A. above). param1 <- ScanBamParam(which=ex) gal1 <- readGAlignments(bam, param=param1) length(gal1) # many "duplicated records" ## Strategy #2: faster and generates less duplicated records but ## doesn't eliminate them. param2 <- ScanBamParam(which=exonic_regions) gal2 <- readGAlignments(bam, param=param2) length(gal2) # less "duplicated records" ## Strategy #3: fast and completely eliminates duplicated records. gal0 <- readGAlignments(bam) gal3 <- subsetByOverlaps(gal0, exonic_regions, ignore.strand=TRUE) length(gal3) # no "duplicated records" ## Note that, in this case using 'exonic_regions' or 'ex' makes no ## difference: gal3b <- subsetByOverlaps(gal0, ex, ignore.strand=TRUE) stopifnot(identical(gal3, gal3b)) ## Strategy #4: strategy #3 however can require a lot of memory if the ## file is big because we load all the alignments into memory before we ## select those that overlap with the exonic regions. Strategy #4 ## addresses this by loading the file by chunks. bamfile <- BamFile(bam, yieldSize=50000) open(bamfile) while (length(chunk0 <- readGAlignments(bamfile))) { chunk <- subsetByOverlaps(chunk0, ex, ignore.strand=TRUE) cat("chunk0:", length(chunk0), "- chunk:", length(chunk), "\n") ## ... do something with 'chunk' ... } close(bamfile) ## --------------------------------------------------------------------- ## E. readGappedReads() ## --------------------------------------------------------------------- greads1 <- readGappedReads(bamfile) greads1 names(greads1) qseq(greads1) greads2 <- readGappedReads(bamfile, use.names=TRUE) head(greads2) head(names(greads2))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.