CIGAR utility functions
Utility functions for low-level CIGAR manipulation.
## -=-= Supported CIGAR operations =-=- CIGAR_OPS ## -=-= Transform CIGARs into other useful representations =-=- explodeCigarOps(cigar, ops=CIGAR_OPS) explodeCigarOpLengths(cigar, ops=CIGAR_OPS) cigarToRleList(cigar) ## -=-= Summarize CIGARs =-=- cigarOpTable(cigar) ## -=-= From CIGARs to ranges =-=- cigarRangesAlongReferenceSpace(cigar, flag=NULL, N.regions.removed=FALSE, pos=1L, f=NULL, ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE, with.ops=FALSE) cigarRangesAlongQuerySpace(cigar, flag=NULL, before.hard.clipping=FALSE, after.soft.clipping=FALSE, ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE, with.ops=FALSE) cigarRangesAlongPairwiseSpace(cigar, flag=NULL, N.regions.removed=FALSE, dense=FALSE, ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE, with.ops=FALSE) extractAlignmentRangesOnReference(cigar, pos=1L, drop.D.ranges=FALSE, f=NULL) ## -=-= From CIGARs to sequence lengths =-=- cigarWidthAlongReferenceSpace(cigar, flag=NULL, N.regions.removed=FALSE) cigarWidthAlongQuerySpace(cigar, flag=NULL, before.hard.clipping=FALSE, after.soft.clipping=FALSE) cigarWidthAlongPairwiseSpace(cigar, flag=NULL, N.regions.removed=FALSE, dense=FALSE) ## -=-= Narrow CIGARs =-=- cigarNarrow(cigar, start=NA, end=NA, width=NA) cigarQNarrow(cigar, start=NA, end=NA, width=NA) ## -=-= Translate coordinates between query and reference spaces =-=- queryLoc2refLoc(qloc, cigar, pos=1L) queryLocs2refLocs(qlocs, cigar, pos=1L, flag=NULL)
cigar |
A character vector or factor containing the extended CIGAR strings.
It can be of arbitrary length except for |
ops |
Character vector containing the extended CIGAR operations to actually
consider. Zero-length operations or operations not listed |
flag |
According to the SAM Spec v1.4, flag bit 0x4 is the only reliable place
to tell whether a segment (or read) is mapped (bit is 0) or not (bit
is 1). If |
N.regions.removed |
|
pos |
An integer vector containing the 1-based leftmost
position/coordinate for each (eventually clipped) read
sequence. Must have length 1 (in which case it's recycled to the
length of |
f |
For example, if |
drop.empty.ranges |
Should empty ranges be dropped? |
reduce.ranges |
Should adjacent ranges coming from the same cigar be merged or not?
Using |
with.ops |
|
before.hard.clipping |
|
after.soft.clipping |
|
dense |
|
drop.D.ranges |
Should the ranges corresponding to a deletion from the
reference (encoded with a D in the CIGAR) be dropped?
By default we keep them to be consistent with the pileup tool
from SAMtools.
Note that, when |
start,end,width |
Vectors of integers. NAs and negative values are accepted and
"solved" according to the rules of the SEW (Start/End/Width)
interface (see |
qloc |
An integer vector containing "query-based locations" i.e. 1-based locations relative to the query sequence stored in the SAM/BAM file. |
qlocs |
A list of the same length as |
CIGAR_OPS
is a predefined character vector containing the supported
extended CIGAR operations: M, I, D, N, S, H, P, =, X. See p. 4 of the
SAM Spec v1.4 at http://samtools.sourceforge.net/ for the list of
extended CIGAR operations and their meanings.
For explodeCigarOps
and explodeCigarOpLengths
:
Both functions return a list of the same length as cigar
where each
list element is a character vector (for explodeCigarOps
) or an integer
vector (for explodeCigarOpLengths
). The 2 lists have the same shape,
that is, same length()
and same elementNROWS()
. The i-th
character vector in the list returned by explodeCigarOps
contains one
single-letter string per CIGAR operation in cigar[i]
. The i-th integer
vector in the list returned by explodeCigarOpLengths
contains the
corresponding CIGAR operation lengths. Zero-length operations or operations
not listed in ops
are ignored.
For cigarToRleList
: A CompressedRleList object.
For cigarOpTable
: An integer matrix with number of rows equal
to the length of cigar
and nine columns, one for each extended
CIGAR operation.
For cigarRangesAlongReferenceSpace
, cigarRangesAlongQuerySpace
,
cigarRangesAlongPairwiseSpace
, and
extractAlignmentRangesOnReference
: An IRangesList
object (more precisely a CompressedIRangesList object)
with 1 list element per element in cigar
.
However, if f
is a factor, then the returned
IRangesList object can be a SimpleIRangesList
object (instead of CompressedIRangesList), and in that case,
has 1 list element per level in f
and is named with those levels.
For cigarWidthAlongReferenceSpace
and
cigarWidthAlongPairwiseSpace
: An integer vector of the same
length as cigar
where each element is the width of the alignment
with respect to the "reference" and "pairwise" space, respectively.
More precisely, for cigarWidthAlongReferenceSpace
, the returned
widths are the lengths of the alignments on the reference,
N gaps included (except if N.regions.removed
is TRUE
).
NAs or "*"
in cigar
will produce NAs in the returned vector.
For cigarWidthAlongQuerySpace
: An integer vector of the same
length as cigar
where each element is the length of the corresponding
query sequence as inferred from the CIGAR string. Note that, by default
(i.e. if before.hard.clipping
and after.soft.clipping
are
FALSE
), this is the length of the query sequence stored in the
SAM/BAM file. If before.hard.clipping
or after.soft.clipping
is TRUE
, the returned widths are the lengths of the query sequences
before hard clipping or after soft clipping.
NAs or "*"
in cigar
will produce NAs in the returned vector.
For cigarNarrow
and cigarQNarrow
: A character vector
of the same length as cigar
containing the narrowed cigars.
In addition the vector has an "rshift" attribute which is an integer
vector of the same length as cigar
. It contains the values that
would need to be added to the POS field of a SAM/BAM file as a
consequence of this cigar narrowing.
For queryLoc2refLoc
: An integer vector of the same length as
qloc
containing the "reference-based locations" (i.e. the
1-based locations relative to the reference sequence) corresponding
to the "query-based locations" passed in qloc
.
For queryLocs2refLocs
: A list of the same length as
qlocs
where each element is an integer vector containing
the "reference-based locations" corresponding to the "query-based
locations" passed in the corresponding element in qlocs
.
Hervé Pagès & P. Aboyoun
The sequenceLayer function in the GenomicAlignments package for laying the query sequences alongside the "reference" or "pairwise" spaces.
The GAlignments container for storing a set of genomic alignments.
The IRanges, IRangesList, and RleList classes in the IRanges package.
The coverage
generic and methods for
computing the coverage across a set of ranges or genomic ranges.
## --------------------------------------------------------------------- ## A. CIGAR_OPS, explodeCigarOps(), explodeCigarOpLengths(), ## cigarToRleList(), and cigarOpTable() ## --------------------------------------------------------------------- ## Supported CIGAR operations: CIGAR_OPS ## Transform CIGARs into other useful representations: cigar1 <- "3H15M55N4M2I6M2D5M6S" cigar2 <- c("40M2I9M", cigar1, "2S10M2000N15M", "3H33M5H") explodeCigarOps(cigar2) explodeCigarOpLengths(cigar2) explodeCigarOpLengths(cigar2, ops=c("I", "S")) cigarToRleList(cigar2) ## Summarize CIGARs: cigarOpTable(cigar2) ## --------------------------------------------------------------------- ## B. From CIGARs to ranges and to sequence lengths ## --------------------------------------------------------------------- ## CIGAR ranges along the "reference" space: cigarRangesAlongReferenceSpace(cigar1, with.ops=TRUE)[[1]] cigarRangesAlongReferenceSpace(cigar1, reduce.ranges=TRUE, with.ops=TRUE)[[1]] ops <- setdiff(CIGAR_OPS, "N") cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]] cigarRangesAlongReferenceSpace(cigar1, ops=ops, reduce.ranges=TRUE, with.ops=TRUE)[[1]] ops <- setdiff(CIGAR_OPS, c("D", "N")) cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]] cigarWidthAlongReferenceSpace(cigar1) pos2 <- c(1, 1001, 1, 351) cigarRangesAlongReferenceSpace(cigar2, pos=pos2, with.ops=TRUE) res1a <- extractAlignmentRangesOnReference(cigar2, pos=pos2) res1b <- cigarRangesAlongReferenceSpace(cigar2, pos=pos2, ops=setdiff(CIGAR_OPS, "N"), reduce.ranges=TRUE) stopifnot(identical(res1a, res1b)) res2a <- extractAlignmentRangesOnReference(cigar2, pos=pos2, drop.D.ranges=TRUE) res2b <- cigarRangesAlongReferenceSpace(cigar2, pos=pos2, ops=setdiff(CIGAR_OPS, c("D", "N")), reduce.ranges=TRUE) stopifnot(identical(res2a, res2b)) seqnames <- factor(c("chr6", "chr6", "chr2", "chr6"), levels=c("chr2", "chr6")) extractAlignmentRangesOnReference(cigar2, pos=pos2, f=seqnames) ## CIGAR ranges along the "query" space: cigarRangesAlongQuerySpace(cigar2, with.ops=TRUE) cigarWidthAlongQuerySpace(cigar1) cigarWidthAlongQuerySpace(cigar1, before.hard.clipping=TRUE) ## CIGAR ranges along the "pairwise" space: cigarRangesAlongPairwiseSpace(cigar2, with.ops=TRUE) cigarRangesAlongPairwiseSpace(cigar2, dense=TRUE, with.ops=TRUE) ## --------------------------------------------------------------------- ## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE ## --------------------------------------------------------------------- ## The information stored in a BAM file can be used to compute the ## "coverage" of the mapped reads i.e. the number of reads that hit any ## given position in the reference genome. ## The following function takes the path to a BAM file and returns an ## object representing the coverage of the mapped reads that are stored ## in the file. The returned object is an RleList object named with the ## names of the reference sequences that actually receive some coverage. flag0 <- scanBamFlag(isUnmappedQuery=FALSE, isDuplicate=FALSE) extractCoverageFromBAM <- function(bamfile) { stopifnot(is(bamfile, "BamFile")) ## This ScanBamParam object allows us to load only the necessary ## information from the file. param <- ScanBamParam(flag=flag0, what=c("rname", "pos", "cigar")) bam <- scanBam(bamfile, param=param)[[1]] ## Note that unmapped reads and reads that are PCR/optical duplicates ## have already been filtered out by using the ScanBamParam object ## above. f <- factor(bam$rname, levels=seqlevels(bamfile)) irl <- extractAlignmentRangesOnReference(bam$cigar, pos=bam$pos, f=f) coverage(irl, width=seqlengths(bamfile)) } library(Rsamtools) f1 <- system.file("extdata", "ex1.bam", package="Rsamtools") cvg <- extractCoverageFromBAM(BamFile(f1)) ## extractCoverageFromBAM() is equivalent but slightly more efficient ## than loading a GAlignments object and computing its coverage: cvg2 <- coverage(readGAlignments(f1, param=ScanBamParam(flag=flag0))) stopifnot(identical(cvg, cvg2)) ## --------------------------------------------------------------------- ## D. cigarNarrow() and cigarQNarrow() ## --------------------------------------------------------------------- ## cigarNarrow(): cigarNarrow(cigar1) # only drops the soft/hard clipping cigarNarrow(cigar1, start=10) cigarNarrow(cigar1, start=15) cigarNarrow(cigar1, start=15, width=57) cigarNarrow(cigar1, start=16) #cigarNarrow(cigar1, start=16, width=55) # ERROR! (empty cigar) cigarNarrow(cigar1, start=71) cigarNarrow(cigar1, start=72) cigarNarrow(cigar1, start=75) ## cigarQNarrow(): cigarQNarrow(cigar1, start=4, end=-3) cigarQNarrow(cigar1, start=10) cigarQNarrow(cigar1, start=19) cigarQNarrow(cigar1, start=24) ## --------------------------------------------------------------------- ## E. PERFORMANCE ## --------------------------------------------------------------------- if (interactive()) { ## We simulate 20 millions aligned reads, all 40-mers. 95% of them ## align with no indels. 5% align with a big deletion in the ## reference. In the context of an RNAseq experiment, those 5% would ## be suspected to be "junction reads". set.seed(123) nreads <- 20000000L njunctionreads <- nreads * 5L / 100L cigar3 <- character(nreads) cigar3[] <- "40M" junctioncigars <- paste( paste(10:30, "M", sep=""), paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""), paste(30:10, "M", sep=""), sep="") cigar3[sample(nreads, njunctionreads)] <- junctioncigars some_fake_rnames <- paste("chr", c(1:6, "X"), sep="") rname <- factor(sample(some_fake_rnames, nreads, replace=TRUE), levels=some_fake_rnames) pos <- sample(80000000L, nreads, replace=TRUE) ## The following takes < 3 sec. to complete: system.time(irl1 <- extractAlignmentRangesOnReference(cigar3, pos=pos)) ## The following takes < 4 sec. to complete: system.time(irl2 <- extractAlignmentRangesOnReference(cigar3, pos=pos, f=rname)) ## The sizes of the resulting objects are about 240M and 160M, ## respectively: object.size(irl1) object.size(irl2) }
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.