Stack the read sequences stored in a GAlignments object or a BAM file
stackStringsFromGAlignments
stacks the read sequences (or
their quality strings) stored in a GAlignments object over
a user-specified region.
stackStringsFromBam
stacks the read sequences (or their quality
strings) stored in a BAM file over a user-specified region.
alphabetFrequencyFromBam
computes the alphabet frequency of the
reads over a user-specified region.
All these functions take into account the CIGAR of each read to lay the read sequence (or its quality string) alongside the reference space. This step ensures that each nucleotide in a read is associated with the correct position on the reference sequence.
stackStringsFromGAlignments(x, region, what="seq", D.letter="-", N.letter=".", Lpadding.letter="+", Rpadding.letter="+") stackStringsFromBam(file, index=file, param, what="seq", use.names=FALSE, D.letter="-", N.letter=".", Lpadding.letter="+", Rpadding.letter="+") alphabetFrequencyFromBam(file, index=file, param, what="seq", ...)
x |
A GAlignments object with the read sequences in the |
region |
A GRanges object with exactly 1 genomic range. The read sequences (or read quality strings) will be stacked over that region. |
what |
A single string. Either |
D.letter, N.letter |
A single letter used as a filler for injections. The 2 arguments are
passed down to the |
Lpadding.letter, Rpadding.letter |
A single letter to use for padding the sequences on the left, and another
one to use for padding on the right. The 2 arguments are passed down to
the |
file, index |
The path to the BAM file containing the reads, and to its index file,
respectively. The latter is given without the '.bai'
extension. See |
param |
A ScanBamParam object containing exactly 1 genomic region
(i.e. |
use.names |
Use the query template names (QNAME field) as the names of the returned object? If not (the default), then the returned object has no names. |
... |
Further arguments to be passed to alphabetFrequency. |
stackStringsFromGAlignments
performs the 3 following steps:
Subset GAlignments object x
to keep only the
alignments that overlap with the specified region.
Lay the sequences in x
alongside the reference space,
using their CIGARs. This is done with the
sequenceLayer
function.
Stack them on the specified region. This is done with the
stackStrings
function defined in the
Biostrings package.
stackStringsFromBam
performs the 3 following steps:
Load the read sequences (or their quality strings) from the BAM
file. Only the read sequences that overlap with the specified region
are loaded. This is done with the readGAlignments
function. Note that if the file contains paired-end reads, the
pairing is ignored.
Same as stackStringsFromGAlignments
.
Same as stackStringsFromGAlignments
.
alphabetFrequencyFromBam
also performs steps 1. and 2. but,
instead of stacking the sequences at step 3., it computes the nucleotide
frequencies for each genomic position in the specified region.
For stackStringsFromBam
: A rectangular (i.e. constant-width)
DNAStringSet object (if what
is "seq"
)
or BStringSet object (if what
is "qual"
).
For alphabetFrequencyFromBam
: By default a matrix like one returned
by alphabetFrequency
. The matrix has 1 row per
nucleotide position in the specified region.
TWO IMPORTANT CAVEATS ABOUT stackStringsFromGAlignments
AND
stackStringsFromBam
:
Specifying a big genomic region, say >= 100000 bp, can require a lot of
memory (especially with high coverage reads) so is not recommended.
See the pileLettersAt
function for piling the read letters
on top of a set of genomic positions, which is more flexible and more
memory efficient.
Paired-end reads are treated as single-end reads (i.e. they're not paired).
Hervé Pagès
The pileLettersAt
function for piling the letters
of a set of aligned reads on top of a set of genomic positions.
The readGAlignments
function for loading read
sequences (or their quality strings) from a BAM file (via a
GAlignments object).
The sequenceLayer
function for laying read sequences
alongside the reference space, using their CIGARs.
The stackStrings
function in the
Biostrings package for stacking an arbitrary
XStringSet object.
The alphabetFrequency
function in the
Biostrings package.
The SAMtools mpileup command available at http://samtools.sourceforge.net/ as part of the SAMtools project.
## --------------------------------------------------------------------- ## A. EXAMPLE WITH TOY DATA ## --------------------------------------------------------------------- bamfile1 <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools")) region1 <- GRanges("seq1", IRanges(1, 60)) # region of interest ## Stack the read sequences directly from the BAM file: stackStringsFromBam(bamfile1, param=region1, use.names=TRUE) ## or, alternatively, from a GAlignments object: gal1 <- readGAlignments(bamfile1, param=ScanBamParam(what="seq"), use.names=TRUE) stackStringsFromGAlignments(gal1, region1) ## Compute the "consensus matrix" (1 column per nucleotide position ## in the region of interest): af <- alphabetFrequencyFromBam(bamfile1, param=region1, baseOnly=TRUE) cm1a <- t(af[ , DNA_BASES]) cm1a ## Stack their quality strings: stackStringsFromBam(bamfile1, param=region1, what="qual") ## Control the number of reads to display: options(showHeadLines=18) options(showTailLines=6) stackStringsFromBam(bamfile1, param=GRanges("seq1", IRanges(61, 120))) stacked_qseq <- stackStringsFromBam(bamfile1, param="seq2:1509-1519") stacked_qseq # deletion in read 13 af <- alphabetFrequencyFromBam(bamfile1, param="seq2:1509-1519", baseOnly=TRUE) cm1b <- t(af[ , DNA_BASES]) # consensus matrix cm1b ## Sanity check: stopifnot(identical(consensusMatrix(stacked_qseq)[DNA_BASES, ], cm1b)) stackStringsFromBam(bamfile1, param="seq2:1509-1519", what="qual") ## --------------------------------------------------------------------- ## B. EXAMPLE WITH REAL DATA ## --------------------------------------------------------------------- library(RNAseqData.HNRNPC.bam.chr14) bamfile2 <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]) ## Region of interest: region2 <- GRanges("chr14", IRanges(19650095, 19650159)) ## Stack the read sequences directly from the BAM file: stackStringsFromBam(bamfile2, param=region2) ## or, alternatively, from a GAlignments object: gal2 <- readGAlignments(bamfile2, param=ScanBamParam(what="seq")) stackStringsFromGAlignments(gal2, region2) af <- alphabetFrequencyFromBam(bamfile2, param=region2, baseOnly=TRUE) cm2 <- t(af[ , DNA_BASES]) # consensus matrix cm2 ## --------------------------------------------------------------------- ## C. COMPUTE READ CONSENSUS SEQUENCE FOR REGION OF INTEREST ## --------------------------------------------------------------------- ## Let's write our own little naive function to go from consensus matrix ## to consensus sequence. For each nucleotide position in the region of ## interest (i.e. each column in the matrix), we select the letter with ## highest frequency. We also use special letter "*" at positions where ## there is a tie, and special letter "." at positions where all the ## frequencies are 0 (a particular type of tie): cm_to_cs <- function(cm) { stopifnot(is.matrix(cm)) nr <- nrow(cm) rnames <- rownames(cm) stopifnot(!is.null(rnames) && all(nchar(rnames) == 1L)) selection <- apply(cm, 2, function(x) { i <- which.max(x) if (x[i] == 0L) return(nr + 1L) if (sum(x == x[i]) != 1L) return(nr + 2L) i }) paste0(c(rnames, ".", "*")[selection], collapse="") } cm_to_cs(cm1a) cm_to_cs(cm1b) cm_to_cs(cm2) ## Note that the consensus sequences we obtain are relative to the ## plus strand of the reference sequence.
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.