getSeq methods for BSgenome and XStringSet objects
getSeq
methods for extracting a set of
sequences (or subsequences) from a BSgenome or
XStringSet object. For XStringSets, there are also
convenience methods on [
that delegate to getSeq
.
## S4 method for signature 'BSgenome' getSeq(x, names, start=NA, end=NA, width=NA, strand="+", as.character=FALSE) ## S4 method for signature 'XStringSet' getSeq(x, names)
x |
A BSgenome or XStringSet object.
See the |
names |
When If See When |
start, end, width |
Vector of integers (eventually with NAs) specifying the locations
of the subsequences to extract.
These are not needed (and it's an error to supply them)
when |
strand |
A vector containing |
as.character |
|
L, the number of sequences to extract, is determined as follow:
If names
is a GRanges or
IntegerRanges object then L = length(names)
.
If names
is a GRangesList or
IntegerRangesList object then
L = length(unlist(names))
.
Otherwise, L is the length of the longest of names
,
start
, end
and width
and all these
arguments are recycled to this length.
NA
s and negative values in these 3 arguments are
solved according to the rules of the SEW (Start/End/Width)
interface (see ?solveUserSEW
for
the details).
If names
is neither a GRanges or
GRangesList object, then the strand
argument is also recycled to length L.
Here is how the names passed to the names
argument are matched
to the names of the sequences in BSgenome object x
.
For each name
in names
:
(1): If x
contains a single sequence with that name
then this sequence is used for extraction;
(2): Otherwise the names of all the elements in all the
multiple sequences are searched. If the names
argument
is a character vector then name
is treated as a regular
expression and grep
is used for this search,
otherwise (i.e. when the names are supplied via a higher level
object like GRanges or
GRangesList) then name
must match
exactly the name of the sequence. If exactly 1 sequence is found,
then it is used for extraction, otherwise (i.e. if no sequence or
more than 1 sequence is found) then an error is raised.
There are convenience methods for extracting sequences from
XStringSet objects using a
GenomicRanges or GRangesList
subscript (character subscripts are implicitly supported). Both methods
are simple wrappers around getSeq
, although the GRangesList method
differs from the getSeq
behavior in that the within-element results
are concatenated and returned as an XStringSet, rather than an
XStringSetList. See the examples.
Normally a DNAStringSet object (or character vector
if as.character=TRUE
).
With the 2 following exceptions:
A DNAStringSetList object (or
CharacterList object if as.character=TRUE
)
of the same shape as names
if names
is a
GRangesList object.
A DNAString object (or single character string
if as.character=TRUE
) if L = 1 and names
is not a GRanges,
GRangesList, IntegerRangesList,
or IntegerRanges object.
Be aware that using as.character=TRUE
can be very inefficient
when extracting a "big" amount of DNA sequences (e.g. millions of
short sequences or a small number of very long sequences).
Note that the masks in x
, if any, are always ignored. In other
words, masked regions in the genome are extracted in the same way as
unmasked regions (this is achieved by dropping the masks before extraction).
See ?`MaskedDNAString-class`
for more
information about masked DNA sequences.
H. Pagès; improvements suggested by Matt Settles and others
## --------------------------------------------------------------------- ## A. SIMPLE EXAMPLES ## --------------------------------------------------------------------- ## Load the Caenorhabditis elegans genome (UCSC Release ce2): library(BSgenome.Celegans.UCSC.ce2) ## Look at the index of sequences: Celegans ## Get chromosome V as a DNAString object: getSeq(Celegans, "chrV") ## which is in fact the same as doing: Celegans$chrV ## Not run: ## Never try this: getSeq(Celegans, "chrV", as.character=TRUE) ## or this (even worse): getSeq(Celegans, as.character=TRUE) ## End(Not run) ## Get the first 20 bases of each chromosome: getSeq(Celegans, end=20) ## Get the last 20 bases of each chromosome: getSeq(Celegans, start=-20) ## --------------------------------------------------------------------- ## B. EXTRACTING SMALL SEQUENCES FROM DIFFERENT CHROMOSOMES ## --------------------------------------------------------------------- myseqs <- data.frame( chr=c("chrI", "chrX", "chrM", "chrM", "chrX", "chrI", "chrM", "chrI"), start=c(NA, -40, 8510, 301, 30001, 9220500, -2804, -30), end=c(50, NA, 8522, 324, 30011, 9220555, -2801, -11), strand=c("+", "-", "+", "+", "-", "-", "+", "-") ) getSeq(Celegans, myseqs$chr, start=myseqs$start, end=myseqs$end) getSeq(Celegans, myseqs$chr, start=myseqs$start, end=myseqs$end, strand=myseqs$strand) ## --------------------------------------------------------------------- ## C. USING A GRanges OBJECT ## --------------------------------------------------------------------- gr1 <- GRanges(seqnames=c("chrI", "chrI", "chrM"), ranges=IRanges(start=101:103, width=9)) gr1 # all strand values are "*" getSeq(Celegans, gr1) # treats strand values as if they were "+" strand(gr1)[] <- "-" getSeq(Celegans, gr1) strand(gr1)[1] <- "+" getSeq(Celegans, gr1) strand(gr1)[2] <- "*" if (interactive()) getSeq(Celegans, gr1) # Error: cannot mix "*" with other strand values gr2 <- GRanges(seqnames=c("chrM", "NM_058280_up_1000"), ranges=IRanges(start=103:102, width=9)) gr2 if (interactive()) { ## Because the sequence names are supplied via a GRanges object, they ## are not treated as regular expressions: getSeq(Celegans, gr2) # Error: sequence NM_058280_up_1000 not found } ## --------------------------------------------------------------------- ## D. USING A GRangesList OBJECT ## --------------------------------------------------------------------- gr1 <- GRanges(seqnames=c("chrI", "chrII", "chrM", "chrII"), ranges=IRanges(start=101:104, width=12), strand="+") gr2 <- shift(gr1, 5) gr3 <- gr2 strand(gr3) <- "-" grl <- GRangesList(gr1, gr2, gr3) getSeq(Celegans, grl) ## --------------------------------------------------------------------- ## E. EXTRACTING A HIGH NUMBER OF RANDOM 40-MERS FROM A GENOME ## --------------------------------------------------------------------- extractRandomReads <- function(x, density, readlength) { if (!is.integer(readlength)) readlength <- as.integer(readlength) start <- lapply(seqnames(x), function(name) { seqlength <- seqlengths(x)[name] sample(seqlength - readlength + 1L, seqlength * density, replace=TRUE) }) names <- rep.int(seqnames(x), elementNROWS(start)) ranges <- IRanges(start=unlist(start), width=readlength) strand <- strand(sample(c("+", "-"), length(names), replace=TRUE)) gr <- GRanges(seqnames=names, ranges=ranges, strand=strand) getSeq(x, gr) } ## With a density of 1 read every 100 genome bases, the total number of ## extracted 40-mers is about 1 million: rndreads <- extractRandomReads(Celegans, 0.01, 40) ## Notes: ## - The short sequences in 'rndreads' can be seen as the result of a ## simulated high-throughput sequencing experiment. A non-realistic ## one though because: ## (a) It assumes that the underlying technology is perfect (the ## generated reads have no technology induced errors). ## (b) It assumes that the sequenced genome is exactly the same as ## the reference genome. ## (c) The simulated reads can contain IUPAC ambiguity letters only ## because the reference genome contains them. In a real ## high-throughput sequencing experiment, the sequenced genome ## of course doesn't contain those letters, but the sequencer ## can introduce them in the generated reads to indicate ## ambiguous base-calling. ## - Those reads are coming from the plus and minus strands of the ## chromosomes. ## - With a density of 0.01 and the reads being only 40-base long, the ## average coverage of the genome is only 0.4 which is low. The total ## number of reads is about 1 million and it takes less than 10 sec. ## to generate them. ## - A higher coverage can be achieved by using a higher density and/or ## longer reads. For example, with a density of 0.1 and 100-base reads ## the average coverage is 10. The total number of reads is about 10 ## millions and it takes less than 1 minute to generate them. ## - Those reads could easily be mapped back to the reference by using ## an efficient matching tool like matchPDict() for performing exact ## matching (see ?matchPDict for more information). Typically, a ## small percentage of the reads (4 to 5% in our case) will hit the ## reference at multiple locations. This is especially true for such ## short reads, and, in a lower proportion, is still true for longer ## reads, even for reads as long as 300 bases. ## --------------------------------------------------------------------- ## F. SEE THE BSgenome CACHE IN ACTION ## --------------------------------------------------------------------- options(verbose=TRUE) first20 <- getSeq(Celegans, end=20) first20 gc() stopifnot(length(ls(Celegans@.seqs_cache)) == 0L) ## One more gc() call is needed in order to see the amount of memory in ## used after all the chromosomes have been removed from the cache: gc() ## --------------------------------------------------------------------- ## G. USING '[' FOR CONVENIENT EXTRACTION ## --------------------------------------------------------------------- seqs <- getSeq(Celegans) seqs[gr1] seqs[grl]
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.