Calculate the frequency of letters in a biological sequence, or the consensus matrix of a set of sequences
Given a biological sequence (or a set of biological sequences),
the alphabetFrequency
function computes the frequency of
each letter of the relevant alphabet.
letterFrequency
is similar, but more compact if one is only
interested in certain letters.
It can also tabulate letters "in common".
letterFrequencyInSlidingView
is a more specialized version
of letterFrequency
for (non-masked) XString objects.
It tallys the requested letter frequencies for a fixed-width view,
or window, that is conceptually slid along the entire input sequence.
The consensusMatrix
function computes the consensus matrix
of a set of sequences, and the consensusString
function creates
the consensus sequence from the consensus matrix based upon specified
criteria.
In this man page we call "DNA input" (or "RNA input") an XString, XStringSet, XStringViews or MaskedXString object of base type DNA (or RNA).
alphabetFrequency(x, as.prob=FALSE, ...) hasOnlyBaseLetters(x) uniqueLetters(x) letterFrequency(x, letters, OR="|", as.prob=FALSE, ...) letterFrequencyInSlidingView(x, view.width, letters, OR="|", as.prob=FALSE) consensusMatrix(x, as.prob=FALSE, shift=0L, width=NULL, ...) ## S4 method for signature 'matrix' consensusString(x, ambiguityMap="?", threshold=0.5) ## S4 method for signature 'DNAStringSet' consensusString(x, ambiguityMap=IUPAC_CODE_MAP, threshold=0.25, shift=0L, width=NULL) ## S4 method for signature 'RNAStringSet' consensusString(x, ambiguityMap= structure(as.character(RNAStringSet(DNAStringSet(IUPAC_CODE_MAP))), names= as.character(RNAStringSet(DNAStringSet(names(IUPAC_CODE_MAP))))), threshold=0.25, shift=0L, width=NULL)
x |
An XString, XStringSet, XStringViews
or MaskedXString object for DNA or RNA input for An XString object for A character vector, or an XStringSet or XStringViews
object for A consensus matrix (as returned by |
as.prob |
If |
view.width |
For |
letters |
For |
OR |
For |
ambiguityMap |
Either a single character to use when agreement is not reached or
a named character vector where the names are the ambiguity characters
and the values are the combinations of letters that comprise the
ambiguity (e.g. |
threshold |
The minimum probability threshold for an agreement to be declared.
When |
... |
Further arguments to be passed to or from other methods. For the XStringViews and XStringSet methods,
the Except for |
shift |
An integer vector (recycled to the length of |
width |
The number of columns of the returned matrix for the The length of the returned sequence for the |
alphabetFrequency
, letterFrequency
, and
letterFrequencyInSlidingView
are
generic functions defined in the Biostrings package.
letterFrequency
is similar to alphabetFrequency
but
specific to the letters of interest, hence more compact, especially
with OR
non-zero.
letterFrequencyInSlidingView
yields the same result, on the
sequence x
, that letterFrequency
would, if applied to the
hypothetical (and possibly huge) XStringViews
object
consisting of all the intervals of length view.width
on x
.
Taking advantage of the knowledge that successive "views" are nearly
identical, for letter counting purposes, it is both lighter and faster.
For letterFrequencyInSlidingView
, a masked (MaskedXString)
object x
is only supported through a cast to an (ordinary)
XString such as unmasked
(which includes its masked
regions).
When consensusString
is executed with a named character
ambiguityMap
argument, it weights each input string equally and
assigns an equal probability to each of the base letters represented by
an ambiguity letter. So for DNA and a threshold
of 0.25,
a "G" and an "R" would result in an "R" since
1/2 "G" + 1/2 "R" = 3/4 "G" + 1/4 "A" => "R";
two "G"'s and one "R" would result in a "G" since
2/3 "G" + 1/3 "R" = 5/6 "G" + 1/6 "A" => "G"; and
one "A" and one "N" would result in an "N" since
1/2 "A" + 1/2 "N" = 5/8 "A" + 1/8 "C" + 1/8 "G" + 1/8 "T" => "N".
alphabetFrequency
returns an integer vector when x
is an
XString or MaskedXString object. When x
is an
XStringSet or XStringViews object, then it returns
an integer matrix with length(x)
rows where the
i
-th row contains the frequencies for x[[i]]
.
If x
is a DNA or RNA input, then the returned vector is named
with the letters in the alphabet. If the baseOnly
argument is
TRUE
, then the returned vector has only 5 elements: 4 elements
corresponding to the 4 nucleotides + the 'other' element.
letterFrequency
returns, similarly, an integer vector or matrix,
but restricted and/or collated according to letters
and OR
.
hasOnlyBaseLetters
returns TRUE
or FALSE
indicating
whether or not x
contains only base letters (i.e. As, Cs, Gs and Ts
for DNA input and As, Cs, Gs and Us for RNA input).
uniqueLetters
returns a vector of 1-letter or empty strings. The empty
string is used to represent the nul character if x
happens to contain
any. Note that this can only happen if the base class of x
is BString.
An integer matrix with letters as row names for consensusMatrix
.
A standard character string for consensusString
.
H. Pagès and P. Aboyoun; H. Jaffee for letterFrequency and letterFrequencyInSlidingView
## --------------------------------------------------------------------- ## alphabetFrequency() ## --------------------------------------------------------------------- data(yeastSEQCHR1) yeast1 <- DNAString(yeastSEQCHR1) alphabetFrequency(yeast1) alphabetFrequency(yeast1, baseOnly=TRUE) hasOnlyBaseLetters(yeast1) uniqueLetters(yeast1) ## With input made of multiple sequences: library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) alphabetFrequency(probes[1:50], baseOnly=TRUE) alphabetFrequency(probes, baseOnly=TRUE, collapse=TRUE) ## --------------------------------------------------------------------- ## letterFrequency() ## --------------------------------------------------------------------- letterFrequency(probes[[1]], letters="ACGT", OR=0) base_letters <- alphabet(probes, baseOnly=TRUE) base_letters letterFrequency(probes[[1]], letters=base_letters, OR=0) base_letter_freqs <- letterFrequency(probes, letters=base_letters, OR=0) head(base_letter_freqs) GC_content <- letterFrequency(probes, letters="CG") head(GC_content) letterFrequency(probes, letters="CG", collapse=TRUE) ## --------------------------------------------------------------------- ## letterFrequencyInSlidingView() ## --------------------------------------------------------------------- data(yeastSEQCHR1) x <- DNAString(yeastSEQCHR1) view.width <- 48 letters <- c("A", "CG") two_columns <- letterFrequencyInSlidingView(x, view.width, letters) head(two_columns) tail(two_columns) three_columns <- letterFrequencyInSlidingView(x, view.width, letters, OR=0) head(three_columns) tail(three_columns) stopifnot(identical(two_columns[ , "C|G"], three_columns[ , "C"] + three_columns[ , "G"])) ## Note that, alternatively, 'three_columns' can also be obtained by ## creating the views on 'x' (as a Views object) and by calling ## alphabetFrequency() on it. But, of course, that is be *much* less ## efficient (both, in terms of memory and speed) than using ## letterFrequencyInSlidingView(): v <- Views(x, start=seq_len(length(x) - view.width + 1), width=view.width) v three_columns2 <- alphabetFrequency(v, baseOnly=TRUE)[ , c("A", "C", "G")] stopifnot(identical(three_columns2, three_columns)) ## Set the width of the view to length(x) to get the global frequencies: letterFrequencyInSlidingView(x, letters="ACGTN", view.width=length(x), OR=0) ## --------------------------------------------------------------------- ## consensus*() ## --------------------------------------------------------------------- ## Read in ORF data: file <- system.file("extdata", "someORF.fa", package="Biostrings") orf <- readDNAStringSet(file) ## To illustrate, the following example assumes the ORF data ## to be aligned for the first 10 positions (patently false): orf10 <- DNAStringSet(orf, end=10) consensusMatrix(orf10, baseOnly=TRUE) ## The following example assumes the first 10 positions to be aligned ## after some incremental shifting to the right (patently false): consensusMatrix(orf10, baseOnly=TRUE, shift=0:6) consensusMatrix(orf10, baseOnly=TRUE, shift=0:6, width=10) ## For the character matrix containing the "exploded" representation ## of the strings, do: as.matrix(orf10, use.names=FALSE) ## consensusMatrix() can be used to just compute the alphabet frequency ## for each position in the input sequences: consensusMatrix(probes, baseOnly=TRUE) ## After sorting, the first 5 probes might look similar (at least on ## their first bases): consensusString(sort(probes)[1:5]) consensusString(sort(probes)[1:5], ambiguityMap = "N", threshold = 0.5) ## Consensus involving ambiguity letters in the input strings consensusString(DNAStringSet(c("NNNN","ACTG"))) consensusString(DNAStringSet(c("AANN","ACTG"))) consensusString(DNAStringSet(c("ACAG","ACAR"))) consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG"))) ## --------------------------------------------------------------------- ## C. RELATIONSHIP BETWEEN consensusMatrix() AND coverage() ## --------------------------------------------------------------------- ## Applying colSums() on a consensus matrix gives the coverage that ## would be obtained by piling up (after shifting) the input sequences ## on top of an (imaginary) reference sequence: cm <- consensusMatrix(orf10, shift=0:6, width=10) colSums(cm) ## Note that this coverage can also be obtained with: as.integer(coverage(IRanges(rep(1, length(orf)), width(orf)), shift=0:6, width=10))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.