Low-level matching functions
In this man page we define precisely and illustrate what a "match" of a pattern P in a subject S is in the context of the Biostrings package. This definition of a "match" is central to most pattern matching functions available in this package: unless specified otherwise, most of them will adhere to the definition provided here.
hasLetterAt
checks whether a sequence or set of sequences has the
specified letters at the specified positions.
neditAt
, isMatchingAt
and which.isMatchingAt
are
low-level matching functions that only look for matches at the specified
positions in the subject.
hasLetterAt(x, letter, at, fixed=TRUE) ## neditAt() and related utils: neditAt(pattern, subject, at=1, with.indels=FALSE, fixed=TRUE) neditStartingAt(pattern, subject, starting.at=1, with.indels=FALSE, fixed=TRUE) neditEndingAt(pattern, subject, ending.at=1, with.indels=FALSE, fixed=TRUE) ## isMatchingAt() and related utils: isMatchingAt(pattern, subject, at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE) isMatchingStartingAt(pattern, subject, starting.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE) isMatchingEndingAt(pattern, subject, ending.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE) ## which.isMatchingAt() and related utils: which.isMatchingAt(pattern, subject, at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE, follow.index=FALSE, auto.reduce.pattern=FALSE) which.isMatchingStartingAt(pattern, subject, starting.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE, follow.index=FALSE, auto.reduce.pattern=FALSE) which.isMatchingEndingAt(pattern, subject, ending.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE, follow.index=FALSE, auto.reduce.pattern=FALSE)
x |
A character vector, or an XString or XStringSet object. |
letter |
A character string or an XString object containing the letters to check. |
at, starting.at, ending.at |
An integer vector specifying the starting (for For the |
pattern |
The pattern string (but see |
subject |
A character vector, or an XString or XStringSet object containing the subject sequence(s). |
max.mismatch, min.mismatch |
Integer vectors of length >= 1 recycled to the length of the
|
with.indels |
See details below. |
fixed |
Only with a DNAString or RNAString-based subject can a
If
|
follow.index |
Whether the single integer returned by |
auto.reduce.pattern |
Whether |
A "match" of pattern P in subject S is a substring S' of S that is considered similar enough to P according to some distance (or metric) specified by the user. 2 distances are supported by most pattern matching functions in the Biostrings package. The first (and simplest) one is the "number of mismatching letters". It is defined only when the 2 strings to compare have the same length, so when this distance is used, only matches that have the same number of letters as P are considered. The second one is the "edit distance" (aka Levenshtein distance): it's the minimum number of operations needed to transform P into S', where an operation is an insertion, deletion, or substitution of a single letter. When this metric is used, matches can have a different number of letters than P.
The neditAt
function implements these 2 distances.
If with.indels
is FALSE
(the default), then the first distance
is used i.e. neditAt
returns the "number of mismatching letters"
between the pattern P and the substring S' of S starting at the
positions specified in at
(note that neditAt
is vectorized
so a long vector of integers can be passed thru the at
argument).
If with.indels
is TRUE
, then the "edit distance" is
used: for each position specified in at
, P is compared to
all the substrings S' of S starting at this position and the smallest
distance is returned. Note that this distance is guaranteed to be reached
for a substring of length < 2*length(P) so, of course, in practice,
P only needs to be compared to a small number of substrings for every
starting position.
hasLetterAt
: A logical matrix with one row per element in x
and one column per letter/position to check. When a specified position
is invalid with respect to an element in x
then the corresponding
matrix element is set to NA.
neditAt
: If subject
is an XString object, then
return an integer vector of the same length as at
.
If subject
is an XStringSet object, then return the
integer matrix with length(at)
rows and length(subject)
columns defined by:
sapply(unname(subject), function(x) neditAt(pattern, x, ...))
neditStartingAt
is identical to neditAt
except
that the at
argument is now called starting.at
.
neditEndingAt
is similar to neditAt
except that
the at
argument is now called ending.at
and must contain
the ending positions of the pattern relatively to the subject.
isMatchingAt
: If subject
is an XString object,
then return the logical vector defined by:
min.mismatch <= neditAt(...) <= max.mismatch
If subject
is an XStringSet object, then return the
logical matrix with length(at)
rows and length(subject)
columns defined by:
sapply(unname(subject), function(x) isMatchingAt(pattern, x, ...))
isMatchingStartingAt
is identical to isMatchingAt
except
that the at
argument is now called starting.at
.
isMatchingEndingAt
is similar to isMatchingAt
except that
the at
argument is now called ending.at
and must contain
the ending positions of the pattern relatively to the subject.
which.isMatchingAt
: The default behavior (follow.index=FALSE
)
is as follow. If subject
is an XString object,
then return the single integer defined by:
which(isMatchingAt(...))[1]
If subject
is an XStringSet object, then return
the integer vector defined by:
sapply(unname(subject), function(x) which.isMatchingAt(pattern, x, ...))
If follow.index=TRUE
, then the returned value is defined by:
at[which.isMatchingAt(..., follow.index=FALSE)]
which.isMatchingStartingAt
is identical to which.isMatchingAt
except that the at
argument is now called starting.at
.
which.isMatchingEndingAt
is similar to which.isMatchingAt
except that the at
argument is now called ending.at
and must
contain the ending positions of the pattern relatively to the subject.
## --------------------------------------------------------------------- ## hasLetterAt() ## --------------------------------------------------------------------- x <- DNAStringSet(c("AAACGT", "AACGT", "ACGT", "TAGGA")) hasLetterAt(x, "AAAAAA", 1:6) ## hasLetterAt() can be used to answer questions like: "which elements ## in 'x' have an A at position 2 and a G at position 4?" q1 <- hasLetterAt(x, "AG", c(2, 4)) which(rowSums(q1) == 2) ## or "how many probes in the drosophila2 chip have T, G, T, A at ## position 2, 4, 13 and 20, respectively?" library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) q2 <- hasLetterAt(probes, "TGTA", c(2, 4, 13, 20)) sum(rowSums(q2) == 4) ## or "what's the probability to have an A at position 25 if there is ## one at position 13?" q3 <- hasLetterAt(probes, "AACGT", c(13, 25, 25, 25, 25)) sum(q3[ , 1] & q3[ , 2]) / sum(q3[ , 1]) ## Probabilities to have other bases at position 25 if there is an A ## at position 13: sum(q3[ , 1] & q3[ , 3]) / sum(q3[ , 1]) # C sum(q3[ , 1] & q3[ , 4]) / sum(q3[ , 1]) # G sum(q3[ , 1] & q3[ , 5]) / sum(q3[ , 1]) # T ## See ?nucleotideFrequencyAt for another way to get those results. ## --------------------------------------------------------------------- ## neditAt() / isMatchingAt() / which.isMatchingAt() ## --------------------------------------------------------------------- subject <- DNAString("GTATA") ## Pattern "AT" matches subject "GTATA" at position 3 (exact match) neditAt("AT", subject, at=3) isMatchingAt("AT", subject, at=3) ## ... but not at position 1 neditAt("AT", subject) isMatchingAt("AT", subject) ## ... unless we allow 1 mismatching letter (inexact match) isMatchingAt("AT", subject, max.mismatch=1) ## Here we look at 6 different starting positions and find 3 matches if ## we allow 1 mismatching letter isMatchingAt("AT", subject, at=0:5, max.mismatch=1) ## No match neditAt("NT", subject, at=1:4) isMatchingAt("NT", subject, at=1:4) ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE) neditAt("NT", subject, at=1:4, fixed=FALSE) isMatchingAt("NT", subject, at=1:4, fixed=FALSE) ## max.mismatch != 0 and fixed=FALSE can be used together neditAt("NCA", subject, at=0:5, fixed=FALSE) isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE) some_starts <- c(10:-10, NA, 6) subject <- DNAString("ACGTGCA") is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1) some_starts[is_matching] which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1) which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1, follow.index=TRUE) ## --------------------------------------------------------------------- ## WITH INDELS ## --------------------------------------------------------------------- subject <- BString("ABCDEFxxxCDEFxxxABBCDE") neditAt("ABCDEF", subject, at=9) neditAt("ABCDEF", subject, at=9, with.indels=TRUE) isMatchingAt("ABCDEF", subject, at=9, max.mismatch=1, with.indels=TRUE) isMatchingAt("ABCDEF", subject, at=9, max.mismatch=2, with.indels=TRUE) neditAt("ABCDEF", subject, at=17) neditAt("ABCDEF", subject, at=17, with.indels=TRUE) neditEndingAt("ABCDEF", subject, ending.at=22) neditEndingAt("ABCDEF", subject, ending.at=22, with.indels=TRUE)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.