PDict objects
The PDict class is a container for storing a preprocessed dictionary of DNA
patterns that can later be passed to the matchPDict
function
for fast matching against a reference sequence (the subject).
PDict
is the constructor function for creating new PDict objects.
PDict(x, max.mismatch=NA, tb.start=NA, tb.end=NA, tb.width=NA, algorithm="ACtree2", skip.invalid.patterns=FALSE)
x |
A character vector, a DNAStringSet object or an XStringViews object with a DNAString subject. |
max.mismatch |
A single non-negative integer or |
tb.start,tb.end,tb.width |
A single integer or |
algorithm |
|
skip.invalid.patterns |
This argument is not supported yet (and might in fact be replaced
by the |
THIS IS STILL WORK IN PROGRESS!
If the original dictionary x
is a character vector or
an XStringViews object with a DNAString subject,
then the PDict
constructor will first try to turn it
into a DNAStringSet object.
By default (i.e. if PDict
is called with max.mismatch=NA
,
tb.start=NA
, tb.end=NA
and tb.width=NA
)
the following limitations apply: (1) the original dictionary can only
contain base letters (i.e. only As, Cs, Gs and Ts), therefore IUPAC
ambiguity codes are not allowed; (2) all the
patterns in the dictionary must have the same length ("constant width"
dictionary); and (3) later matchPdict
can only be used with
max.mismatch=0
.
A Trusted Band can be used in order to relax these limitations (see the "Trusted Band" section below).
If you are planning to use the resulting PDict
object in order
to do inexact matching where valid hits are allowed to have a small
number of mismatching letters, then see the "Allowing a small number
of mismatching letters" section below.
Two preprocessing algorithms are currently supported:
algorithm="ACtree2"
(the default) and algorithm="Twobit"
.
With the "ACtree2"
algorithm, all the oligonucleotides in the
Trusted Band are stored in a 4-ary Aho-Corasick tree.
With the "Twobit"
algorithm, the 2-bit-per-letter
signatures of all the oligonucleotides in the Trusted Band are computed
and the mapping from these signatures to the 1-based position of the
corresponding oligonucleotide in the Trusted Band is stored in a way that
allows very fast lookup.
Only PDict objects preprocessed with the "ACtree2"
algo can then
be used with matchPdict
(and family) and with fixed="pattern"
(instead of fixed=TRUE
, the default), so that IUPAC ambiguity codes
in the subject are treated as ambiguities. PDict objects obtained with the
"Twobit"
algo don't allow this.
See ?`matchPDict-inexact`
for more information about support
of IUPAC ambiguity codes in the subject.
What's a Trusted Band?
A Trusted Band is a region defined in the original dictionary where the limitations described above will apply.
Why use a Trusted Band?
Because the limitations described above will apply to the Trusted Band only!
For example the Trusted Band cannot contain IUPAC ambiguity codes but the
"head" and the "tail" can (see below for what those are).
Also with a Trusted Band, if matchPdict
is called with a non-null
max.mismatch
value then mismatching letters will be allowed in the
head and the tail. Or, if matchPdict
is called with
fixed="subject"
, then IUPAC ambiguity codes in the head and the
tail will be treated as ambiguities.
How to specify a Trusted Band?
Use the tb.start
, tb.end
and tb.width
arguments of the
PDict
constructor in order to specify a Trusted Band.
This will divide each pattern in the original dictionary into three parts:
a left part, a middle part and a right part.
The middle part is defined by its starting and ending nucleotide positions
given relatively to each pattern thru the tb.start
, tb.end
and tb.width
arguments. It must have the same length for all
patterns (this common length is called the width of the Trusted Band).
The left and right parts are defined implicitely: they are the
parts that remain before (prefix) and after (suffix) the middle part,
respectively.
Therefore three DNAStringSet objects result from this division:
the first one is made of all the left parts and forms the head of the PDict
object, the second one is made of all the middle parts and forms the Trusted
Band of the PDict object, and the third one is made of all the right parts
and forms the tail of the PDict object.
In other words you can think of the process of specifying a Trusted Band as drawing 2 vertical lines on the original dictionary (note that these 2 lines are not necessarily straight lines but the horizontal space between them must be constant). When doing this, you are dividing the dictionary into three regions (from left to right): the head, the Trusted Band and the tail. Each of them is a DNAStringSet object with the same number of elements than the original dictionary and the original dictionary could easily be reconstructed from those three regions.
The width of the Trusted Band must be >= 1 because Trusted Bands of width 0 are not supported.
Finally note that calling PDict
with tb.start=NA
,
tb.end=NA
and tb.width=NA
(the default) is equivalent
to calling it with tb.start=1
, tb.end=-1
and
tb.width=NA
, which results in a full-width Trusted Band i.e.
a Trusted Band that covers the entire dictionary (no head and no tail).
[TODO]
In the code snippets below,
x
is a PDict object.
length(x)
:
The number of patterns in x
.
width(x)
:
A vector of non-negative integers containing the number
of letters for each pattern in x
.
names(x)
:
The names of the patterns in x
.
head(x)
:
The head of x
or NULL
if x
has no head.
tb(x)
:
The Trusted Band defined on x
.
tb.width(x)
:
The width of the Trusted Band defined on x
.
Note that, unlike width(tb(x))
, this is a single integer.
And because the Trusted Band has a constant width, tb.width(x)
is in fact equivalent to unique(width(tb(x)))
,
or to width(tb(x))[1]
.
tail(x)
:
The tail of x
or NULL
if x
has no tail.
In the code snippets below,
x
is a PDict object.
x[[i]]
:
Extract the i-th pattern from x
as a DNAString object.
In the code snippet below,
x
is a PDict object.
duplicated(x)
:
[TODO]
patternFrequency(x)
:
[TODO]
H. Pagès
Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.
## --------------------------------------------------------------------- ## A. NO HEAD AND NO TAIL (THE DEFAULT) ## --------------------------------------------------------------------- library(drosophila2probe) dict0 <- DNAStringSet(drosophila2probe) dict0 # The original dictionary. length(dict0) # Hundreds of thousands of patterns. unique(nchar(dict0)) # Patterns are 25-mers. pdict0 <- PDict(dict0) # Store the original dictionary in # a PDict object (preprocessing). pdict0 class(pdict0) length(pdict0) # Same as length(dict0). tb.width(pdict0) # The width of the (implicit) # Trusted Band. sum(duplicated(pdict0)) table(patternFrequency(pdict0)) # 9 patterns are repeated 3 times. pdict0[[1]] pdict0[[5]] ## --------------------------------------------------------------------- ## B. NO HEAD AND A TAIL ## --------------------------------------------------------------------- dict1 <- c("ACNG", "GT", "CGT", "AC") pdict1 <- PDict(dict1, tb.end=2) pdict1 class(pdict1) length(pdict1) width(pdict1) head(pdict1) tb(pdict1) tb.width(pdict1) width(tb(pdict1)) tail(pdict1) pdict1[[3]]
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.