Scoring matrices
Predefined substitution matrices for nucleotide and amino acid alignments.
data(BLOSUM45) data(BLOSUM50) data(BLOSUM62) data(BLOSUM80) data(BLOSUM100) data(PAM30) data(PAM40) data(PAM70) data(PAM120) data(PAM250) nucleotideSubstitutionMatrix(match = 1, mismatch = 0, baseOnly = FALSE, type = "DNA") qualitySubstitutionMatrices(fuzzyMatch = c(0, 1), alphabetLength = 4L, qualityClass = "PhredQuality", bitScale = 1) errorSubstitutionMatrices(errorProbability, fuzzyMatch = c(0, 1), alphabetLength = 4L, bitScale = 1)
match |
the scoring for a nucleotide match. |
mismatch |
the scoring for a nucleotide mismatch. |
baseOnly |
|
type |
either "DNA" or "RNA". |
fuzzyMatch |
a named or unnamed numeric vector representing the base match probability. |
errorProbability |
a named or unnamed numeric vector representing the error probability. |
alphabetLength |
an integer representing the number of letters in the underlying string alphabet. For DNA and RNA, this would be 4L. For Amino Acids, this could be 20L. |
qualityClass |
a character string of |
bitScale |
a numeric value to scale the quality-based substitution matrices. By default, this is 1, representing bit-scale scoring. |
The BLOSUM and PAM matrices are square symmetric matrices with integer coefficients, whose row and column names are identical and unique: each name is a single letter representing a nucleotide or an amino acid.
nucleotideSubstitutionMatrix
produces a substitution matrix for all IUPAC
nucleic acid codes based upon match and mismatch parameters.
errorSubstitutionMatrices
produces a two element list of numeric
square symmetric matrices, one for matches and one for mismatches.
qualitySubstitutionMatrices
produces the substitution matrices
for Phred or Solexa quality-based reads.
The BLOSUM and PAM matrices are not unique. For example, the definition of the widely used BLOSUM62 matrix varies depending on the source, and even a given source can provide different versions of "BLOSUM62" without keeping track of the changes over time. NCBI provides many matrices here ftp://ftp.ncbi.nih.gov/blast/matrices/ but their definitions don't match those of the matrices bundled with their stand-alone BLAST software available here ftp://ftp.ncbi.nih.gov/blast/
The BLOSUM45, BLOSUM62, BLOSUM80, PAM30 and PAM70 matrices were taken from NCBI stand-alone BLAST software.
The BLOSUM50, BLOSUM100, PAM40, PAM120 and PAM250 matrices were taken from ftp://ftp.ncbi.nih.gov/blast/matrices/
The quality matrices computed in qualitySubstitutionMatrices
are
based on the paper by Ketil Malde. Let ε_i be the probability
of an error in the base read. For "Phred"
quality measures Q
in [0, 99], these error probabilities are given by
ε_i = 10^{-Q/10}. For "Solexa"
quality measures
Q in [-5, 99], they are given by
ε_i = 1 - 1/(1 + 10^{-Q/10}).
Assuming independence within and between base reads, the combined error
probability of a mismatch when the underlying bases do match is
ε_c = ε_1 + ε_2 - (n/(n-1)) * ε_1 * ε_2,
where n is the number of letters in the underlying alphabet. Using
ε_c, the substitution score is given by when two bases match is given by
b * \log_2(γ_{x,y} * (1 - ε_c) * n + (1 - γ_{x,y}) * ε_c * (n/(n-1))),
where b is the bit-scaling for the scoring and γ_{x,y} is the
probability that characters x and y represents the same underlying
information (e.g. using IUPAC, γ_{A,A} = 1 and γ_{A,N} = 1/4.
In the arguments listed above fuzzyMatch
represents γ_{x,y}
and errorProbability
represents ε_i.
H. Pagès and P. Aboyoun
K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics, Feb 23, 2008.
s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC") ## Fit a global pairwise alignment using edit distance scoring pairwiseAlignment(s1, s2, substitutionMatrix = nucleotideSubstitutionMatrix(0, -1, TRUE), gapOpening = 0, gapExtension = 1) ## Examine quality-based match and mismatch bit scores for DNA/RNA ## strings in pairwiseAlignment. ## By default patternQuality and subjectQuality are PhredQuality(22L). qualityMatrices <- qualitySubstitutionMatrices() qualityMatrices["22", "22", "1"] qualityMatrices["22", "22", "0"] pairwiseAlignment(s1, s2) ## Get the substitution scores when the error probability is 0.1 subscores <- errorSubstitutionMatrices(errorProbability = 0.1) submat <- matrix(subscores[,,"0"], 4, 4) diag(submat) <- subscores[,,"1"] dimnames(submat) <- list(DNA_ALPHABET[1:4], DNA_ALPHABET[1:4]) submat pairwiseAlignment(s1, s2, substitutionMatrix = submat) ## Align two amino acid sequences with the BLOSUM62 matrix aa1 <- AAString("HXBLVYMGCHFDCXVBEHIKQZ") aa2 <- AAString("QRNYMYCFQCISGNEYKQN") pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 3, gapExtension = 1) ## See how the gap penalty influences the alignment pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 6, gapExtension = 2) ## See how the substitution matrix influences the alignment pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM50", gapOpening = 3, gapExtension = 1) if (interactive()) { ## Compare our BLOSUM62 with BLOSUM62 from ftp://ftp.ncbi.nih.gov/blast/matrices/ data(BLOSUM62) BLOSUM62["Q", "Z"] file <- "ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62" b62 <- as.matrix(read.table(file, check.names=FALSE)) b62["Q", "Z"] }
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.