Biostrings: substitution_matrices – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

Biostrings

substitution_matrices

Scoring matrices

Description

Predefined substitution matrices for nucleotide and amino acid alignments.

Usage

data(BLOSUM45)
data(BLOSUM50)
data(BLOSUM62)
data(BLOSUM80)
data(BLOSUM100)
data(PAM30)
data(PAM40)
data(PAM70)
data(PAM120)
data(PAM250)
nucleotideSubstitutionMatrix(match = 1, mismatch = 0, baseOnly = FALSE, type = "DNA")
qualitySubstitutionMatrices(fuzzyMatch = c(0, 1), alphabetLength = 4L, qualityClass = "PhredQuality", bitScale = 1)
errorSubstitutionMatrices(errorProbability, fuzzyMatch = c(0, 1), alphabetLength = 4L, bitScale = 1)

Arguments

`match`	the scoring for a nucleotide match.
`mismatch`	the scoring for a nucleotide mismatch.
`baseOnly`	`TRUE` or `FALSE`. If `TRUE`, only uses the letters in the "base" alphabet i.e. "A", "C", "G", "T".
`type`	either "DNA" or "RNA".
`fuzzyMatch`	a named or unnamed numeric vector representing the base match probability.
`errorProbability`	a named or unnamed numeric vector representing the error probability.
`alphabetLength`	an integer representing the number of letters in the underlying string alphabet. For DNA and RNA, this would be 4L. For Amino Acids, this could be 20L.
`qualityClass`	a character string of `"PhredQuality"`, `"SolexaQuality"`, or `"IlluminaQuality"`.
`bitScale`	a numeric value to scale the quality-based substitution matrices. By default, this is 1, representing bit-scale scoring.

Format

The BLOSUM and PAM matrices are square symmetric matrices with integer coefficients, whose row and column names are identical and unique: each name is a single letter representing a nucleotide or an amino acid.

nucleotideSubstitutionMatrix produces a substitution matrix for all IUPAC nucleic acid codes based upon match and mismatch parameters.

errorSubstitutionMatrices produces a two element list of numeric square symmetric matrices, one for matches and one for mismatches.

qualitySubstitutionMatrices produces the substitution matrices for Phred or Solexa quality-based reads.

Details

The BLOSUM and PAM matrices are not unique. For example, the definition of the widely used BLOSUM62 matrix varies depending on the source, and even a given source can provide different versions of "BLOSUM62" without keeping track of the changes over time. NCBI provides many matrices here ftp://ftp.ncbi.nih.gov/blast/matrices/ but their definitions don't match those of the matrices bundled with their stand-alone BLAST software available here ftp://ftp.ncbi.nih.gov/blast/

The BLOSUM45, BLOSUM62, BLOSUM80, PAM30 and PAM70 matrices were taken from NCBI stand-alone BLAST software.

The BLOSUM50, BLOSUM100, PAM40, PAM120 and PAM250 matrices were taken from ftp://ftp.ncbi.nih.gov/blast/matrices/

The quality matrices computed in qualitySubstitutionMatrices are based on the paper by Ketil Malde. Let ε_i be the probability of an error in the base read. For "Phred" quality measures Q in [0, 99], these error probabilities are given by ε_i = 10^{-Q/10}. For "Solexa" quality measures Q in [-5, 99], they are given by ε_i = 1 - 1/(1 + 10^{-Q/10}). Assuming independence within and between base reads, the combined error probability of a mismatch when the underlying bases do match is ε_c = ε_1 + ε_2 - (n/(n-1)) * ε_1 * ε_2, where n is the number of letters in the underlying alphabet. Using ε_c, the substitution score is given by when two bases match is given by b * \log_2(γ_{x,y} * (1 - ε_c) * n + (1 - γ_{x,y}) * ε_c * (n/(n-1))), where b is the bit-scaling for the scoring and γ_{x,y} is the probability that characters x and y represents the same underlying information (e.g. using IUPAC, γ_{A,A} = 1 and γ_{A,N} = 1/4. In the arguments listed above fuzzyMatch represents γ_{x,y} and errorProbability represents ε_i.

Author(s)

H. Pagès and P. Aboyoun

References

K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics, Feb 23, 2008.

Examples

s1 <- 
    DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
  s2 <-
    DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")

  ## Fit a global pairwise alignment using edit distance scoring
  pairwiseAlignment(s1, s2,
                    substitutionMatrix = nucleotideSubstitutionMatrix(0, -1, TRUE),
                    gapOpening = 0, gapExtension = 1)

  ## Examine quality-based match and mismatch bit scores for DNA/RNA
  ## strings in pairwiseAlignment.
  ## By default patternQuality and subjectQuality are PhredQuality(22L).
  qualityMatrices <- qualitySubstitutionMatrices()
  qualityMatrices["22", "22", "1"]
  qualityMatrices["22", "22", "0"]

  pairwiseAlignment(s1, s2)

  ## Get the substitution scores when the error probability is 0.1
  subscores <- errorSubstitutionMatrices(errorProbability = 0.1)
  submat <- matrix(subscores[,,"0"], 4, 4)
  diag(submat) <- subscores[,,"1"]
  dimnames(submat) <- list(DNA_ALPHABET[1:4], DNA_ALPHABET[1:4])
  submat
  pairwiseAlignment(s1, s2, substitutionMatrix = submat)

  ## Align two amino acid sequences with the BLOSUM62 matrix
  aa1 <- AAString("HXBLVYMGCHFDCXVBEHIKQZ")
  aa2 <- AAString("QRNYMYCFQCISGNEYKQN")
  pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 3, gapExtension = 1)

  ## See how the gap penalty influences the alignment
  pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 6, gapExtension = 2)

  ## See how the substitution matrix influences the alignment
  pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM50", gapOpening = 3, gapExtension = 1)

  if (interactive()) {
    ## Compare our BLOSUM62 with BLOSUM62 from ftp://ftp.ncbi.nih.gov/blast/matrices/
    data(BLOSUM62)
    BLOSUM62["Q", "Z"]
    file <- "ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62"
    b62 <- as.matrix(read.table(file, check.names=FALSE))
    b62["Q", "Z"]
  }

Biostrings

Efficient manipulation of biological strings

v2.58.0

Artistic-2.0

Authors

H. Pagès, P. Aboyoun, R. Gentleman, and S. DebRoy

Initial release

substitution_matrices

Description

Usage

Arguments

Format

Details

Author(s)

References

See Also

Examples

Biostrings

We don't support your browser anymore