seqinr: zscore – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

zscore

Statistical over- and under- representation of dinucleotides in a sequence

Description

These two functions compute two different types of statistics for the measure of statistical dinculeotide over- and under-representation : the rho statistic, and the z-score, each computed for all 16 dinucleotides.

Usage

rho(sequence, wordsize = 2, alphabet = s2c("acgt"))
zscore(sequence, simulations = NULL, modele, exact = FALSE, alphabet = s2c("acgt"), ... )

Arguments

`sequence`	a vector of single characters.
`wordsize`	an integer giving the size of word (n-mer) to consider.
`simulations`	If `NULL`, analytical solution is computed when available (models `base` and `codon`). Otherwise, it should be the number of permutations for the z-score computation
`modele`	A string of characters describing the model chosen for the random generation
`exact`	Whether exact analytical calculation or an approximation should be used
`alphabet`	A vector of single characters.
`...`	Optional parameters for specific model permutations are passed on to `permutation` function.

Details

The rho statistic, as presented in Karlin S., Cardon LR. (1994), can be computed on each of the 16 dinucleotides. It is the frequence of dinucleotide xy divided by the product of frequencies of nucleotide x and nucleotide y. It is equal to 1.00 when dinucleotide xy is formed by pure chance, and it is superior (respectively inferior) to 1.00 when dinucleotide xy is over- (respectively under-) represented. Note that if you want to reproduce Karlin's results you have to compute the statistic from the sequence concatenated with its inverted complement that is with something like rho(c(myseq, rev(comp(mysed)))).

The zscore statistic, as presented in Palmeira, L., Guéguen, L. and Lobry JR. (2006). The statistic is the normalization of the rho statistic by its expectation and variance according to a given random sequence generation model, and follows the standard normal distribution. This statistic can be computed with several models (cf. permutation for the description of each of the models). We provide analytical calculus for two of them: the base permutations model and the codon permutations model.

The base model allows for random sequence generation by shuffling (with/without replacement) of all bases in the sequence. Analytical computations are available for this model: either as an approximation for large sequences (cf. Palmeira, L., Guéguen, L. and Lobry JR. (2006)), either as the exact analytical formulae (cf. Schbath, S. (1995)).

The position model allows for random sequence generation by shuffling (with/without replacement) of bases within their position in the codon (bases in position I, II or III stay in position I, II or III in the new sequence.

The codon model allows for random sequence generation by shuffling (with/without replacement) of codons. Analytical computation is available for this model (Gautier, C., Gouy, M. and Louail, S. (1985)).

The syncodon model allows for random sequence generation by shuffling (with/without replacement) of synonymous codons.

Value

a table containing the computed statistic for each dinucleotide

Author(s)

L. Palmeira, J.R. Lobry with suggestions from A. Coghlan.

References

Gautier, C., Gouy, M. and Louail, S. (1985) Non-parametric statistics for nucleic acid sequence study. Biochimie, 67:449-453.

Karlin S. and Cardon LR. (1994) Computational DNA sequence analysis. Annu Rev Microbiol, 48:619-654.

Schbath, S. (1995) Étude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN. Thèse de l'Université René Descartes, Paris V

Palmeira, L., Guéguen, L. and Lobry, J.R. (2006) UV-targeted dinucleotides are not depleted in light-exposed Prokaryotic genomes. Molecular Biology and Evolution, 23:2214-2219. https://academic.oup.com/mbe/article/23/11/2214/1335460

citation("seqinr")

Examples

## Not run: 
sequence <- sample(x = s2c("acgt"), size = 6000, replace = TRUE)
rho(sequence)
zscore(sequence, modele = "base")
zscore(sequence, modele = "base", exact = TRUE)
zscore(sequence, modele = "codon")
zscore(sequence, simulations = 1000, modele = "syncodon")

## End(Not run)

seqinr

Biological Sequences Retrieval and Analysis

v4.2-16

GPL (>= 2)

Authors

Delphine Charif [aut], Olivier Clerc [ctb], Carolin Frank [ctb], Jean R. Lobry [aut, cph], Anamaria Necşulea [ctb], Leonor Palmeira [ctb], Simon Penel [cre], Guy Perrière [ctb]

Initial release

2022-05-19