Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

recstat

Prediction of Coding DNA Sequences.


Description

This function aims at predicting the position of Coding DNA Sequences (CDS) through the use of a Correspondence Analysis (CA) computed on codon composition, this for the three reading frames of a DNA strand.

Usage

recstat(seq, sizewin = 90, shift = 30, seqname = "no name")

Arguments

seq

a nucleic acid sequence as a vector of characters

sizewin

an integer, multiple of 3, giving the length of the sliding window

shift

an integer, multiple of 3, giving the length of the steps between two windows

seqname

the name of the sequence

Details

The method is built on the hypothesis that the codon composition of a CDS is biased while it is not the case outside these regions. In order to detect such bias, a CA on codon frequencies is computed on the six possible reading frames of a DNA sequence (three from the direct strand and three from the reverse strand). When there is a CDS in one of the reading frame, it is expected that the CA factor scores observed in this frame (fot both rows and columns) will be significantly different from those in the two others.

Value

This function returns a list containing the following components:

seq

a single DNA sequence as a vector of characters

sizewin

length of the sliding window

shift

length of the steps between windows

seqsize

length of the sequence

seqname

name of the sequence

vdep

a vector containing the positions of windows starts

vind

a vector containing the reading frame of each window

vstopd

a vector of stop codons positions in direct strand

vstopr

a vector of stop codons positions in reverse strand

vinitd

a vector of start codons positions in direct strand

vinitr

a vector of start codons positions in reverse strand

resd

a matrix containing codons frequencies for all the windows in the three frames of the direct strand

resr

a matrix containing codons frequencies for all the windows in the three frames of the reverse strand

resd.coa

list of class coa and dudi containing the result of the CA computed on the codons frequencies in the direct strand

resr.coa

list of class coa and dudi containing the result of the CA computed on the codons frequencies in the reverse strand

Note

This method works only with DNA sequences long enough to obtain a sufficient number of windows. As the optimal windows length has been estimated to be 90 bp by Fichant and Gautier (1987), the minimal sequence length is around 500 bp. The method can be used on prokaryotic and eukaryotic sequences. Also, only the four first factors of the CA are kept. Indeed, most of the time, only the first factor is relevant in order to detect CDS.

Author(s)

O. Clerc, G. Perrière

References

The original paper describing recstat is:

Fichant, G., Gautier, C. (1987) Statistical method for predicting protein coding regions in nucleic acid sequences. Comput. Appl. Biosci., 3, 287–295.
https://academic.oup.com/bioinformatics/article-abstract/3/4/287/218186

See Also

Examples

ff <- system.file("sequences/ECOUNC.fsa", package = "seqinr")
seq <- read.fasta(ff)
rec <- recstat(seq[[1]], seqname = getName(seq))

seqinr

Biological Sequences Retrieval and Analysis

v4.2-16
GPL (>= 2)
Authors
Delphine Charif [aut], Olivier Clerc [ctb], Carolin Frank [ctb], Jean R. Lobry [aut, cph], Anamaria Necşulea [ctb], Leonor Palmeira [ctb], Simon Penel [cre], Guy Perrière [ctb]
Initial release
2022-05-19

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.