mlbench: DNA – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

DNA

Primate splice-junction gene sequences (DNA)

Description

It consists of 3,186 data points (splice junctions). The data points are described by 180 indicator binary variables and the problem is to recognize the 3 classes (ei, ie, neither), i.e., the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out).

The StaLog dna dataset is a processed version of the Irvine database described below. The main difference is that the symbolic variables representing the nucleotides (only A,G,T,C) were replaced by 3 binary indicator variables. Thus the original 60 symbolic attributes were changed into 180 binary attributes. The names of the examples were removed. The examples with ambiguities were removed (there was very few of them, 4). The StatLog version of this dataset was produced by Ross King at Strathclyde University. For original details see the Irvine database documentation.

The nucleotides A,C,G,T were given indicator values as follows:

	A -> 1 0 0
	C -> 0 1 0
	G -> 0 0 1
	T -> 0 0 0

Hint. Much better performance is generally observed if attributes closest to the junction are used. In the StatLog version, this means using attributes A61 to A120 only.

Usage

data(DNA)

Format

A data frame with 3,186 observations on 180 variables, all nominal and a target class.

Source

Source:
- all examples taken from Genbank 64.1 (ftp site: genbank.bio.net)
- categories "ei" and "ie" include every "split-gene" for primates in Genbank 64.1
- non-splice examples taken from sequences known not to include a splicing site
Donor: G. Towell, M. Noordewier, and J. Shavlik, towell,shavlik@cs.wisc.edu, noordewi@cs.rutgers.edu

These data have been taken from:

ftp.stams.strath.ac.uk/pub/Statlog

and were converted to R format by Evgenia Dimitriadou.

References

machine learning:
– M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; "Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences". Advances in Neural Information Processing Systems, volume 3, Morgan Kaufmann.

– G. G. Towell and J. W. Shavlik and M. W. Craven, 1991; "Constructive Induction in Knowledge-Based Neural Networks", In Proceedings of the Eighth International Machine Learning Workshop, Morgan Kaufmann.

– G. G. Towell, 1991; "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and Extraction", PhD Thesis, University of Wisconsin - Madison.

– G. G. Towell and J. W. Shavlik, 1992; "Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules", In Advances in Neural Information Processing Systems, volume 4, Morgan Kaufmann.

Examples

data(DNA)
summary(DNA)

mlbench

Machine Learning Benchmark Problems

v2.1-3

GPL-2

Authors

Friedrich Leisch and Evgenia Dimitriadou.

Initial release

2021-01-21

DNA