Read DNA Sequences from GenBank via Internet
This function connects to the GenBank database, and reads nucleotide sequences using accession numbers given as arguments.
read.GenBank(access.nb, seq.names = access.nb, species.names = TRUE, as.character = FALSE, chunk.size = 400, quiet = TRUE)
access.nb |
a vector of mode character giving the accession numbers. |
seq.names |
the names to give to each sequence; by default the accession numbers are used. |
species.names |
a logical indicating whether to attribute the species names to the returned object. |
as.character |
a logical controlling whether to return the
sequences as an object of class |
chunk.size |
the number of sequences downloaded together (see details). |
quiet |
a logical value indicating whether to show the progress
of the downloads. If |
The function uses the site https://www.ncbi.nlm.nih.gov/ from where the sequences are retrieved.
If species.names = TRUE
, the returned list has an attribute
"species"
containing the names of the species taken from the
field “ORGANISM” in GenBank.
Since ape 3.6, this function retrieves the sequences in FASTA
format: this is more efficient and more flexible (scaffolds and
contigs can be read) than what was done in previous versions. The
option gene.names
has been removed in ape 5.4; this
information is also present in the description.
Setting species.names = FALSE
is much faster (could be useful
if you read a series of scaffolds or contigs, or if you already have
the species names).
The argument chunk.size
is set by default to 400 which is
likely to work in many cases. If an error occurs such as “Cannot open
file ...” showing the list of the accession numbers, then you may
try decreasing chunk.size
to 200 or 300.
If quiet = FALSE
, the display is done chunk by chunk, so the
message “Downloading sequences: 400 / 400 ...” means that the
download from sequence 1 to sequence 400 is under progress (it is not
possible to display a more accurate message because the download
method depends on the platform).
A list of DNA sequences made of vectors of class "DNAbin"
, or
of single characters (if as.character = TRUE
) with two
attributes (species and description).
Emmanuel Paradis
## This won't work if your computer is not connected ## to the Internet ## Get the 8 sequences of tanagers (Ramphocelus) ## as used in Paradis (1997) ref <- c("U15717", "U15718", "U15719", "U15720", "U15721", "U15722", "U15723", "U15724") ## Copy/paste or type the following commands if you ## want to try them. ## Not run: Rampho <- read.GenBank(ref) ## get the species names: attr(Rampho, "species") ## build a matrix with the species names and the accession numbers: cbind(attr(Rampho, "species"), names(Rampho)) ## print the first sequence ## (can be done with `Rampho$U15717' as well) Rampho[[1]] ## the description from each FASTA sequence: attr(Rampho, "description") ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.