Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

getChromInfoFromNCBI

Get chromosome information for an NCBI assembly


Description

getChromInfoFromNCBI returns chromosome information like sequence names, lengths and circularity flags for a given NCBI assembly e.g. for GRCh38, ARS-UCD1.2, R64, etc...

Note that getChromInfoFromNCBI behaves slightly differently depending on whether the assembly is registered in the GenomeInfoDb package or not. See below for the details.

Use registered_NCBI_assemblies to list all the NCBI assemblies currently registered in the GenomeInfoDb package.

Usage

getChromInfoFromNCBI(assembly,
                     assembled.molecules.only=FALSE,
                     assembly.units=NULL,
                     recache=FALSE,
                     as.Seqinfo=FALSE)

registered_NCBI_assemblies()

Arguments

assembly

A single string specifying the name of an NCBI assembly (e.g. "GRCh38"). Alternatively, an assembly accession (GenBank or RefSeq) can be supplied (e.g. "GCF_000001405.12").

assembled.molecules.only

If FALSE (the default) then chromosome information is returned for all the sequences in the assembly (unless assembly.units is specified, see below), that is, for all the chromosomes, plasmids, and scaffolds.

If TRUE then chromosome information is returned only for the assembled molecules. These are the chromosomes (including the mitochondrial chromosome) and plasmids only. No scaffolds.

assembly.units

If NULL (the default) then chromosome information is returned for all the sequences in the assembly (unless assembled.molecules.only is set to TRUE, see above), that is, for all the chromosomes, plasmids, and scaffolds.

assembly.units can be set to a character vector containing the names of Assembly Units (e.g. "non-nuclear") in which case chromosome information is returned only for the sequences that belong to these Assembly Units.

recache

getChromInfoFromNCBI uses a cache mechanism so the chromosome information of a given assembly only gets downloaded once during the current R session (note that the caching is done in memory so cached information does NOT persist across sessions). Setting recache to TRUE forces a new download (and recaching) of the chromosome information for the specified assembly.

as.Seqinfo

TRUE or FALSE (the default). If TRUE then a Seqinfo object is returned instead of a data frame. Note that only the SequenceName, SequenceLength, and circular columns of the data frame are used to make the Seqinfo object. All the other columns are ignored (and lost).

Details

registered vs unregistered NCBI assemblies:

  • All NCBI assemblies can be looked up by assembly accession (GenBank or RefSeq) but only registered assemblies can also be looked up by assembly name.

  • For registered assemblies, the returned circularity flags are guaranteed to be accurate. For unregistered assemblies, a heuristic is used to determine the circular sequences.

Please contact the maintainer of the GenomeInfoDb package to request registration of additional assemblies.

Value

For getChromInfoFromNCBI: By default, a 10-column data frame with columns:

  1. SequenceName: character.

  2. SequenceRole: factor.

  3. AssignedMolecule: factor.

  4. GenBankAccn: character.

  5. Relationship: factor.

  6. RefSeqAccn: character.

  7. AssemblyUnit: factor.

  8. SequenceLength: integer. Note that this column **can** contain NAs! For example this is the case in assembly Amel_HAv3.1 where the length of sequence MT is missing or in assembly Release 5 where the length of sequence Un is missing.

  9. UCSCStyleName: character.

  10. circular: logical.

For registered_NCBI_assemblies: A data frame summarizing all the NCBI assemblies currently registered in the GenomeInfoDb package.

Author(s)

H. Pagès

See Also

Examples

## Internet access required!

getChromInfoFromNCBI("GRCh37")

getChromInfoFromNCBI("GRCh37", as.Seqinfo=TRUE)

getChromInfoFromNCBI("GRCh37", assembled.molecules.only=TRUE)

getChromInfoFromNCBI("TAIR10.1")

getChromInfoFromNCBI("TAIR10.1", assembly.units="non-nuclear")

## List of NCBI assemblies currently registered in the package:
registered_NCBI_assemblies()

## The GRCh38.p12 assembly only adds "patch sequences" to the GRCh38
## assembly:
GRCh38 <- getChromInfoFromNCBI("GRCh38")
table(GRCh38$SequenceRole)
GRCh38.p12 <- getChromInfoFromNCBI("GRCh38.p12")
table(GRCh38.p12$SequenceRole)  # 140 patch sequences (70 fix + 70 novel)

## Sanity checks:
idx <- match(GRCh38$SequenceName, GRCh38.p12$SequenceName)
stopifnot(!anyNA(idx))
tmp1 <- GRCh38.p12[idx, ]
rownames(tmp1) <- NULL
tmp2 <- GRCh38.p12[-idx, ]
stopifnot(
  identical(tmp1[ , -(5:7)], GRCh38[ , -(5:7)]),
  identical(tmp2, GRCh38.p12[GRCh38.p12$AssemblyUnit == "PATCHES", ])
)

GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style

v1.26.7
Artistic-2.0
Authors
Sonali Arora, Martin Morgan, Marc Carlson, H. Pagès
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.