Import and export
The functions import
and export
load and save
objects from and to particular file formats. The rtracklayer package
implements support for a number of annotation and sequence formats.
export(object, con, format, ...) import(con, format, text, ...)
object |
The object to export. |
con |
The connection from which data is loaded or to which data
is saved. If this is a character vector, it is assumed to be a
filename and a corresponding file connection is created and then
closed after exporting the object. If a |
format |
The format of the output. If missing and |
text |
If |
... |
Parameters to pass to the format-specific method. |
The rtracklayer package supports a number of file formats for
representing annotated genomic intervals. These are
each represented as a subclass of RTLFile
. Below,
we list the major supported formats, with some advice for when a
particular file format is appropriate:
The General Feature Format is
meant to represent any set of genomic features, with
application-specific columns represented as
“attributes”. There are three principal versions (1, 2, and
3). This is a good format for interoperating with other genomic
tools and is the most flexible format, in that a feature may have
any number of attributes (in version 2 and above). Version 3
(GFF3) is the preferred version. Its specification lays out
conventions for representing various types of data, including gene
models, for which it is the format of choice. For variants,
rtracklayer has rudimentary support for an extention of GFF3
called GVF. UCSC supports GFF1, but it needs to be encapsulated in
the UCSC metaformat, i.e. export.ucsc(subformat =
"gff1")
. The BED format is typically preferred over GFF for
interaction with UCSC. GFF files can be indexed with the tabix
utility for fast range-based queries via rtracklayer and
Rsamtools.
The Browser Extended Display
format is for displaying qualitative tracks in a genome browser,
in particular UCSC. It finds a good balance between simplicity and
expressiveness. It is much simpler than GFF and yet can still
represent multi-exon gene structures. It is somewhat limited by
its lack of the attribute support of GFF. To circumvent this, many
tools and organizations have extended BED with additional
columns. Use the extraCols
argument on import
to
read those columns into R. The rtracklayer package supports two official
extensions of BED: Bed15 and bedGraph, and the unofficial BEDPE
format, see below. BED files can be indexed with the tabix utility
for fast range-based queries via rtracklayer and Rsamtools.
An extension of BED with 15 columns, Bed15 is meant to represent data from microarray experiments. Multiple samples/columns are supported, and the data is displayed in UCSC as a compact heatmap. Few other tools support this format. With 15 columns per feature, this format is probably too verbose for e.g. ChIP-seq coverage (use multiple BigWig tracks instead).
A variant of BED that
represents a score column more compactly than BED and
especially Bed15, although only one sample is
supported. The data is displayed in UCSC as a bar or line
graph. For large data (the typical case), BigWig
is
preferred.
A variant of BED that represents pairs of genomic regions, such as interaction data or chromosomal rearrangements. The data cannot be displayed in UCSC directly but can be represented using the BED12 format.
The Wiggle format is meant for
storing dense numerical data, such as window-based GC and
conservation scores. The data is displayed in UCSC as a bar or
line graph. The WIG format only works for intervals with a uniform
width. For non-uniform widths, consider bedGraph
. For large
data, consider BigWig
.
The BigWig format is a
binary version of both bedGraph
and WIG
(which are
now somewhat obsolete). A BigWig file contains a spatial index for
fast range-based queries and also embeds summary statistics of the
scores at several zoom levels. Thus, it is ideal for visualization
of and parallel computing on genome-scale vectors, like the
coverage from a high-throughput sequencing experiment.
In summary, for the typical use case of combining gene models with
experimental data, GFF is preferred for gene models and
BigWig
is preferred for quantitative score vectors. Note that
the Rsamtools package provides support for the
BAM
file format (for representing
read alignments), among others. Based on this, the rtracklayer package
provides an export
method for writing GAlignments
and GappedReads
objects as BAM
. For variants, consider
VCF, supported by the VariantAnnotation package.
If con
is missing, a character vector containing the string
output. Otherwise, nothing is returned.
Michael Lawrence
track <- import(system.file("tests", "v1.gff", package = "rtracklayer")) ## Not run: export(track, "my.gff", version = "3") ## equivalently, ## Not run: export(track, "my.gff3") ## or ## Not run: con <- file("my.gff3") export(track, con, "gff3") ## End(Not run) ## or as a string export(track, format = "gff3")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.