Querying UCSC Tables
The UCSC genome browser is backed by a large database,
which is exposed by the Table Browser web interface. Tracks are
stored as tables, so this is also the mechanism for retrieving tracks. The
UCSCTableQuery
class represents a query against the Table
Browser. Storing the query fields in a formal class facilitates
incremental construction and adjustment of a query.
There are five supported fields for a table query:
The UCSCSession
instance from
the tables are retrieved. Although all sessions are based on the
same database, the set of user-uploaded tracks, which are represented
as tables, is not the same, in general.
The name of a track from which to retrieve a
table. Each track can have multiple tables. Many times there is a
primary table that is used to display the track, while the other
tables are supplemental. Sometimes, tracks are displayed by
aggregating multiple tables. If NULL
, search for a primary
table across all of the tracks (will not find secondary tables).
The name of the specific table to retrieve. May be
NULL
, in which case the behavior depends on how the query
is executed, see below.
A genome identifier, a
GRanges
or
a IntegerRangesList
indicating
the portion of the table to retrieve, in genome coordinates.
Simply specifying the genome string is the easiest way to download
data for the entire genome, and GRangesForUCSCGenome
facilitates downloading data for e.g. an entire chromosome.
Names/accessions of the desired features
A common workflow for querying the UCSC database is to create an
instance of UCSCTableQuery
using the ucscTableQuery
constructor, invoke tableNames
to list the available tables for
a track, and finally to retrieve the desired table either as a
data.frame
via getTable
or as a track
via track
. See the examples.
The reason for a formal query class is to facilitate multiple queries
when the differences between the queries are small. For example, one
might want to query multiple tables within the track and/or same
genomic region, or query the same table for multiple regions. The
UCSCTableQuery
instance can be incrementally adjusted for each
new query. Some caching is also performed, which enhances performance.
ucscTableQuery(x, track, range = seqinfo(x), table = NULL,
names = NULL)
: Creates a UCSCTableQuery
with the
UCSCSession
or genome identifier given as x
and
the track name given by
the single string track
. range
should be a genome
string identifier, a GRanges
instance or
IntegerRangesList
instance, and it effectively defaults to
genome(x)
. If the genome is missing, it is taken from the
session. The table name is given by
table
, which may be a single string or
NULL
. Feature names, such as gene identifiers, may be
passed via names
as a character vector.
Below, object
is a UCSCTableQuery
instance.
track(object)
:
Retrieves the indicated table as a track, i.e. a GRanges
object. Note that not all tables are available as tracks.
getTable(object)
: Retrieves the indicated table as a
data.frame
. Note that not all tables are output in
parseable form, and that UCSC will truncate responses if they
exceed certain limits (usually around 100,000 records). The safest
(and most efficient) bet for large queries is to download the file
via FTP and query it locally.
tableNames(object)
: Gets the names of the tables available
for the session, track and range specified by the query.
In the code snippets below, x
/object
is a
UCSCTableQuery
object.
browserSession(object)
,
browserSession(object) <- value
:
Get or set the UCSCSession
to query.
trackName(x)
, trackName(x) <- value
: Get or
set the single string indicating the track containing the table of
interest.
trackNames(x)
List the names of the tracks available for retrieval for the assigned genome.
tableName(x)
, tableName(x) <- value
: Get or
set the single string indicating the name of the table to
retrieve. May be NULL
, in which case the table is
automatically determined.
range(x)
, range(x) <- value
: Get or set the
GRanges
indicating the portion of the table to retrieve in
genomic coordinates. Any missing information, such as the genome
identifier, is filled in using range(browserSession(x))
. It
is also possible to set the genome identifier string or
a IntegerRangesList
.
names(x)
, names(x) <- value
: Get or set the
names of the features to retrieve. If NULL
, this filter is
disabled.
ucscSchema(x)
: Get
the UCSCSchema
object describing the selected table.
Michael Lawrence
## Not run: session <- browserSession() genome(session) <- "mm9" trackNames(session) ## list the track names ## choose the Conservation track for a portion of mm9 chr1 query <- ucscTableQuery(session, "Conservation", GRangesForUCSCGenome("mm9", "chr12", IRanges(57795963, 57815592))) ## list the table names tableNames(query) ## get the phastCons30way track tableName(query) <- "phastCons30way" ## retrieve the track data track(query) # a GRanges object ## get a data.frame summarizing the multiple alignment tableName(query) <- "multiz30waySummary" getTable(query) genome(session) <- "hg18" query <- ucscTableQuery(session, "snp129", names = c("rs10003974", "rs10087355", "rs10075230")) ucscSchema(query) getTable(query) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.