Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

UCSCTableQuery-class

Querying UCSC Tables


Description

The UCSC genome browser is backed by a large database, which is exposed by the Table Browser web interface. Tracks are stored as tables, so this is also the mechanism for retrieving tracks. The UCSCTableQuery class represents a query against the Table Browser. Storing the query fields in a formal class facilitates incremental construction and adjustment of a query.

Details

There are five supported fields for a table query:

session

The UCSCSession instance from the tables are retrieved. Although all sessions are based on the same database, the set of user-uploaded tracks, which are represented as tables, is not the same, in general.

trackName

The name of a track from which to retrieve a table. Each track can have multiple tables. Many times there is a primary table that is used to display the track, while the other tables are supplemental. Sometimes, tracks are displayed by aggregating multiple tables. If NULL, search for a primary table across all of the tracks (will not find secondary tables).

tableName

The name of the specific table to retrieve. May be NULL, in which case the behavior depends on how the query is executed, see below.

range

A genome identifier, a GRanges or a IntegerRangesList indicating the portion of the table to retrieve, in genome coordinates. Simply specifying the genome string is the easiest way to download data for the entire genome, and GRangesForUCSCGenome facilitates downloading data for e.g. an entire chromosome.

names

Names/accessions of the desired features

A common workflow for querying the UCSC database is to create an instance of UCSCTableQuery using the ucscTableQuery constructor, invoke tableNames to list the available tables for a track, and finally to retrieve the desired table either as a data.frame via getTable or as a track via track. See the examples.

The reason for a formal query class is to facilitate multiple queries when the differences between the queries are small. For example, one might want to query multiple tables within the track and/or same genomic region, or query the same table for multiple regions. The UCSCTableQuery instance can be incrementally adjusted for each new query. Some caching is also performed, which enhances performance.

Constructor

ucscTableQuery(x, track, range = seqinfo(x), table = NULL, names = NULL): Creates a UCSCTableQuery with the UCSCSession or genome identifier given as x and the track name given by the single string track. range should be a genome string identifier, a GRanges instance or IntegerRangesList instance, and it effectively defaults to genome(x). If the genome is missing, it is taken from the session. The table name is given by table, which may be a single string or NULL. Feature names, such as gene identifiers, may be passed via names as a character vector.

Executing Queries

Below, object is a UCSCTableQuery instance.

track(object): Retrieves the indicated table as a track, i.e. a GRanges object. Note that not all tables are available as tracks.

getTable(object): Retrieves the indicated table as a data.frame. Note that not all tables are output in parseable form, and that UCSC will truncate responses if they exceed certain limits (usually around 100,000 records). The safest (and most efficient) bet for large queries is to download the file via FTP and query it locally.

tableNames(object): Gets the names of the tables available for the session, track and range specified by the query.

Accessor methods

In the code snippets below, x/object is a UCSCTableQuery object.

browserSession(object), browserSession(object) <- value: Get or set the UCSCSession to query.

trackName(x), trackName(x) <- value: Get or set the single string indicating the track containing the table of interest.

trackNames(x)

List the names of the tracks available for retrieval for the assigned genome.

tableName(x), tableName(x) <- value: Get or set the single string indicating the name of the table to retrieve. May be NULL, in which case the table is automatically determined.

range(x), range(x) <- value: Get or set the GRanges indicating the portion of the table to retrieve in genomic coordinates. Any missing information, such as the genome identifier, is filled in using range(browserSession(x)). It is also possible to set the genome identifier string or a IntegerRangesList.

names(x), names(x) <- value: Get or set the names of the features to retrieve. If NULL, this filter is disabled.

ucscSchema(x): Get the UCSCSchema object describing the selected table.

Author(s)

Michael Lawrence

Examples

## Not run: 
session <- browserSession()
genome(session) <- "mm9"
trackNames(session) ## list the track names
## choose the Conservation track for a portion of mm9 chr1
query <- ucscTableQuery(session, "Conservation",
                        GRangesForUCSCGenome("mm9", "chr12",
                                             IRanges(57795963, 57815592)))
## list the table names
tableNames(query)
## get the phastCons30way track
tableName(query) <- "phastCons30way"
## retrieve the track data
track(query)  # a GRanges object
## get a data.frame summarizing the multiple alignment
tableName(query) <- "multiz30waySummary"
getTable(query)

genome(session) <- "hg18"
query <- ucscTableQuery(session, "snp129",
                        names = c("rs10003974", "rs10087355", "rs10075230"))
ucscSchema(query)
getTable(query)

## End(Not run)

rtracklayer

R interface to genome annotation files and the UCSC genome browser

v1.50.0
Artistic-2.0 + file LICENSE
Authors
Michael Lawrence, Vince Carey, Robert Gentleman
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.