Extract sequences upstream of a set of genes or transcripts
extractUpstreamSeqs
is a generic function for extracting
sequences upstream of a supplied set of genes or transcripts.
extractUpstreamSeqs(x, genes, width=1000, ...) ## Dispatch is on the 2nd argument! ## S4 method for signature 'GenomicRanges' extractUpstreamSeqs(x, genes, width=1000) ## S4 method for signature 'TxDb' extractUpstreamSeqs(x, genes, width=1000, exclude.seqlevels=NULL)
x |
An object containing the chromosome sequences from which to extract the
upstream sequences. It can be a BSgenome,
TwoBitFile, or FaFile object,
or any genome sequence container.
More formally, |
genes |
An object containing the locations (i.e. chromosome name, start, end, and
strand) of the genes or transcripts with respect to the reference genome.
Only GenomicRanges and TxDb objects
are supported at the moment. If the latter, the gene locations are obtained
by calling the |
width |
How many bases to extract upstream of each TSS (transcription start site). |
... |
Additional arguments, for use in specific methods. |
exclude.seqlevels |
A character vector containing the chromosome names (a.k.a. sequence levels) to exclude when the genes are obtained from a TxDb object. |
A DNAStringSet object containing one upstream sequence
per gene (or per transcript if genes
is a
GenomicRanges object containing transcript ranges).
More precisely, if genes
is a GenomicRanges
object, the returned object is parallel to it, that is, the i-th
element in the returned object is the upstream sequence corresponding to
the i-th gene (or transcript) in genes
. Also the names on the
GenomicRanges object are propagated to the returned
object.
In addition, the returned object has the following metadata columns
(accessible with mcols
) that provide some information about
the gene (or transcript) corresponding to each upstream sequence:
gene_seqnames
: the chromosome name of the gene (or
transcript);
gene_strand
: the strand of the gene (or transcript);
gene_TSS
: the transcription start site of the gene (or
transcript).
IMPORTANT: Always make sure to use a TxDb package (or TxDb
object) that contains a gene model compatible with the genome sequence
container x
, that is, a gene model based on the exact same reference
genome as x
.
See
http://bioconductor.org/packages/release/BiocViews.html#___TxDb
for the list of TxDb packages available in the current release of
Bioconductor.
Note that you can make your own custom TxDb object from
various annotation resources by using one of the makeTxDbFrom*()
functions listed in the "See also" section below.
Hervé Pagès
makeTxDbFromUCSC
, makeTxDbFromBiomart
,
and makeTxDbFromEnsembl
, for making a TxDb
object from online resources.
makeTxDbFromGRanges
and makeTxDbFromGFF
for making a TxDb object from a GRanges
object, or from a GFF or GTF file.
The available.genomes
function in the
BSgenome package for checking avaibility of BSgenome
data packages (and installing the desired one).
The BSgenome, TwoBitFile, and FaFile classes, defined and documented in the BSgenome, rtracklayer, and Rsamtools packages, respectively.
The TxDb class.
The genes
function for extracting gene ranges from
a TxDb object.
The GenomicRanges class defined and documented in the GenomicRanges package.
The DNAStringSet class defined and documented in the Biostrings package.
The seqinfo
getter defined and documented
in the GenomeInfoDb package.
The getSeq
function for extracting
subsequences from a sequence container.
## Load a genome: library(BSgenome.Dmelanogaster.UCSC.dm3) genome <- BSgenome.Dmelanogaster.UCSC.dm3 genome ## Use a TxDb object: library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene txdb # contains Ensembl gene IDs ## Because the chrU and chrUextra sequences are made of concatenated ## scaffolds (see http://genome.ucsc.edu/cgi-bin/hgGateway?db=dm3), ## extracting the upstream sequences for genes located on these ## scaffolds is not reliable. So we exclude them: exclude <- c("chrU", "chrUextra") up1000seqs <- extractUpstreamSeqs(genome, txdb, width=1000, exclude.seqlevels=exclude) up1000seqs # the names are Ensembl gene IDs mcols(up1000seqs) ## Upstream sequences for genes close to the chromosome bounds can be ## shorter than 1000 (note that this does not happen for circular ## chromosomes like chrM): table(width(up1000seqs)) mcols(up1000seqs)[width(up1000seqs) != 1000, ]
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.