Memory-efficient representation of genomic positions
The GPos class is a container for storing a set of genomic positions (a.k.a. genomic loci). It exists in 2 flavors: UnstitchedGPos and StitchedGPos. Each flavor uses a particular internal representation:
In an UnstitchedGPos instance the positions are stored as an integer vector.
In a StitchedGPos instance the positions are stored as an IRanges object where each range represents a run of consecutive positions (i.e. a run of positions that are adjacent and in ascending order). This storage is particularly memory-efficient when the vector of positions contains long runs of consecutive positions.
Because genomic positions can be seen as genomic ranges of width 1, the GPos class extends the GenomicRanges virtual class (via the GRanges class).
## Constructor function GPos(seqnames=NULL, pos=NULL, strand=NULL, ..., seqinfo=NULL, seqlengths=NULL, stitch=NA)
seqnames, strand, ..., seqinfo, seqlengths |
See documentation of the |
pos |
When |
stitch |
Controls which internal representation should be used: StitchedGPos
(when When |
OTOH the memory footprint of a StitchedGPos object can vary a lot but will never be worse than that of a GRanges object. However it will reduce dramatically if the vector of positions contains long runs of consecutive positions. In the worst case scenario (i.e. when the object contains no consecutive positions) its memory footprint will be the same as that of a GRanges object.
Like for any Vector derivative, the length of a
GPos object cannot exceed .Machine$integer.max
(i.e. 2^31 on
most platforms). GPos()
will return an error if pos
contains too many positions.
An UnstitchedGPos or StitchedGPos object.
GPos objects support the same set of getters as other GenomicRanges
derivatives (i.e. seqnames()
, ranges()
, start()
,
end()
, strand()
, mcols()
, seqinfo()
,
etc...), plus the pos()
getter which is equivalent to
start()
or end()
. See ?GenomicRanges
for the
list of getters supported by GenomicRanges derivatives.
Note that ranges()
returns an IPos derivative
instead of the IRanges object that one gets with other
GenomicRanges derivatives. To get an IRanges
object, you need to call ranges()
again on this
IPos derivative i.e. do ranges(ranges(x))
on GPos object x
.
Like GRanges objects, GPos derivatives support the following setters:
The seqnames()
and strand()
setters.
The names()
, mcols()
and metadata()
setters.
The family of setters that operate on the seqinfo
component of an object:
seqlevels()
,
seqlevelsStyle()
,
seqlengths()
,
isCircular()
,
genome()
,
and seqinfo()
.
These setters are defined and documented in the GenomeInfoDb
package.
However, there is no pos()
setter for GPos derivatives at the
moment (although one might be added in the future).
From UnstitchedGPos to StitchedGPos and vice-versa: coercion back and
forth between UnstitchedGPos and StitchedGPos is supported via
as(x, "StitchedGPos")
and as(x, "UnstitchedGPos")
.
This is the most efficient and recommended way to switch between the
2 internal representations. Note that this switch can have dramatic
consequences on memory usage so is for advanced users only.
End users should almost never need to do this switch when following
a typical workflow.
From GenomicRanges to UnstitchedGPos, StitchedGPos, or GPos:
A GenomicRanges derivative x
in which all the ranges have
a width of 1 can be coerced to an UnstitchedGPos or StitchedGPos object
with as(x, "UnstitchedGPos")
or as(x, "StitchedGPos")
,
respectively.
For convenience as(x, "GPos")
is supported and is equivalent to
as(x, "UnstitchedGPos")
.
From GPos to ordinary R objects:
Like with any other GenomicRanges derivative, as.character()
,
as.factor()
, and as.data.frame()
work on a GPos derivative
x
. Note however that as.data.frame(x)
returns a data frame
with a pos
column (containing pos(x)
) instead of the
start
, end
, and width
columns that one gets with other
GenomicRanges derivatives.
A GPos derivative can be subsetted exactly like a GRanges object.
GPos derivatives can be concatenated with c()
or append()
.
See ?c
in the S4Vectors package for
more information about concatenating Vector derivatives.
Like with any other GRanges object, split()
and relist()
work on a GPos derivative.
Internal representation of GPos objects has changed in GenomicRanges
1.29.10 (Bioc 3.6). Update any old object x
with:
x <- updateObject(x, verbose=TRUE)
.
Hervé Pagès; based on ideas borrowed from Georg Stricker georg.stricker@in.tum.de and Julien Gagneur gagneur@in.tum.de
The IPos class in the IRanges package for storing a set of integer positions (i.e. integer ranges of width 1).
The GRanges class for storing a set of genomic ranges of arbitrary width.
Seqinfo objects and the
seqinfo
accessor and family in the
GenomeInfoDb package for accessing/modifying information
about the underlying sequences of a GenomicRanges derivative.
GenomicRanges-comparison for comparing and ordering genomic ranges and/or positions.
findOverlaps-methods for finding overlapping genomic ranges and/or positions.
intra-range-methods and inter-range-methods for intra range and inter range transformations of GenomicRanges derivatives.
coverage-methods for computing the coverage of a set of genomic ranges and/or positions.
nearest-methods for finding the nearest genomic range/position neighbor.
The snpsBySeqname
,
snpsByOverlaps
, and
snpsById
methods for
SNPlocs objects defined in the BSgenome
package for extractors that return a GPos derivative.
SummarizedExperiment objects and derivatives in the SummarizedExperiment package.
showClass("GPos") # shows the known subclasses ## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- ## Example 1: gpos1a <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)), pos=c(44:53, 5:10, 2:5)) gpos1a # unstitched length(gpos1a) seqnames(gpos1a) pos(gpos1a) # same as 'start(gpos1a)' and 'end(gpos1a)' strand(gpos1a) as.character(gpos1a) as.data.frame(gpos1a) as(gpos1a, "GRanges") as.data.frame(as(gpos1a, "GRanges")) gpos1a[9:17] gpos1b <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)), pos=c(44:53, 5:10, 2:5), stitch=TRUE) gpos1b # stitched ## 'gpos1a' and 'gpos1b' are semantically equivalent, only their ## internal representations differ: all(gpos1a == gpos1b) gpos1c <- GPos(c("chr1:44-53", "chr2:5-10", "chr1:2-5")) gpos1c # stitched identical(gpos1b, gpos1c) ## Example 2: pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)), strand=c("*", "-", "-", "+")) gpos2 <- GPos(pos_runs) gpos2 # stitched strand(gpos2) ## Example 3: gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000")) npos <- length(gpos3A) mcols(gpos3A)$sample <- Rle("sA") sA_counts <- sample(10, npos, replace=TRUE) mcols(gpos3A)$counts <- sA_counts mcols(gpos3B)$sample <- Rle("sB") sB_counts <- sample(10, npos, replace=TRUE) mcols(gpos3B)$counts <- sB_counts gpos3 <- c(gpos3A, gpos3B) gpos3 ## Example 4: library(BSgenome.Scerevisiae.UCSC.sacCer2) genome <- BSgenome.Scerevisiae.UCSC.sacCer2 gpos4 <- GPos(seqinfo(genome)) gpos4 # all the positions along the genome are represented mcols(gpos4)$dna <- do.call("c", unname(as.list(genome))) gpos4 ## Note however that, like for any Vector derivative, the length of a ## GPos derivative cannot exceed '.Machine$integer.max' (i.e. 2^31 on ## most platforms) so the above only works with a "small" genome. ## For example it doesn't work with the Human genome: library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## Not run: GPos(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)) # error! ## End(Not run) ## You can use isSmallGenome() to check upfront whether the genome is ## "small" or not. isSmallGenome(genome) # TRUE isSmallGenome(TxDb.Hsapiens.UCSC.hg19.knownGene) # FALSE ## --------------------------------------------------------------------- ## MEMORY USAGE ## --------------------------------------------------------------------- ## Coercion to GRanges works... gr4 <- as(gpos4, "GRanges") gr4 ## ... but is generally not a good idea: object.size(gpos4) object.size(gr4) # 8 times bigger than the StitchedGPos object! ## Shuffling the order of the positions impacts memory usage: gpos4r <- rev(gpos4) object.size(gpos4r) # significantly gpos4s <- sample(gpos4) object.size(gpos4s) # even worse! ## If one anticipates a lot of shuffling of the genomic positions, ## then an UnstitchedGPos object should be used instead: gpos4b <- as(gpos4, "UnstitchedGPos") object.size(gpos4b) # initial size is bigger than stitched version object.size(rev(gpos4b)) # size didn't change object.size(sample(gpos4b)) # size increased, but is still < stitched # version ## AN IMPORTANT NOTE: In the worst situations, GPos still performs as ## good as a GRanges object. object.size(as(gpos4r, "GRanges")) # same size as 'gpos4r' object.size(as(gpos4s, "GRanges")) # same size as 'gpos4s' ## Best case scenario is when the object is strictly sorted (i.e. ## positions are in strict ascending order). ## This can be checked with: is.unsorted(gpos4, strict=TRUE) # 'gpos4' is strictly sorted ## --------------------------------------------------------------------- ## USING MEMORY-EFFICIENT METADATA COLUMNS ## --------------------------------------------------------------------- ## In order to keep memory usage as low as possible, it is recommended ## to use a memory-efficient representation of the metadata columns that ## we want to set on the object. Rle's are particularly well suited for ## this, especially if the metadata columns contain long runs of ## identical values. This is the case for example if we want to use a ## GPos object to represent the coverage of sequencing reads along a ## genome. ## Example 5: library(pasillaBamSubset) library(Rsamtools) # for the BamFile() constructor function bamfile1 <- BamFile(untreated1_chr4()) bamfile2 <- BamFile(untreated3_chr4()) gpos5 <- GPos(seqinfo(bamfile1)) library(GenomicAlignments) # for "coverage" method for BamFile objects cvg1 <- unlist(coverage(bamfile1), use.names=FALSE) cvg2 <- unlist(coverage(bamfile2), use.names=FALSE) mcols(gpos5) <- DataFrame(cvg1, cvg2) gpos5 object.size(gpos5) # lightweight ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: gpos5[mcols(gpos5)$cvg1 >= 10 | mcols(gpos5)$cvg2 >= 10] ## --------------------------------------------------------------------- ## USING A GPos OBJECT IN A RangedSummarizedExperiment OBJECT ## --------------------------------------------------------------------- ## Because the GPos class extends the GenomicRanges virtual class, a ## GPos derivative can be used as the rowRanges component of a ## RangedSummarizedExperiment object. ## As a 1st example, we show how the counts for samples sA and sB in ## 'gpos3' can be stored in a SummarizedExperiment object where the rows ## correspond to unique genomic positions and the columns to samples: library(SummarizedExperiment) counts <- cbind(sA=sA_counts, sB=sB_counts) mcols(gpos3A) <- NULL rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A) rse3 rowRanges(rse3) head(assay(rse3)) ## Finally we show how the coverage data from Example 5 can be easily ## stored in a lightweight SummarizedExperiment derivative: cvg <- mcols(gpos5) mcols(gpos5) <- NULL rse5 <- SummarizedExperiment(list(cvg=cvg), rowRanges=gpos5) rse5 rowRanges(rse5) assay(rse5) ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: rse5[assay(rse5)$cvg1 >= 10 | assay(rse5)$cvg2 >= 10]
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.