Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

SNPlocs-class

SNPlocs objects


Description

The SNPlocs class is a container for storing known SNP locations (of class snp) for a given organism.

SNPlocs objects are usually made in advance by a volunteer and made available to the Bioconductor community as SNPlocs data packages. See ?available.SNPs for how to get the list of SNPlocs and XtraSNPlocs data packages curently available.

The main focus of this man page is on how to extract SNPs from an SNPlocs object.

Usage

snpcount(x)

snpsBySeqname(x, seqnames, ...)
## S4 method for signature 'SNPlocs'
snpsBySeqname(x, seqnames, drop.rs.prefix=FALSE, genome=NULL)

snpsByOverlaps(x, ranges, ...)
## S4 method for signature 'SNPlocs'
snpsByOverlaps(x, ranges, drop.rs.prefix=FALSE, ..., genome=NULL)

snpsById(x, ids, ...)
## S4 method for signature 'SNPlocs'
snpsById(x, ids, ifnotfound=c("error", "warning", "drop"), genome=NULL)

inferRefAndAltAlleles(gpos, genome)

Arguments

x

A SNPlocs object.

seqnames

The names of the sequences for which to get SNPs. Must be a subset of seqlevels(x). NAs and duplicates are not allowed.

...

Additional arguments, for use in specific methods.

Arguments passed to the snpsByOverlaps method for SNPlocs objects thru ... are used internally in the call to subsetByOverlaps(). See ?IRanges::subsetByOverlaps in the IRanges package and ?GenomicRanges::subsetByOverlaps in the GenomicRanges package for more information about the subsetByOverlaps() generic and its method for GenomicRanges objects.

drop.rs.prefix

Should the rs prefix be dropped from the returned RefSNP ids? (RefSNP ids are stored in the RefSNP_id metadata column of the returned object.)

genome

For snpsBySeqname, snpsByOverlaps, and snpsById:

NULL (the default), or a BSgenome object containing the sequences of the reference genome that corresponds to the SNP positions. See inferRefAndAltAlleles below for an alternative way to specify genome.

If genome is supplied, then inferRefAndAltAlleles is called internally by snpsBySeqname, snpsByOverlaps, or snpsById to infer the reference allele (a.k.a. ref allele) and alternate allele(s) (a.k.a. alt allele(s)) for each SNP in the returned GPos object. The inferred ref allele and alt allele(s) are returned in additional metadata columns ref_allele (character) and alt_alleles (CharacterList).

For inferRefAndAltAlleles:

A BSgenome object containing the sequences of the reference genome that corresponds to the SNP positions in gpos. Alternatively genome can be a single string containing the name of the reference genome, in which case it must be specified in a way that is accepted by the getBSgenome function (e.g. "GRCh38") and the corresponding BSgenome data package needs to be already installed (see ?getBSgenome for the details).

ranges

One or more genomic regions of interest specified as a GRanges or GPos object. A single region of interest can be specified as a character string of the form "ch14:5201-5300".

ids

The RefSNP ids to look up (a.k.a. rs ids). Can be integer or character vector, with or without the "rs" prefix. NAs are not allowed.

ifnotfound

What to do if SNP ids are not found.

gpos

A GPos object containing SNPs. It must have a metadata column alleles_as_ambig like obtained when using any of the SNP extractor snpsBySeqname, snpsByOverlaps, or snpsById on a SNPlocs object.

Details

When the reference genome is specified via the genome argument, SNP extractors snpsBySeqname, snpsByOverlaps, and snpsById call inferRefAndAltAlleles internally to infer the reference allele (a.k.a. ref allele) and alternate allele(s) (a.k.a. alt allele(s)) for each SNP.

For each SNP the ref allele is inferred from the actual nucleotide found in the reference genome at the SNP position. The alt alleles are inferred from metadata column alleles_as_ambig and the ref allele. More precisely for each SNP the alt alleles are considered to be the alleles in alleles_as_ambig minus the ref allele.

Value

snpcount returns a named integer vector containing the number of SNPs for each sequence in the reference genome.

snpsBySeqname, snpsByOverlaps, and snpsById return an unstranded GPos object with one element (genomic position) per SNP and the following metadata columns:

  • RefSNP_id: RefSNP ID (aka "rs id"). Character vector with no NAs and no duplicates.

  • alleles_as_ambig: A character vector with no NAs containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code. See ?IUPAC_CODE_MAP in the Biostrings package for more information.

If the reference genome was specified (via the genome argument), the additional metadata columns are returned:

  • genome_compat: A logical vector indicating whether the alleles in alleles_as_ambig are consistent with the reference genome.

  • ref_allele: A character vector containing the inferred reference allele for each SNP.

  • alt_alleles: A CharacterList object where each list element is a character vector containing the inferred alternate allele(s) for the corresponding SNP.

Note that this GPos object is unstranded i.e. all the SNPs in it have their strand set to "*". Alleles are always reported with respect to the positive strand.

If ifnotfound="error", the object returned by snpsById is guaranteed to be parallel to ids, that is, the i-th element in the GPos object corresponds to the i-th element in ids.

inferRefAndAltAlleles returns a DataFrame with one row per SNP in gpos and with columns genome_compat (logical), ref_allele (character), and alt_alleles (CharacterList).

Author(s)

H. Pagès

See Also

Examples

library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
snpcount(snps)

## ---------------------------------------------------------------------
## snpsBySeqname()
## ---------------------------------------------------------------------

## Get all SNPs located on chromosome 22 or MT:
snpsBySeqname(snps, c("22", "MT"))

## ---------------------------------------------------------------------
## snpsByOverlaps()
## ---------------------------------------------------------------------

## Get all SNPs overlapping some genomic region of interest:
snpsByOverlaps(snps, "X:3e6-33e6")

## With the regions of interest being all the known CDS for hg38
## located on chromosome 22 or MT (except for the chromosome naming
## convention, hg38 is the same as GRCh38):
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
my_cds <- cds(txdb)
seqlevels(my_cds, pruning.mode="coarse") <- c("chr22", "chrM")
seqlevelsStyle(my_cds)  # UCSC
seqlevelsStyle(snps)    # NCBI
seqlevelsStyle(my_cds) <- seqlevelsStyle(snps)
genome(my_cds) <- genome(snps)
my_snps <- snpsByOverlaps(snps, my_cds)
my_snps
table(my_snps %within% my_cds)

## ---------------------------------------------------------------------
## snpsById()
## ---------------------------------------------------------------------

## Lookup some RefSNP ids:
my_rsids <- c("rs10458597", "rs12565286", "rs7553394")
## Not run: 
  snpsById(snps, my_rsids)  # error, rs7553394 not found

## End(Not run)
## The following example uses more than 2GB of memory, which is more
## than what 32-bit Windows can handle:
is_32bit_windows <- .Platform$OS.type == "windows" &&
                    .Platform$r_arch == "i386"
if (!is_32bit_windows) {
    snpsById(snps, my_rsids, ifnotfound="drop")
}

## ---------------------------------------------------------------------
## Obtaining the ref allele and alt allele(s)
## ---------------------------------------------------------------------

## When the reference genome is specified (via the 'genome' argument),
## SNP extractors snpsBySeqname(), snpsByOverlaps(), and snpsById()
## call inferRefAndAltAlleles() internally to **infer** the ref allele
## and alt allele(s) for each SNP.
my_snps <- snpsByOverlaps(snps, "X:3e6-8e6", genome="GRCh38")
my_snps

## Most SNPs have only 1 alternate allele:
table(lengths(mcols(my_snps)$alt_alleles))

## SNPs with 2 alternate alleles:
my_snps[lengths(mcols(my_snps)$alt_alleles) == 2]

## SNPs with 3 alternate alleles:
my_snps[lengths(mcols(my_snps)$alt_alleles) == 3]

## Note that a small percentage of SNPs in dbSNP have alleles that
## are inconsistent with the reference genome (don't ask me why):
table(mcols(my_snps)$genome_compat)

## For the inconsistent SNPs, all the alleles reported by dbSNP
## are considered alternate alleles i.e. for each inconsistent SNP
## metadata columns "alleles_as_ambig" and "alt_alleles" represent
## the same set of nucleotides (the latter being just an expanded
## representation of the IUPAC ambiguity letter in the former):
my_snps[!mcols(my_snps)$genome_compat]

BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs

v1.58.0
Artistic-2.0
Authors
Hervé Pagès
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.