SNPlocs objects
The SNPlocs class is a container for storing known SNP locations (of class snp) for a given organism.
SNPlocs objects are usually made in advance by a volunteer and made
available to the Bioconductor community as SNPlocs data packages.
See ?available.SNPs
for how to get the list of
SNPlocs and XtraSNPlocs data packages curently available.
The main focus of this man page is on how to extract SNPs from an SNPlocs object.
snpcount(x) snpsBySeqname(x, seqnames, ...) ## S4 method for signature 'SNPlocs' snpsBySeqname(x, seqnames, drop.rs.prefix=FALSE, genome=NULL) snpsByOverlaps(x, ranges, ...) ## S4 method for signature 'SNPlocs' snpsByOverlaps(x, ranges, drop.rs.prefix=FALSE, ..., genome=NULL) snpsById(x, ids, ...) ## S4 method for signature 'SNPlocs' snpsById(x, ids, ifnotfound=c("error", "warning", "drop"), genome=NULL) inferRefAndAltAlleles(gpos, genome)
x |
A SNPlocs object. |
seqnames |
The names of the sequences for which to get SNPs. Must be a subset of
|
... |
Additional arguments, for use in specific methods. Arguments passed to the |
drop.rs.prefix |
Should the |
genome |
For
If For A BSgenome object containing the sequences of the
reference genome that corresponds to the SNP positions in
|
ranges |
One or more genomic regions of interest specified as a
GRanges or GPos object.
A single region of interest can be specified as a character string of
the form |
ids |
The RefSNP ids to look up (a.k.a. rs ids). Can be integer or character
vector, with or without the |
ifnotfound |
What to do if SNP ids are not found. |
gpos |
A GPos object containing SNPs. It must have a
metadata column |
When the reference genome is specified via the genome
argument,
SNP extractors snpsBySeqname
, snpsByOverlaps
, and
snpsById
call inferRefAndAltAlleles
internally to
infer the reference allele (a.k.a. ref allele) and
alternate allele(s) (a.k.a. alt allele(s)) for each SNP.
For each SNP the ref allele is inferred from the actual
nucleotide found in the reference genome at the SNP position.
The alt alleles are inferred from metadata column
alleles_as_ambig
and the ref
allele. More precisely
for each SNP the alt alleles are considered to be the alleles
in alleles_as_ambig
minus the ref allele.
snpcount
returns a named integer vector containing the number
of SNPs for each sequence in the reference genome.
snpsBySeqname
, snpsByOverlaps
, and snpsById
return
an unstranded GPos object with one element
(genomic position) per SNP and the following metadata columns:
RefSNP_id
: RefSNP ID (aka "rs id"). Character vector
with no NAs and no duplicates.
alleles_as_ambig
: A character vector with no NAs
containing the alleles for each SNP represented by an IUPAC
nucleotide ambiguity code.
See ?IUPAC_CODE_MAP
in the
Biostrings package for more information.
If the reference genome was specified (via the genome
argument),
the additional metadata columns are returned:
genome_compat
: A logical vector indicating whether the
alleles in alleles_as_ambig
are consistent with the
reference genome.
ref_allele
: A character vector containing the
inferred reference allele for each SNP.
alt_alleles
: A CharacterList object
where each list element is a character vector containing the
inferred alternate allele(s) for the corresponding SNP.
Note that this GPos object is unstranded
i.e. all the SNPs in it have their strand set to "*"
.
Alleles are always reported with respect to the positive strand.
If ifnotfound="error"
, the object returned by snpsById
is guaranteed to be parallel to ids
, that is, the i-th
element in the GPos object corresponds to the
i-th element in ids
.
inferRefAndAltAlleles
returns a DataFrame with
one row per SNP in gpos
and with columns genome_compat
(logical), ref_allele
(character), and alt_alleles
(CharacterList).
H. Pagès
XtraSNPlocs packages and objects for molecular variations of class other than snp e.g. of class in-del, heterozygous, microsatellite, etc...
IRanges::subsetByOverlaps
in the
IRanges package and
GenomicRanges::subsetByOverlaps
in the GenomicRanges package for more information about the
subsetByOverlaps()
generic and its method for
GenomicRanges objects.
IUPAC_CODE_MAP
in the Biostrings
package.
library(SNPlocs.Hsapiens.dbSNP144.GRCh38) snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38 snpcount(snps) ## --------------------------------------------------------------------- ## snpsBySeqname() ## --------------------------------------------------------------------- ## Get all SNPs located on chromosome 22 or MT: snpsBySeqname(snps, c("22", "MT")) ## --------------------------------------------------------------------- ## snpsByOverlaps() ## --------------------------------------------------------------------- ## Get all SNPs overlapping some genomic region of interest: snpsByOverlaps(snps, "X:3e6-33e6") ## With the regions of interest being all the known CDS for hg38 ## located on chromosome 22 or MT (except for the chromosome naming ## convention, hg38 is the same as GRCh38): library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene my_cds <- cds(txdb) seqlevels(my_cds, pruning.mode="coarse") <- c("chr22", "chrM") seqlevelsStyle(my_cds) # UCSC seqlevelsStyle(snps) # NCBI seqlevelsStyle(my_cds) <- seqlevelsStyle(snps) genome(my_cds) <- genome(snps) my_snps <- snpsByOverlaps(snps, my_cds) my_snps table(my_snps %within% my_cds) ## --------------------------------------------------------------------- ## snpsById() ## --------------------------------------------------------------------- ## Lookup some RefSNP ids: my_rsids <- c("rs10458597", "rs12565286", "rs7553394") ## Not run: snpsById(snps, my_rsids) # error, rs7553394 not found ## End(Not run) ## The following example uses more than 2GB of memory, which is more ## than what 32-bit Windows can handle: is_32bit_windows <- .Platform$OS.type == "windows" && .Platform$r_arch == "i386" if (!is_32bit_windows) { snpsById(snps, my_rsids, ifnotfound="drop") } ## --------------------------------------------------------------------- ## Obtaining the ref allele and alt allele(s) ## --------------------------------------------------------------------- ## When the reference genome is specified (via the 'genome' argument), ## SNP extractors snpsBySeqname(), snpsByOverlaps(), and snpsById() ## call inferRefAndAltAlleles() internally to **infer** the ref allele ## and alt allele(s) for each SNP. my_snps <- snpsByOverlaps(snps, "X:3e6-8e6", genome="GRCh38") my_snps ## Most SNPs have only 1 alternate allele: table(lengths(mcols(my_snps)$alt_alleles)) ## SNPs with 2 alternate alleles: my_snps[lengths(mcols(my_snps)$alt_alleles) == 2] ## SNPs with 3 alternate alleles: my_snps[lengths(mcols(my_snps)$alt_alleles) == 3] ## Note that a small percentage of SNPs in dbSNP have alleles that ## are inconsistent with the reference genome (don't ask me why): table(mcols(my_snps)$genome_compat) ## For the inconsistent SNPs, all the alleles reported by dbSNP ## are considered alternate alleles i.e. for each inconsistent SNP ## metadata columns "alleles_as_ambig" and "alt_alleles" represent ## the same set of nucleotides (the latter being just an expanded ## representation of the IUPAC ambiguity letter in the former): my_snps[!mcols(my_snps)$genome_compat]
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.