edgeR: processAmplicons – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

processAmplicons

Process raw fastq data from pooled genetic sequencing screens

Description

Given a list of sample-specific index (barcode) sequences and hairpin/sgRNA-specific sequences from an amplicon sequencing screen, generate a DGEList of counts from the raw fastq file/(s) containing the sequence reads. The position of the index sequences and hairpin/sgRNA sequences is considered variable, with the hairpin/sgRNA sequences assumed to be located after the index sequences in the read.

Usage

processAmplicons(readfile, readfile2=NULL, barcodefile, hairpinfile,
                    allowMismatch=FALSE, barcodeMismatchBase=1,
                    hairpinMismatchBase=2, dualIndexForwardRead=FALSE,
                    verbose=FALSE, barcodesInHeader=FALSE,
                    plotPositions=FALSE)

Arguments

`readfile`	character vector giving one or more fastq filenames
`readfile2`	character vector giving one or more fastq filenames for reverse read, default to NULL
`barcodefile`	filename containing sample-specific barcode ids and sequences
`hairpinfile`	filename containing hairpin/sgRNA-specific ids and sequences
`allowMismatch`	logical, indicates whether sequence mismatch is allowed
`barcodeMismatchBase`	numeric value of maximum number of base sequence mismatches allowed in a barcode sequence when `allowMismatch` is `TRUE`
`hairpinMismatchBase`	numeric value of maximum number of base sequence mismatches allowed in a hairpin/sgRNA sequence when `allowMismatch` is `TRUE`
`dualIndexForwardRead`	logical, indicates if forward reads contains a second barcode sequence (must be present in `barcodefile`) which should be matched
`verbose`	if `TRUE`, output program progress
`barcodesInHeader`	logical, indicates if barcode sequences should be matched in the header (sequence identifier) of each read (i.e. the first of every group of four lines in the fastq files)
`plotPositions`	logical, indicates if a density plot displaying the position of each barcode and hairpin/sgRNA sequence in the reads should be created. If `dualIndexForwardRead` is `TRUE` or `readfile2` is not `NULL`, plotPositions will generate two density plots, side by side, indicating the positions of the first barcodes and hairpins in the first plot, and second barcodes in the second.

Details

The processAmplicons function allows for hairpins/sgRNAs/sample index sequences to be in variable positions within each read.

The input barcode file and hairpin/sgRNA files are tab-separated text files with at least two columns (named 'ID' and 'Sequences') containing the sample or hairpin/sgRNA ids and a second column indicating the sample index or hairpin/sgRNA sequences to be matched. If dualIndexForwardRead is TRUE, a third column 'Sequences2' is expected in the barcode file. If readfile2 is specified, another column 'SequencesReverse' is expected in the barcode file. The barcode file may also contain a 'group' column that indicates which experimental group a sample belongs to. Additional columns in each file will be included in the respective $samples or $genes data.frames of the final codeDGEList object. These files, along with the fastq file/(s) are assumed to be in the current working directory.

To compute the count matrix, matching to the given barcodes and hairpins/sgRNAs is conducted in two rounds. The first round looks for an exact sequence match for the given barcode sequences and hairpin/sgRNA sequences through the entire read, returning the first match found. If a match isn't found, the program performs a second round of matching which allows for sequence mismatches if allowMismatch is set to TRUE. The maximum number of mismatch bases in barcode and hairpin/sgRNA are specified by the parameters barcodeMismatchBase and hairpinMismatchBase respectively.

The program outputs a DGEList object, with a count matrix indicating the number of times each barcode and hairpin/sgRNA combination could be matched in reads from input fastq file(s).

For further examples and data, refer to the case studies available from http://bioinf.wehi.edu.au/shRNAseq.

Value

Returns a DGEList object with following components:

`counts`	read count matrix tallying up the number of reads with particular barcode and hairpin/sgRNA matches. Each row is a hairpin/sgRNA and each column is a sample
`genes`	In this case, hairpin/sgRNA-specific information (ID, sequences, corresponding target gene) may be recorded in this data.frame
`lib.size`	auto-calculated column sum of the counts matrix

Note

This function replaced the earlier function processHairpinReads in edgeR 3.7.17.

This function replaces the previous processAmplicons function, which expected the sequences in the fastq files to have a fixed structure (as per Figure 1A of Dai et al., 2014). This function can be used, and is intended for, reads where hairpins/sgRNAs/sample index sequences can be in variable positions within each read. When plotPositions=TRUE a density plot of the match positions is created to allow the user to assess whether they occur in the expected postions.

Author(s)

Oliver Voogd, Zhiyin Dai, Shian Su and Matthew Ritchie

References

Dai Z, Sheridan JM, Gearing, LJ, Moore, DL, Su, S, Wormald, S, Wilcox, S, O'Connor, L, Dickins, RA, Blewitt, ME, Ritchie, ME(2014). edgeR: a versatile tool for the analysis of shRNA-seq and CRISPR-Cas9 genetic screens. F1000Research 3, 95. http://f1000research.com/articles/3-95

edgeR

Empirical Analysis of Digital Gene Expression Data in R

v3.32.1

GPL (>=2)

Authors

Yunshun Chen, Aaron TL Lun, Davis J McCarthy, Matthew E Ritchie, Belinda Phipson, Yifang Hu, Xiaobei Zhou, Mark D Robinson, Gordon K Smyth

Initial release

2021-01-14