Gene Expression Quantification

To enable the quantification module, set the --enable-rna-quantification option to true in your current RNA-seq command-line scripts. In addition, quantification requires the use of a gene annotations file (GTF/GFF), which provides the genomic position of all transcripts to quantify. This is specified with the -a (or --annotation-file) option.

Quantification Outputs

Transcript quantification results are reported in the <outputPrefix>.quant.sf file. This is a text file that lists results for each transcript. For example:

Name		    Length  EffectiveLength 	TPM     	NumReads
ENST00000364415.1   116     12.3238 		5.2328		1
ENST00000564138.1   2775    2105.58 		1.28293 	41.8885

•

Name lists the transcriptID of the transcript.

•

Length is the length of the (spliced) transcript in basepairs.

•

EffectiveLength is the length as accessible to RNA-seq, accounting for insert-size and edge effects.

•

TPM is Transcripts per Million, which represents the expression of the transcript, normalized for transcript length and sequencing depth

•

NumReads is the estimated number of reads from the transcript (not normalized).

This file can be used as input for differential gene expression using tools such as tximport and DESeq2.

Similarly, the <outputPrefix>.quant.genes.sf file contains quantification results at the gene level. These are produced by summing together all transcripts with the same geneID in the annotation (GTF). Length and EffectiveLength are the (expression-)weighted means of the individual transcripts in the gene.

DRAGEN reports several metrics relevant to RNA transcripts and quantification in the following files.

•

<outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. For example, Transcript fragments, Intron fragments, Intergenic fragments, Reverse transcript fragments, Equivalence classes, Median CV coverage, and additional metrics.

•

<outputPrefix>.quant.transcript_fragment_lengths.txt —Full fragment length distribution of reads mapped to transcripts.

•

<outputPrefix>.quant.transcript_coverage.txt—Average 5' to 3' transcript coverage pattern.

•

<outputPrefix>.SJ.saturation.txt—Saturation of splicing junctions observed or discovered as a function of reads processed.

Quantification Options

•

--enable-rna-quantification

If set to true, enables RNA quantification. Requires --enable-rna to be set to true as well.

•

--rna-quantification-library-type

Specifies the type of RNA-seq library.

–

IU—Paired-end unstranded library.

–

ISR—Paired-end stranded library in which read2 matches the transcript strand (eg, TruSeq RNA).

–

ISF—Paired-end stranded library in which read1 matches the transcript strand.

–

U—Single-end unstranded library.

–

SR—Single-end stranded library in which reads are in reverse orientation to the transcript strand (eg, TruSeq RNA).

–

SF—Single-end stranded library in which reads match the transcript strand.

–

A (autodetect, the default)—For this value, DRAGEN examines the first reads/pairs in the dataset to automatically determine the correct library type.

•

--rna-quantification-gc-bias

GC bias correction estimates the effect of transcript %GC on sequencing coverage and accounts for it when estimating expression. Setting this option to false disables GC bias correction.

•

--rna-quantification-fld-max, --rna-quantification-fld-mean, --rna-quantification-fld-sd

These options are used to specify the insert size distribution of the RNA-seq library for single-end runs. This is relevant for GC bias correction. The defaults are 250 +- 25, max=1000; changing these to values matching the specific library can improve accuracy.