Gene Expression Quantification
The DRAGEN RNA pipeline contains a gene expression quantification module, that estimates the expression of each transcript and gene in an RNA-seq dataset. First, it internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then it uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model GC-bias and correct for it in the reported quantification results.

To enable the quantification module, set the --enable-rna-quantification option to true in your current RNA-seq command-line scripts. In addition, quantification requires the use of a gene annotations file (GTF/GFF), which provides the genomic position of all transcripts to quantify. This is specified with the -a (or --annotation-file) option.

Transcript quantification results are reported in the <outputPrefix>.quant.sf file. This is a text file that lists results for each transcript. For example:
Name Length EffectiveLength TPM NumReads
ENST00000364415.1 116 12.3238 5.2328 1
ENST00000564138.1 2775 2105.58 1.28293 41.8885
• | Name lists the transcriptID of the transcript. |
• | Length is the length of the (spliced) transcript in basepairs. |
• | EffectiveLength is the length as accessible to RNA-seq, accounting for insert-size and edge effects. |
• | TPM is Transcripts per Million, which represents the expression of the transcript, normalized for transcript length and sequencing depth |
• | NumReads is the estimated number of reads from the transcript (not normalized). |
This file can be used as input for differential gene expression using tools such as tximport and DESeq2.
Similarly, the <outputPrefix>.quant.genes.sf file contains quantification results at the gene level. These are produced by summing together all transcripts with the same geneID in the annotation (GTF). Length and EffectiveLength are the (expression-)weighted means of the individual transcripts in the gene.
DRAGEN reports several metrics relevant to RNA transcripts and quantification in the following files.
• | <outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. For example, Transcript fragments, Intron fragments, Intergenic fragments, Reverse transcript fragments, Equivalence classes, Median CV coverage, and additional metrics. |
• | <outputPrefix>.quant.transcript_fragment_lengths.txt —Full fragment length distribution of reads mapped to transcripts. |
• | <outputPrefix>.quant.transcript_coverage.txt—Average 5' to 3' transcript coverage pattern. |
• | <outputPrefix>.SJ.saturation.txt—Saturation of splicing junctions observed or discovered as a function of reads processed. |

• | --enable-rna-quantification |
If set to true, enables RNA quantification. Requires --enable-rna to be set to true as well.
• | --rna-quantification-library-type |
Specifies the type of RNA-seq library.
– | IU—Paired-end unstranded library. |
– | ISR—Paired-end stranded library in which read2 matches the transcript strand (eg, TruSeq RNA). |
– | ISF—Paired-end stranded library in which read1 matches the transcript strand. |
– | U—Single-end unstranded library. |
– | SR—Single-end stranded library in which reads are in reverse orientation to the transcript strand (eg, TruSeq RNA). |
– | SF—Single-end stranded library in which reads match the transcript strand. |
– | A (autodetect, the default)—For this value, DRAGEN examines the first reads/pairs in the dataset to automatically determine the correct library type. |
• | --rna-quantification-gc-bias |
GC bias correction estimates the effect of transcript %GC on sequencing coverage and accounts for it when estimating expression. Setting this option to false disables GC bias correction.
• | --rna-quantification-fld-max, --rna-quantification-fld-mean, --rna-quantification-fld-sd |
These options are used to specify the insert size distribution of the RNA-seq library for single-end runs. This is relevant for GC bias correction. The defaults are 250 +- 25, max=1000; changing these to values matching the specific library can improve accuracy.