Gene Annotation File

In addition to the standard input files (reads from fastq or bam, reference genome, etc.), DRAGEN can also take a gene annotations file as input. A gene annotations file aids in the alignment of reads to known splice junctions,and is required for gene expression quantification and gene fusion calling.

To specify a gene annotation file, use the -a (--annotation-file) command line option. The input file must conform to the GTF/GFF specification (http://uswest.ensembl.org/info/website/upload/gff.html). The file must contain features of type exon, and the record must contain attributes of type gene_id and transcript_id. An example of a valid GTF file is shown below.

chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1    HAVANA  exon        11869   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1    HAVANA  exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1    HAVANA  exon        13221   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1    ENSEMBL transcript  11872   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …
chr1    ENSEMBL exon        11872   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …
chr1    ENSEMBL exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …
chr1    ENSEMBL exon        13225   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …


Similarly, a GFF file can be used. Each exon feature must have as a Parent a transcript identifier that is used to group exons. An example of a valid GFF file is shown below.

1   ensembl_havana  processed_transcript    11869   14409       .   +   .   ID=transcript:ENST00000456328;
1   havana          exon                    11869   12227       .   +   .   Parent=transcript:ENST00000456328; …
1   havana          exon                    12613   12721       .   +   .   Parent=transcript:ENST00000456328; …
1   havana          exon                    13221   14409       .   +   .   Parent=transcript:ENST00000456328; …

The DRAGEN host software parses the file for exons within the transcripts and produces splice junctions. The following output displays the number of splice junctions detected.

==================================================================
Generating annotated splice junctions
==================================================================
Input annotations file: ./gencode.v19.annotation.gtf
Splice junctions database file: output/rna.sjdb.annotations.out.tab

Number of genes: 27459

Number of transcripts: 196520
Number of exons: 1196293
Number of splice junctions: 343856

The splice junctions that are detected from the annotation file are also written to *.sjdb.annotations.out.tab. Splice junctions below a minimum length are excluded, which helps filter annotation artifacts that do not meet the minimum required length. This helps to lower the false detection rate for falsely annotated junctions. This minimum annotation splice junction length is controlled by the --rna-ann-sj-min-len option, which has a default value of 6.