Gene Annotation File
In addition to the standard input files (reads from fastq or bam, reference genome, etc.), DRAGEN can also take a gene annotations file as input. A gene annotations file aids in the alignment of reads to known splice junctions,and is required for gene expression quantification and gene fusion calling.
To specify a gene annotation file, use the -a (--annotation-file) command line option. The input file must conform to the GTF/GFF specification (http://uswest.ensembl.org/info/website/upload/gff.html). The file must contain features of type exon, and the record must contain attributes of type gene_id and transcript_id. An example of a valid GTF file is shown below.
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; …
chr1 ENSEMBL transcript 11872 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …
chr1 ENSEMBL exon 11872 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …
chr1 ENSEMBL exon 12613 12721 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …
chr1 ENSEMBL exon 13225 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; …
…
Similarly, a GFF file can be used. Each exon feature must have as a Parent a transcript identifier that is used to group exons. An example of a valid GFF file is shown below.
1 ensembl_havana processed_transcript 11869 14409 . + . ID=transcript:ENST00000456328;
1 havana exon 11869 12227 . + . Parent=transcript:ENST00000456328; …
1 havana exon 12613 12721 . + . Parent=transcript:ENST00000456328; …
1 havana exon 13221 14409 . + . Parent=transcript:ENST00000456328; …
…
The DRAGEN host software parses the file for exons within the transcripts and produces splice junctions. The following output displays the number of splice junctions detected.
==================================================================
Generating annotated splice junctions
==================================================================
Input annotations file: ./gencode.v19.annotation.gtf
Splice junctions database file: output/rna.sjdb.annotations.out.tab
Number of genes: 27459
Number of transcripts: 196520
Number of exons: 1196293
Number of splice junctions: 343856
The splice junctions that are detected from the annotation file are also written to *.sjdb.annotations.out.tab. Splice junctions below a minimum length are excluded, which helps filter annotation artifacts that do not meet the minimum required length. This helps to lower the false detection rate for falsely annotated junctions. This minimum annotation splice junction length is controlled by the --rna-ann-sj-min-len option, which has a default value of 6.