DRAGEN Germline Small Variant Caller

The DRAGEN Germline Small Variant Caller takes mapped and aligned DNA reads as input and calls SNPs and indels through a combination of column-wise detection and local de novo assembly of haplotypes.

Callable reference regions are first identified with sufficient alignment coverage. Within these reference regions, a fast scan of the sorted reads identifies active regions, centered around pileup columns with evidence of a variant. The active regions are padded with enough context to cover significant, nonreference content nearby and padded even more where there is evidence of indels.

Aligned reads are clipped within each active region and assembled into a De Bruijn graph. The edges of the clipped reads are weighted by observation counts, with the reference sequence as a backbone. After some graph cleanup and simplification, all source-to-sink paths are extracted as candidate haplotypes. Each haplotype is Smith-Waterman aligned to the reference genome to identify the variants it represents. This set of events may be augmented by a position-based detection. For each read-haplotype pair, the probability P(r|H) of observing the read assuming the haplotype is the true starting sample is estimated using a pair hidden Markov model (HMM).

Scanning by reference position over the active region, candidate genotypes are formed from diploid combinations of variant events (SNPs or indels). For each event (including reference), the conditional probability P(r|e) of observing each overlapping read is estimated as the maximum P(r|H) for haplotypes supporting the event. These are combined into the conditional probability P(r|e1e2) for a genotype (event pair) and multiplied to yield the conditional probability P(R|e1e2) of observing the whole read pileup. Using Bayes’ Formula, the posterior probability P(e1e2|R) of each diploid genotype is calculated, and the winner is called.

In the GVCF mode used for scalable multisample variant calling, the DRAGEN Germline Small Variant Caller can be run per-sample to generate an intermediate genomic gVCF (gVCF). The gVCF can then be used for efficient joint genotyping of multiple samples which allows for the rapid incremental processing of samples and scaling to large cohort sizes.

Because the DRAGEN Germline Small Variant Caller has algorithms that make it able to efficiently distinguish correlated errors from true variants, the filtering rules are very simple.