The Variant Caller Algorithm
The DRAGEN Haplotype Caller performs the following steps:
|
•
|
Active Region Identification—Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing. |
|
•
|
Localized Haplotype Assembly— For each active region, assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K sub-sequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths diverging and rejoining. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. Values of K=10 and 25 are tried by default. If those values produce an invalid graph, then additional values of K = 35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, every possible path is extracted to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence may be on at least one strand. |
|
•
|
Haplotype Alignment—Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome to determine what variations from the reference it implies. |
|
•
|
Read Likelihood Calculation—Tests each read against each haplotype, to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read. |
|
•
|
Genotyping—Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate a likelihood that each genotype is the truth, given the entire read pileup observed. Genotypes with maximum likelihood are reported. |