Somatic Caller

Strelka uses a Bayesian probability model similar to the one used for germline variant calling in the Starling Small Variant Caller or in external tools such as GATK. Using this model, our objective is to compute the posterior probability P(θ│D), which is the probability of the model state θ conditioned on the observed sequencing data.

In a germline variant caller, the state space of the model is conventionally a discrete set of diploid genotypes. For SNVs, the set of possible states is G={"AA,CC,GG,TT,AC,AG,AT,CG,CT,GT"}.

The Strelka model instead approximates continuous allele frequencies for each allele:

f={f_A, f_C, f_G, f_T}

The allele frequencies are restricted to allow a maximum of 2 nonzero frequencies. Any additional alleles observed in the data are treated as noise.

Another departure from typical germline calling methods is that the state space of the model is the allele frequency of both the tumor and the normal sample. In the following equation, f_t and f_n represent the allele frequencies of the tumor and normal samples, respectively.

θ=(f_t, f_n)

The final somatic variant quality value reported by the model is computed from the probability that the allele frequencies are unequal (ie, f_t≠f_n) given the observed sequence data.