Normalizing Small Tandem Duplications

Tandem duplications may also be represented as insertions. This representation creates ambiguity in how such variants are represented in the VCF output, particularly for small tandem duplications, and can lead to complications such as unrecognized call duplication.

To better normalize the SV caller output such that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions, using the symbolic allele <INS> for the ALT field. An example of such an insertion is:

chr2 2520057 MantaDUP:TANDEM:53645:0:1:0:0:0 T <INS> 813 PASS END=2520057;SVTYPE=INS;SVLEN=52;DUPSVLEN=52 GT:FT:GQ:PL:PR:SR 0/1:PASS:393:863,0,390:25,0:19,25

Converted insertions include copies of certain output fields as they would have appeared in a tandem duplication record, such as INFO/DUPSVINSSEQ providing a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication such a value would normally be written to INFO/SVINSSEQ. An example of a converted insertion with such a value is:

chr2 2645730 MantaDUP:TANDEM:53649:0:1:0:0:0 C <INS> 367 PASS END=2645730;SVTYPE=INS;SVLEN=97;DUPSVLEN=86;DUPSVINSLEN=11;DUPSVINSSEQ=CTCACCTTCAT GT:FT:GQ:PL:PR:SR 0/1:PASS:367:417,0,386:19,0:20,15

For more information about copied INFO fields, see VCF INFO Fields.