Read Collapsing

The Read Collapsing analysis step performs an algorithm that uses sets of reads (known as families) with very similar genomic locations and UMI tags to collapse them into representative sequences. This process allows for the accurate removal of duplicate reads without losing the signal of very low frequency sequence variations.

Read Collapsing processes aligned reads from libraries that contain UMIs. These UMIs, along with the positional information in the alignment, are used to group duplicate reads and collapse them into a single consensus aligned read. The resulting reads have higher per-base quality and lower noise from various sources. Read Collapsing also provides various metrics that can be useful for tuning assay development and for quality control when UMIs are involved.

The Read Collapsing step only outputs reads from a successfully collapsed family (single consensus aligned read). The following types of reads are currently discarded from the read collapsed BAM.

•

Reads outside of manifest regions

•

Split reads

•

Large fragment reads

The input files to the Read Collapsing step are an aligned BAM and a BAM index file. The output files of the Read Collapsing step are as follows.

•

Read collapsed BAM

•

BAM index file

•

Read collapsing metrics JSON file

Note that split reads and large fragment reads (including read pairs mapped to different chromosomes) may be true signals for structural variants (fusions, inversions, etc.) that are discarded during the read collapsing step. Therefore, for fusion detection, using the raw BWA BAM (and a mark duplicate program) is recommended instead of the collapsed BAM. For more information on duplex sequencing, refer to the Indexed Sequencing Overview Guide on the Illumina Support website.

The Read Collapsing Step adds the following BAM tags.

•

XU: UMI

•

XV: Number of reads in the family

•

XW: The number of reads in the duplex-family, or 0 if not a duplex family

Related articles

Read Collapsing