CYP2D6 Caller
The CYP2D6 caller is capable of genotyping the CYP2D6 gene from whole-genome sequencing (WGS) data and is derived from the method implemented in Cyrius¹. Due to high sequence similarity with its pseudogene paralog CYP2D7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.
The CYP2D6 Caller performs the following steps:
1. | Determines total CYP2D6 and CYP2D7 copy number from read depth. |
2. | Determines CYP2D6-derived copy number at CYP2D6/CYP2D7 differentiating sites. |
3. | Detects SV breakpoints by calculating the changes in CYP2D6-derived copy number along the CYP2D6 gene. |
4. | Calls small variants in CYP2D6 copies. |
5. | Identifies star alleles from the detected SV breakpoints and small variants. |
6. | Identifies the most likely genotype for the called star alleles. |
The CYP2D6 Caller requires whole-genome sequencing (WGS) data aligned to a human reference genome with at least 30x coverage.

The first step of CYP2D6 calling is to determine the combined copy number of CYP2D6 and CYP2D7. Reads aligned to regions in either CYP2D6 or CYP2D7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. The combined CYP2D6 and CYP2D7 copy number is then calculated from the average sequencing depth across the CYP2D6 and CYP2D7 regions.

The CYP2D6-derived copy number is calculated at 117 predefined differentiating sites across the CYP2D6 gene. The differentiating sites are selected at positions with sequence differences in CYP2D6 and CYP2D7 where calling the CYP2D6-derived copy number was most likely to be correct based on sequencing data from the 1000 Genomes Project. The figure below shows the frequency of correctly calling the CYP2D6-derived copy number as a function of the position having a sequence difference between CYP2D6 and CYP2D7. The sequence differences showing calling accuracy > 98% are used as differentiating sites.
For each differentiating site, CYP2D6-specific and CYP2D7-specific alleles are counted in reads mapping to either CYP2D6 or the homologous region in CYP2D7. The CYP2D6-derived copy number is then calculated from this two gene-specific allele counts using the total CYP2D6 and CYP2D7 copy number calculated from the previous step.

The CYP2D6-derived copy number along the CYP2D6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:
Fusion Breakpoint |
Hybrid Gene Structure |
Associated Star Alleles |
---|---|---|
exon 9 |
2D6-2D7 |
4.013, 36, 57, 83 |
exon 9 |
2D7-2D6 |
13 |
intron 4 |
2D7-2D6 |
13 |
intron 1 |
2D7-2D6 |
13 |
intron 1 |
2D6-2D7 |
68 |
In addition to the exon 9 fusion breakpoints, exon 9 can participate in CYP2D7 gene conversion resulting in an embedded CYP2D7 sequence instead of a true hybrid. The exon 9 gene conversions are also detected by the structural variant caller. Because only changes in CYP2D6-derived copy number yield structural variant calls, there might be rare cases where two hybrid copies result in no structural variant calls. For example, when both *36 and *13 with fusion breakpoint in exon 9 are present. However, the structural variant caller is capable of detecting multiple copies of the same fusion type (eg, *36x2) or cases where both an exon 9 gene conversion copy and an exon 9 2D6-2D7 hybrid are present.

118 small variants that define various star alleles are detected from the read alignments. 96 of these variants are in unique (nonhomologous) regions of CYP2D6 with high mapping quality. Only reads mapping to CYP2D6 are used for calling variants in nonhomologous regions. The other 22 variants occur in homologous regions of CYP2D6 where reads mapping to either CYP2D6 or CYP2D7 are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant). A strand bias filter is applied for certain noisy variants that tend to have false positive calls.
In some cases, when the allele counts are noisy or the total CYP2D6 copy number is greater than five, the most likely variant copy number can be wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.

The called SVs and small variant genotypes are matched against the definitions of 124 different star alleles. In some cases, different sets of star alleles might match the called variant genotypes. The matching sets are emitted. When the small variant caller indicates alternate variant copy numbers with significant likelihood, then these alternate sets of variant genotypes are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2D6-derived gene copies determined from previous steps. When there are fewer than two CYP2D6-derived gene copies, then one or more *5 deletion haplotypes are included in the output set of star alleles. If all variant genotypes cannot be matched to a set of star alleles, the CYP2D6 caller returns a no call during the genotyping step with filter value No_call.

Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The deletion haplotype\(*5) is considered as a possible haplotype during this process. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the most likely genotype is selected. When two or more possible genotypes are identified with similar likelihoods, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype.

The CYP2D6 Caller generates a <output-file-prefix>.cyp2d6.tsv file in the output directory. The output file contains the following tab-delimited fields:
• | Sample name. |
• | One or more semicolon-delimited CYP2D6 genotypes or None for no call. |
• | The filter status. The value can include: PASS, No_call, or More_than_one_possible_genotype. |
Each CYP2D6 genotype contains two haplotypes separated by a slash (eg *1/*2). Each haplotype consists of one or more star alleles separated by a plus sign (eg *10+*36). When a haplotype contains more than one copy of the same star allele, that star allele only appears once and is followed by a multiplication sign, and then the number of copies (eg *1x2 for two copies of *1).

The following are example output files:
NA18632 *10+*36x2/*52 PASS
HG01190 *4+*68/*5 PASS
NA17244 *2/*4x2+*13+*83;*2/*4+*4.013+*13+*39 More_than_one_possible_genotype
NA19908 *1/*46;*43/*45 More_than_one_possible_genotype
NA18611 None No_call

To enable the CYP2D6 Caller, use --enable-cyp2d6=true. The CYP2D6 Caller is disabled by default. The CYP2D6 Caller can run directly with the mapper or from prealigned BAM/CRAM input. You can also enable the CYP2D6 Caller in parallel with other components as part of a germline analysis workflow.

The following is an example command line with FASTQ input:
dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/8 \
--fastq-file1 /staging/test/data/NA12878_R1.fastq \
--fastq-file2 /staging/test/data/NA12878_R2.fastq \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--RGID DRAGEN_RGID \
--RGSM NA12878 \
--enable-map-align=true \
--enable-cyp2d6=true

The following is an example command line with prealigned BAM input:
dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/8 \
--bam-input /staging/test/data/NA12878.bam \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-map-align=false \
--enable-cyp2d6=true