HaplotypeCaller reference bias in a nematode hybrid zone
I am analysing genetic variation across a hybrid zone between two strains of a parasitic nematode and find that my results are affected by the reference bias that GATK shows in low coverage regions. In the following I am describing the situation in some detail as it might be instructive for other GATK users as well.
For 12 of 50 samples, either total read count is low or library complexity is low, such that coverage is very uneven despite a respectable mean (up to 20x coverage for a haploid genome; males are haploid and females are diploid in this species). Here is how the striking bias in these samples becomes apparent: a more or less linear relationship exists between the frequency of strain2 alleles at diagnostic loci in worm samples and the frequency of strain2 alleles in the host taxon, which also hybridises. However, the low coverage/lib.complexity worm samples are drastic outliers with a very low strain2 allele frequency. Note that reads were mapped to a reference genome from strain1.
The bias arises at variant positions with low read coverage and/or base quality. HaplotypeCaller reports a REF allele with genotype quality = 0 in cases where all reads support an ALT allele. GenotypeGVCFs then turns these zero quality REF call into NO-CALLS. Inspection of ~200 of these NO-CALL positions in two outlier samples (expected to be almost pure strain2) shows that the suppressed ALT allele is invariably the one that is typical for strain2. In other words, even at this low coverage there is information in the data.
Figure 1 illustrates the bias for these two samples: a haploid male (left) and a diploid female (right). Each data point is a variant position without support for the reference allele in the bam file of the sample. Data were collected from HaplotypeCaller -bamout files or, when information there was missing, from the input bam file. The bias is particularly strong after base quality score recalibration (BQSR, base qualities < 20 in all samples and alleles). In the absence of a 'truth set', recalibration was done using a set of high quality variant calls, the recommended approach but perhaps not suitable for this dataset.
Running HaplotypeCaller on the dedupped rather than the recalibrated bam files for the two samples, followed by another round of GenotypeGVCFs, increases the strain2 allele frequency from 0.08 to 0.75 (haploid male) and from 0.16 to 0.76 (diploid female). Even so, ALT alleles are still missed: for the diploid female, my analysis suggests a remaining bias of 15%. This reduced bias is illustrated in Figure 1 (data from 'dedup' bam files).
Rerunning HaplotypeCaller for the female sample using the dedupped bam file and the -newQual and flat input prior options leaves the strain2 allele frequency unchanged.
From other posts I understand that HaplotypeCaller is optimised for coverages around 30x. But for most organisms/datasets, there will be some 'difficult' samples or low coverage genome regions. To make full use of the available information in these cases, it seems that HaplotypeCaller would have to report in the g.vcf file what is actually observed, with an appropriate measure of support, so that GenotypeGVCFs can arrive at the best possible, unbiased genotype calls. Is there any way I can tweak HaplotypeCaller to do this? My understanding is that reducing the -minPruning and -minDanglingBranchLength parameters will greatly complicate the computations, because then variants supported by a single read will be incorporated into the assembly graph even in high coverage regions. So, I haven't tried this yet.
Additional info: I used GATK version 3.7. Whole-genome sequencing (Illumina 150bp, pe) was carried out on 50 nematode samples. Raw reads were mapped with Bowtie2 (local-sensitive mode). Per sample, bam files were deduplicated and base qualities recalibrated following GATK Best Practice recommendations. Median dedupped coverage is 16.6x. HaplotypeCaller was run with the -mmq 10 option to accommodate the mapping quality scale of Bowtie2 (different from BWA). However, mapping qualities were around 40 (i.e. at the top end of that scale) in essentially all of the test cases reported above. Variants in the raw vcf file (GenotypeGVCFs output) were hard-filtered following GATK recommendations. Only PASS variants were used to investigate the reference bias.