To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

HaplotypeCaller reference bias in a nematode hybrid zone

beadorobeadoro Edinburgh, UKMember

I am analysing genetic variation across a hybrid zone between two strains of a parasitic nematode and find that my results are affected by the reference bias that GATK shows in low coverage regions. In the following I am describing the situation in some detail as it might be instructive for other GATK users as well.

For 12 of 50 samples, either total read count is low or library complexity is low, such that coverage is very uneven despite a respectable mean (up to 20x coverage for a haploid genome; males are haploid and females are diploid in this species). Here is how the striking bias in these samples becomes apparent: a more or less linear relationship exists between the frequency of strain2 alleles at diagnostic loci in worm samples and the frequency of strain2 alleles in the host taxon, which also hybridises. However, the low coverage/lib.complexity worm samples are drastic outliers with a very low strain2 allele frequency. Note that reads were mapped to a reference genome from strain1.

The bias arises at variant positions with low read coverage and/or base quality. HaplotypeCaller reports a REF allele with genotype quality = 0 in cases where all reads support an ALT allele. GenotypeGVCFs then turns these zero quality REF call into NO-CALLS. Inspection of ~200 of these NO-CALL positions in two outlier samples (expected to be almost pure strain2) shows that the suppressed ALT allele is invariably the one that is typical for strain2. In other words, even at this low coverage there is information in the data.

Figure 1 illustrates the bias for these two samples: a haploid male (left) and a diploid female (right). Each data point is a variant position without support for the reference allele in the bam file of the sample. Data were collected from HaplotypeCaller -bamout files or, when information there was missing, from the input bam file. The bias is particularly strong after base quality score recalibration (BQSR, base qualities < 20 in all samples and alleles). In the absence of a 'truth set', recalibration was done using a set of high quality variant calls, the recommended approach but perhaps not suitable for this dataset.

Running HaplotypeCaller on the dedupped rather than the recalibrated bam files for the two samples, followed by another round of GenotypeGVCFs, increases the strain2 allele frequency from 0.08 to 0.75 (haploid male) and from 0.16 to 0.76 (diploid female). Even so, ALT alleles are still missed: for the diploid female, my analysis suggests a remaining bias of 15%. This reduced bias is illustrated in Figure 1 (data from 'dedup' bam files).
Rerunning HaplotypeCaller for the female sample using the dedupped bam file and the -newQual and flat input prior options leaves the strain2 allele frequency unchanged.

From other posts I understand that HaplotypeCaller is optimised for coverages around 30x. But for most organisms/datasets, there will be some 'difficult' samples or low coverage genome regions. To make full use of the available information in these cases, it seems that HaplotypeCaller would have to report in the g.vcf file what is actually observed, with an appropriate measure of support, so that GenotypeGVCFs can arrive at the best possible, unbiased genotype calls. Is there any way I can tweak HaplotypeCaller to do this? My understanding is that reducing the -minPruning and -minDanglingBranchLength parameters will greatly complicate the computations, because then variants supported by a single read will be incorporated into the assembly graph even in high coverage regions. So, I haven't tried this yet.

Additional info: I used GATK version 3.7. Whole-genome sequencing (Illumina 150bp, pe) was carried out on 50 nematode samples. Raw reads were mapped with Bowtie2 (local-sensitive mode). Per sample, bam files were deduplicated and base qualities recalibrated following GATK Best Practice recommendations. Median dedupped coverage is 16.6x. HaplotypeCaller was run with the -mmq 10 option to accommodate the mapping quality scale of Bowtie2 (different from BWA). However, mapping qualities were around 40 (i.e. at the top end of that scale) in essentially all of the test cases reported above. Variants in the raw vcf file (GenotypeGVCFs output) were hard-filtered following GATK recommendations. Only PASS variants were used to investigate the reference bias.


  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    I will get back to you soon. Thanks for the detailed post.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @beadoro, thanks one the very thorough writeup. I’m not certain but this problem of GQ0 ref calls feels familiar — I think it’s something we might have fixed in GATK4. Any chance you could run a few of your problem regions through the latest version and see if the issue persists?

    Would also be interesting to see if using BWA makes a difference, but I would understand if you didn’t want to go all the way back to mapping.

  • beadorobeadoro Edinburgh, UKMember

    @Sheila, @Geraldine_VdAuwera

    Dear Sheila and Geraldine,

    Many thanks for commenting on my post. GATK 4.0 has just been installed on our computing grid. So, I'll give that a try and will report back whether the reference bias still occurs.

    Re BWA, I've tried both mappers on a subset of samples with good coverage and library complexity. Of the variants called by both mappers, the GenotypeGVCFs genotype calls agree in 99.5% of the cases (GATK GenotypeConcordance). Further analyses showed that variants called by only one mapper are less reliable than those called by both (I'd be happy to elaborate but it would be similarly long post as the previous one). I chose Bowtie2 because it produces decidedly fewer of this type of variants and most of them are easy to spot as errors, presumably due to overmerging in the assembly of the reference (het genotypes when HaplotypeCaller is run in diploid mode on bam files from haploid males).


Sign In or Register to comment.