HaplotypeCaller reference bias in a nematode hybrid zone

beadorobeadoro Edinburgh, UKMember

I am analysing genetic variation across a hybrid zone between two strains of a parasitic nematode and find that my results are affected by the reference bias that GATK shows in low coverage regions. In the following I am describing the situation in some detail as it might be instructive for other GATK users as well.

For 12 of 50 samples, either total read count is low or library complexity is low, such that coverage is very uneven despite a respectable mean (up to 20x coverage for a haploid genome; males are haploid and females are diploid in this species). Here is how the striking bias in these samples becomes apparent: a more or less linear relationship exists between the frequency of strain2 alleles at diagnostic loci in worm samples and the frequency of strain2 alleles in the host taxon, which also hybridises. However, the low coverage/lib.complexity worm samples are drastic outliers with a very low strain2 allele frequency. Note that reads were mapped to a reference genome from strain1.

The bias arises at variant positions with low read coverage and/or base quality. HaplotypeCaller reports a REF allele with genotype quality = 0 in cases where all reads support an ALT allele. GenotypeGVCFs then turns these zero quality REF call into NO-CALLS. Inspection of ~200 of these NO-CALL positions in two outlier samples (expected to be almost pure strain2) shows that the suppressed ALT allele is invariably the one that is typical for strain2. In other words, even at this low coverage there is information in the data.

Figure 1 illustrates the bias for these two samples: a haploid male (left) and a diploid female (right). Each data point is a variant position without support for the reference allele in the bam file of the sample. Data were collected from HaplotypeCaller -bamout files or, when information there was missing, from the input bam file. The bias is particularly strong after base quality score recalibration (BQSR, base qualities < 20 in all samples and alleles). In the absence of a 'truth set', recalibration was done using a set of high quality variant calls, the recommended approach but perhaps not suitable for this dataset.

Running HaplotypeCaller on the dedupped rather than the recalibrated bam files for the two samples, followed by another round of GenotypeGVCFs, increases the strain2 allele frequency from 0.08 to 0.75 (haploid male) and from 0.16 to 0.76 (diploid female). Even so, ALT alleles are still missed: for the diploid female, my analysis suggests a remaining bias of 15%. This reduced bias is illustrated in Figure 1 (data from 'dedup' bam files).
Rerunning HaplotypeCaller for the female sample using the dedupped bam file and the -newQual and flat input prior options leaves the strain2 allele frequency unchanged.

From other posts I understand that HaplotypeCaller is optimised for coverages around 30x. But for most organisms/datasets, there will be some 'difficult' samples or low coverage genome regions. To make full use of the available information in these cases, it seems that HaplotypeCaller would have to report in the g.vcf file what is actually observed, with an appropriate measure of support, so that GenotypeGVCFs can arrive at the best possible, unbiased genotype calls. Is there any way I can tweak HaplotypeCaller to do this? My understanding is that reducing the -minPruning and -minDanglingBranchLength parameters will greatly complicate the computations, because then variants supported by a single read will be incorporated into the assembly graph even in high coverage regions. So, I haven't tried this yet.

Additional info: I used GATK version 3.7. Whole-genome sequencing (Illumina 150bp, pe) was carried out on 50 nematode samples. Raw reads were mapped with Bowtie2 (local-sensitive mode). Per sample, bam files were deduplicated and base qualities recalibrated following GATK Best Practice recommendations. Median dedupped coverage is 16.6x. HaplotypeCaller was run with the -mmq 10 option to accommodate the mapping quality scale of Bowtie2 (different from BWA). However, mapping qualities were around 40 (i.e. at the top end of that scale) in essentially all of the test cases reported above. Variants in the raw vcf file (GenotypeGVCFs output) were hard-filtered following GATK recommendations. Only PASS variants were used to investigate the reference bias.

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @beadoro
    Hi,

    I will get back to you soon. Thanks for the detailed post.

    -Sheila

  • beadorobeadoro Edinburgh, UKMember

    @Sheila, @Geraldine_VdAuwera

    Dear Sheila and Geraldine,

    Many thanks for commenting on my post. GATK 4.0 has just been installed on our computing grid. So, I'll give that a try and will report back whether the reference bias still occurs.

    Re BWA, I've tried both mappers on a subset of samples with good coverage and library complexity. Of the variants called by both mappers, the GenotypeGVCFs genotype calls agree in 99.5% of the cases (GATK GenotypeConcordance). Further analyses showed that variants called by only one mapper are less reliable than those called by both (I'd be happy to elaborate but it would be similarly long post as the previous one). I chose Bowtie2 because it produces decidedly fewer of this type of variants and most of them are easy to spot as errors, presumably due to overmerging in the assembly of the reference (het genotypes when HaplotypeCaller is run in diploid mode on bam files from haploid males).

    Beate

  • beadorobeadoro Edinburgh, UKMember

    Sorry for the long delay in getting back to this. I had to work on a different project since my last post...

    To check whether the reference bias has been eliminated in GATK 4.0, I have now re-analysed the entire 50-sample dataset with v. 4.0.1.0 HaplotypeCaller and GenotypeGVCFs. There is a dramatic reduction in bias in the two samples that I looked at previously and that are expected to be almost pure strain2 (recall that the reference was derived from strain1). Compared to GATK v.3.7, the frequency of strain2 alleles at diagnostic loci increased from 0.08 to 0.73 (haploid male) and from 0.16 to 0.75 (diploid female). BAM files after base quality score recalibration (BQSR) were used as input.

    To see whether some HaplotypeCaller reference bias remains, I revisited the same set of variant sites without read support for the REF allele in the two samples that I had looked at earlier. In many cases, the ALT allele is now recorded, whereas the v3.7 g.vcf file contained a zero-quality REF allele that was converted to a NO_CALL by GenotypesGVCFs. However, the bias still exists at sites with very low coverage. The following figure illustrates this. The top two panels (version 3.7, identical to ones I posted earlier) are included for ease of comparison. Each data point shows a HaplotypeCaller variant call as a function of the number of reads supporting the ALT allele and the mean base quality at the site. Separate analyses were run using either BQSR or deduplicated BAM files as input. Note that BQSR led to a marked reduction in mean base quality. GQ = genotype quality. In this v. 4.0 analysis, the reference bias is independent of base quality and essentially restricted to sites with fewer than four reads.

    In other words, it will be non-existent for samples with high and even coverage across the genome. However, in this study with limiting amounts of tissue from which to extract DNA and prepare libraries, there were some samples with low read cover overall and some with adequate mean coverage but large variation across the genome. I have therefore recomputed the mean strain2 allele frequency for all samples after manually setting all variant/sample cells with read cover < 4 to NO_CALL. This has brought several additional outlier samples in line with expectation. It has also reduced the data set by 10%. This is a loss of information, because in every case that I looked at of an ALT allele being supported by fewer than 4 reads (resulting in a NO_CALL in the vcf file) the suppressed allele was the one typical for strain2 (rather than a third and possibly erroneous allele).

    I think I will end this rather long story here. I hope some of you will find it useful.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @beadoro
    Hi Beate,

    Thank you for the thorough insight. This should definitely help other users. We are glad to hear the bias is pretty much gone except for in very low coverage sites. I don't think we will be able to do much for the low coverage sites, as less than 4X coverage is really not ideal for variant calling.

    Thanks,
    Sheila

Sign In or Register to comment.