To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

Inconsistent results with HaplotypeCaller on haploid organism

Hello GATK team,

I would appreciate some help in understanding how GATK works in GVCF mode on my data.
Here is my data example I'm usign GATK v3.8:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 328-16 983-16 NC_018661.1 859953 . C T > 31035.30 SnpCluster AC=15;AF=0.536;AN=28;BaseQRankSum=-1.134e+00;ClippingRankSum=0.00;DP=7961;FS=3.292;MLEAC=15;MLEAF=0.536;MQ=56.52;MQRankSum=0.00;QD=7.76;ReadPosRankSum=-5.480e-01;SOR=0.486 GT:AD:DP:GQ:PL 0:354,0:354:99:0,106 1:157,157:314:99:361,0

  • First thing weird is that the variant is in heterocygosis with the highest GQ (99) when we are analyzing an haploid sample, this is different from the explanation given in this post

  • Second issue appears when we observe this position using IGV in our aligned reads using bwa mem(bam format). Here we see that both samples seem to have this site as AD 50%, but HaplotypeCaller calls it totally different.

This are the parameters we use for one sample:
java -Xmx10g -jar /opt/g atk/gatk-3.8.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECO LI07_SS_S/REFERENCES/GCF_000299475.1_ASM29947v1_genomic_NoPlasmid.fna -I /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM0 62_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/Alignment/BAM/328-16/328-16.woduplicates.bam -o /processing_Data/bioinformatics/services_and_colaboratio ns/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/variants/328-16.g.vcf -stand_call_conf 30 --em itRefConfidence GVCF -ploidy 1 -S LENIENT -log /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/A NALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/snp_indels.vcf-HaplotypeCaller.log

How is this even possible? (I have infinite-checked that bam files are the same used in IGV and passed to GATK, you never know...)

Could be the effect referred in this thread be somehow affecting the variant calling? Should we use BP_Resolution? Which is the main difference between GVCF and BP_RESOLUTION mode?

Our first idea is select by AD our GVCFs using JEXL expressions but as GVCF has reference blocks with no AD the command fails:

ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null

I could filter them manually before GenotypeGVCFs but, it is a good practice? As I read in this thread this is not recommended, obviously because we override GATK model which takes a lot more of variables into account...
Any ideas? We are kind of struggling, maybe is something trivial but we can't see it, any help will be much appreciate.

Thanks very much in advance,
Best Regards


  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Sara,

    Unfortunately, that is another "quirk" of the haploid model. I suspect even though the ADs are the same, the variant allele reads have higher base qualities, which causes the model to output a higher likelihood of the site being variant.

    You can read more about this here.


Sign In or Register to comment.