Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

HaplotypeCaller DP reports low values

sam24sam24 UKPosts: 2Member

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,



  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Hi Sam,

    I believe this is due to an issue with downsampling in HaplotypeCaller, which we are aware of but haven't been able to fix yet. In a nutshell, both callers normally downsample the data to a max coverage of 250 by default, but it seems that downsampling is currently not working for HC. So when you have ~200X coverage, there's no downsampling anywhere so you don't see any differences, but when you have much higher coverage, downsampling kicks in for UG but not for HC, hence the differences you observe.

    We're working to fix the bug; in the meantime you can either hard-downsample your data using PrintReads with -dcov 250 before running HC, or just ignore the differences. The results are correct in both cases. Though the HC runs are bound to take much longer without downsampling...

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Ack, and now my colleague points out that you are using -dt NONE in both cases so downsampling can't be the issue. It was such a nice, tidy explanation... but ignore what I said earlier, sorry.

    Do you get the same discrepancies if you run single-threaded? If not it might be the multi-threading that is malfunctioning.

    Geraldine Van der Auwera, PhD

  • sam24sam24 UKPosts: 2Member

    Hi Geraldine,

    Thank you for your comments about downsampling and HC. This would have been a lovely explanation! But yes, I am switching off downsampling. I did try (amongst many iterations) setting -dcov 5000 and this did not change the reported DP values.

    It is good to know that the low numbers should not worry me. I can definitely report to you that all the variants in my data set are being reported, just with a lower numbers.

    I have just re-run the same command using a single thread and unfortunately got the same result.


  • KathKath Posts: 36Member


    Has anyone had any further thoughts about this? I seem to be having the same issue. At first I assumed it was that IGV included duplicate reads (as I have marked but not removed duplicates from my bam files) but I have set preferences such that IGV should not display duplicates...



  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Hi @Kath,

    Unfortunately we haven't had time to look into this in detail, so we're not sure yet what's going on. Since it doesn't seem to have adverse consequences on variant discovery, we're not currently able to treat this as a priority issue. However if you have data that suggests it is a big problem, we would look into it more closely.

    Geraldine Van der Auwera, PhD

  • MaikeMaike FrankfurtPosts: 1Member

    Dear Geraldine,

    are there any news regarding this?
    I think I 'm having this issue too. I have amplicon data and used the HaplotypeCaller for a sample conaining one individual.
    I should get a depth of about DP=370 but I get values between 19 and 199. The data is not downsampled and only one of the reads was filtered out (by the NotPrimaryAlignmentFilter).
    Unfortunately I can't really say how much of a problem this ist because I don't really know why it happens and what it does to the
    results, but I'm worried that most of my data might not be used for the computation, either in a random way or because
    of some parameter that might introduce a bias.
    Can you say anything about that? I would really appreciate your help.

    Thank you,

  • ebanksebanks Posts: 678GATK Developer mod

    You can test whether reads are being used by the tool with the -bamOut option. You will see which reads align to which haplotypes - and which are uninformative to the calculation.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • RyanRyan Posts: 3Member

    Hi Geraldine,

    We have a similar situation as Maike. We are using a targeted capture method and are expecting a coverage of about 1000X in our areas of interest. As above, the issue happens with HaplotypeCaller even though no Downsampling has been selected. We have tried it with the latest release of GATK 3.2-2 and the issue is still present.

    I can't make the case that we are missing calls because of this, but the fact that the read depth reported by HC is erroneous could have some impact on the results we report.

    If people are using GATK to analyze their amplicon or targeted capture panels, more and more and datasets will have this issue.

    Thanks for looking into this!


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Hi @Ryan,

    It occurs to me that we haven't yet discussed the mapping quality in this thread, as a potential culprit. The HC applies a fairly stringent MQ-based filtering, automatically discarding any read that has MQ lower than 20. Have you checked what is the MQ of the data in the affected regions? What does the filtering summary (last bit of console output) say about HCMappingQualityFilter counts?

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 3Member

    Hi @Geraldine,

    The mapping quality seems to be pretty good in those areas. Here is an output of HC. In this case we had removed our duplicate reads from the input:

    INFO 13:22:29,057 MicroScheduler - 220625 reads were filtered out during the traversal out of approximately 13691080 total reads (1.61%)
    INFO 13:22:29,057 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter
    INFO 13:22:29,057 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 13:22:29,058 MicroScheduler - -> 220625 reads (1.61% of total) failing HCMappingQualityFilter
    INFO 13:22:29,058 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 13:22:29,058 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 13:22:29,059 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
    INFO 13:22:29,059 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
    INFO 13:22:30,138 GATKRunReport - Uploaded run statistics report to AWS S3



  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Hmm, so much for that explanation. Have you checked with -bamOut how many reads are actually left in that region after the HC's reassembly process?

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 3Member

    We just checked with -bamOut and only a small subset of reads are used (~200-350) after HC's assembly out of the 1000 or plus reads available in those areas. I am willing to share some data if that is helpful. Thanks for the help!


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    I see -- can you just please check whether this still happens with the latest version (3.2-2)? If so, please upload a snippet of data to our FTP. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.