Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

HaplotypeCaller DP reports low values

sam24sam24 UKPosts: 2Member

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Sam,

    I believe this is due to an issue with downsampling in HaplotypeCaller, which we are aware of but haven't been able to fix yet. In a nutshell, both callers normally downsample the data to a max coverage of 250 by default, but it seems that downsampling is currently not working for HC. So when you have ~200X coverage, there's no downsampling anywhere so you don't see any differences, but when you have much higher coverage, downsampling kicks in for UG but not for HC, hence the differences you observe.

    We're working to fix the bug; in the meantime you can either hard-downsample your data using PrintReads with -dcov 250 before running HC, or just ignore the differences. The results are correct in both cases. Though the HC runs are bound to take much longer without downsampling...

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Ack, and now my colleague points out that you are using -dt NONE in both cases so downsampling can't be the issue. It was such a nice, tidy explanation... but ignore what I said earlier, sorry.

    Do you get the same discrepancies if you run single-threaded? If not it might be the multi-threading that is malfunctioning.

    Geraldine Van der Auwera, PhD

  • sam24sam24 UKPosts: 2Member

    Hi Geraldine,

    Thank you for your comments about downsampling and HC. This would have been a lovely explanation! But yes, I am switching off downsampling. I did try (amongst many iterations) setting -dcov 5000 and this did not change the reported DP values.

    It is good to know that the low numbers should not worry me. I can definitely report to you that all the variants in my data set are being reported, just with a lower numbers.

    I have just re-run the same command using a single thread and unfortunately got the same result.

    Best, Sam

  • KathKath Posts: 36Member

    Hi,

    Has anyone had any further thoughts about this? I seem to be having the same issue. At first I assumed it was that IGV included duplicate reads (as I have marked but not removed duplicates from my bam files) but I have set preferences such that IGV should not display duplicates...

    Thanks

    Kath

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi @Kath,

    Unfortunately we haven't had time to look into this in detail, so we're not sure yet what's going on. Since it doesn't seem to have adverse consequences on variant discovery, we're not currently able to treat this as a priority issue. However if you have data that suggests it is a big problem, we would look into it more closely.

    Geraldine Van der Auwera, PhD

  • MaikeMaike FrankfurtPosts: 1Member

    Dear Geraldine,

    are there any news regarding this? I think I 'm having this issue too. I have amplicon data and used the HaplotypeCaller for a sample conaining one individual. I should get a depth of about DP=370 but I get values between 19 and 199. The data is not downsampled and only one of the reads was filtered out (by the NotPrimaryAlignmentFilter). Unfortunately I can't really say how much of a problem this ist because I don't really know why it happens and what it does to the results, but I'm worried that most of my data might not be used for the computation, either in a random way or because of some parameter that might introduce a bias. Can you say anything about that? I would really appreciate your help.

    Thank you, Maike

  • ebanksebanks Posts: 671GSA Member mod

    You can test whether reads are being used by the tool with the -bamOut option. You will see which reads align to which haplotypes - and which are uninformative to the calculation.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.