HaplotypeCaller DP reports low values

sam24sam24 UKPosts: 2Member

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Hi Sam,

    I believe this is due to an issue with downsampling in HaplotypeCaller, which we are aware of but haven't been able to fix yet. In a nutshell, both callers normally downsample the data to a max coverage of 250 by default, but it seems that downsampling is currently not working for HC. So when you have ~200X coverage, there's no downsampling anywhere so you don't see any differences, but when you have much higher coverage, downsampling kicks in for UG but not for HC, hence the differences you observe.

    We're working to fix the bug; in the meantime you can either hard-downsample your data using PrintReads with -dcov 250 before running HC, or just ignore the differences. The results are correct in both cases. Though the HC runs are bound to take much longer without downsampling...

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Ack, and now my colleague points out that you are using -dt NONE in both cases so downsampling can't be the issue. It was such a nice, tidy explanation... but ignore what I said earlier, sorry.

    Do you get the same discrepancies if you run single-threaded? If not it might be the multi-threading that is malfunctioning.

    Geraldine Van der Auwera, PhD

  • sam24sam24 UKPosts: 2Member

    Hi Geraldine,

    Thank you for your comments about downsampling and HC. This would have been a lovely explanation! But yes, I am switching off downsampling. I did try (amongst many iterations) setting -dcov 5000 and this did not change the reported DP values.

    It is good to know that the low numbers should not worry me. I can definitely report to you that all the variants in my data set are being reported, just with a lower numbers.

    I have just re-run the same command using a single thread and unfortunately got the same result.

    Best, Sam

  • KathKath Posts: 36Member

    Hi,

    Has anyone had any further thoughts about this? I seem to be having the same issue. At first I assumed it was that IGV included duplicate reads (as I have marked but not removed duplicates from my bam files) but I have set preferences such that IGV should not display duplicates...

    Thanks

    Kath

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Hi @Kath,

    Unfortunately we haven't had time to look into this in detail, so we're not sure yet what's going on. Since it doesn't seem to have adverse consequences on variant discovery, we're not currently able to treat this as a priority issue. However if you have data that suggests it is a big problem, we would look into it more closely.

    Geraldine Van der Auwera, PhD

  • MaikeMaike FrankfurtPosts: 1Member

    Dear Geraldine,

    are there any news regarding this? I think I 'm having this issue too. I have amplicon data and used the HaplotypeCaller for a sample conaining one individual. I should get a depth of about DP=370 but I get values between 19 and 199. The data is not downsampled and only one of the reads was filtered out (by the NotPrimaryAlignmentFilter). Unfortunately I can't really say how much of a problem this ist because I don't really know why it happens and what it does to the results, but I'm worried that most of my data might not be used for the computation, either in a random way or because of some parameter that might introduce a bias. Can you say anything about that? I would really appreciate your help.

    Thank you, Maike

  • ebanksebanks Posts: 683GATK Developer mod

    You can test whether reads are being used by the tool with the -bamOut option. You will see which reads align to which haplotypes - and which are uninformative to the calculation.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • RyanRyan Posts: 4Member

    Hi Geraldine,

    We have a similar situation as Maike. We are using a targeted capture method and are expecting a coverage of about 1000X in our areas of interest. As above, the issue happens with HaplotypeCaller even though no Downsampling has been selected. We have tried it with the latest release of GATK 3.2-2 and the issue is still present.

    I can't make the case that we are missing calls because of this, but the fact that the read depth reported by HC is erroneous could have some impact on the results we report.

    If people are using GATK to analyze their amplicon or targeted capture panels, more and more and datasets will have this issue.

    Thanks for looking into this!

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Hi @Ryan,

    It occurs to me that we haven't yet discussed the mapping quality in this thread, as a potential culprit. The HC applies a fairly stringent MQ-based filtering, automatically discarding any read that has MQ lower than 20. Have you checked what is the MQ of the data in the affected regions? What does the filtering summary (last bit of console output) say about HCMappingQualityFilter counts?

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 4Member

    Hi @Geraldine,

    The mapping quality seems to be pretty good in those areas. Here is an output of HC. In this case we had removed our duplicate reads from the input:

    INFO 13:22:29,057 MicroScheduler - 220625 reads were filtered out during the traversal out of approximately 13691080 total reads (1.61%) INFO 13:22:29,057 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 13:22:29,057 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 13:22:29,058 MicroScheduler - -> 220625 reads (1.61% of total) failing HCMappingQualityFilter INFO 13:22:29,058 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 13:22:29,058 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 13:22:29,059 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 13:22:29,059 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 13:22:30,138 GATKRunReport - Uploaded run statistics report to AWS S3

    Thanks,

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Hmm, so much for that explanation. Have you checked with -bamOut how many reads are actually left in that region after the HC's reassembly process?

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 4Member

    We just checked with -bamOut and only a small subset of reads are used (~200-350) after HC's assembly out of the 1000 or plus reads available in those areas. I am willing to share some data if that is helpful. Thanks for the help!

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    I see -- can you just please check whether this still happens with the latest version (3.2-2)? If so, please upload a snippet of data to our FTP. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 4Member

    Hi Geraldine,

    I have uploaded a snippet of our data to your FTP server. the compressed file is named HC_low_DP_issue_with_high_coverage.tar.gz This snippet was generated with the latest version of GATK 3.2-2 Let me know if you need a larger segment of data as I limited it to a small region on MTOR with a variant call.

    Thanks!

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Thanks Ryan. I ran some tests on your data and I think I can explain what's happening here.

    The first couple of steps in the HC process involve determining what regions may contain variants ("determine ActiveRegions") and reassemble the reads to determine what are the possible haplotypes.

    In your case, the HC initially identifies an active region with coordinates chr1:11205021-11205095 containing 3000 reads, then applies some trimming and processing that leaves 1070 reads in the active region, which is what it will then proceed with to the next step (assembly). The reason why that reduction happens is that there's a hardcoded downsampling operation that reduces the overall content of the active region to approximately 1000 reads in total (but respecting a minimum of 5 reads starting per position).

    I believe that is why your particular site of interest ends up with only a fraction of the expected depth. Currently this 1000-read total per active region cannot be overridden from command line, but I'm going to put in a feature request to have this be an adjustable parameter in future versions.

    Geraldine Van der Auwera, PhD

  • mmajmmaj Posts: 9Member

    Hi Geraldine (and others). We're experiencing similar problems with the Haplotypecaller (both v3.1-1 and v3.2-2). In fact our problem is almost identical to that reported by Ryan, a few posts back in this thread. We use amplicon sequencing and consequently have very high coverage (150 bp PE data). In specific cases, HC reports DP=30 reads in regions with app. 1000x good quality coverage. In contrast, UnifiedGenotyper calls the same variants and reports the "correct" DP= ~1000x. This shouldn't be a result of HCmappingQualityFilter, as only ~2% reads are discarded due to this filter. However, we do observe a rather low fraction of reads in these regions having the "proper-pairs" tag. Could this discrepancy between HC and UGT in DP reporting be due to some inherent different treatment of reads that do not have the "proper-pair" tag set between the two callers? The problem / discrepancy between UGT and HC persists for all the aligners we frequently use (BWA, MOSAIK, STAMPY). Thanks in advance!

    Mads

  • SheilaSheila Broad InstitutePosts: 561Member, GATK Developer, Broadie, Moderator admin

    @mmaj

    Hi Mads,

    We are aware of this issue, but it is not a super high priority fix as the results output are correct. Just the DP is being reported incorrectly. If the results were incorrect, it would be a higher priority.

    This seems like some quirk and not a bug, but we will certainly post when the issue is resolved.

    -Sheila

  • SheilaSheila Broad InstitutePosts: 561Member, GATK Developer, Broadie, Moderator admin

    @mmaj

    Hi again,

    A fix to my previous comment, we are aware the DP is not reported incorrectly, but rather Haplotype Caller is using fewer reads than it could/should.

    -Sheila

  • mmajmmaj Posts: 9Member

    Hi Sheila and Geraldine, While it may not seem like a bug, I must admit I have a hard time understanding how HC selects a tiny subset of (what appears to be) perfectly good reads in our amplicon sequencing setup. In a very recent sample, we have +600x cov at a clearly het.zyg site, and UnifiedGenotyper correctly reports the DP. However, HC reports only 16 reads for this region, ultimately resulting in a very low quality score for that particular variant. You may argue that this is not important, but it makes it very difficult to apply hard filters to our raw callsets, since we now have to loosen our hard filters to account for the misleading DP / quality scores reported by HC. Do you know whether this issue is being or will be addressed anytime soon?

    Anyways, thanks for a really great toolkit!!

    -Mads

  • SheilaSheila Broad InstitutePosts: 561Member, GATK Developer, Broadie, Moderator admin

    @mmaj

    Hi Mads,

    Which version of GATK are you using? Can you post the commands you used? Have you checked the mapping qualities of the reads in the area? If they are low, that could be a potential reason they are getting filtered out.

    -Sheila

  • mmajmmaj Posts: 9Member

    Hi sheila, I'm using GATK v3-2.2. command: -T HaplotypeCaller --dbsnp $dbSNP -I sample.bam -rf MappingQualityZero -pairHMM VECTOR_LOGLESS_CACHING -o sample.raw.HTC.vcf -R $fastaref -L $ROI -minPruning 2 -nct $threads -dt NONE

    From this run, we lose ~4% of reads due to the readfilters, primarily HCMappingQualityFilter. The mapping quality is generally (always) rather low for this particular setup, as we're sequencing degraded DNA from FFPE tissue. However, for the variant of interest, we get MQ=23.7 from the above HC run. Similarly, we get MQ=23.2 for the same variant using UnifiedGenotyper, but UGT correctly emits the DP (+600 reads) and AD for this variant. consequently, the QUAL emitted by HC and UGT for this variant is ~150 and 4700, respectively...

    While I'm aware that the mapping qualities are definitely lower for this type of setup, I assumed that reads passing the automatically applied readfilters (e.g. HCMappingQualityFilter) would be included in the variant-calling/-reporting process. Although HC's de novo assembly step may to some degree affect the DP and AD emitted, I still don't understand how it could reduce DP from 600+ to 16 :)

    • Mads
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Hi @mmaj,

    I notice you are using -dt none. We have had reports that modifying the default downsampling with HC somehow adversely affects the downsampling process. In particular it seems that disabling downsampling leads to decreased reported depth. Unintuitive and vexing, it is. This is a problem specific to HC; we haven't had the bandwidth yet to assign someone to figure it out, so unfortunately at this time our only recommendation is to leave the downsampling alone.

    Geraldine Van der Auwera, PhD

  • mmajmmaj Posts: 9Member

    Hi Geraldine, Thanks for the suggestion! I've rerun the HC command without downsampling (simply removing "-dt NONE" from above command), and it emits a DP around 200 reads, which (albeit being only 1/3 of the total reads for this position), is a huge difference from the DP=16 emitted with -dt NONE.... I'll definitely keep the effect of (no) downsampling in mind! Thanks!

    -Mads

  • metheusemetheuse ann arborPosts: 5Member
    edited October 28

    I ran HaplotypeCaller on some exome-seq samples with both dedupped reads (>200x coverage) and non-dedupped reads (>2000x coverage, yes, there is a lot of duplication). I was expecting to see the depth from non-dedupped reads is much higher than with dedupped reads. However, it's remarkably opposite.. Here is the command I was using: java -jar ${GATK} -T HaplotypeCaller -R ${genome_fasta} -I Sample_${id}_recal.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 --dbsnp ${dbsnp} -L ${target_bed} --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o Sample_${id}.vcf Btw, after running haplotypecaller and genotypeGVCFs, there are 2300 variants from dedupped reads, and 2270 variants from non-dedupped reads, which is, again, against my expectation. This is a target exome-seq and only exons from hundred genes were targeted, so that's why there is very high duplication. I was concerned about whether I should remove duplicates, so I tried both.

    I kind of understand that it was because HC randomly downsampled the reads. When there are so many duplicates, the unique reads would have much lower chance to be included. I was thinking I should run HC without downsampling by adding -dt NONE, however, after reading this thread, especially the message above, it seems that adding -dt NONE is not a good choice. Any advice? Thanks.

    Post edited by metheuse on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Hi @metheuse,

    There is actually no expectation of getting either more or fewer variants in dedupped reads vs. non-dedupped reads, since you don't know ahead of time if the duplicates support reads that are all ref or that have evidence of variation.

    Regarding your observations about depth, can you give a little more detail and post a few records showing the difference?

    Geraldine Van der Auwera, PhD

  • metheusemetheuse ann arborPosts: 5Member
    edited October 30

    @Geraldine_VdAuwera said: Hi metheuse,

    There is actually no expectation of getting either more or fewer variants in dedupped reads vs. non-dedupped reads, since you don't know ahead of time if the duplicates support reads that are all ref or that have evidence of variation.

    Regarding your observations about depth, can you give a little more detail and post a few records showing the difference?

    Thanks for replying. I think I'll anyway remove duplicates. What I really wanted to know is that what is the benefit to let HC do downsampling? It doesn't make sense to me to sequence the DNA ultra-deeply but not use all of them in variant calling. But from the above discussion on this page, it seems that you don't recommend using -dt NONE. Is that still true? Thanks.

    Post edited by metheuse on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    The downsampling system is tuned to provide the analysis tools with as much data as they need to be empowered to make the right decisions, and no more. Providing the tools with more data than they need just slows them down and increases your compute costs without benefit. Generally speaking, sequencing DNA ultra-deeply rarely makes sense. The way the Broad's production facility (and many others) works is that the sequencing process is optimized to yield the right depth of data per experimental design. This can vary between studies, and when a new design is introduced, pilot studies are carried out to determine what the depth of sequencing should be. This allows you to get the discovery power you need without wasting money and computing resources.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.