HC listing depth one read less

nitinCelmatixnitinCelmatix NYCMember
edited August 2017 in Ask the GATK team

Hi,
I ran haplotypecaller on a bunch of samples using the following commands:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -drf DuplicateRead -R hg19.fa -I SAMPLE.bam -o SAMPLE.g.vcf -L target_region.bed -ERC GVCF
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19.fa -V SAMPLE.g.vcf -o SAMPLE.hc.vcf

For many variants, DP is listed to be one read less than it actually is. I load the bam file in IGV and count the reads manually (also appears when I hover over the bar plot). Moreover, the correct depth is listed by the output of the DepthOfCoverage tool:

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -drf DuplicateRead -R hg19.fa -I SAMPLE.bam --omitDepthOutputAtEachBase -o SAMPLE.coverage

Here is an example from hc.vcf:

chrX 49119876 . T C 531.77 . AC=2;AF=1.00;AN=2;DP=19;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.99;SOR=1.022 GT:AD:DP:GQ:PL 1/1:0,19:19:57:560,57,0

and from DepthOfCoverage:
chrX:49119876 20 20.00 20 20.00 21 21 21 100.0

So, depth matches between DepthOfCoverage and bam+IGV (DP=20), but it is one less in hc.vcf (DP=19).

Has anybody else seen this issue or know how to fix it? This is giving me problems for the variants that are right at my threshold.

Thanks a lot in advance!

Answers

  • HaplotypeCaller by default filters out reads with a mapping quality below 20 (--min_mapping_quality_score). Have you checked that your missing read is actually not below this threshold? Hence the reason?

    Also, HaplotypeCaller should not take into account for calling a genotype - and DP statistics - bases below quality 10 (--min_base_quality_score). Check if pos chrX:49119876 is not below this threshold in one read too !

  • Thanks for the reply. However, none of those is an issue for my case. All of my reads are above MQ>60 and have high base quality score. 
    What else could be causing HaplotypeCaller to neglect a few reads?
    Thanks!
  • nitinCelmatixnitinCelmatix NYCMember
    edited August 2017

    I did a bit more research but to to no avail. I ran HC on a very small portion of the bam file by first extracting the reads that overlapping a a specific variant position where there were fewer reads reported in the VCF by HC and then running HC only on that bam file. The idea was to see how many reads failed which in-built filter but it said that no reads failed any filter by HC.

    INFO 18:12:21,870 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 14 total reads (0.00%)
    INFO 18:12:21,871 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 18:12:21,871 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 18:12:21,872 MicroScheduler - -> 0 reads (0.00% of total) failing HCMappingQualityFilter
    INFO 18:12:21,872 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 18:12:21,872 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 18:12:21,873 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
    INFO 18:12:21,873 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    Now, in fact, only 11 reads (instead of 14) were reported in the GVCF:

    chrX 49119876 . T C, 52.77 . BaseQRankSum=0.278;ClippingRankSum=-0.896;DP=11;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-0.597;ReadPosRankSum=-1.663 GT:AD:DP:GQ:PL:SB 0/1:7,4,0:11:81:81,0,155,102,167,269:3,4,2,2

    I am at my wit's end in figuring out why there is discrepancy between HC and DepthOfCoverage. My main aim is to get the same depth reported between the two tools.

    Any idea about this will help.
    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @nitinCelmatix Frankly this could simply be caused by an off-by-one counting error, which is one of the most common errors in software development. If you have a simple, uncontroversial case (where there is no ambiguity about what should be counted) that reproduces the issue, we can have the dev team look at it and determine what's going on here.

  • Thanks a lot Geraldine, for the help. There is sometimes a difference of 3 reads (not just 1). These are some confidential files. How can I send them to you in a secure way?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nitinCelmatix
    Hi,

    To confirm, are you looking at the bamout files and discounting the Artificial Haplotypes? If so, details for submitting a bug report are here.

    -Sheila

  • Thanks, Sheila. I prepared the files but I could not upload them (I am assuming I need to upload them to 'incoming') using scp:

    $scp files.zip gsapubftp@p.broadinstitute.org:/incoming/.
    ssh: connect to host ftp.broadinstitute.org port 22: Connection refused
    lost connection

    Am I missing something? Thanks again for the help.

  • Hi Sheila,
    Can you please let me know exactly can I send you the files. Please see me error above. Thanks!

  • Hi Sheila,
    I have managed to upload the zip file that contains the commands, the bam/bai snippet files and other supporting files needed to diagnose the issue (it is a small file). The file is called "HC_depth.zip".
    Please let me know if you need anything else or when you have an update.
    Thanks for all your help in advance!

    Issue · Github
    by Sheila

    Issue Number
    2424
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nitinCelmatix
    Hi,

    Thanks. I will have a look soon. The error message usually arises from too many people trying to access the ftp site at once. The best recommendation we give people is to keep trying.

    -Sheila

  • nitinCelmatixnitinCelmatix NYCMember
    edited August 2017

    Hi,
    Please let me know if there are any updates on the above issue I have been facing or if you have had any trouble re-generating the issue.
    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nitinCelmatix
    Hi,

    I have not gotten to it yet. I hope to do so by the beginning of next week.

    -Sheila

  • Hi Sheila,
    Please let me know if you have any updates.
    Thanks for your efforts in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nitinCelmatix
    Hi,

    It looks like the hg19 reference we have is not what you used. Can you upload the reference you used?

    Thanks,
    Sheila

  • Hi Sheila,
    Sorry, I was traveling out of the country for a while and so could not respond.
    I have just uploaded the hg19.fa file which I used as reference. Hope this will help you resolve the matter. Please let me know if you need anything else.
    Thanks so much for your help.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nitinCelmatix
    Hi,

    Thanks. I just ran again and the reference works. I did see the issue in GATK3 (version 3.8), but this issue does not occur in GATK4 :smiley: There will be no more bug fixing in GATK3, so you will need to upgrade to GATK4.

    Side note, there really are only two reads in the original BAM file that show no evidence for the variant. I am happy to report that not only are the read depths more accurate in GATK4, but the variant call is also more accurate (i.e there is no variant call).

    -Sheila

  • Thanks, Sheila. I see that GATK 4.0 is in beta version currently. When can we expect a permanent version?
    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nitinCelmatix
    Hi,

    We are going to announce the release date soon, so keep an out for a new announcement :smiley:

    -Sheila

Sign In or Register to comment.