Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

Strang alternating GQ=0, GQ=99 calls in g.vcf

I was looking at a g.vcf file, and I noticed that there are many alternating calls, with one having GQ of 99, and the next having a GQ of 0. Looking at the other fields, I can find no reason why the GQ of these calls should alternate like that.

For example, these two calls are adjacent, and cover only one position each. The only difference appears to be that the GQ=0 call has a slightly higher read depth
gi|194447306|ref|NC_011083.1| 25644 0 A <NON_REF> 0 0 END=25644 GT:DP:GQ:MIN_DP:PL 0:34:99:34:0,1391 gi|194447306|ref|NC_011083.1| 25645 0 G <NON_REF> 0 0 END=25645 GT:DP:GQ:MIN_DP:PL 0:35:0:35:0,0

Here, the GQ=0 call actually spans more positions, and has a higher depth then the GQ=99 call
gi|194447306|ref|NC_011083.1| 25646 0 T <NON_REF> 0 0 END=25646 GT:DP:GQ:MIN_DP:PL 0:35:99:35:0,1480 gi|194447306|ref|NC_011083.1| 25647 0 C <NON_REF> 0 0 END=25655 GT:DP:GQ:MIN_DP:PL 0:39:0:35:0,0

This pattern repeats throughout the file, can someone point me to an explanation as to why this is?

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Redmar_van_den_Berg
    Hi,

    Can you tell me the exact command and version of HaplotypeCaller you ran?

    Please also post a screenshot of the original bam file and bamout file of those regions above.

    Thanks,
    Sheila

  • edited January 2016

    Hi,

    The exact HaplotypeCaller command
    gatk -T HaplotypeCaller \ --sample_ploidy 1 \ -R $reference \ -I $bamfile \ -o $output \ -ERC GVCF \ -bamout $bamout

    The exact version number
    $ gatk --version 3.5-0-g36282e4

    Region of the g.vcf file shown in the screenshot below
    gi|194447306|ref|NC_011083.1| 25630 0 A <NON_REF> 0 0 END=25630 GT:DP:GQ:MIN_DP:PL 0:7:99:7:0,294 gi|194447306|ref|NC_011083.1| 25631 0 A <NON_REF> 0 0 END=25632 GT:DP:GQ:MIN_DP:PL 0:9:0:7:0,0 gi|194447306|ref|NC_011083.1| 25633 0 T <NON_REF> 0 0 END=25635 GT:DP:GQ:MIN_DP:PL 0:15:99:9:0,381 gi|194447306|ref|NC_011083.1| 25636 0 A <NON_REF> 0 0 END=25640 GT:DP:GQ:MIN_DP:PL 0:22:0:20:0,0 gi|194447306|ref|NC_011083.1| 25641 0 C <NON_REF> 0 0 END=25642 GT:DP:GQ:MIN_DP:PL 0:30:99:28:0,1160 gi|194447306|ref|NC_011083.1| 25643 0 G <NON_REF> 0 0 END=25643 GT:DP:GQ:MIN_DP:PL 0:32:0:32:0,0 gi|194447306|ref|NC_011083.1| 25644 0 A <NON_REF> 0 0 END=25644 GT:DP:GQ:MIN_DP:PL 0:34:99:34:0,1391 gi|194447306|ref|NC_011083.1| 25645 0 G <NON_REF> 0 0 END=25645 GT:DP:GQ:MIN_DP:PL 0:35:0:35:0,0 gi|194447306|ref|NC_011083.1| 25646 0 T <NON_REF> 0 0 END=25646 GT:DP:GQ:MIN_DP:PL 0:35:99:35:0,1480 gi|194447306|ref|NC_011083.1| 25647 0 C <NON_REF> 0 0 END=25655 GT:DP:GQ:MIN_DP:PL 0:39:0:35:0,0 gi|194447306|ref|NC_011083.1| 25656 0 A <NON_REF> 0 0 END=25656 GT:DP:GQ:MIN_DP:PL 0:47:99:47:0,1800 gi|194447306|ref|NC_011083.1| 25657 0 G <NON_REF> 0 0 END=25657 GT:DP:GQ:MIN_DP:PL 0:47:0:47:0,0 gi|194447306|ref|NC_011083.1| 25658 0 A <NON_REF> 0 0 END=25659 GT:DP:GQ:MIN_DP:PL 0:52:99:51:0,1800 gi|194447306|ref|NC_011083.1| 25660 0 T <NON_REF> 0 0 END=25663 GT:DP:GQ:MIN_DP:PL 0:55:0:54:0,0 gi|194447306|ref|NC_011083.1| 25664 0 C <NON_REF> 0 0 END=25664 GT:DP:GQ:MIN_DP:PL 0:62:99:62:0,1800 gi|194447306|ref|NC_011083.1| 25665 0 C <NON_REF> 0 0 END=25666 GT:DP:GQ:MIN_DP:PL 0:63:0:62:0,0 gi|194447306|ref|NC_011083.1| 25667 0 G <NON_REF> 0 0 END=25667 GT:DP:GQ:MIN_DP:PL 0:66:99:66:0,1800 gi|194447306|ref|NC_011083.1| 25668 0 C <NON_REF> 0 0 END=25669 GT:DP:GQ:MIN_DP:PL 0:67:0:66:0,0 gi|194447306|ref|NC_011083.1| 25670 0 T <NON_REF> 0 0 END=25670 GT:DP:GQ:MIN_DP:PL 0:67:99:67:0,1800 gi|194447306|ref|NC_011083.1| 25671 0 A <NON_REF> 0 0 END=25677 GT:DP:GQ:MIN_DP:PL 0:73:0:70:0,0 gi|194447306|ref|NC_011083.1| 25678 0 C <NON_REF> 0 0 END=25680 GT:DP:GQ:MIN_DP:PL 0:75:99:74:0,1800 gi|194447306|ref|NC_011083.1| 25681 0 T <NON_REF> 0 0 END=25683 GT:DP:GQ:MIN_DP:PL 0:77:0:76:0,0 gi|194447306|ref|NC_011083.1| 25684 0 C <NON_REF> 0 0 END=63392 GT:DP:GQ:MIN_DP:PL 0:164:99:85:0,1800

    IGV screenshot. The top sample is the original bam file, the bottom row is the bamout file. I do not understand why a lot of the positions in the g.vcf file posted above show up with 0 coverage in the igv view.

    Edit to add: I've uploaded the image here since it doesn't seem to embed properly http://imgur.com/S5hMpHJ

    Issue · Github
    by Sheila

    Issue Number
    498
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Redmar_van_den_Berg
    Hi,

    I am asking Geraldine to have a look at this. We will get back to you asap.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Redmar_van_den_Berg
    Hi,

    Can you try running the bamout command with -forceActive, -disableOptimizations and -dontTrimActiveRegions so we can see the data in the hom-ref blocks?

    Please also zoom out and show that there is actually data in the original bam file at those positions.

    I see some blue and red reads in the screenshot you posted; what IGV preferences did you set to display?

    Thanks,
    Sheila

  • Hi,

    I've ran the analysis again, with the following command
    gatk -T HaplotypeCaller \ --sample_ploidy 1 \ -R $reference \ -I $bamfile \ -o $output \ -ERC GVCF \ -bamout $bamout \ -forceActive \ -disableOptimizations \ -dontTrimActiveRegions

    I've zoomed out the igv view this time, the view now covers the following records from the g.vcf file, excluding the first and the last
    gi|194447306|ref|NC_011083.1| 1 0 A <NON_REF> 0 0 END=24481 GT:DP:GQ:MIN_DP:PL 0:167:99:89:0,225 gi|194447306|ref|NC_011083.1| 24482 0 G <NON_REF> 0 0 END=24482 GT:DP:GQ:MIN_DP:PL 0:109:45:109:0,45 gi|194447306|ref|NC_011083.1| 24483 0 T <NON_REF> 0 0 END=25626 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0 gi|194447306|ref|NC_011083.1| 25627 0 A <NON_REF> 0 0 END=25628 GT:DP:GQ:MIN_DP:PL 0:5:99:5:0,199 gi|194447306|ref|NC_011083.1| 25629 0 G <NON_REF> 0 0 END=25629 GT:DP:GQ:MIN_DP:PL 0:6:0:6:0,0 gi|194447306|ref|NC_011083.1| 25630 0 A <NON_REF> 0 0 END=25630 GT:DP:GQ:MIN_DP:PL 0:7:99:7:0,294 gi|194447306|ref|NC_011083.1| 25631 0 A <NON_REF> 0 0 END=25632 GT:DP:GQ:MIN_DP:PL 0:9:0:7:0,0 gi|194447306|ref|NC_011083.1| 25633 0 T <NON_REF> 0 0 END=25635 GT:DP:GQ:MIN_DP:PL 0:15:99:9:0,381 gi|194447306|ref|NC_011083.1| 25636 0 A <NON_REF> 0 0 END=25640 GT:DP:GQ:MIN_DP:PL 0:22:0:20:0,0 gi|194447306|ref|NC_011083.1| 25641 0 C <NON_REF> 0 0 END=25642 GT:DP:GQ:MIN_DP:PL 0:30:99:28:0,1160 gi|194447306|ref|NC_011083.1| 25643 0 G <NON_REF> 0 0 END=25643 GT:DP:GQ:MIN_DP:PL 0:32:0:32:0,0 gi|194447306|ref|NC_011083.1| 25644 0 A <NON_REF> 0 0 END=25644 GT:DP:GQ:MIN_DP:PL 0:34:99:34:0,1391 gi|194447306|ref|NC_011083.1| 25645 0 G <NON_REF> 0 0 END=25645 GT:DP:GQ:MIN_DP:PL 0:35:0:35:0,0 gi|194447306|ref|NC_011083.1| 25646 0 T <NON_REF> 0 0 END=25646 GT:DP:GQ:MIN_DP:PL 0:35:99:35:0,1480 gi|194447306|ref|NC_011083.1| 25647 0 C <NON_REF> 0 0 END=25655 GT:DP:GQ:MIN_DP:PL 0:39:0:35:0,0 gi|194447306|ref|NC_011083.1| 25656 0 A <NON_REF> 0 0 END=25656 GT:DP:GQ:MIN_DP:PL 0:47:99:47:0,1800 gi|194447306|ref|NC_011083.1| 25657 0 G <NON_REF> 0 0 END=25657 GT:DP:GQ:MIN_DP:PL 0:47:0:47:0,0 gi|194447306|ref|NC_011083.1| 25658 0 A <NON_REF> 0 0 END=25659 GT:DP:GQ:MIN_DP:PL 0:52:99:51:0,1800 gi|194447306|ref|NC_011083.1| 25660 0 T <NON_REF> 0 0 END=25663 GT:DP:GQ:MIN_DP:PL 0:55:0:54:0,0 gi|194447306|ref|NC_011083.1| 25664 0 C <NON_REF> 0 0 END=25664 GT:DP:GQ:MIN_DP:PL 0:62:99:62:0,1800 gi|194447306|ref|NC_011083.1| 25665 0 C <NON_REF> 0 0 END=25666 GT:DP:GQ:MIN_DP:PL 0:63:0:62:0,0 gi|194447306|ref|NC_011083.1| 25667 0 G <NON_REF> 0 0 END=25667 GT:DP:GQ:MIN_DP:PL 0:66:99:66:0,1800 gi|194447306|ref|NC_011083.1| 25668 0 C <NON_REF> 0 0 END=25669 GT:DP:GQ:MIN_DP:PL 0:67:0:66:0,0 gi|194447306|ref|NC_011083.1| 25670 0 T <NON_REF> 0 0 END=25670 GT:DP:GQ:MIN_DP:PL 0:67:99:67:0,1800 gi|194447306|ref|NC_011083.1| 25671 0 A <NON_REF> 0 0 END=25677 GT:DP:GQ:MIN_DP:PL 0:73:0:70:0,0 gi|194447306|ref|NC_011083.1| 25678 0 C <NON_REF> 0 0 END=25680 GT:DP:GQ:MIN_DP:PL 0:75:99:74:0,1800 gi|194447306|ref|NC_011083.1| 25681 0 T <NON_REF> 0 0 END=25683 GT:DP:GQ:MIN_DP:PL 0:77:0:76:0,0 gi|194447306|ref|NC_011083.1| 25684 0 C <NON_REF> 0 0 END=63392 GT:DP:GQ:MIN_DP:PL 0:164:99:85:0,1800 gi|194447306|ref|NC_011083.1| 63393 0 C A,<NON_REF> 6792 0 DP=168;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=604800.00 GT:AD:DP:GQ:PL:SB 1:0,168,0:168:99:6822,0,6822:0,0,60,108

    I've uploaded the picture here, and I'll try embedding it again

    I haven't changed any of the igv settings, so the blue/red reads should be what they are by default.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Is this microbial data?

    Does this pattern also occur if you run with the previous version, 3.4-46? Some changes were made to the haploid math in 3.5 and I/m wondering if they might be causing this.

  • Yes, this is microbial (haploid) data. I'll run the analysis again using 3.4-46 and upload the results, but it will take several hours with the additional flags Sheila requested.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You could do a pass without the extra diagnostics flags -- the final result will tell us whether the same thing is happening or not.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Redmar_van_den_Berg
    Hi,

    After discussing with the team, I think it would be best for you to submit a bug report, if you can. Instructions are here.

    Thanks,
    Sheila

  • edited January 2016

    I have uploaded the bug report as discussion_6835.tar.gz, thanks for looking into this.

    Edit: typo

    Issue · Github
    by Sheila

    Issue Number
    527
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @Redmar_van_den_Berg,

    Based on the latest findings in your case as well as @chlangley's (in the other thread where you have commented as well) we think that the evidence points to the presence of a large insertion / structural variant in the problem region. I'm not sure why in your case this manifests as alternating GQ 0/ QG 99 but in any case it seems this is beyond HC's ability to resolve, so we have to classify this as a known limitation for now. The devs are going to think of ways to put in some logic to detect this sort of event, but unfortunately at this time there's no way to improve those calls.

    I'd like to write up a short case study article for the docs to document how to diagnose this sort of problem. Do you mind if we use IGV screenshots that were generated during the debugging process in the document? We can redact file names from the screenshot if it helps.

  • hi @Geraldine_VdAuwera,
    No problem, as long as filenames/sample names are redacted, just to be safe ;)

  • @Sheila

    Thank you for looking into this, I'll use -dontUseSoftClippedBases for now

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Thanks!

    Regarding -dontUseSoftClippedBases, be aware that using it over all of your data would reduce HC's ability to discover indels. You may want to consider either using the results as they are, knowing why some of the GQs are zero (which do correspond to something going on in the region that we're unsure about) or applying that argument only to regions like this.

  • buddejbuddej St. LouisMember

    @Sheila said:
    Hi,

    The entire region is full of soft-clips. I'm not sure why the GQ would alternate between 0 and 99, but if you add -dontUseSoftClippedBases to your command, the alternating GQ's don't occur (they all become GQ=0).

    Please note, those sites do not show up in the final VCF, they just show up in the GVCF. I will make a note of this and see what the team says.

    -Sheila

    I've noticed a similar phenomenon in some recent .g.vcfs we created. Mostly the GQ=0 is in / near an indel, which I understand is a known issue ( http://gatkforums.broadinstitute.org/gatk/discussion/6451/curious-zero-quality-reference-calls-around-indels ). However, these calls (all 0/0) are ending up in my final VCF after running GGVCFs. The site is included due to some samples having good data in this region. Any idea why the calls with GQ=0 are being included? Is this a normal behavior? We're not sure whether we can trust these calls or not, since the quality appears to be very low (usually 0, sometimes 7, 8, 10, etc.)

    -John

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @buddej, could you post a few example records so we can get a more concrete idea of what these sites look like?

  • buddejbuddej St. LouisMember

    Of course. Here's 1 particular site from a single .g.vcf from an HC run (using --emitRefConfidence GVCF:

    #CHROM  POS      ID  REF  ALT        QUAL  FILTER     INFO        FORMAT              Sample1
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:30:0:30:0,0,503
    

    And there's the same position from an additional 12 .g.vcfs, all different samples (headerless)

    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:37:0:37:0,0,658
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:22:11:22:0,11,456
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:37:0:37:0,0,550
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:22:0:22:0,0,377
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:36:0:36:0,0,514
    4       962081   .   A    T,<NON_REF>  166.77 .       BaseQRankSum=-0.527;ClippingRankSum=-1.663;DP=32;MLEAC=1,0;MLEAF=0.500,0.00;MQ=59.29;MQRankSum=0.041;ReadPosRankSum=-2.798 GT:AD:DP:GQ:PL:SB  0/1:19,4,0:23:99:195,0,3688,252,3706,3958:4,15,0,4
    4       962081   .   A    C,<NON_REF>  145.77 .       BaseQRankSum=1.083;ClippingRankSum=-0.938;DP=19;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-0.505;ReadPosRankSum=-2.526 GT:AD:DP:GQ:PL:SB  0/1:12,3,0:15:99:174,0,2328,210,2343,2553:4,8,0,3
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:31:0:31:0,0,633
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:39:0:39:0,0,472
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:23:0:23:0,0,301
    4       962081   .   A    G,<NON_REF>  0      .       BaseQRankSum=1.123;ClippingRankSum=-1.327;DP=19;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQRankSum=-1.327;ReadPosRankSum=-1.327 GT:AD:DP:GQ:PL:SB 0/0:16,1,0:17:6:0,6,2167,48,2170,2212:3,13,0,1
    4       962081   .   A    <NON_REF>    .      .       END=962081  GT:DP:GQ:MIN_DP:PL  0/0:18:0:18:0,0,97
    

    As you can see, a lot of 0/0 ref calls with GQ=0, though one (line 3) has GQ=11. The samples which carry the variant look fine, GQ=99.
    Here is the .vcf that results when I run GGVCFs (transposed for clarity):

    #CHROM             4
    POS                962081
    ID                 .
    REF                A
    ALT                C
    QUAL               69.19
    FILTER             PASS
    INFO               AC=1;AF=0.038;AN=26;BaseQRankSum=1.16;ClippingRankSum=-1.448e+00;DP=364;FS=0.000;InbreedingCoeff=-0.3384;MLEAC=4;MLEAF=0.154;MQ=60.00;MQRankSum=-2.900e-01;QD=5.77;ReadPosRankSum=-1.738e+00;SOR=1.371;VQSLOD=1.43;culprit=QD
    FORMAT             GT:AD:DP:GQ:PGT:PID:PL
    Sample1            0/0:30,0:30:0:.:.:0,0,503
    Sample2            0/0:37,0:37:0:.:.:0,0,658
    Sample3            0/0:22,0:22:11:.:.:0,11,456
    Sample4            0/0:37,0:37:0:.:.:0,0,550
    Sample5            0/0:22,0:22:0:.:.:0,0,374
    Sample6            0/0:33,0:33:0:.:.:0,0,514
    Sample7            0/0:35,0:35:0:.:.:0,0,394
    Sample8            0/1:11,1:12:93:0|1:962053_G_C:93,0,453
    Sample9            0/0:30,0:30:0:.:.:0,0,633
    Sample10           0/0:39,0:39:0:.:.:0,0,472
    Sample11           0/0:22,0:22:0:.:.:0,0,301
    Sample12           0/0:23,0:23:0:.:.:0,0,473
    Sample13           0/0:18,0:18:0:.:.:0,0,97
    

    There is an indel nearby in this case. I'm not sure why some of the DP counts change for some samples. It's not as if GGVCFs can do further realignment without the .bam.

    I guess what my question boils down to is, can I trust these 0/0 calls? Is the GQ=0 an indication of a genotype that should definitely not be used, or more of a flag that this particular site should be checked closely, since the indel is potentially causing alignment issues. Happy to generate .bamout if that would be helpful. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Oh I see, thanks for posting details. These indicate that the caller saw something going on at those sites but wasn't able to decide either way -- so you shouldn't trust those genotypes. You can decide to look more closely at the sites if they are of interest, but you can't treat them as hom ref based on those calls alone.
  • buddejbuddej St. LouisMember

    Thanks @Geraldine_VdAuwera, that's good to know. But there's no way to get GGVCFs to not include these genotypes in the final VCF? I'd rather have these marked as ./. if the genotypes can't be trusted.

    Is there a tool in GATK that will remove individual genotypes based on GQ? SelectVariants only filters whole variant sites, as I understand. Something like what vcftools --minGQ or bcftools filter -i 'GQ > x' --set-GTs '.' does?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @buddej
    Hi,

    If you use the latest version of GATK, the GQ0 genotypes will be set to ./.

    -Sheila

  • buddejbuddej St. LouisMember
    edited July 2016

    @Sheila

    By latest vesion, do you mean the nightly build? I'm using the main release 3.6-0-g89b7209 (for HC and GGVCFs), and these 0/0 calls are appearing in the final .vcf. Commands are below -- am I missing something obvious here?

    ${JAVA} ${JAVAOPTS} -jar ${GATK360} -T HaplotypeCaller \
      -R /data/GATK_pipeline_files/human_g1k_v37_decoy.fasta \
      -I Sample1.bam \
      --emitRefConfidence GVCF \
      --dbsnp /data/GATK_pipeline_files/dbsnp_138.b37.vcf \
      -L 4:961500-962500 \
      -o Sample1.g.vcf \
      -bamout Sample1-bamout.bam
    
    ${JAVA} ${JAVAOPTS} -jar ${GATK360} -T GenotypeGVCFs \
      -R /data/GATK_pipeline_files/human_g1k_v37_decoy.fasta \
      -V samples.list \
      -L 4:961500-962500 \
      -o 13lines.vcf
    

    I tried a recent nightly build as well, and it produces the same results.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @buddej
    Hi John,

    Ah, I see. It looks like a rounding issue. Can you submit a bug report? Instructions are here. Please include some BAM file snippets that I can run HaplotypeCaller and GenotypeGVCFs on and get GQ0 sites that have a proper genotype. Those should be no-call sites.

    Thanks,
    Sheila

Sign In or Register to comment.