Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

I get very different MQ values when using GVCF vs BP_RESOLUTION

slamentslament SwedenMember

Hello! I had a question about the difference between using HaplotypeCaller's --emitRefConfidence GVCF vs BP_RESOLUTION. Maybe the answer is obvious or in the forum somewhere already but I couldn't spot it...

First, some context: I'm working with GATK v. 3.5.0 in a haploid organism. I have 34 samples, from which 5 are very similar to the reference (they are backcrosses) while the rest are strains from a wild population. Originally I used --emitRefConfidence GVCF followed by GenotypeGVCF. While checking the output VCF file, I realized that the five backcrosses had a much lower DP in average than the other samples (but this doesn't make sense due to difference in reads numbers or anything like that, since they were run in the same lane, etc). I assume this happened because there are long tracks without any variant compare to the reference in those samples, and the GVCF blocks end up assigning a lower depth for a great amount of sites in those samples compare to the much more polymorphic ones. In any case, I figured I could just get all sites using BP_RESOLUTION so to obtain the "true" DP values per site. However, when I tried to do that, the resulting VCF file had very low MQ values! Can you explain why this happened?

This is the original file with --emitRefConfidence GVCF:

$ bcftools view -H 34snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   309.4   .   AC=4;AF=0.235;AN=17;DP=582;FS=0;MLEAC=4;MLEAF=0.235;MQ=40;QD=34.24;SOR=2.303
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=603;FS=0;MLEAC=2;MLEAF=0.065;MQ=44.44;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-0.762;ClippingRankSum=0.762;DP=660;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=29.53;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

And this is with --emitRefConfidence BP_RESOLUTION:

$ bcftools view -H 34allgenome_snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   307.28  .   AC=4;AF=0.211;AN=19;DP=602;FS=0;MLEAC=4;MLEAF=0.211;MQ=8.23;QD=34.24;SOR=2.204
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=750;FS=0;MLEAC=2;MLEAF=0.065;MQ=5.53;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-1.179;ClippingRankSum=0.762;DP=796;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=4.8;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

I find it particularly strange since the mapping quality of the backcrosses should in fact be slightly better in average (around 59 for the original BAM file) than the other more polymorphic samples (around 58)...

Thank you very much!

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi,

    We are assuming the examples above are from the final VCFs. Can you please post the intermediate records from the GVCFs?

    Thanks,
    Sheila

  • slamentslament SwedenMember

    Hi Sheila,

    Indeed I posted the entries in the final VCF files. Here I put a few entries (different individuals) for the positions above in g.vcf files (I tried to pick variable and non variable individuals):

    Some sites don't appear in the GVCF mode, but I put the GVCFblock just in case...(I don't know if that is useful).

    GVCF

    Sample 130p
    -   (##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive))
    Sample 1m 
    chromosome_1    57  .   A   <NON_REF>   .   .   END=57  GT:DP:GQ:MIN_DP:PL  0:6:0:6:0,0
    Sample 21m 
    chromosome_1    57  .   A   G,<NON_REF> 13.22   .   DP=1;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=1600.00   GT:AD:DP:GQ:PL:SB   1:0,1,0:1:43:43,0,43:0,0,1,0
    Sample 87m
    chromosome_1    57  .   A   G,<NON_REF> 51  .   DP=2;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=3200.00   GT:AD:DP:GQ:PL:SB   1:0,2,0:2:81:81,0,81:0,0,2,0
    Sample 63p
    chromosome_1    57  .   A   G,<NON_REF> 0   .   DP=5;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=9873.00   GT:AD:DP:GQ:PL:SB   0:2,0,0:2:90:0,90,90:0,2,0,0
    Sample 63m
    chromosome_1    57  .   A   G,<NON_REF> 0   .   DP=7;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=13104.00  GT:AD:DP:GQ:PL:SB   0:1,0,0:1:45:0,45,45:0,1,0,0
    
    ----
    
    Sample 130p
    -   (##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive))
    Sample 21m
    -   (##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive))
    Sample 87m
    -   (##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive))
    Sample 63p
    chromosome_1    81  .   G   A,<NON_REF> 60  .   DP=5;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=9873.00   GT:AD:DP:GQ:PL:SB   1:0,2,0:2:90:90,0,90:0,0,0,2
    Sample 63m
    chromosome_1    81  .   G   A,<NON_REF> 15.14   .   DP=7;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=13104.00  GT:AD:DP:GQ:PL:SB   1:0,1,0:1:45:45,0,45:0,0,0,1
    
    ----
    
    Sample 130p
    -   (##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive))
    Sample 21m
    chromosome_1    87  .   A   ACGG,<NON_REF>  140.97  .   DP=4;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=6728.00   GT:AD:DP:GQ:PL:SB   1:0,4,0:4:99:180,0,180:0,0,4,0
    Sample 87m
    chromosome_1    87  .   A   ACGG,<NON_REF>  50.97   .   DP=5;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=5875.00   GT:AD:DP:GQ:PL:SB   1:0,2,0:2:90:90,0,90:0,0,2,0
    Sample 63p
    -   (##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive))
    Sample 100p
    chromosome_1    87  .   A   ACGG,<NON_REF>  50.97   .   DP=4;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=4441.00   GT:AD:DP:GQ:PL:SB   1:0,2,0:2:90:90,0,90:0,0,2,0
    

    BP_RESOLUTION

    Sample 130p
    chromosome_1    57  .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0:48,4:52:99:0,1800
    Sample 1m
    chromosome_1    57  .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0:0,6:6:0:0,0
    Sample 21m
    chromosome_1    57  .   A   G,<NON_REF> 13.22   .   DP=1;MLEAC=1,0;MLEAF=1,0;RAW_MQ=1600    GT:AD:DP:GQ:PL:SB   1:0,1,0:1:43:43,0,43:0,0,1,
    Sample 87m
    chromosome_1    57  .   A   G,<NON_REF> 51  .   DP=2;MLEAC=1,0;MLEAF=1,0;RAW_MQ=3200    GT:AD:DP:GQ:PL:SB   1:0,2,0:2:81:81,0,81:0,0,2,0
    Sample 63p
    chromosome_1    57  .   A   G,<NON_REF> 0   .   DP=5;MLEAC=0,0;MLEAF=0,0;RAW_MQ=9873    GT:AD:DP:GQ:PL:SB   0:2,0,0:2:90:0,90,90:0,2,0,0
    Sample 63m
    chromosome_1    57  .   A   G,<NON_REF> 0   .   DP=7;MLEAC=0,0;MLEAF=0,0;RAW_MQ=13104   GT:AD:DP:GQ:PL:SB   0:1,0,0:1:45:0,45,45:0,1,0,0
    
    ----
    
    Sample 130p
    chromosome_1    81  .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0:64,0:64:99:0,1800
    Sample 21m
    chromosome_1    81  .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0:5,0:5:99:0,204
    Sample 87m
    chromosome_1    81  .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0:5,2:7:99:0,135
    Sample 63p
    chromosome_1    81  .   G   A,<NON_REF> 60  .   DP=5;MLEAC=1,0;MLEAF=1,0;RAW_MQ=9873    GT:AD:DP:GQ:PL:SB   1:0,2,0:2:90:90,0,90:0,0,0,2
    Sample 63m
    chromosome_1    81  .   G   A,<NON_REF> 15.14   .   DP=7;MLEAC=1,0;MLEAF=1,0;RAW_MQ=13104   GT:AD:DP:GQ:PL:SB   1:0,1,0:1:45:45,0,45:0,0,0,1
    
    ------
    
    Sample 130p
    chromosome_1    87  .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0:65,0:65:99:0,1800
    Sample 21m
    chromosome_1    87  .   A   ACGG,<NON_REF>  140.97  .   DP=4;MLEAC=1,0;MLEAF=1,0;RAW_MQ=6728    GT:AD:DP:GQ:PL:SB   1:0,4,0:4:99:180,0,180:,0,4,0
    Sample 87m
    chromosome_1    87  .   A   ACGG,<NON_REF>  50.97   .   DP=5;MLEAC=1,0;MLEAF=1,0;RAW_MQ=5875    GT:AD:DP:GQ:PL:SB   1:0,2,0:2:90:90,0,90:00,2,0
    Sample 63p
    chromosome_1    87  .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0:17,0:17:99:0,630
    Sample 100p
    chromosome_1    87  .   A   ACGG,<NON_REF>  50.97   .   DP=4;MLEAC=1,0;MLEAF=1,0;RAW_MQ=4441    GT:AD:DP:GQ:PL:SB   1:0,2,0:2:90:90,0,90:0,,2,0
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi,

    Thanks. It looks like something might be going on in GenotypeGVCFs, because the MQ values seem to match in the two different GVCFs. Can you submit some test files so we can debug locally? Instructions are here.

    Thanks,
    Sheila

  • slamentslament SwedenMember

    Hi there, sorry for the super late response!

    I uploaded a zip file called GATKbug_report_slament.zip to the server. The zip file contains a README where I try to explain a bit the content.

    At this point, I feel that the difference in MQ is largely due to the block system used in GVCF mode and a somewhat hidden assumption that the levels of heterozygosity (and sites) are similar among samples (i.e. this is maybe not a bug, really). In such case, could you say that the true MQ value is retrieved with BP_RESOLUTION mode?

    Thank you!

    Issue · Github
    by Sheila

    Issue Number
    811
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The GVCF block system relies on the Genotype Quality (GQ) to group together sites where we have the same level of confidence in the genotype assignment. Generally speaking we expect that consecutive sites where GQ is the same have similar properties in terms of the quality and quantity of read data. I suppose it's possible that in some cases there is more heterogeneity but that it ends up balancing out (e.g. more reads but with lower quality), which could explain more variance observed in BP_RESOLUTION than is captured in the GVCF block annotations.

    In principle the BP_RESOLUTION file should indeed give you the most accurate picture of the data. To satisfy yourself that this is indeed the case, you can look at the actual data in the bamouts at a few of these specific sites. This should tell you fairly quickly whether the annotations are consistent or not with the data.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi!

    I am happy you managed to upload the files. I will look at the issue soon. In the meantime, you can try as Geraldine suggested.

    -Sheila

  • slamentslament SwedenMember

    Thank you for your responses! @Geraldine_VdAuwera I looked at a few sites. The bamouts are identical if I use --emitRefConfidence GVCF or if I use BP_RESOLUTION modes (as they should, I guess). The g.vcf files are the same between modes when the site is variable in all samples compared to the reference. When they are not, then there are differences in the quality metrics of the final vcf file. From what I see, using BP_RESOLUTION retrieves the correct DP values for each individual site, but the MQ value drops terribly. So like @Sheila mentioned some comments above, I think something happens during the genotyping.

    For example, I'm attaching the distributions (produce with the vcfR package) of DP and MQ for GVCF and BP_RESOLUTION for one of the chromosomes.

    GVCF mode
    image

    BP_RESOLUTION mode
    image

    They are dramatically different ... what could be happening? help :(

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The MQ distribution in the GVCF mode looks exactly as expected. I agree it looks like there's something wrong with the calculation when it's done in BP_RESOLUTION mode. Would you be able to share a snippet of data that reproduces the problem so we can debug locally?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Whoops, I just realized you already sent us test data, sorry. We'll try to look at this in the near future.

  • slamentslament SwedenMember

    Thank you @Geraldine_VdAuwera. The snippet I sent is actually very small so perhaps you cannot so obviously see the differences. I could upload a bigger snippet. Let me know if that would be useful!

  • slamentslament SwedenMember

    Hi again! I just tried the same commands with data from another organism, and the distributions recovered are almost identical between modes... There are two main differences between species: my species is haploid and a selfer; the new species is diploid and outcrosser. So I wonder if ploidy level, or long tracks of no variation are causing this MQ drop somehow...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi,

    I just tested out your files, and unfortunately, I could not find any difference between the RAW_MQ values from the GVCF and BP_RESOLUTION mode. Also, I was not able to find any difference between the MQ values in the final VCFs. I tested both the two different BAM files between positions 1-1000. I could not find a difference. If I looked at the wrong positions, please let me know.

    -Sheila

  • slamentslament SwedenMember

    Hi @Sheila,

    Thank you for checking the files! The RAW_MQ values are indeed the same between modes. However, I do see a difference in the final VCF file... I am a bit confused: one of the BAMs goes from position 100-400 in the provided reference fasta sequence and the other goes from 83500-84100. Can you clarify what do you mean by 1-1000? In the BAM from 100-400, the only site position that I see as different is 126. In the BAM 83500-84100 all variants have different MQ values.

    For example, the site 126 (BAM 100-400) in GVCF:

    chromosome_4_Slice_0_100000 126 .   A   T   3942.76 .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.97;ClippingRankSum=-7.000e-03;DP=1778;FS=7.601;MLEAC=1;MLEAF=0.500;MQ=56.66;MQRankSum=-1.540e-01;QD=34.24;ReadPosRankSum=3.84;SOR=0.076   GT:AD:DP:GQ:PL  0:1611,0:1611:99:0,1800 1:6,102:108:99:3974,0
    

    And in BP_RESOLUTION:

    chromosome_4_Slice_0_100000 126 .   A   T   3942.76 .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.97;ClippingRankSum=-7.000e-03;DP=1978;FS=7.601;MLEAC=1;MLEAF=0.500;MQ=16.46;MQRankSum=-1.540e-01;QD=34.24;ReadPosRankSum=3.84;SOR=0.076   GT:AD:DP:GQ:PL  0:1665,146:1811:99:0,1800   1:6,102:108:99:3974,0
    

    In one case I got MQ=56.66 and the other MQ=16.46.

    Did you re-run the commands? Could it be something wrong with they way I called GATK?

    Thank you again!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi,

    Ah I see. I was running GenotypeGVCFs on the single samples. Now, I see the issue when running GenotypeGVCFs on the two samples together! I am about to put in a bug report. You can keep track of it here.

    -Sheila

  • slamentslament SwedenMember

    Good to know, thank you @Sheila !

  • slamentslament SwedenMember

    Hi @Sheila, I'm terribly sorry if I'm being a pain, but has there been any progress on this? The bug tracker seems a bit quite. Thank you in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi,

    The developer knew exactly what happened when she saw the report! She has the fix in, but needs to add some tests. I will let you know soon when the fix is in the nightly build :)

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No harm at all in pinging us about bugs. We realize that the bug tracker is more a static list at the moment, and there's not really any visibility into status or progress. We are thinking about some ways to increase the usefulness of the bug tracker by showing when progress is being made (i.e. when there is active discussion in the issue thread); the difficulty is that we need to do this without exposing the actual content of the discussion. This is because sometimes we need to be able to refer internally to other data or cases with confidentiality restrictions, so we treat all discussions as confidential to avoid any unintentional leaks of protected information.

    As for the time it's taking, to be honest we're currently using some of the bugs to train new members of the development team. This does lead to a slower pace of resolution in the short term, but in the long run it leads to a bigger pool of devs who understand those parts of the code. Something to look forward to :)

  • slamentslament SwedenMember

    Ohhh, that's great @Sheila and @Geraldine_VdAuwera! Thank you very much! I'll be waiting :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @slament The fix for this just sailed through code review without a hitch; we just need to update some test expectations and run final checks before merging it in. So it will probably be merged within a day or two, and will definitely be in the upcoming 3.6 release.

  • slamentslament SwedenMember

    Great news @Geraldine_VdAuwera, really looking forward! Thanks!

  • slamentslament SwedenMember

    It worked!!!!!

    Thank you so much @Sheila and @Geraldine_VdAuwera and the developers!

    Here is the MQ distribution before the fix:
    image

    And after the fix:
    image

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Glad to hear it, @slament; we'll convey your happiness to the dev who fixed it :)

    (that is a really big difference...)

  • vcaropre1vcaropre1 Member

    I saw the same type of differences in the MQ scores in my vcf file when the gvcfs are made using GVCF vs. BP_RESOLUTION in HaplotypeCaller and when calling three samples together. With the fix in the nightly build, they seem to have more similar outputs in MQ scores, but some of the scores seem to have extremely high values > 70. The histograms below are shown as:
    1. MQ scores in the vcf from GenotypeGVCFs when using HaplotypeCaller in GVCF mode in the 5/24 nightly build
    2. MQ scores in the vcf from GenotypeGVCFs when using HaplotypeCaller in BP_RESOLUTION mode in the 5/24 nightly build
    3. MQ scores in the vcf from GenotypeGVCFs when using HaplotypeCaller in GVCF mode in the stable release of GATK3.5

    <img src="http://i1168.photobucket.com/albums/r500/vitosays/5-26-2016%2011-47-38%20AM_zpsgypuwdoh.png">

    It does not seem a large portion of scores are past 80, but I just wasn’t sure what is going on with the scores since the fix.

    The following scatter plots also show the same information just another way:
    1. MQ scores from the vcf from the --emitRefConfidence BP_RESOLUTION mode vs. MQ from the --emitRefConfidence GVCF mode in the 5/24 nightly build
    2. MQ scores from the vcf from the --emitRefConfidence GVCF mode in the nightly build compared to the MQ scores from the stable release of GATK3.5 in the --emitRefConfidence GVCF mode

    <img src="http://i1168.photobucket.com/albums/r500/vitosays/5-26-2016%2011-44-53%20AM_zpshwmvswqo.png">

    An example of a few records with varying scores:
    The top two records are pulled from the vcf generated from 3.5 in --emitRefConfidence GVCF mode
    The bottom two records are pulled from the vcf generated from the nightly build in --emitRefConfidence GVCF mode

    <img src="http://i1168.photobucket.com/albums/r500/vitosays/5-26-2016%2012-00-51%20PM_zpsyjqqx6ry.png">

    Is there a reason why these scores have been inflated due to the fix? I was assuming the MQ scores generated from the stable build of 3.5 in GVCF mode should be the exact (or very close) to the scores generated by the nightly build in GVCF mode too. I also the post fix graph that @slament posted above has a very large shift on the X-axis towards the max and was curious if you had a few mapping qualities that fell into the >70 score.

    Just as another note the aligner used for this process was bwa mem 0.7.8.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vcaropre1,

    That's definitely unexpected. Would you be able to share with us the gVCFs for the 3 samples at those positions (22:40060700 and 22:40060711), generated with the 5/24 nightly build? Instructions are here.

    A workaround for now would be to parse the AS_MQ (allele specific version of the annotation) because that looks like it's still behaving very well.

  • vcaropre1vcaropre1 Member

    I thought the AS_MQ seemed fine, but I have uploaded both sets of gvcfs (BP_RESOLUTION and GVCF mode) for the three samples in a file called "mq_gatk_bug.tar"

    Issue · Github
    by Sheila

    Issue Number
    945
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vcaropre1
    Hi,

    I just had a look. I am about to submit a bug report. You can keep track of it here.

    -Sheila

    P.S. I should have it in by the end of the day :smile:

  • slamentslament SwedenMember

    Hi @Sheila and @vcaropre1, has there been any progress on this? :) Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi,

    Sorry no progress yet. However, there is a developer assigned.

    I will update here when I learn more.

    -Sheila

  • KurtKurt Member ✭✭✭

    I noticed that it says that it has been closed above and the bug tracker entry has been cleared out even if the bug is still listed as open in the issue tracker

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Kurt, apologies for the confusion -- it's because we use two different trackers for forum tickets and for code change tickets. So the question ticket above is considered resolved because we have determined it is a bug that needs to be fixed, and a ticket has been entered in the development repo. We realize it's confusing, so we're looking into some options to make this more readable/trackable for the external world.

  • KurtKurt Member ✭✭✭

    @Geraldine_VdAuwera . No worries, thanks for the explanation :smiley:

  • slamentslament SwedenMember

    Hello @Geraldine_VdAuwera and @Sheila. I was wondering if you found a solution for the problem mentioned by @vcaropre1? Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @slament
    Hi,

    No, not yet. I will bring this up at our meeting tomorrow.

    -Sheila

  • KatieKatie United StatesMember

    Hi, I am also finding strange MQ scores using version 3.5-0-g36282e4 when I use the GenotypeGVCFs tool. Specifically, I find that for variants with low allele frequency, but high RAW_MQ on the per-sample gVCFs, this corresponds to very low MQ in the combined gVCF file.

    Does the latest version of GATK fix this?

    Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Katie
    Hi,

    Yes, the latest version contains a fix :smile:

    -Sheila

  • KurtKurt Member ✭✭✭

    @Sheila

    Hi,

    Do you mean that the latest version contains a fix for @Katie issue

    but @slament and @vcaropre1 issue is a different one that has not been fixed yet?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Kurt
    Hi,

    Yes, that is what I meant.

    @Katie
    To avoid confusion, I assumed you meant the original issue discussed in this thread (from @slament). If you note later in this thread, @vcaropre1 brings up a new issue which has not been fixed yet.

    -Sheila

  • KatieKatie United StatesMember

    Got it, thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Katie @Kurt @slament @vcaropre1
    Hi all,

    All the issues in this thread should be fixed in the latest nightly build :smile:

    -Sheila

  • vcaropre1vcaropre1 Member

    Great @Sheila ! I'll give it a shot!

  • vcaropre1vcaropre1 Member
    edited October 2016

    Hi @Sheila ,

    So it looks like there is still a small subset of positions that have an oddly high MQ score (> 70), but overall a lot better than before (No more >200). Most of the oddly high MQ values tend to be the same between both versions of GATK.

    Fig.1 MQ Scores from the previous release that @slament and I used to test the original fix.
    image

    Fig. 2 MQ scores from the most recent nightly build
    image

    Fig. 3 The comparison of MQ scores from the vcf by gvcfs created in BP_RESOLUTION vs. GVCF. This was ran with the new nightly build to see if there was any difference in the mapping quality mechanism between the two modes.
    image

    Issue · Github
    by Sheila

    Issue Number
    1364
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vcaropre1
    Hi,

    Hmm. I will check with the team and get back to you.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vcaropre1, sorry for the delayed response -- you caught us in the last hectic part of the workshop season. We've now been able to get back to this and take a good look and what's happening here. In a nutshell, this is a corner case due to a problematic interaction between how MQ is calculated and what information we retain in GVCF blocks. @Sheila will be in touch soon to request a bug report (with test data) if needed, and we'll try to get this fixed. However I don't think the fix will make it into the next release, which I'm hoping to cut by the end of the month. We'll keep you posted of any new developments. Thanks for reporting this!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vcaropre1
    Hi,

    I don't think we need any test data. The data you provided earlier is enough. There may actually be a fix in the next release (fingers crossed!).

    -Sheila

  • slamentslament SwedenMember
    edited February 2017

    Hi everyone @Sheila @vcaropre1 @Geraldine_VdAuwera! So things look like they've improved, but they are not completely fixed, uh? I now have version 3.7.0 and I was wondering if this issue remains unsolved.

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    As our colleagues on a related project like to say, things are not as broken as they were before, so we must have improved :smiley:

    Sadly the ticket for the bug mentioned above has not yet been addressed. Because of the low perceived impact, we haven't been able to prioritize it.
  • slamentslament SwedenMember

    I see, thank you for your quick answer @Geraldine_VdAuwera!

  • ronlevineronlevine Lexington, MAMember

    @slament I ran the script you uploaded April, 2016 for the 100-400 interval using the 2017-04-05-g34bd8a3 nightly build and found there is no longer a difference between the GVCF and BP_RESOLUTION MQ values:
    GVCF

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Pa130p  PaWa53p
    chromosome_4_Slice_0_100000     126     .       A       T       3798.76 .       AC=1;AF=0.500;AN=2;BaseQRankSum=6.22;ClippingRankSum=0.00;DP=1781;FS=17.704;MLEAC=1;MLEAF=0.500;MQ=56.55;MQRankSum=4.16;QD=33.32;ReadPosRankSum=6.33;SOR=0.023  GT:AD:DP:GQ:PL  0:1609,0:1609:99:0,1800 1:12,102:114:99:3830,0
    chromosome_4_Slice_0_100000     326     .       A       G       353.77  .       AC=1;AF=1.00;AN=1;BaseQRankSum=-6.820e-01;ClippingRankSum=0.00;DP=1573;FS=10.122;MLEAC=1;MLEAF=1.00;MQ=23.32;MQRankSum=-1.001e+01;QD=1.43;ReadPosRankSum=-2.121e+00;SOR=1.141   GT:AD:DP:GQ:PL  .:0,0   1:123,125:248:99:382,0
    chromosome_4_Slice_0_100000     334     .       T       C       3817.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-3.180e-01;ClippingRankSum=0.00;DP=589;FS=4.071;MLEAC=2;MLEAF=1.00;MQ=55.17;MQRankSum=-7.448e+00;QD=6.79;ReadPosRankSum=-4.770e-01;SOR=0.987     GT:AD:DP:GQ:PL  1:114,189:303:99:3213,0 1:123,136:259:99:633,0
    chromosome_4_Slice_0_100000     335     .       A       C       2221.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-1.800e+00;ClippingRankSum=0.00;DP=594;FS=3.266;MLEAC=2;MLEAF=1.00;MQ=55.18;MQRankSum=-6.511e+00;QD=4.11;ReadPosRankSum=-3.080e-01;SOR=0.961     GT:AD:DP:GQ:PL  1:114,162:303:99:2040,0 1:129,135:264:99:210,0
    chromosome_4_Slice_0_100000     339     .       T       C       6721.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=0.724;ClippingRankSum=0.00;DP=599;FS=6.191;MLEAC=2;MLEAF=1.00;MQ=55.06;MQRankSum=-8.432e+00;QD=11.75;ReadPosRankSum=0.100;SOR=1.054      GT:AD:DP:GQ:PL  1:111,189:300:99:3735,0 1:109,163:272:99:3015,0
    chromosome_4_Slice_0_100000     365     .       G       A       595.77  .       AC=2;AF=1.00;AN=2;BaseQRankSum=-8.140e-01;ClippingRankSum=0.00;DP=633;FS=1.131;MLEAC=2;MLEAF=1.00;MQ=55.68;MQRankSum=-7.463e+00;QD=1.11;ReadPosRankSum=3.27;SOR=0.810   GT:AD:DP:GQ:PL  1:117,128:297:99:472,0  1:140,152:335:99:152,0
    chromosome_4_Slice_0_100000     376     .       A       G       4773.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-2.560e+00;ClippingRankSum=0.00;DP=646;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=55.87;MQRankSum=-7.996e+00;QD=7.39;ReadPosRankSum=3.79;SOR=0.689   GT:AD:DP:GQ:PL  1:117,173:290:99:2471,0 1:152,204:356:99:2331,0
    

    BP_RESOLUTION

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Pa130p  PaWa53p
    chromosome_4_Slice_0_100000     126     .       A       T       3798.76 .       AC=1;AF=0.500;AN=2;BaseQRankSum=6.22;ClippingRankSum=0.00;DP=1980;FS=17.704;MLEAC=1;MLEAF=0.500;MQ=56.55;MQRankSum=4.16;QD=33.32;ReadPosRankSum=6.33;SOR=0.023  GT:AD:DP:GQ:PL  0:1662,146:1808:99:0,1800       1:12,102:114:99:3830,0
    chromosome_4_Slice_0_100000     326     .       A       G       353.77  .       AC=1;AF=1.00;AN=1;BaseQRankSum=-6.820e-01;ClippingRankSum=0.00;DP=1573;FS=10.122;MLEAC=1;MLEAF=1.00;MQ=23.32;MQRankSum=-1.001e+01;QD=1.43;ReadPosRankSum=-2.121e+00;SOR=1.141   GT:AD:DP:GQ:PL  .:0,0   1:123,125:248:99:382,0
    chromosome_4_Slice_0_100000     334     .       T       C       3817.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-3.180e-01;ClippingRankSum=0.00;DP=589;FS=4.071;MLEAC=2;MLEAF=1.00;MQ=55.17;MQRankSum=-7.448e+00;QD=6.79;ReadPosRankSum=-4.770e-01;SOR=0.987     GT:AD:DP:GQ:PL  1:114,189:303:99:3213,0 1:123,136:259:99:633,0
    chromosome_4_Slice_0_100000     335     .       A       C       2221.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-1.800e+00;ClippingRankSum=0.00;DP=594;FS=3.266;MLEAC=2;MLEAF=1.00;MQ=55.18;MQRankSum=-6.511e+00;QD=4.11;ReadPosRankSum=-3.080e-01;SOR=0.961     GT:AD:DP:GQ:PL  1:114,162:303:99:2040,0 1:129,135:264:99:210,0
    chromosome_4_Slice_0_100000     339     .       T       C       6721.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=0.724;ClippingRankSum=0.00;DP=599;FS=6.191;MLEAC=2;MLEAF=1.00;MQ=55.06;MQRankSum=-8.432e+00;QD=11.75;ReadPosRankSum=0.100;SOR=1.054      GT:AD:DP:GQ:PL  1:111,189:300:99:3735,0 1:109,163:272:99:3015,0
    chromosome_4_Slice_0_100000     365     .       G       A       595.77  .       AC=2;AF=1.00;AN=2;BaseQRankSum=-8.140e-01;ClippingRankSum=0.00;DP=633;FS=1.131;MLEAC=2;MLEAF=1.00;MQ=55.68;MQRankSum=-7.463e+00;QD=1.11;ReadPosRankSum=3.27;SOR=0.810   GT:AD:DP:GQ:PL  1:117,128:297:99:472,0  1:140,152:335:99:152,0
    chromosome_4_Slice_0_100000     376     .       A       G       4773.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-2.560e+00;ClippingRankSum=0.00;DP=646;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=55.87;MQRankSum=-7.996e+00;QD=7.39;ReadPosRankSum=3.79;SOR=0.689   GT:AD:DP:GQ:PL  1:117,173:290:99:2471,0 1:152,204:356:99:2331,0
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Since it looks like this got magically fixed, we're going to close the issue ticket, but if anyone encounters this problem again let us know in this thread.

Sign In or Register to comment.