Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MQ and Multisample calling from GVCFs

Dear GATK team,

I'm getting puzzled with the MQ distribution coming out of our multisample calling.
Our procedure is:

  • We start form a set of GVCF files created with GATK 3.5 HaplotypeCaller in BP_RESOLUTION mode for ~70 samples
  • We combine them with CombineGVCFs (GATK 3.5)
  • We call them with GenotypeGVCFs (GATK 3.5 first, GATK 3.8 now)

With GATK 3.5 we had an odd MQ distribution (deeply underscored), but apparently it was reported as a known bug.
Then we updated to GATK 3.8, now the MQ distribution for MQ<60 looks normal, but ~10% of the positions now have MQs>60 (to values up to ~700).
If it can help, I noticed that some of these ultra high scores originated from positions in which RAW_MQ is not specified in none of the samples' gvcfs. But generally they correspond to variants with high MQ (~60).

Any explanation? How should I treat these MQ>60 values?

Thanks a lot for your support!
Riccardo

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @berutti
    Hi Riccardo,

    Can you post some example GVCF records where the RAW_MQ is not present? Can you also post the final VCF records for those sites?

    Thanks,
    Sheila

    P.S. Because development has stopped on GATK3, can you test this with GATK4 latest beta as well?

  • kw8_klaudia_walterkw8_klaudia_walter SangerMember

    Hi GATK team,
    We have also observed MQ values up to 600 with GATK 4.0.
    Thanks,
    Klaudia

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kw8_klaudia_walter
    Hi Klaudia,

    Can you post some example records?

    Does this happen in a lot of sites or just a few?

    Thanks,
    Sheila

  • kw8_klaudia_walterkw8_klaudia_walter SangerMember

    Hi Sheila,

    It happens in about 9% of bi-allelic SNP sites on chr20 after excluding the centromeric region, 13% of sites when including the centromere. Really high MQ seems to be linked to singletons or doubletons and to the centromeric region.

    Here are three examples.

    chr20 3752524 . T G 152.47 . AC=15;AF=0.051;AN=292;BaseQRankSum=-8.420e-01;ClippingRankSum=0.00;DP=2188;ExcessHet=22.9958;FS=45.542;InbreedingCoeff=-0.1568;MLEAC=12;MLEAF=0.041;MQ=131.78;MQRankSum=0.00;QD=1.18;ReadPosRankSum=-7.240e-01;SOR=0.717

    chr20 10477923 . A C 24.64 . AC=2;AF=6.897e-03;AN=290;DP=2397;ExcessHet=13.5204;FS=0.000;InbreedingCoeff=-0.0880;MLEAC=2;MLEAF=6.897e-03;MQ=406.94;QD=24.64;SOR=1.440

    chr20 51675523 . A C 41.44 . AC=1;AF=3.472e-03;AN=288;BaseQRankSum=-2.110e+00;ClippingRankSum=0.00;DP=2373;ExcessHet=3.0254;FS=122.296;InbreedingCoeff=-0.0093;MLEAC=1;MLEAF=3.472e-03;MQ=322.66;MQRankSum=0.00;QD=4.60;ReadPosRankSum=0.404;SOR=7.658

    Thanks,
    Klaudia

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kw8_klaudia_walter
    Hi Klaudia,

    Interesting that the sites seem to be singletons or doubletons. Can you submit a bug report? Instructions are here.

    Thanks,
    Sheila

  • kw8_klaudia_walterkw8_klaudia_walter SangerMember

    Hi Sheila,

    Sorry this has taken me so long. I uploaded the file snps_MQ_60_GATK4_GRCh38.tar.gz that contains a VCF file with 49,558 variants on chr20 with MQ>60 and BAM files with slices around two of those variants for 146 samples. We used GRCh38 for the mapping and GATK 4.0 for the calling. Please let me know if you need any more information.

    Thanks,
    Klaudia

    Issue · Github
    by Sheila

    Issue Number
    3088
    State
    open
    Last Updated
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kw8_klaudia_walter
    Hi Klaudia,

    Thanks. I will have a look soon.

    -Sheila

Sign In or Register to comment.