We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How to diagnose missing MQRankSum annotations (when BaseQRankSum is available)

init_jsinit_js Member
edited October 2019 in Ask the GATK team

We wish to discover short variants in a cohort of 60 plant whole-genome-samples. We're blocked on VariantRecalibrator.

We have a VCF truth set (aka resource) of SNPs which has been computed beforehand and hard-filtered. And we have a raw VCF for the 60 samples under study. This input VCF has been joint-called with HaplotypeCaller (GVCF) + GenomicsDBImport + GenotypeGVCFs over the whole genome. We computed a sites-only version of that input VCF and fed it to VariantRecalibrator. We configured HaplotypeCaller to produce allele-specific annotations (-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation) and GenotypeGVCFs as well (-G StandardAnnotation -G AS_StandardAnnotation).

We've configured VariantRecalibrator to build its SNP model based on the set of annotations: -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP. This was based on this allele-specific annotation and filtering article.

Unfortunately, both AS_MQRankSum, and MQRankSum annotations have variance 0 over our data, and prevent the model from being produced. Dropping the annotation is one option, but it's ill-advised afaik (see reference #1).

How do we diagnose this?

/gatk/gatk VariantRecalibrator --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx183296m' --tmp-dir <redacted> <redacted_list_of_input_vcf_files> --resource:GOLD,known=false,training=true,truth=true,prior=10.0 <redacted>/gold.snps.vcf.gz --mode SNP -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP --trust-all-polymorphic --truth-sensitivity-tranche 100.0 --truth-sensitivity-tranche 99.0 --truth-sensitivity-tranche 90.0 --truth-sensitivity-tranche 70.0 --truth-sensitivity-tranche 50.0 --max-gaussians 6 --rscript-file <redacted>  --tranches-file <redacted> -AS --output <redacted>/snp.recal.vcf.gz
22:16:29.941 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
...
A USER ERROR has occurred: Bad input: Found annotations with zero variance. They must be excluded before proceeding.
...
19:52:31.846 INFO  ProgressMeter - Traversal complete. Processed 2786933 total variants in 0.8 minutes.
19:52:31.959 INFO  VariantDataManager - AS_QD:   mean = 17.44    standard deviation = 8.75
19:52:32.173 INFO  VariantDataManager - MQRankSum:       mean = 0.00     standard deviation = 0.00
19:52:32.487 INFO  VariantDataManager - AS_ReadPosRankSum:       mean = 0.01     standard deviation = 0.84
19:52:32.797 INFO  VariantDataManager - AS_FS:   mean = 1.79     standard deviation = 3.03
19:52:32.962 INFO  VariantDataManager - AS_MQ:   mean = 60.00    standard deviation = 0.12
19:52:33.139 INFO  VariantDataManager - AS_SOR:          mean = 0.68     standard deviation = 0.25
19:52:33.335 INFO  VariantDataManager - DP:      mean = 1323.99  standard deviation = 945.30

(We initially tried with AS_MQRankSum, instead of MQRankSum, but it also had 0.00 variance)

Question 1: As we understand it, MQRankSum can only be computed on sites which are heterozygous reference (see reference #2). Generally speaking, and not for our particular dataset, do we need a good representation of such sites both in the truth "resource" sets and the input raw variant vcfs, or just in the input vcfs ?

Question 2: We've confirmed that our data (both the truth set and the input set) does have many het-ref sites with good read support for all alleles. One evidence of this (we think), is the fact that AS_ReadPosRankSum was calculated to have non-zero variance, as shown in the VariantRecalibrator output above. MQRankSum's variance couldn't be calculated, but both it and AS_MQRankSum have the same caveats in the documentation. What are cases where one annotation can be calculated, but not the other? e.g. Does this indicate that my mapping qualities are too uniform (in which case, the variance would be exactly 0.000)?

Question 3: If we dig in the dataset (i.e. in the sites-only VCF inputs), we see a lot of sites whose relevant annotations are a mix of "nul", and "0.000". At certain sites, AS_MQRankSum is there but not MQRankSum. Sometimes both of them are there. How should we interpret the different values? Is there anything "wrong" with that?

Ex: biallelic heterozygous-ref site. AS_MQRankSum and AS_ReadPosRankSum are "nul". MQRankSum and ReadPosRankSum are omitted altogether.

HanXRQChr01     16169   .       G       T       114.60  PASS    AC=2;AF=0.250;AN=8;AS_BaseQRankSum=nul;AS_FS=0.000;AS_MQ=
60.00;AS_MQRankSum=nul;AS_QD=30.82;AS_ReadPosRankSum=nul;AS_SOR=0.693;DP=6;ExcessHet=0.3218;FS=0.000;MLEAC=10;MLEAF=1.00;MQ=60.00;QD=29.27;SOR=0.693

Ex: biallelic het-ref site. ReadPosRankSum is non-zero. AS_MQRankSum is zero.

HanXRQChr01     17137   .       G       A       53.23   PASS AC=1;AF=0.010;AN=96;AS_BaseQRankSum=0.600;AS_FS=0.000;AS_InbreedingCoeff=-0.0476;AS_MQ=60.00;AS_MQRankSum=0.000;AS_QD=8.87;AS_ReadPosRankSum=0.800;AS_SOR=1.179;BaseQRankSum=0.623;DP=243;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0476;MLEAC=1;MLEAF=0.010;MQ=60.00;MQRankSum=0.00;QD=8.87;ReadPosRankSum=0.842;SOR=1.179

Ex: site where one allele has nuls, and the other one has floats

HanXRQChr01     18154   .       G       C,A     8466.08 PASS    AC=25,1;AF=0.240,9.615e-03;AN=104;AS_BaseQRankSum=-0.550,nul;AS_FS=1.536,2.158;AS_InbreedingCoeff=0.8253,-0.0126;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=29.59,8.09;AS_ReadPosRankSum=0.900,nul;AS_SOR=0.400,0.223;BaseQRankSum=0.494;DP=813;ExcessHet=0.0000;FS=1.538;InbreedingCoeff=0.8833;MLEAC=26,1;MLEAF=0.250,9.615e-03;MQ=60.00;MQRankSum=0.00;QD=32.25;ReadPosRankSum=1.52;SOR=0.391

Relevant pages and comments I've found on the subject:
1. "MQRankSum is one of the core annotations that we recommend using, so I would recommend going to the trouble of finding out why it's not working." (https://gatkforums.broadinstitute.org/gatk/discussion/comment/9737/#Comment_9737 )
2. "The Rank Sum Tests require at least one individual to be heterozygous and have a mix of ref and alt reads" (https://gatkforums.broadinstitute.org/gatk/discussion/comment/33174/#Comment_33174 )
3. AS_ReadPosRankSum annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_ReadPosRankSumTest.php)
4. AS_MQRankSum annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_MappingQualityRankSumTest.php )
5. Hard-filtering recommendations (which talks about how the tests work, in particular MQRankSum and ReadPosRankSum): (https://software.broadinstitute.org/gatk/documentation/article.php?id=6925 )
6. Allele-specific annotation and filtering article. (https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering/)

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • init_jsinit_js Member
    edited October 2019

    Thank you for looking into this. That's too bad. Not supported because it's not human? We do have a curated training set in this instance. And we have more than 30 exomes in this particular run. It's also worked well for other runs in the past (they were not using allele-specific annotations however).

    We could skip the VQSR on this one, and follow the recommendations of the article you cite: do hard-filtering instead. The article recommends using this for SNPs:

    QD < 2.0
    MQ < 40.0
    FS > 60.0
    SOR > 3.0 
    MQRankSum < -12.5
    ReadPosRankSum < -8.0
    

    In the end we'd be in the same situation vis-a-vis MQRankSum, where a lot of the sites in the vcf to filter have "nul" there. In fact our vcfs have only three distinct values for MQRankSum and AS_MQRankSum: nul, 0.00, and 0.000. Not sure there's something significant about having 0.00 vs 0.000 -- perhaps numerical rounding of small values?

    Is there a test we could perform on the GVCFs to make sure no data was dropped when we performed the joint-call with GenotypeGVCFs. One hypothesis could be that some tool's not dealing well with the allele-specific annotations. This wasn't an issue in the past.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @init_js

    Can you please post a few vcf records with MQRankSum = null?

  • init_jsinit_js Member

    Clarification, MQRankSum itself is never nul, but it can be 0.00 or 0.000, or missing altogether. AS_MQRankSum, AS_BaseQRankSum, and AS_ReadPosRankSum can be nul, 0.00, or 0.000.

    A few records from our training resource vcf (AS_MQRankSum/AS_BaseQRankSum with nul):

    HanXRQChr02 27126396    .   A   C   75.76   PASS    AC=2;AF=0.071;AN=28;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.3992;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=33.73;AS_ReadPosRankSum=nul;AS_SOR=0.693;BaseQRankSum=-9.670e-01;DP=106;ExcessHet=0.0877;FS=0.000;InbreedingCoeff=0.3992;MLEAC=1;MLEAF=0.036;MQ=60.00;MQRankSum=0.00;QD=34.55;ReadPosRankSum=-2.820e-01;SOR=0.693
    
    HanXRQChr02 27815707    .   A   C,* 464.81  PASS    AC=3,2;AF=0.107,0.071;AN=28;AS_BaseQRankSum=0.000,nul;AS_FS=0.000,0.000;AS_InbreedingCoeff=0.3702,0.4776;AS_MQ=60.00,0.00;AS_MQRankSum=0.000,nul;AS_QD=33.15,29.07;AS_ReadPosRankSum=0.900,nul;AS_SOR=0.442,0.223;BaseQRankSum=0.00;DP=62;ExcessHet=0.0680;FS=0.000;InbreedingCoeff=0.5485;MLEAC=4,1;MLEAF=0.143,0.036;MQ=60.00;MQRankSum=0.00;QD=27.34;ReadPosRankSum=0.967;SOR=0.330
    
    HanXRQChr08 90900151    .   G   A,T 5884.79 PASS    AC=22,1;AF=0.786,0.036;AN=28;AS_BaseQRankSum=0.600,nul;AS_FS=2.144,1.249;AS_InbreedingCoeff=0.1024,-0.0460;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=29.77,16.33;AS_ReadPosRankSum=-1.000,nul;AS_SOR=0.543,0.970;BaseQRankSum=0.682;DP=207;ExcessHet=0.8894;FS=0.990;InbreedingCoeff=0.2499;MLEAC=23,1;MLEAF=0.821,0.036;MQ=60.00;MQRankSum=0.00;QD=30.18;ReadPosRankSum=-9.820e-01;SOR=0.521
    
    HanXRQChr08 90911842    .   G   *,A 719.81  PASS    AC=17,1;AF=0.567,0.033;AN=30;AS_BaseQRankSum=nul,-0.300;AS_FS=0.000,0.000;AS_InbreedingCoeff=0.3210,-0.0346;AS_MQ=0.00,60.00;AS_MQRankSum=nul,0.000;AS_QD=27.17,1.83;AS_ReadPosRankSum=nul,0.500;AS_SOR=0.283,0.646;BaseQRankSum=-2.820e-01;DP=486;ExcessHet=1.2216;FS=2.338;InbreedingCoeff=0.1663;MLEAC=17,1;MLEAF=0.567,0.033;MQ=60.00;MQRankSum=0.00;QD=1.80;ReadPosRankSum=0.516;SOR=0.571
    
    HanXRQChr08 90912151    .   C   G,A 9061.27 PASS    AC=16,1;AF=0.533,0.033;AN=30;AS_BaseQRankSum=0.700,nul;AS_FS=3.292,0.000;AS_InbreedingCoeff=-0.0715,-0.0345;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=17.92,1.25;AS_ReadPosRankSum=-0.800,nul;AS_SOR=0.687,0.684;BaseQRankSum=-3.810e-01;DP=507;ExcessHet=2.1177;FS=3.312;InbreedingCoeff=0.0497;MLEAC=16,1;MLEAF=0.533,0.033;MQ=60.00;MQRankSum=0.00;QD=19.91;ReadPosRankSum=-6.200e-01;SOR=0.672
    
    HanXRQChr08 90912513    .   G   T,A 747.76  PASS    AC=4,1;AF=0.133,0.033;AN=30;AS_BaseQRankSum=-0.100,nul;AS_FS=0.000,0.000;AS_InbreedingCoeff=0.2867,-0.0685;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=18.77,9.78;AS_ReadPosRankSum=0.300,nul;AS_SOR=0.784,0.602;BaseQRankSum=-7.200e-02;DP=138;ExcessHet=0.0485;FS=0.000;InbreedingCoeff=0.6127;MLEAC=4,1;MLEAF=0.133,0.033;MQ=60.00;MQRankSum=0.00;QD=20.77;ReadPosRankSum=0.316;SOR=0.797
    

    A few records with nuls from the sites-only VCFs being calibrated:

    HanXRQChr01     1031643 .       T       A,C     142.46  PASS    AC=2,2;AF=0.100,0.100;AN=20;AS_BaseQRankSum=nul,nul;AS_FS=0.000,0.000;AS_InbreedingCoeff=0.2598,0.3437;AS_MQ=60.00,60.00;AS_MQRankSum=nul,nul;AS_QD=31.26,27.76;AS_ReadPosRankSum=nul,nul;AS_SOR=0.693,0.693;DP=35;ExcessHet=0.1296;FS=0.000;InbreedingCoeff=0.5685;MLEAC=5,5;MLEAF=0.250,0.250;MQ=60.00;QD=29.09;SOR=1.179
    
    HanXRQChr01     1031648 .       C       T       493.75  PASS    AC=8;AF=0.308;AN=26;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.6084;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=26.08;AS_ReadPosRankSum=nul;AS_SOR=1.179;DP=36;ExcessHet=0.0010;FS=0.000;InbreedingCoeff=0.6084;MLEAC=22;MLEAF=0.846;MQ=60.00;QD=33.85;SOR=0.859
    
    HanXRQChr01     1031653 .       C       A       803.53  PASS    AC=14;AF=0.636;AN=22;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.5505;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=27.08;AS_ReadPosRankSum=nul;AS_SOR=1.329;DP=31;ExcessHet=0.0067;FS=0.000;InbreedingCoeff=0.5505;MLEAC=47;MLEAF=1.00;MQ=60.00;QD=25.42;SOR=1.112
    
    HanXRQChr01     1031697 .       T       C       43.41   PASS    AC=2;AF=0.056;AN=36;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.1534;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=21.70;AS_ReadPosRankSum=nul;AS_SOR=0.693;DP=43;ExcessHet=0.0625;FS=0.000;InbreedingCoeff=0.1534;MLEAC=4;MLEAF=0.111;MQ=60.00;QD=21.70;SOR=0.693
    
    HanXRQChr01     1031714 .       C       T       1106.82 PASS    AC=12;AF=0.375;AN=32;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.6960;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=30.55;AS_ReadPosRankSum=nul;AS_SOR=1.022;DP=57;ExcessHet=0.0001;FS=0.000;InbreedingCoeff=0.6960;MLEAC=32;MLEAF=1.00;MQ=60.00;QD=32.62;SOR=1.179
    
    HanXRQChr01     1031717 .       C       T       80.60   PASS    AC=2;AF=0.048;AN=42;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.1986;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=29.33;AS_ReadPosRankSum=nul;AS_SOR=0.693;DP=58;ExcessHet=0.0533;FS=0.000;InbreedingCoeff=0.1986;MLEAC=3;MLEAF=0.071;MQ=60.00;QD=35.36;SOR=0.693
    
    HanXRQChr01     1031762 .       C       A,T     1923.07 PASS    AC=18,1;AF=0.346,0.019;AN=52;AS_BaseQRankSum=0.000,nul;AS_FS=6.821,0.000;AS_InbreedingCoeff=0.4323,-0.0111;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=28.76,5.55;AS_ReadPosRankSum=0.900,nul;AS_SOR=2.200,0.527;BaseQRankSum=0.00;DP=173;ExcessHet=0.0014;FS=6.540;InbreedingCoeff=0.6045;MLEAC=32,2;MLEAF=0.615,0.038;MQ=60.00;MQRankSum=0.00;QD=29.94;ReadPosRankSum=-5.450e-01;SOR=1.767
    
    HanXRQChr01     1031874 .       A       T       255.06  PASS    AC=2;AF=0.027;AN=74;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.2967;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=28.34;AS_ReadPosRankSum=nul;AS_SOR=0.892;DP=239;ExcessHet=0.0298;FS=0.000;InbreedingCoeff=0.2967;MLEAC=3;MLEAF=0.041;MQ=60.00;QD=28.34;SOR=0.892
    
    HanXRQChr01     1031900 .       A       G       196.79  PASS    AC=2;AF=0.026;AN=76;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.3614;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=32.80;AS_ReadPosRankSum=nul;AS_SOR=0.693;DP=225;ExcessHet=0.0290;FS=0.000;InbreedingCoeff=0.3614;MLEAC=2;MLEAF=0.026;MQ=60.00;QD=32.80;SOR=0.693
    
    HanXRQChr01     1031901 .       G       C       182.34  PASS    AC=2;AF=0.026;AN=76;AS_BaseQRankSum=nul;AS_FS=0.000;AS_InbreedingCoeff=0.3319;AS_MQ=60.00;AS_MQRankSum=nul;AS_QD=29.41;AS_ReadPosRankSum=nul;AS_SOR=0.223;BaseQRankSum=-9.670e-01;DP=228;ExcessHet=0.0290;FS=0.000;InbreedingCoeff=0.3319;MLEAC=2;MLEAF=0.026;MQ=60.00;MQRankSum=0.00;QD=31.11;ReadPosRankSum=0.967;SOR=0.223
    
    HanXRQChr07     8527273 .       C       *,T     167.14  PASS    AC=2,3;AF=0.059,0.088;AN=34;AS_BaseQRankSum=nul,0.000;AS_FS=0.000,0.000;AS_InbreedingCoeff=0.2137,0.0804;AS_MQ=0.00,60.00;AS_MQRankSum=nul,0.000;AS_QD=18.50,20.78;AS_ReadPosRankSum=nul,-0.300;AS_SOR=0.693,0.693;BaseQRankSum=0.00;DP=38;ExcessHet=0.1296;FS=0.000;InbreedingCoeff=0.3564;MLEAC=4,8;MLEAF=0.118,0.235;MQ=60.00;MQRankSum=0.00;QD=15.19;ReadPosRankSum=-2.530e-01;SOR=0.941
    
    HanXRQChr07     8528291 .       G       C,T     3727.60 PASS    AC=48,3;AF=0.889,0.056;AN=54;AS_BaseQRankSum=0.000,nul;AS_FS=2.117,0.000;AS_InbreedingCoeff=0.1333,0.1586;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=27.81,29.68;AS_ReadPosRankSum=nul,nul;AS_SOR=0.448,0.368;BaseQRankSum=0.00;DP=103;ExcessHet=3.0103;FS=2.115;InbreedingCoeff=0.2239;MLEAC=88,5;MLEAF=1.00,0.093;MQ=60.00;MQRankSum=0.00;QD=30.21;SOR=0.494
    
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @init_js

    This issue might be fixed in GATK v4.1.4.0. Can you please upgrade and see if that fixes the issue.
    FYI this issue is related to this pull request: https://github.com/broadinstitute/gatk/pull/6079

Sign In or Register to comment.