Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Is new MQ jittering functionality recommended?

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this:
1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero?
2) Is the use of MQCapForLogitJitterTransform recommended?
3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.

Best Answer


  • Hi Bertrand, many thanks for the very clear and full answers! I have a few less pressing, somewhat more philosopical follow-on questions:
    4) The new jitter feature is only applied to MQ, and not any other annotation variables, is that right? Do you think there might be value in adding this to other annotation variables? For example, I am seeing lots of variants with FS==0.0 and lots with missing values for MQRankSum and ReadPosRankSum (I have haploid data, some mixed genomes, some clonal). Perhaps setting the missing values to 0.0 then adding random jitter might make these annotations useable for me?
    5) Likewise, do you think there might be value in logit transforming other annotation variables, e.g. FS doesn't look very Gaussian in my data.
    6) Given your answers to 2 and 3, should we expect these to be included in future versions of the best practices?

  • bertrandbertrand 301 BinneyMember

    Hi Richard,

    4.a) The jitter to MQ is new but jitter to SOR, InbreedingCoeff, FS, and HaplotypeScore have been set in previous release and remain at the same value (0.01 * Normal(0,1)). This alleviates some of the problem having a lot of variants with one specific value, but doe snot do too much to the whole distribution.
    4.b) Rank-sum stats like MQRankSum and ReadPosRankSum should already have, from their definition, a Gaussian looking distribution, so there is no point in adding jitter. However, they need a minimum amount of samples (10 I believe) in order to be output, otherwise you get a NA. So if you have mixed data as you mention, you might have a few NA for those rank-sums. There are many different ways of dealing with missing values. The easiest way is to ignore them (when possible). The theoretically correct way would be to impute them but that becomes a bit more involved (see wikipedia on "missing data" for example). What you are suggesting seems a practical hack somewhat in-between: Compute mean and variance on non-missing data (which should be about 0 and 1 for rank-sum statistics--normalized U-statistic from the Mann-Whitney rank-sum test), and fill in missing values with random values from a Normal(0,1). It might work but it depends on your data and your analysis: (a) There might be some bias in the missing/non-missing parts of your data. (b) Your analysis might or might not be sensitive to that.
    5) We are working now on a better VQSR algorithm altogether involving appropriate transforms that make these annotations look more Gaussian, just as you mention. So stay tuned! :-)
    6) Absolutely.

Sign In or Register to comment.