To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Is new MQ jittering functionality recommended?

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this:
1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero?
2) Is the use of MQCapForLogitJitterTransform recommended?
3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.

Best Answer

Answers

  • Hi Bertrand, many thanks for the very clear and full answers! I have a few less pressing, somewhat more philosopical follow-on questions:
    4) The new jitter feature is only applied to MQ, and not any other annotation variables, is that right? Do you think there might be value in adding this to other annotation variables? For example, I am seeing lots of variants with FS==0.0 and lots with missing values for MQRankSum and ReadPosRankSum (I have haploid data, some mixed genomes, some clonal). Perhaps setting the missing values to 0.0 then adding random jitter might make these annotations useable for me?
    5) Likewise, do you think there might be value in logit transforming other annotation variables, e.g. FS doesn't look very Gaussian in my data.
    6) Given your answers to 2 and 3, should we expect these to be included in future versions of the best practices?

  • bertrandbertrand 301 BinneyMember

    Hi Richard,

    4.a) The jitter to MQ is new but jitter to SOR, InbreedingCoeff, FS, and HaplotypeScore have been set in previous release and remain at the same value (0.01 * Normal(0,1)). This alleviates some of the problem having a lot of variants with one specific value, but doe snot do too much to the whole distribution.
    4.b) Rank-sum stats like MQRankSum and ReadPosRankSum should already have, from their definition, a Gaussian looking distribution, so there is no point in adding jitter. However, they need a minimum amount of samples (10 I believe) in order to be output, otherwise you get a NA. So if you have mixed data as you mention, you might have a few NA for those rank-sums. There are many different ways of dealing with missing values. The easiest way is to ignore them (when possible). The theoretically correct way would be to impute them but that becomes a bit more involved (see wikipedia on "missing data" for example). What you are suggesting seems a practical hack somewhat in-between: Compute mean and variance on non-missing data (which should be about 0 and 1 for rank-sum statistics--normalized U-statistic from the Mann-Whitney rank-sum test), and fill in missing values with random values from a Normal(0,1). It might work but it depends on your data and your analysis: (a) There might be some bias in the missing/non-missing parts of your data. (b) Your analysis might or might not be sensitive to that.
    5) We are working now on a better VQSR algorithm altogether involving appropriate transforms that make these annotations look more Gaussian, just as you mention. So stay tuned! :-)
    6) Absolutely.

Sign In or Register to comment.