Strange VQSR plots after version 3.5 release

I ran GATK's cohort genotyping pipeline on 5000 human samples with Illumina WGS ~1.3x data, up through GenotypeGVCFs (and CatVariants to combine chunks) using v3.4-46. Next I ran VariantRecalibrator (initially just chr1) using recommended settings with both v3.4-46 and v3.5. Here is my command for both versions:

java -Xmx40g \
-jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R hs37m.fa \
-input gatk.hc.combined.genotyped.chr1-22.vcf.gz \
-recalFile snps.recal \
-tranchesFile snps.tranches \
-rscriptFile recalibrate_SNP_plots.R \
--target_titv 2.15 \
-nt 24 \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
-mode SNP \
-L 1 \
-tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 98.5 -tranche 90.0 \
--maxGaussians 6 \
-log VariantRecalibrator.snps.log

The attached tranches plots (snps.tranches.v3.5.pdf) generated w/ v3.5 look strange because:

1) The tranches are out of order on the bar plot (e.g., 99.5 is before 99)
2) The fill coloring doesn't make sense for tranches 99 and 98.5 - there are orange stripes over the blue bar
3) The scatter plot's connecting lines go in both directions

The plots for v3.4-46 look more normal (snps.tranches.v3.4-46.pdf), though I'm still trying to figure out how to get closer to the expected 2.15 Ti/Tv ratio. Oddly, the Ti/Tv ratios differ slightly between v3.4-46 and v3.5 even though the same data and settings were used.

I suspected the behavior w/ v3.5 may be a possible bug in VariantRecalibrator, which is why I'm posting here. Please let me know if you need any more information.

My best,

Chris

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cedlund
    Hi Chris,

    Have a look here at Geraldine's comment from April 17th.

    I think using dbsnp_138.b37.excluding_sites_after_129.vcf will help. Also, VQSR is not meant to be run on one chromosome. Can you try running on all the chromosomes and let us know how that goes?

    As for the differences in versions, there were some changes made to VQSR in the latest version. I am about to make a document about it. Basically, there were some changes to how MQ values are handled in VQSR. Originally, they were not Gaussianly distributed, and VQSR assumes Gaussianly distributed annotations. The change allows the MQ values to be more Gaussian.

    -Sheila

  • cedlundcedlund Member

    Hi Sheila,

    I tried what you suggested (running the whole genome and using the older dbSNP file) but I'm still seeing the same problem with the tranches plot when using v3.5. Please see attached.

    Notice how the tranches are out of order and the coloring of the bars is off.

    My best,
    Chris

    Issue · Github
    by Sheila

    Issue Number
    419
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cedlund
    Hi Chris,

    Sorry for the late response. You are right the plots look weird. Can I ask you to try running on the entire genome rather than chromosome 1 only? Also, please try running without -nt 24.

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cedlund
    Hi again Chris,

    I just realized I already asked you to try on the whole genome. Let us know how the results look without -nt. I put in a ticket to get this checked out. I may have to ask you for a bug report.

    -Sheila

  • cedlundcedlund Member

    Hi Sheila,

    I tried without the -nt option, but still get the same problem. Please see attached.

    -Chris

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The tranches are ordered by TiTv instead of by truth sensitivity on purpose. On these results it would look weird with either ordering but we find this ordering is more appropriate most of the time.

    We occasionally see effects like this where the TiTv content of tranches doesn't quite follow expectations; over time I've come to the conclusion that it's not necessarily a sign of anything wrong, so much as some variability in datasets that causes the TiTv proxy we use to estimate false positive rates to be more or less appropriate (i.e. work more or less well, empirically). I would say that overall your data looks pretty good, and that you've just got a case where the TiTv proxy method for generating the plots is not doing the best job ever.

    If you wanted to investigate this a bit further, you could do a comparison of which variants get filtered with each tranche between the 3.4 and 3.5 callsets. My bet is that you wouldn't see much difference, and that those variants being treated differently would probably look questionable on some dimensions.

    You could also plot the distribution of each set of annotation values for the filtered vs unfiltered variants. for each of the tranche cutoffs. Again, my bet is you wouldn't see anything majorly different, except possibly on MQ since that's the only part of VQSR that was changed in 3.5. I'd be interested to know if the MQ plots between 3.4 and 3.5 look like they might explain the differences you see.

    If you don't know how to go about generating these kinds of plots, let us know. @Sheila is working on some documentation to explain how to do it and could probably use a test subject to evaluate how helpful her docs are.

  • cedlundcedlund Member

    Hi Geraldine,

    Thank you for the detailed explanation. I will try what you suggested and look at which variants get filtered using v3.4-46 and v3.5. I'll also take a look at the MQ plots (if you have any scripts that would help generate these it would be much appreciated). Just a couple more follow-up questions:

    1) Is it okay to use VariantRecalibrator v3.5 when I've run HaplotypeCaller and GenotypeGVCFs with v3.4-46, should I be using the -MQCap option?

    2) In the last tranches plot I attached, what does the color blue with orange stripes overlaid represent?

    My best,
    Chris

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We don't have scripts as such but we have a procedure that consists of exporting the variant data to tables and running some ggplot commands in R. It's all described in the hands on tutorial materials that were presented a few weeks ago; there's a blog post with links to the tutorial bundle on the GATK blog (pending @Sheila's writeup of this as a proper doc article).

    1) it should be fine, yes. I don't think you need to use MQCap but feel free to experiment.

    2) it's a plotting anomaly because the plotting script is naive and does not handle this sort of case properly.

Sign In or Register to comment.