VQSR on ~500 genomes


I am working on VQSR step (using GATK 2.8.1) on variants which have been called by UG from ~500 whole genomes of cattle .
I run VariantRecalibrator as following:

${JAVA} ${GATK}/GenomeAnalysisTK.jar -T VariantRecalibrator \
-R ${REF} -input ${OUTPUT}/GATK-502-sorted.full.vcf.gz \
-resource:HD,known=false,training=true,truth=true,prior=15.0  HD_bosTau6.vcf \
-resource:JH_F1,known=false,training=true,truth=false,prior=10.0  F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0  BosTau6_dbSNP138_NCBI.vcf \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP -an HaplotypeScore \
-mode SNP \
-recalFile ${OUTPUT}/gatk_502_sorted_fixed.recal \
-tranchesFile ${OUTPUT}/gatk_502_sorted_fixed.tranches \
-rscriptFile ${OUTPUT}/gatk_502_sorted_fixed.plots.R

HD_bosTau6.vcf : ~770k markers on Illumina bovine high-density chip array

F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf : ~5.4M SNPs

The tranches pdf I got looks really weird, please check the attached file.

Then I tried to vary the 'prior' score of trainning VCF, and also supply additional VCF file from another project as training datasets. And I still got the similar tranches graph as above. e.g.:

-resource:HD,known=false,training=true,truth=true,prior=15.0  HD_bosTau6.vcf 
-resource:JH_F1,known=false,training=true,truth=false,prior=12.0  F1_uni_idra_pp_trusted_only_LMQFS_bosTau6.vcf 
-resource:DN,known=false,training=true,truth=false,prior=12.0  HC-Plat-FB.3in3.vcf.gz 
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0  BosTau6_dbSNP138_NCBI.vcf 

HC-Plat-FB.3in3.vcf.gz : ~ 14M markers

It is worthy to mention that I have done VariantRecalibrator step with the same parameters and training sets on another 50 whole genomes very recently, and it worked fine.
Actually I have done VariantRecalibrator on the 500 animals before when I accidentally took a unfiltered VCF called by UG as training set. Surprisingly, I got good tranches graph that time, similar to the graph posted on GATK best practice.
Do you have any suggestion for me?




  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We've seen this issue before though I'm still not sure what is the underlying cause. The reason the plots look funky is that there is an assumption that as the truth sensitivity percentage increases, the tranche Ti/Tv ratio will increase. You can see this exemplified in the "normal" plots, in the ROC curve panel, where the data points for the tranches are plotted and linked together in order of descending Ti/Tv ratio. In the weird-looking plots, it appears that this assumption is not verified. What I don't know is whether this represents an underlying issue in the data, a bug in the program, or whether this assumption itself is wrong and the data is fine. I'll need to discuss this with the team.

    In the meantime, can you tell me what were the differences between the two calling runs done on the same 500 animals? Was it just that you used a different caller (UG vs HC) or did you use a different version of GATK, or different command lines? If we identify the differences we may be able to narrow down the problem.

  • WanboWanbo Member

    Hi Geraldine,

    The VCF of 500 animals used for variants recalibration in both times is the same one, which had been called by UG. I did use different version of GATK ( 2.5-2 and 2.8-1), and slightly different prior values. To make things easier, I will rerun the recalibration step with GATK 2.8-1 with the unfiltered training set, and I will post my results here tomorrow.


  • WanboWanbo Member

    I have tested variants recalibration step on the same VCF file with unfiltered training set using GATK 2.8-1. Still, the tranches plots looks abnormal. Then I checked carefully what I have exactly used in previous commands which gave me "normal" recalibration plots. I realized I was using a subset of the VCF file that time (I am sorry, because it's been a while), namely exome regions. Once again, I reran recalibration with the first commands I posted above ( high-quality training set, GATK 2.81) on the exome regions, then I got fairly good-looking plots (the attached files).

    It makes me think whether the amount of variants or the ratio of variants/training variants affects GATK recalibration performance? It worked on VCF of 50 WGS, and exome regions of 500 animals, but didn't work well on the whole dataset.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Wanbo, sorry for the late reply. Are you saying you are combining exome and WGS datasets here and recalibrating them together? First thing is you want to make sure that you are only recalibrating together calls from samples that were called together (unless you use the new HaplotypeCaller GVCF workflow - but even then you can only recalibrate together calls from samples that were joint-genotyped together). Second thing is if you do that you can only recalibrate on the exome intervals. Trying to recalibrate on the whole genome when most of your data is exome will definitely lead to funky results.

  • WanboWanbo Member

    Hi Geraldine,

    No, I didn't combine exomes and WGS together. I recalibrated on exonic regions of WGS datasets by specifying intervals with -L .

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh good -- I'm glad I just misunderstood that. So to make sure I understand correctly now: the weird results are when you did VQSR on the full genome of the 500 animal WGS, but if you do just exonic regions on those same animal WGS, the plots look good?

    I'm doubtful that the issue results from having too much data -- typically VQSR performs better with more data, not worse. But this may indicate that there are some specific issues in the non-exonic regions. Could you try running VQSR on just the non-exonic regions to see what the plots look like for just them? The way to do that is to use -XL instead of -L to specify the capture intervals.

  • WanboWanbo Member

    @Geraldine_VdAuwera said:
    So to make sure I understand correctly now: the weird results are when you did VQSR on the full genome of the 500 animal WGS, but if you do just exonic regions on those same animal WGS, the plots look good?


    Yes, you are right.

    I followed your suggestion to make VQSR on the non-exonic regions, and I got the weird plots again. Then what would be the explanation? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I think that means that the TiTv in those regions is very different from the expectation. You can change the expected TiTv in your command line (see tool documentation for the exact argument name) and see if that fixes the plots. This doesn't have any impact on the calculations, only on plotting, so you can safely experiment with this parameter.

Sign In or Register to comment.