Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Seeking help debugging a VQSR tranche plot

VivekVivek Member ✭✭
edited February 18 in Ask the GATK team

Dear GATK Team,

I have received data from ~ 35k whole exome sequences from a collaborator and while proceeding with variant filtering I noticed that the VQSR tranche plot looked abnormal.

Contrary to the tranche plots I usually see with the ti/tv declining with added false positives at tranche levels, I see the ti/tv rising sharply at more permissive filtering levels (plot attached).

I thought this might have to do with some degraded samples with high singletons, so I performed a singleton QC and removed any samples with singletons exceeding a median absolute deviation exceeding 3, which set the threshold at ~ 62 singletons. This removed about 4,500 samples.

I re-did VQSR on the remaining 30.5k samples but the problem seems to persist
and I'm out of ideas on how to resolve this. I'd appreciate any tips on how to debug the issue.

Cheers,
Vivek.


Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited February 25

    Hi @Vivek

    Would you please share the version of gatk and the exact command you are using.
    If you are not using the latest GATK4.1 version then I suggest yo try it with that and see if the problem persists.

  • VivekVivek Member ✭✭

    Hi @bhanuGandham

    I re-ran VQSR with GATK 4.1, the problem persists. I have attached the new tranche plot.

    This is the command I'm using:

    gatk VariantRecalibrator -R hg19.fa -V Exomes_35k.concat.vcf --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_hg19.vcf --resource:omni,known=false,training=true,truth=true,prior=12.0 omni_2.5.vcf --resource:Genomes1k,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.vcf -an DP -an QD -an FS -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.95 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 -O SNPs_GATK4.1.recal --tranches SNPs_GATK4.1.tranches --rscript-file SNPs_GATK4.1_plots.R

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Vivek

    I asked the dev team to weigh in and this is what they had to say:

    the sudden sharp increase in TiTv at more permissive filtering levels may be due to a transition biased artifact in the data. For example, if the samples are somewhat old, deamination events could pile up over time. Since deamination causes transition mutations, this would skew the TiTv ratio higher. The user seems to be considering this possibility, as they tested removing samples with high numbers of singletons from the cohort. The fact that the effect seems to decrease when singleton enriched samples are removed seems to support this idea as well (assuming I am interpreting the plots correctly and the effect did decrease somewhat with the removal of the singleton enriched samples). It is not surprising in this case that other samples which are not as enriched with singletons would still have some contamination due to these events, and so this increase in TiTv with more permissive filtering would still be observed.

  • VivekVivek Member ✭✭
    edited March 6

    Hi @bhanuGandham

    I appreciate the response. So with the next steps in the process, what is the best way to go about QC'ing this dataset?

    1. Censure the samples for singletons even further? What would be a way to find a more satisfactory threshold?
    2. Completely eliminate SNPs in tranches > 97.5 or some even stricter threshold?
    3. A combination of 1 & 2?

    I appreciate any guidance on how to proceed from here.

    We want to use the dataset for two purposes:

    1. Association analysis with gene or region based burden tests for complex psychiatric traits
    2. Evaluate the different ways we are imputing our chip data using an overlap of samples between the chip and exome sequencing.

    Thanks!
    Vivek.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Vivek

    Try eliminating SNPs in tranches > 97.5 or some even stricter threshold first and if that doesn't work then maybe try to use a combination of the two methods.

Sign In or Register to comment.