Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Strange tranche plot after GATK4 germline SNPs pipeline

Dear GATK team,

Context : 504 WES human samples
Objective : Call SNP using GATK4 best practices

After applying the GATK4 germline SNP and indels best practives pipeline ( we used your .wdl file with cromwell ), VariantRecalibrator gave us this very strange (and bad) tranche plot.

We decided to go further and hard filter using the parameters suggested on your site :

for SNPs :
--filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \

and for indels :
--filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \

We had already analyzed these samples using GWAS. We decided to compare the concordance of the hard filered WES and the GWAS using SnpSift concordance. Results : on average 97% of concordance between the WES and GWAS samples.

So we are kind of lost now regarding the strange tranche plot that VariantRecalibrator gave us and we would like to solve this issue. Even if the hard filtering seems to work ok we are afraid to lose some important information ; or to have too many false positive calls..

In this context, could you help us and give us some advice to debug this ?

Thank you
N.

The tranche plot :

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @NicoBxl

    Our dev team is looking into this. I will get back to you shortly.

  • emeryjemeryj Member, Broadie

    Hello, @NicoBxl.

    I have a few questions about how you generated your data.

    First off, which wdl did you use for processing your exome data? Do you mean this workflow? https://github.com/gatk-workflows/gatk4-germline-snps-indels. We would recommend if you used that workflow on exome data that you do not run VQSR with DP as an analysis annotation.

    Could you clarify what you mean by running GWAS to generate your truth data, do you mean whole genome sequences for the same files?

    Can you send us the command you use to run VQSR? What were you using as known sites?

    What do the tranche results look like after applying hard filtering?

  • NicoBxlNicoBxl Member

    Hello @emeryj

    Thanks for the answer.

    I used the haplotypecaller-gvcf-gatk4.wdl and joint-discovery-gatk4.wdl .

    VQSR was not run with DP. It was run with
    -an FS -an MQ -an ReadPosRankSum -an MQRankSum -an QD -an SOR

    For the GWAS, we used a 650000 SNP array on the same samples. and comparing using SnpSift concordance we have 97% of concordance for each pair WES - GWAS for the same sample.

    Also I forgot to tell you that we also did some sample in WGS (in addition to the WES). We compared WES and WGS for the same samples and the concordance was quiet good (More analysis are required though to have an exhaustive view)

    Here is the command :

        ${gatk_path} --java-options "-Xmx100g -Xms100g" \
          VariantRecalibrator \
          -V ${sites_only_variant_filtered_vcf} \
          -O ${recalibration_filename} \
          --tranches-file ${tranches_filename} \
          --trust-all-polymorphic \
          -tranche ${sep=' -tranche ' recalibration_tranche_values} \
          -an ${sep=' -an ' recalibration_annotation_values} \
          -mode SNP \
          --sample-every-Nth-variant ${downsampleFactor} \
          --output-model ${model_report_filename} \
          --max-gaussians 6 \
          --target-titv 3.0 \
          --resource:hapmap,known=false,training=true,truth=true,prior=15 ${hapmap_resource_vcf} \
          --resource:omni,known=false,training=true,truth=true,prior=12 ${omni_resource_vcf} \
          --resource:1000G,known=false,training=true,truth=false,prior=10 ${one_thousand_genomes_resource_vcf} \
          --resource:dbsnp,known=true,training=false,truth=false,prior=2 ${dbsnp_resource_vcf}
    

    Ressources used (found on GATK bundle - all hg38) :

    • hapmap_3.3
    • 1000G_omni2.5
    • 1000G_phase1.snps.high_confidence
    • dbsnp_146

    I run VQSR on the hardfiltered vcf. Here is the plot :

    Thank you for your help.

  • emeryjemeryj Member, Broadie
    edited June 27

    @NicoBxl Thank you. Those results are still significantly outside of our expectations even after hard filtering. I suspect something has gone awry, possibly at an earlier stage in the pipeline.

    Can we see the the variant calling metrics output files from this run (both the detail and summary files from the workflow output)? It may also help figure out what is going on to try rerunning VariantRecalibrator with --rscript-file plot_file_basename set. This will produce some nice summary plots of the truth and test data that might illuminate what is going.

    As an addendum, what json parameters did you use when you ran haplotypecaller-gvcf-gatk4.wdl on this sample?

  • NicoBxlNicoBxl Member

    Dear @emeryj ,

    I join the SNP and indel recalibration plots from Variant Recalibrator.

    For the json I used the ones found on your github. I only modified it to be able to use the workflow without docker (our HPC does not accept docker yet). Also I just see that I use dbsnp146 instead of dbsnp138 (Is that the origin of these results ?)

    Thank you for your help

  • NicoBxlNicoBxl Member

    Dear @emeryj ,

    Hereby the log files and the recalibration plots.

    Any thoughts yet regarding the strange tranche plot ? Let me know if you need additional information. We really appreciate your help on this

    Thank you

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @NicoBxl ,

    Based on the VQSR outputs (the plots and the logs) there's nothing obviously wrong with your data. The annotation distributions look about how I would expect.

    The Ti/Tv for your callset is definitely lower than you told the tool to expect, which is why it's predicting so many false positives. In the log, the strictest tranche (most precise, least sensitive) has a Ti/Tv of 2.3449 over known variants, which is closer to the WGS expected value than the WES expected value. The WGS version of the joint calling workflow supplied in that github repo uses intervals over the whole genome. Did you restrict your variant calling to just the exome? In most captures there's a significant amount of coverage adjacent to the targets. We found recently that adding 150bp to each side increases the territory by about 4X, which would definitely pick up some unconstrained variants and bring down your Ti/Tv. Can you try running VariantRecalibrator just over the unpadded exome targets? I've uploaded the list here: gs://gatk-best-practices/exome-analysis-pipeline/exome_calling_regions.v1.unpadded.interval_list

    -Laura

  • NicoBxlNicoBxl Member

    Thany you @gauthier . I'll try that asap.

    I'll come back to you as soon as the analysis is done.

    Nicolas

  • NicoBxlNicoBxl Member

    Dear @gauthier ,

    Here is the tranche plot by using VQSR with the interval file you send me. We didn't execute the whole pipeline from scratch using this interval file. Only the VQSR part. There is an improvement compared to the initial analysis.

    An important point, that I didn't mention in my first post is that my exome cohort is a mix of samples prepared using different exome kits. To have an interval list more representative of our dataset, we did the union of all intervals and executed VQSR using this "union set". Here is the tranche plot. Even better Ti/Tv compared to your interval file (expected as it's more representative of our data).

    Do you expect such tranche plot ? or Ti/Tv is still too low for exomes ?

    As all the steps didn't use the same interval list ( I wanted to go fast and test your interval file ) I have now executed the whole pipeline using the "union set" interval file as it gives me a better Ti/Tv. It will maybe increase Ti/Tv. I will keep you updated as soon as the whole pipeline is finished

    Thank you for helping us Laura (and the whole GATK team).

    Nicolas

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    I'm a little surprised that the union of the captures does even better. Our intervals list should be all the coding regions from Refseq plus some micro RNAs plus the Illumina exome capture targets, so I didn't expect you to pick up more variants with high Ti/Tv by adding more intervals.

    The Ti/Tv values themselves don't concern me so much, but the fact that the first two tranches add a lot of false positives compared with the rest (where these aren't really validated false positives, but estimated using the target Ti/Tv), which means that the variants aren't being ranked very well. I admittedly don't look at these plots very often, but I'm used to seeing more like in the docs: https://software.broadinstitute.org/gatk/documentation/article?id=11084

    One thing that might improve your data would be to put the QD hard filter back on top of the VQSR filters. I might even go as high as 4 or 5 for the SNP threshold. There are a lot of lower QD sites in the training data and VQSR usually doesn't successfully penalize low QD variants. It's also one of the less Gaussian annotations. And you are already applying the ExcessHet filter according to the published pipeline, right? We use a hard filter for that one since it's non-Gaussian and because we don't want to penalize sites that have less heterozygosity than expected because the cohort is a mixture of two different populations with drastically different allele frequencies.

    -Laura

  • NicoBxlNicoBxl Member
    edited July 25

    Hi Laura,

    Just to clarify the "union set" I used is only the union the interval files of the exome kits we used (not the union of your file and the interval files of the exome kits we used).

    Yes I applied ExcessHet filter before VQSR :

    gatk  VariantFiltration \
      -V $vcf_in \
      --filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
      -O vcf_hg38_excesshet.vcf.gz
    

    I reran VQSR by adding an additional filter on QD

    --filter-expression "QD < 5.0" --filter-name "QD5"

    Here's the trancheplot using your interval file and QD<5 filter. I also added the other plots you asked in attachment. Adding QD < 5 didn't seem to have a important impact on the results of VQSR.

    Thank you
    Nicolas

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Your interval list definitely does better than mine, which makes sense since off-target reads are pretty unreliable. I think you should use your interval list for filtering and for the final callset. The two things left I'd suggest are looking at the variant calling detail metrics to make sure all your samples are reasonable (I'm happy to take a look if you want to post it here) and the more time consuming option would be to add some truth samples from 1000G like NA12878 into your callset and run GenotypeGVCFs and VQSR again. Then we could see how well the filtering does on a known sample. I found an NA12878 exome here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA12878/alignment/ but there are undoubtedly others that you can probably find via Genomes in a Bottle.

    Also I didn't see an attachment.

  • NicoBxlNicoBxl Member

    here's the attachment

    I will post the variant calling metrics as soon as possible here. Also I will try to add NA12878 to my dataset and rerun the pipeline.

    Thank you again for your amazing help !

    Nicolas

  • NicoBxlNicoBxl Member

    Hi @gauthier ,

    Hereby in attachment the sample metrics (CollectVariantCallingMetrics) for my cohort . I used the vcf file after GenotypeVCF as input. I rerun the whole pipeline with the "union set" interval. For now NA12878 is not in it yet (Analysis should be finished for tonight).

    Here's the tranche plot. I put in attachment the other plots from VQSR :

    We can see some improvements compared the last tranche plot (step by step we go closer to what we expect :wink: )

    I will post the results with NA12878 as soon as I have the results.

    Thank you for your help

    Nicolas

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Your variant calling detail metrics look pretty bad, but they're also saying that there are no filtered snps or indels in any sample. These are the metrics from the post-VQSR vcf right? Regardless, there are some samples with a lot more singletons than the others, which is usually cause for concern.

  • NicoBxlNicoBxl Member
    edited August 9

    Hi @gauthier ,

    The metric file was pre-VQSR. As I already mentionned we have a mixed dataset of exomes. To be more detailed here is the division of our dataset in the different exome kits :

                                                           Var1 Freq
    1       SureSelect_Human_All_Exon_V2_46Mb_S0293689_hg38.bed   40
    2      SureSelect_Human_All_Exon_V3_50Mb_S02972011_hg38.bed    7
    3 SureSelect_Human_All_Exon_V4+UTRs_71Mb_S03723424_hg38.bed   87
    4 SureSelect_Human_All_Exon_V5+UTRs_75Mb_S04380219_hg38.bed  227
    5 SureSelectClinicalResearchExomeV2_64Mb_S30409818_hg38.bed   61
    6              TwistHumanCoreExomeEnrichmentSystem_hg38.bed   76
    

    The last results I send you were built using a union set for the different samples. This union interval file was used in each steps of the pipeline (BQSR -> Haplotypecaller -> GenomicsDBimport -> GenotypeGVCF -> VQSR).

    To avoid an additional bias at BQSR and Haplotyecaller I reran all samples with its respective exome kit interval. Then for GenomicsDBimport -> GenotypeGVCFs -> VQSR I used the union set as before.

    So :

    • BQSR -> HC : exome kit interval of the respective sample
    • GenomicsDBimport -> GenotypeGVCF -> VQSR : union interval file of all exome kits interval files

    Here is the resulting tranche plo. There is clearly an improvement compared to the last analysis

    I also attached the pre and post VQSR (I used 99% for now as it seems to be the branching point..). Metric didn't change a lot compared to previous analysis though..

    Do you think to call the variants by performing the analysis on each subset of samples with the same exome kit ; and then merge the filtered VCF may be a good idea instead of grouping all samples (unregarding its exome kit).

    or use the new CNNScoreVariants per sample ?

    or use hard filtering ?

    Thank you again for your help.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    The shape of the Ti/Tv plot is definitely much better, but the values are still disappointingly low. But now that I've seen the names of your capture kits I have another idea. Based on the names, it looks like some of your captures include UTRs, and I would expect UTRs to have a Ti/Tv more like non-coding regions than like exons. Maybe the Ti/Tv will improve if you restrict it to just gencode exons? If you don't have an interval list for exons, ours comes from

    wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz
    zcat gencode.v28.annotation.gtf.gz | awk '$3 == "exon" { print $1, $4, $5, $7, $18}' | tr ' ' '\t' | tr -d '";' >> tmp.interval_list

Sign In or Register to comment.