If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Strange tranche plot after GATK4 germline SNPs pipeline

Dear GATK team,

Context : 504 WES human samples
Objective : Call SNP using GATK4 best practices

After applying the GATK4 germline SNP and indels best practives pipeline ( we used your .wdl file with cromwell ), VariantRecalibrator gave us this very strange (and bad) tranche plot.

We decided to go further and hard filter using the parameters suggested on your site :

for SNPs :
--filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \

and for indels :
--filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \

We had already analyzed these samples using GWAS. We decided to compare the concordance of the hard filered WES and the GWAS using SnpSift concordance. Results : on average 97% of concordance between the WES and GWAS samples.

So we are kind of lost now regarding the strange tranche plot that VariantRecalibrator gave us and we would like to solve this issue. Even if the hard filtering seems to work ok we are afraid to lose some important information ; or to have too many false positive calls..

In this context, could you help us and give us some advice to debug this ?

Thank you

The tranche plot :


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @NicoBxl

    Our dev team is looking into this. I will get back to you shortly.

  • emeryjemeryj Member, Broadie

    Hello, @NicoBxl.

    I have a few questions about how you generated your data.

    First off, which wdl did you use for processing your exome data? Do you mean this workflow? We would recommend if you used that workflow on exome data that you do not run VQSR with DP as an analysis annotation.

    Could you clarify what you mean by running GWAS to generate your truth data, do you mean whole genome sequences for the same files?

    Can you send us the command you use to run VQSR? What were you using as known sites?

    What do the tranche results look like after applying hard filtering?

  • NicoBxlNicoBxl Member

    Hello @emeryj

    Thanks for the answer.

    I used the haplotypecaller-gvcf-gatk4.wdl and joint-discovery-gatk4.wdl .

    VQSR was not run with DP. It was run with
    -an FS -an MQ -an ReadPosRankSum -an MQRankSum -an QD -an SOR

    For the GWAS, we used a 650000 SNP array on the same samples. and comparing using SnpSift concordance we have 97% of concordance for each pair WES - GWAS for the same sample.

    Also I forgot to tell you that we also did some sample in WGS (in addition to the WES). We compared WES and WGS for the same samples and the concordance was quiet good (More analysis are required though to have an exhaustive view)

    Here is the command :

        ${gatk_path} --java-options "-Xmx100g -Xms100g" \
          VariantRecalibrator \
          -V ${sites_only_variant_filtered_vcf} \
          -O ${recalibration_filename} \
          --tranches-file ${tranches_filename} \
          --trust-all-polymorphic \
          -tranche ${sep=' -tranche ' recalibration_tranche_values} \
          -an ${sep=' -an ' recalibration_annotation_values} \
          -mode SNP \
          --sample-every-Nth-variant ${downsampleFactor} \
          --output-model ${model_report_filename} \
          --max-gaussians 6 \
          --target-titv 3.0 \
          --resource:hapmap,known=false,training=true,truth=true,prior=15 ${hapmap_resource_vcf} \
          --resource:omni,known=false,training=true,truth=true,prior=12 ${omni_resource_vcf} \
          --resource:1000G,known=false,training=true,truth=false,prior=10 ${one_thousand_genomes_resource_vcf} \
          --resource:dbsnp,known=true,training=false,truth=false,prior=2 ${dbsnp_resource_vcf}

    Ressources used (found on GATK bundle - all hg38) :

    • hapmap_3.3
    • 1000G_omni2.5
    • 1000G_phase1.snps.high_confidence
    • dbsnp_146

    I run VQSR on the hardfiltered vcf. Here is the plot :

    Thank you for your help.

  • emeryjemeryj Member, Broadie
    edited June 27

    @NicoBxl Thank you. Those results are still significantly outside of our expectations even after hard filtering. I suspect something has gone awry, possibly at an earlier stage in the pipeline.

    Can we see the the variant calling metrics output files from this run (both the detail and summary files from the workflow output)? It may also help figure out what is going on to try rerunning VariantRecalibrator with --rscript-file plot_file_basename set. This will produce some nice summary plots of the truth and test data that might illuminate what is going.

    As an addendum, what json parameters did you use when you ran haplotypecaller-gvcf-gatk4.wdl on this sample?

  • NicoBxlNicoBxl Member

    Dear @emeryj ,

    I join the SNP and indel recalibration plots from Variant Recalibrator.

    For the json I used the ones found on your github. I only modified it to be able to use the workflow without docker (our HPC does not accept docker yet). Also I just see that I use dbsnp146 instead of dbsnp138 (Is that the origin of these results ?)

    Thank you for your help

  • NicoBxlNicoBxl Member

    Dear @emeryj ,

    Hereby the log files and the recalibration plots.

    Any thoughts yet regarding the strange tranche plot ? Let me know if you need additional information. We really appreciate your help on this

    Thank you

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @NicoBxl ,

    Based on the VQSR outputs (the plots and the logs) there's nothing obviously wrong with your data. The annotation distributions look about how I would expect.

    The Ti/Tv for your callset is definitely lower than you told the tool to expect, which is why it's predicting so many false positives. In the log, the strictest tranche (most precise, least sensitive) has a Ti/Tv of 2.3449 over known variants, which is closer to the WGS expected value than the WES expected value. The WGS version of the joint calling workflow supplied in that github repo uses intervals over the whole genome. Did you restrict your variant calling to just the exome? In most captures there's a significant amount of coverage adjacent to the targets. We found recently that adding 150bp to each side increases the territory by about 4X, which would definitely pick up some unconstrained variants and bring down your Ti/Tv. Can you try running VariantRecalibrator just over the unpadded exome targets? I've uploaded the list here: gs://gatk-best-practices/exome-analysis-pipeline/exome_calling_regions.v1.unpadded.interval_list


  • NicoBxlNicoBxl Member

    Thany you @gauthier . I'll try that asap.

    I'll come back to you as soon as the analysis is done.


  • NicoBxlNicoBxl Member

    Dear @gauthier ,

    Here is the tranche plot by using VQSR with the interval file you send me. We didn't execute the whole pipeline from scratch using this interval file. Only the VQSR part. There is an improvement compared to the initial analysis.

    An important point, that I didn't mention in my first post is that my exome cohort is a mix of samples prepared using different exome kits. To have an interval list more representative of our dataset, we did the union of all intervals and executed VQSR using this "union set". Here is the tranche plot. Even better Ti/Tv compared to your interval file (expected as it's more representative of our data).

    Do you expect such tranche plot ? or Ti/Tv is still too low for exomes ?

    As all the steps didn't use the same interval list ( I wanted to go fast and test your interval file ) I have now executed the whole pipeline using the "union set" interval file as it gives me a better Ti/Tv. It will maybe increase Ti/Tv. I will keep you updated as soon as the whole pipeline is finished

    Thank you for helping us Laura (and the whole GATK team).


  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    I'm a little surprised that the union of the captures does even better. Our intervals list should be all the coding regions from Refseq plus some micro RNAs plus the Illumina exome capture targets, so I didn't expect you to pick up more variants with high Ti/Tv by adding more intervals.

    The Ti/Tv values themselves don't concern me so much, but the fact that the first two tranches add a lot of false positives compared with the rest (where these aren't really validated false positives, but estimated using the target Ti/Tv), which means that the variants aren't being ranked very well. I admittedly don't look at these plots very often, but I'm used to seeing more like in the docs:

    One thing that might improve your data would be to put the QD hard filter back on top of the VQSR filters. I might even go as high as 4 or 5 for the SNP threshold. There are a lot of lower QD sites in the training data and VQSR usually doesn't successfully penalize low QD variants. It's also one of the less Gaussian annotations. And you are already applying the ExcessHet filter according to the published pipeline, right? We use a hard filter for that one since it's non-Gaussian and because we don't want to penalize sites that have less heterozygosity than expected because the cohort is a mixture of two different populations with drastically different allele frequencies.


  • NicoBxlNicoBxl Member
    edited July 25

    Hi Laura,

    Just to clarify the "union set" I used is only the union the interval files of the exome kits we used (not the union of your file and the interval files of the exome kits we used).

    Yes I applied ExcessHet filter before VQSR :

    gatk  VariantFiltration \
      -V $vcf_in \
      --filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
      -O vcf_hg38_excesshet.vcf.gz

    I reran VQSR by adding an additional filter on QD

    --filter-expression "QD < 5.0" --filter-name "QD5"

    Here's the trancheplot using your interval file and QD<5 filter. I also added the other plots you asked in attachment. Adding QD < 5 didn't seem to have a important impact on the results of VQSR.

    Thank you

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Your interval list definitely does better than mine, which makes sense since off-target reads are pretty unreliable. I think you should use your interval list for filtering and for the final callset. The two things left I'd suggest are looking at the variant calling detail metrics to make sure all your samples are reasonable (I'm happy to take a look if you want to post it here) and the more time consuming option would be to add some truth samples from 1000G like NA12878 into your callset and run GenotypeGVCFs and VQSR again. Then we could see how well the filtering does on a known sample. I found an NA12878 exome here: but there are undoubtedly others that you can probably find via Genomes in a Bottle.

    Also I didn't see an attachment.

  • NicoBxlNicoBxl Member

    here's the attachment

    I will post the variant calling metrics as soon as possible here. Also I will try to add NA12878 to my dataset and rerun the pipeline.

    Thank you again for your amazing help !


  • NicoBxlNicoBxl Member

    Hi @gauthier ,

    Hereby in attachment the sample metrics (CollectVariantCallingMetrics) for my cohort . I used the vcf file after GenotypeVCF as input. I rerun the whole pipeline with the "union set" interval. For now NA12878 is not in it yet (Analysis should be finished for tonight).

    Here's the tranche plot. I put in attachment the other plots from VQSR :

    We can see some improvements compared the last tranche plot (step by step we go closer to what we expect :wink: )

    I will post the results with NA12878 as soon as I have the results.

    Thank you for your help


  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Your variant calling detail metrics look pretty bad, but they're also saying that there are no filtered snps or indels in any sample. These are the metrics from the post-VQSR vcf right? Regardless, there are some samples with a lot more singletons than the others, which is usually cause for concern.

  • NicoBxlNicoBxl Member
    edited August 9

    Hi @gauthier ,

    The metric file was pre-VQSR. As I already mentionned we have a mixed dataset of exomes. To be more detailed here is the division of our dataset in the different exome kits :

                                                           Var1 Freq
    1       SureSelect_Human_All_Exon_V2_46Mb_S0293689_hg38.bed   40
    2      SureSelect_Human_All_Exon_V3_50Mb_S02972011_hg38.bed    7
    3 SureSelect_Human_All_Exon_V4+UTRs_71Mb_S03723424_hg38.bed   87
    4 SureSelect_Human_All_Exon_V5+UTRs_75Mb_S04380219_hg38.bed  227
    5 SureSelectClinicalResearchExomeV2_64Mb_S30409818_hg38.bed   61
    6              TwistHumanCoreExomeEnrichmentSystem_hg38.bed   76

    The last results I send you were built using a union set for the different samples. This union interval file was used in each steps of the pipeline (BQSR -> Haplotypecaller -> GenomicsDBimport -> GenotypeGVCF -> VQSR).

    To avoid an additional bias at BQSR and Haplotyecaller I reran all samples with its respective exome kit interval. Then for GenomicsDBimport -> GenotypeGVCFs -> VQSR I used the union set as before.

    So :

    • BQSR -> HC : exome kit interval of the respective sample
    • GenomicsDBimport -> GenotypeGVCF -> VQSR : union interval file of all exome kits interval files

    Here is the resulting tranche plo. There is clearly an improvement compared to the last analysis

    I also attached the pre and post VQSR (I used 99% for now as it seems to be the branching point..). Metric didn't change a lot compared to previous analysis though..

    Do you think to call the variants by performing the analysis on each subset of samples with the same exome kit ; and then merge the filtered VCF may be a good idea instead of grouping all samples (unregarding its exome kit).

    or use the new CNNScoreVariants per sample ?

    or use hard filtering ?

    Thank you again for your help.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    The shape of the Ti/Tv plot is definitely much better, but the values are still disappointingly low. But now that I've seen the names of your capture kits I have another idea. Based on the names, it looks like some of your captures include UTRs, and I would expect UTRs to have a Ti/Tv more like non-coding regions than like exons. Maybe the Ti/Tv will improve if you restrict it to just gencode exons? If you don't have an interval list for exons, ours comes from

    zcat gencode.v28.annotation.gtf.gz | awk '$3 == "exon" { print $1, $4, $5, $7, $18}' | tr ' ' '\t' | tr -d '";' >> tmp.interval_list

  • NicoBxlNicoBxl Member
    edited August 27

    Dear @gauthier ,

    I took your advice and reran everything ( BQSR -> HC -> GDBI -> GenoGVCFs -> VQSR ). I used this time the intersection of the gencode file (only exons) and the respective exome kit interval file. Unfortunately the tranche plot didn't improve a lot compared to the previous analysis.

    I also performed CollectVariantCallingMetrics on the joint genotyped vcf file. Compared to the previous analysis we can see an increase of Ti/Tv.

                                    bed  titv_union titiv_int_gencode delta_mean  pct_mean
      SureSelect_Human_All_Exon_V2_46Mb        2.70              2.82      0.124      4.61
      SureSelect_Human_All_Exon_V3_50Mb        2.75              2.87      0.117      4.24
      SureSelect_Human_All_Exon_V4+UTRs        2.49              2.58     0.0822      3.31
      SureSelect_Human_All_Exon_V5+UTRs        2.53              2.59     0.0570      2.29
      SureSelectClinicalResearchExomeV2        2.49              2.66      0.175      7.05
    TwistHumanCoreExomeEnrichmentSystem        2.88              2.93     0.0519      1.80

    In red : using exome kit
    In blue : using intersection of exome kit and gencode exons

    For BDBI and GenotypeGVCFs : union of exome kit intervals ; and union of 'intersection of exome kit and gencode exons" respectively.

    I attached the output of CollectVariantCallingMetrics.

    I'm not sure what I can do more to increase the quality of VQSR results. Should I proceed with ApplyVQSR with a strict threshold ? or do hard filtering ?

    Again thank you for your help.


  • NicoBxlNicoBxl Member
    edited August 28

    Dear @gauthier ,

    To add some context to my previous post, we have already 80 WGS analyzed succesfully with GATK4 best practices (in this case the tranche plot was good and we proceed to applyVQSR to have a filtered VCF to use for further analysis).

    Now we would like to use both WES dataset (the one you are currently helping us to analyze) and WGS dataset ( already analyzed ) to analyze germline variants. Is it ok to combine both sets of variants if WES is hard-filtered and WGS was VQSR-based filtered ? or do you suggest an other approach in this case ?

    Thank you

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    I went digging in our master variant calling database to find some exome callsets with comparable sample numbers to see if your number of singletons per sample is in line with what we usually see. They're kind of all over the place, but 250-350 would be a broad range for the mean per callset. Your detail stats you attached are showing 698 per sample with some crazy outliers up to ~3500. We have a few outliers too, but even the long tail doesn't go past 1000 most of the time. Even keeping in mind that these are unfiltered calls in your case, that seems really high. If you want your Ti/Tv to look better it might be worth trying it again omitting some of the outliers.

    Do you have ethnicities for your samples? Some of them have very low percentages of variants seen in dbSNP (as low as ~92%), although I'm comparing with filtered numbers that are around 98%. Your median is about in line with what I'm seeing, but with a long tail of samples with a lot of "novel" variation. African samples and samples from underrepresented populations like Arab and South Asian (India, Pakistan, Bangladesh) will have a lot of variation that hasn't been reported in dbSNP, especially African. Although systematic sequencing errors will also show up as novel variation.

    For our large callsets, the analysts doing QC often alternate between variant QC and sample QC, starting with filtered variants and omitting samples, then revisiting variant filtering with that subset of samples, etc. Here it might make sense to try sample QC first and then VQSR again. Have you looked at the metrics for your samples? Do they all have contamination less than 10% and less than 15% chimeric reads? You could probably be even stricter.

    I hope that helps. Even if you end up not using VQSR, I have some concerns about a few of these samples that I think are worth investigating either way.


  • NicoBxlNicoBxl Member
    edited October 1

    Hi @gauthier,

    Thanks a lot again for your invaluable help and sorry for my late reply, we had some important issues with our local cluster causing huge delays in our analyses.


    Following your advice, I plotted the number of NUM_SINGLETONS according to the underlying ancestry. As you suspected, the samples with the highest number of singletons are mostly from non-European origin (mostly Africans and south Asians) see the attached plot. I guess this should therefore not be considered as sequencing errors and thus a proxy of “bad” samples?

    However, I tried to exclude all samples with NUM_SINGLETONS < 1000 as suggested. I performed GenomicsDBImport and GenotypeGVFs without these samples, but this did not improve the Ti/Tv on the tranche plot, even slightly. Attached you will find the tranche plot and others from VQSR.

    2. Chimeric reads

    I checked the chimeric reads by applying GATK CollectMultipleMetrics (Picard). All samples have a chimeric read percentage under 2.5% (average 0.63 %) .

    **3. Contamination **

    Looking at contamination (GATK CalculateContamination), I found some potential outliers (two samples over 10% of contamination).

    4. HET/HOM ratio

    I performed other qc checks and found 7 outliers for the HET/HOM ratio (see the attached plot). Outliers from contamination were the same as the ones from HET/HOM ratio.

    I performed GenomicsDBImport and GenotypeGVFs without all these outliers (7 samples) and applied VQSR on the filtered dataset. Again, the tranche plot looks very similar.

    I also tried other options like excluding 9 samples with an excess of Missing genotype rate (by exome kit), and/or NOVEL_TITV (by exome kit) but it didn’t change anything and I don’t really understand why.

    Therefore, I don’t see how I can reach a NOVEL_TITV of 3.0-3.2 as mentioned if the GATK docs especially since after removing those potential outliers the data shouldn’t be considered as “bad”.

    Do you think that VQSR is still useful here or it’s better (and acceptable) to use hard filtering?

    Many thanks!

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @NicoBxl ,

    If you have a set of hard filters that give you a good Ti/Tv ratio, then there's nothing wrong with using that. (The recommendations we have in the forum article are pretty liberal in that there will probably still be a lot of false positives and a mediocre Ti/Tv, but if you make the thresholds more strict you should be able to do pretty well.) . VQSR was meant to effectively automate this process and also to take into account correlations between annotations, but in practice hard filters can often do better, applying them just requires more experience.

    I'm still concerned by your singletons distribution, but if you want to move forward with your analysis hard filters may be the way to go. Did you ever get a chance to add NA12878? That would be very telling since 100% of her true variants should be in dbSNP. It's also very helpful to construct a ROC or precision-recall curve with respect to the NIST GiaB. We should be hitting numbers like >=97% SNP recall, > 98% SNP precision, ~82% indel recall, and at least 75% indel precision. (We're still sorting out our indel thresholds -- you can get better precision with a big loss in sensitivity like 98% precision with 75% recall.)


  • NicoBxlNicoBxl Member

    Hi @gauthier

    Thanks again! I followed your advice on tuning the hard-filtering procedure. With this aim, I made plots for six annotation (QD, QUAL, SOR, FS, MQ, MQRankSum, ReadPosRankSum) on the unfiltered VCF (547 samples with different capture kits).

    I choose 3 sets of hard filters:

    • Hard_1 :
      QD < 10 – QUAL > 500 – SOR > 2 – FS > 10 – MQ < 50 – MQRankSum < -5 – ReadPosRankSum < -5

    • Hard_2 :
      QD < 12 – QUAL < 500 – SOR < 1 – FS < 10 – MQ < 50 – MQRankSum < -5 – ReadPosRankSum < -5

    • Hard_standard (the recommendations you have in the forum) :
      QD < 2 – QUAL < 30 – SOR < 3 – FS < 60 – MQ < 40 – MQRankSum < -12.5 – ReadPosRankSum < -8.0

    and 2 VQSR thresholds (90 and 99).

    Then I ran GATK CollectVariantCallingmetrics to collect PCT_FILTERED_SNP, PCT_DBSNP, NOVEL_TITV, DBSNP_TITV and NUM_SINGLETONS for the different options (I added in the attached plots the unfiltered VCF for comparison).

    And here the median per filter

    We clearly see a decrease in number of SINGLETONS in both hard_1 and hard_2. However hard_2 is much more stringent than hard_1 (PCT_FILTERED_SNP twice bigger in hard_2 than hard_1). We also have for both sets some outliers ( > 1000 NUM_SINGLETONS). Would you recommend to discard those samples, at least for Europeans?

    As you suggested, I also checked NA12878 for these aforementioned filters.

    In terms of PCT_DBSNP we are close to what is expected (100%)

    filter PCT_DBSNP
    unfiltered_vcf 0.978
    vqsr_99 0.981
    vqsr_90 0.981
    hard_standard 0.987
    hard_1 0.991
    hard_2 0.991

    I used rtg-tools and to compute Precision and Recall (and built the ROC curve, attached). Recall and precision look quite good for all options (SNP >98% ; indel > 85%).

    Based on the ROC hard_1 and hard_2 look like the best options. Hard_2 has lower NUM_SINGLETONS than hard_1 (median of 332 vs. 434) but also, almost two times more filtered SNPs. What would you recommend?

    Thanks a lot.


  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi Nicholas,

    I'm disappointed we couldn't get to the bottom of your bad VQSR results, but your hard filters appear to make a big improvement.

    If you're using the same hard filter for SNPs and indels I would suggest that you use the Hard1 QD criterion to prevent losing too many large indels that will have significant reference bias. But you should modify your QUAL requirement. Unfortunately the QUAL isn't really the overall quality of the site, it's the amount of variant evidence seen in the whole cohort (not accounting for artifacts). If you throw away QUAL > 500 sites then you'll lose a lot of your common variation. I wouldn't suggest QUAL < 500 either because that's going to be too strict on your rare variants. The forum recommendation of QUAL < 30 aims to remove variants that are only seen in one sample and have poor allele balance, which QD should also flag, but just to be safe. Hard1 looks the best to me if you remove the QUAL > 500 criterion. More biased sites will have a higher SOR, so that part of the Hard1 criteria is good as well.


  • NicoBxlNicoBxl Member

    Hi @gauthier ,

    Thank you for you answer. And do not be disappointed, you helped us a lot for the analysis of our data :smile:

    Concerning hard_1 there was a typo in my previous post. It's QUAL < 500 and not QUAL > 500 of course (sorry for the misunderstanding). Looking at various possible QUAL thresold (using the same values for QD, SOR, etc.. than hard_1 ) :

    The percentage of filtered SNPs didn't change significantly using different QUAL values. Nevertheless I will try two sets of filters for downstream analysis : QUAL < 30 and QUAL < 500. And then compare the results to see if there is differences.

    Again thank you a lot for your precious help.


Sign In or Register to comment.