Impact of VQRS variant set size on model

edited July 2017 in Ask the GATK team

Hi,

We are evaluating the option to gather a set of 'good reference samples' to function as additional data in the VQRS step during WES analysis. We would like to do so, since we receive trio-based data, and three samples is typically not recommended for VQRS. As a measure of performance, we try to retrieve the Genome-in-a-bottle variants provided in Illumina's Platinum Call set, in the sample sample sequenced & called in-house. I was rather surprised to see that the sensitivity seems to go down by including more samples in the VQRS training. How can this be explained, and should we thus stick to a lower number of samples ?

image

Some more details:

  • Samples are of mixed ethnicity, and this cannot be split up due to insufficient samples.
  • All samples are prepped and sequenced using identical kits (SureSelect.V5 , HiSeq4000), but not in the same experiment
  • We follow best practice, but perform single sample variant calling, since we are looking for sporadic/de novo mutations.
  • Sensitivity is defined as 'Matching'/(matching + mismatch + not_called + low_quality + non_covered)
  • matching is defined as the exact same genotype, not just variant, in the PASS tranche.
  • 90 VCF files were called once, and 10 random sets per VQRS-training-size were extracted from this set (+ the GIAB-VCF) to train the model.

=> the variability is observed by variants moving from 'matching' to 'low_quality', due to failure to PASS the vqsr filtering.

SNP VQRS command (followed by the -input file.vcf list)
java -Djava.io.tmpdir=/tmp -Xmx10g -jar /opt/NGS/binaries/gatk/GATK_3.5.0/GenomeAnalysisTK.jar -nt 1 -S LENIENT -T VariantRecalibrator -R /opt/NGS/References/hg19/samtools/0.1.19/hg19.fasta -mode SNP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -recalFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/SNP.3.samples.set_1.iter_1.vcf' -tranchesFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/SNP.3.samples.set_1.iter_1.tranches' -rscriptFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/Rscripts/SNP.3.samples.set_1.iter_1.R' -minNumBad 1000 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /opt/NGS/References/hg19/gatk_bundle/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /opt/NGS/References/hg19/gatk_bundle/1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /opt/NGS/References/hg19/gatk_bundle/dbsnp_137.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /opt/NGS/References/hg19/gatk_bundle/1000G_phase1.snps.high_confidence.hg19.vcf

INDEL VQRS command (followed by the -input file.vcf list)
java -Djava.io.tmpdir=/tmp -Xmx10g -jar /opt/NGS/binaries/gatk/GATK_3.5.0/GenomeAnalysisTK.jar -nt 1 -S LENIENT -T VariantRecalibrator -R /opt/NGS/References/hg19/samtools/0.1.19/hg19.fasta -mode INDEL -an QD -an MQRankSum -an ReadPosRankSum -an FS -recalFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/INDEL.3.samples.set_1.iter_1.vcf' -tranchesFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/INDEL.3.samples.set_1.iter_1.tranches' -rscriptFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/Rscripts/INDEL.3.samples.set_1.iter_1.R' -minNumBad 1000 -mG 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 /opt/NGS/References/hg19/gatk_bundle/Mills_and_1000G_gold_standard.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /opt/NGS/References/hg19/gatk_bundle/dbsnp_137.hg19.vcf

Issue · Github
by shlee

Issue Number
2268
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @geertvandeweyer,

    I'll pass your question on to a developer, since it may interest them.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @geertvandeweyer,

    Our developer would like to know if your truthset sample (NA12878) is within your callset. Also, does the reduction in sensitivity correlate with an increase in specificity?

  • Hi @shlee ,

    Thank you for the feedback. We used the sample NA12877 (not NA12878) from the platinum data. I don't know exactly what you refer to as 'truthset' and 'callset', but the VCF of this sample was included as -input during VariantRecalibration. As far as I know. NA12877 is also part of the HapMap phase 3 study, so in that sense, it is part of the truth set.

    Best,
    Geert

  • edited July 2017

    Hi @shlee ,

    I forgot to answer the second question. With regard to specificity, that's hard to tell. The False Discovery Rate doesn't seem to alter much:

    image

    On the other hand, the main effect can be seen in the number of variants not passing the set VQRS filter. Both TP and FP variants are filtered out, leading in an overall reduction in sensivity. specificity was already good, so there's not much improvement there.

    image

    note: the numbers in these plots are #variants (FP or not passing filters) for the NA12877 sample only.

    Best,
    Geert

  • shleeshlee CambridgeMember, Broadie, Moderator

    Thanks @geertvandeweyer for sharing your findings. Our developer says that they have seen an increase in sensitivity in singletons for similar sample sizes as yours. I've attached a poster detailing such from October of 2016.

    image

  • Hi @shlee ,

    Does this reflect our use case? We've not regenotyped with varying numbers of samples. We've just trained the VQSR model, keeping input VCF files constant. What we see is that the number of true variants in the PASS tranch is reducing when we include more samples during training. The variants are all present in the initial VCF though.

    Geert

  • hi @shlee ,

    Thank you for the feedback. The provided information is highly appreciated. I understand that our usecase isn't fully complying to the guidelines, but I hoped to get a bit more insight in our observations. I can understand the load on your team, so I'm already glad you took some time to discuss it with the devs.

    Practically we already divide the exome in high and low conf regions, as you suggested, and results are very good , with over 97% retrieval in high conf regions.

    Best regards,

    Geert

Sign In or Register to comment.