Impact of VQRS variant set size on model
We are evaluating the option to gather a set of 'good reference samples' to function as additional data in the VQRS step during WES analysis. We would like to do so, since we receive trio-based data, and three samples is typically not recommended for VQRS. As a measure of performance, we try to retrieve the Genome-in-a-bottle variants provided in Illumina's Platinum Call set, in the sample sample sequenced & called in-house. I was rather surprised to see that the sensitivity seems to go down by including more samples in the VQRS training. How can this be explained, and should we thus stick to a lower number of samples ?
Some more details:
- Samples are of mixed ethnicity, and this cannot be split up due to insufficient samples.
- All samples are prepped and sequenced using identical kits (SureSelect.V5 , HiSeq4000), but not in the same experiment
- We follow best practice, but perform single sample variant calling, since we are looking for sporadic/de novo mutations.
- Sensitivity is defined as 'Matching'/(matching + mismatch + not_called + low_quality + non_covered)
- matching is defined as the exact same genotype, not just variant, in the PASS tranche.
- 90 VCF files were called once, and 10 random sets per VQRS-training-size were extracted from this set (+ the GIAB-VCF) to train the model.
=> the variability is observed by variants moving from 'matching' to 'low_quality', due to failure to PASS the vqsr filtering.
SNP VQRS command (followed by the -input file.vcf list)
java -Djava.io.tmpdir=/tmp -Xmx10g -jar /opt/NGS/binaries/gatk/GATK_3.5.0/GenomeAnalysisTK.jar -nt 1 -S LENIENT -T VariantRecalibrator -R /opt/NGS/References/hg19/samtools/0.1.19/hg19.fasta -mode SNP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -recalFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/SNP.3.samples.set_1.iter_1.vcf' -tranchesFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/SNP.3.samples.set_1.iter_1.tranches' -rscriptFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/Rscripts/SNP.3.samples.set_1.iter_1.R' -minNumBad 1000 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /opt/NGS/References/hg19/gatk_bundle/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /opt/NGS/References/hg19/gatk_bundle/1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /opt/NGS/References/hg19/gatk_bundle/dbsnp_137.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /opt/NGS/References/hg19/gatk_bundle/1000G_phase1.snps.high_confidence.hg19.vcf
INDEL VQRS command (followed by the -input file.vcf list)
java -Djava.io.tmpdir=/tmp -Xmx10g -jar /opt/NGS/binaries/gatk/GATK_3.5.0/GenomeAnalysisTK.jar -nt 1 -S LENIENT -T VariantRecalibrator -R /opt/NGS/References/hg19/samtools/0.1.19/hg19.fasta -mode INDEL -an QD -an MQRankSum -an ReadPosRankSum -an FS -recalFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/INDEL.3.samples.set_1.iter_1.vcf' -tranchesFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/model_files/INDEL.3.samples.set_1.iter_1.tranches' -rscriptFile '/home/wesdev/Validatie_Cellijn//VQSR.Models/Rscripts/INDEL.3.samples.set_1.iter_1.R' -minNumBad 1000 -mG 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 /opt/NGS/References/hg19/gatk_bundle/Mills_and_1000G_gold_standard.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /opt/NGS/References/hg19/gatk_bundle/dbsnp_137.hg19.vcf