VariantRecalibrator ignoring annotations in VCF after GenotypeGVCFs

bryandbryand San Francisco, USAMember

Hello,

I'm trying to run Variant Recalibrator on a data set which includes ~14 exomes. Using HC 3.5, I generated gVCF for each, then ran GenotypeGVCFs on them to generate a single multi-sample VCF. I then used the files in the hg38 bundle to run the following command line:

java -jar /usr/local/bioinformatics/src/gatk/GenomeAnalysisTK.jar -T VariantRecalibrator -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -input in.vcf -recalFile out_snp.recal -tranchesFile out_snp.tranches --maxGaussians 4 -nt 4 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg38.vcf.gz -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_144.hg38.vcf.gz -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP

Doing this, I get the error message: "##### ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator"

However, nearly every line of the VCF contains a QD annotation. I verified this using the commands:
$ grep -v "^#" in.vcf | wc -l
189927
$ grep -v "^#" in.vcf | grep QD= | wc -l
189925

As an example, here is one line from the VCF:
1 827209 . G C 5015.83 . AC=20;AF=0.909;AN=22;BaseQRankSum=1.48;ClippingRankSum=1.61;DP=123;ExcessHet=3.2222;FS=4.549;InbreedingCoeff=-0.1180;MLEAC=20;MLEAF=0.909;MQ=44.74;MQRankSum=-1.211e+00;QD=34.24;ReadPosRankSum=0.942;SOR=2.199 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,20:20:60:1|1:827209_G_C:871,60,0 1/1:0,4:4:12:1|1:827209_G_C:180,12,0 0/1:4,2:6:72:0|1:827209_G_C:72,0,248 0/1:3,13:16:87:0|1:827209_G_C:537,0,87 ./.:0,0:0 ./.:0,0:0 1/1:0,19:19:57:1|1:827209_G_C:816,57,0 1/1:0,4:4:12:1|1:827209_G_C:180,12,0 1/1:0,19:19:57:1|1:827209_G_C:847,57,0 1/1:0,2:2:6:1|1:827209_G_C:90,6,0 1/1:0,10:10:30:1|1:827209_G_C:450,30,0 1/1:0,13:13:39:1|1:827209_G_C:585,39,0 ./.:0,0:0 1/1:0,10:10:30:1|1:827209_G_C:429,30,0

As you can see, the QD score is present. Even when I remove the -an QD from the VariantRecalibrator call, GATK throws an error for the next annotation.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bryand
    Hi,

    I suspect you do not have enough samples for VQSR. We recommend using at least 30 exome samples.

    However, you can either try hard filtering or pad your data with the 1000Genomes data or ExAC data. Have a look at this thread for information on how to pad your data.

    -Sheila

  • bryandbryand San Francisco, USAMember

    Hi Sheila,

    Thanks for getting back to me.

    I find your explanation surprising given the error message where GATK claims that there are NO variants which have a QD score, whereas all the variants actually do. Nevertheless, I'll give your suggestion a try - can I mix and match my WES with 1000G WGS, or should I use exome data which is there (which might originate from a totally different exome enrichment kit)?

    Thanks!
    Bryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    VariantRecalibrator has the most unhelpful error messages. When it says it didn't find variants with the QD annotation in the training set, that includes the case where there were in fact no variants in the training set (overlapping between your callset and the training resources). I do find it surprising that you wouldn't see any overlap with 14 exomes, but there are a few confounding factors here; one is that you're using hg38 which is technically not supported yet, and the other is the multithreading. Before doing anything drastic, try running without -nt and see what happens. You'll still need to get more exomes (because 14 is too few in any case) but I'm not 100% sure that that's the problem right now.

    And no you can't mix and match WGS with Exomes; the annotation profiles are too different. Best to stick with the same capture kits etc but if you have to mix and match that, it's a lesser evil.

Sign In or Register to comment.