GATK VQSR Errors GATK3.4/3.5 Valid VCF - but Allele not defined

We are having problems running GATK VQSR on recent VCF files generated using GATK HaplotypeCaller in the N+1 mode.

When running VariantRecalibrator we get the following error (with both 3.4 and 3.5), the error occurs.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): 
##### ERROR MESSAGE: The allele with index 5 is not defined in the REF/ALT columns in the record
##### ERROR ------------------------------------------------------------------------------------------


$JAVA -jar $GATK -T VariantRecalibrator -R $REF -input GATK-HC-3.4-DAMONA_10_Jan_2016.vcf.gz \
        -resource:ill770k,known=false,training=true,truth=true,prior=15.0 /home/aeonsim/refs/LIC-seqdHDs.AC1plus.DP10x.vcf.gz \
        -resource:VQSRMend,known=false,training=true,truth=false,prior=10.0  /home/projects/bos_taurus/damona/vcfs/recombination/GATK-HC-95.0tr-GQ40-10Dec-2015run.mend.Con.vcf.gz \
        -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP \
        -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 \
        -recalFile GATK-HC-DAMONA-Jan-2016.recal -tranchesFile GATK-HC-DAMONA-Jan-2016.tranches \
        -rscriptFile GATK-HC-DAMONA-Jan-2016.R

The VCF files were created using GenotypeGVCFs (v3.4-46-gbc02625) on GVCF files created with GATK HaplotypeCaller (N+1 mode, same version) which had been combined using GATK CombineGVCFs to merge ~750 whole bovine genome GVCFs to of 20-90 individuals. All stages occurred to finish successfully with out error until the VQSR stage. All processing of the GATK files was done using GATK tools (v3.4-46) and running the ValidateVariants tool on the files only gives warnings about “Reference allele is too long (116)” nothing about alleles not being defined in the Ref/Alt column.

java -jar GenomeAnalysisTK.jar -T ValidateVariants -R bosTau6.fasta  -V GATK.chr29.vcf.gz --dbsnp BosTau6_
dbSNP140_NCBI.vcf.gz -L chr29
ValidateVariants - Reference allele is too long (116) at position chr29:2513143; skipping that record. Set --referenceWindowStop >= 116
Successfully validated the input file.  Checked 531640 records with no failures.

Older versions of GATK worked successfully on part of this dataset last year, but now we've completed the dataset and rerun everything with the latest version of GATK (at the time 3.4-46) using only the GATK tools, GATK VQSR is unable to process the files that were created by it's own HaplotypeCaller (and from the same version), while validation tools supplied with GATK claim there is no problem and programs like bcftools are happily processing the files.

It appears to me that the problem may be around the introduction of the * in the Alt column for VCF4.2 (?) and a failure of VQSR to fully support this?

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @aeonsim, we've run VQSR successfully on records with the * alt allele so I doubt that's the problem here. Out of curiosity, have you tried running with Java 1.7 instead of 1.8? Some of the list/array data structures have changed between 1.7 and 1.8 in ways that may be disruptive depending on how they're used, which is why running GATK tools on Java 1.8 is not yet supported. And the fact that the error is specifically expressed in terms of the index position makes me think that a data structure error is more likely than a content error.

  • @Geraldine_VdAuwera running with Java 1.7 makes no difference, I've finally managed to track the error down to one of the VCF files being used for the training set. After changing to a different VCF for the training set the error no longer occurs. I find somewhat surprising seeing I though those files were only used to get the positions...

Sign In or Register to comment.