We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

The allele with index <##> is not defined in the REF/ALT columns in the record : CombineVariants

dantakidantaki La JollaMember

Hello I am having an issue with combining VCFs. I am using GATK 3.8-1 for the CombineVariants step that's producing the error.

I have a VCF containing SNPs and INDELs. I first split the VCF using GATK 4.0.5.1. This step does not produce and error and I am able to use bgzip and tabix without error on the resulting VCFs.

/home/dantakli/bin/gatk-4.0.5.1/gatk SplitVcfs --INPUT $1 --SNP_OUTPUT $out\.snps.vcf --INDEL_OUTPUT $out\.indels.vcf --STRICT=false

My next step is to combine the SNPs from the previous command to another SNP VCF file with different samples (set1). At this combine step, I get this allele index error.

Here's the trace, set2 is the VCF that was split above and is the file that produces the error.

INFO  11:48:34,113 HelpFormatter - ------------------------------------------------------------------------------------
INFO  11:48:34,119 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
INFO  11:48:34,119 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  11:48:34,119 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  11:48:34,119 HelpFormatter - [Sun Jul 15 11:48:34 PDT 2018] Executing on Linux 2.6.32-696.10.3.el6.x86_64 amd64
INFO  11:48:34,119 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02
INFO  11:48:34,122 HelpFormatter - Program Args: -T CombineVariants -R /reference/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr5:1-10000000 --genotypemergeoption UNIQUIFY --variant:set1,vcf set1.snps.hg38.chr5.vcf.gz --variant:set2,vcf set2.chr5.snps.vcf.gz -o /set1.set2.chr5.1-10000000.snps.vcf

...

##### ERROR MESSAGE: The allele with index 107548038 is not defined in the REF/ALT columns in the record

I ran ValidateVariants on the set2 SNP file and got the same error.

INFO  11:14:16,261 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
...
INFO  11:14:16,262 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02
INFO  11:14:16,265 HelpFormatter - Program Args: -T ValidateVariants -L chr5:1-10000000 -R GRCh38_full_analysis_set_plus_decoy_hla.fa -V set2.chr5.snps.vcf.gz
...

INFO  11:14:16,301 NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/dantakli/bin/GenomeAnalysisTK-3.8-1/GenomeAnalysisTK.jar!/com/intel/gkl/native/libgkl_compression.so
INFO  11:14:16,313 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO  11:14:16,313 GenomeAnalysisEngine - Inflater: IntelInflater
INFO  11:14:16,313 GenomeAnalysisEngine - Strictness is SILENT
INFO  11:14:17,634 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  11:14:18,822 IntervalUtils - Processing 10000000 bp from intervals
WARN  11:14:18,822 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  11:14:18,890 GenomeAnalysisEngine - Preparing for traversal
....
##### ERROR MESSAGE: File set2.chr5.snps.vcf.gz fails strict validation: The allele with index 107548038 is not defined in the REF/ALT columns in the record
##### ERROR ------------------------------------------------------------------------------------------

I get the same error with the SNP+INDEL vcf (before splitting) too

##### ERROR MESSAGE: File set2.chr5.vcf fails strict validation: The allele with index 107548038 is not defined in the REF/ALT columns in the record

I don't get this error when splitting the VCF into SNPs and INDELs. So why am I getting it when I combine the variants?

Thanks.

Best Answer

  • dantakidantaki La Jolla
    Accepted Answer

    I solved it. The VCFs I'm working on have thousands of samples so it's hard to spot check errors. There was a formatting error in the VCF. Something odd happened and two VCF entries were combined into one record.

Answers

  • dantakidantaki La JollaMember
    Accepted Answer

    I solved it. The VCFs I'm working on have thousands of samples so it's hard to spot check errors. There was a formatting error in the VCF. Something odd happened and two VCF entries were combined into one record.

Sign In or Register to comment.