Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

The allele with index <##> is not defined in the REF/ALT columns in the record : CombineVariants

dantakidantaki La JollaMember

Hello I am having an issue with combining VCFs. I am using GATK 3.8-1 for the CombineVariants step that's producing the error.

I have a VCF containing SNPs and INDELs. I first split the VCF using GATK 4.0.5.1. This step does not produce and error and I am able to use bgzip and tabix without error on the resulting VCFs.

/home/dantakli/bin/gatk-4.0.5.1/gatk SplitVcfs --INPUT $1 --SNP_OUTPUT $out\.snps.vcf --INDEL_OUTPUT $out\.indels.vcf --STRICT=false

My next step is to combine the SNPs from the previous command to another SNP VCF file with different samples (set1). At this combine step, I get this allele index error.

Here's the trace, set2 is the VCF that was split above and is the file that produces the error.

INFO  11:48:34,113 HelpFormatter - ------------------------------------------------------------------------------------
INFO  11:48:34,119 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
INFO  11:48:34,119 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  11:48:34,119 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  11:48:34,119 HelpFormatter - [Sun Jul 15 11:48:34 PDT 2018] Executing on Linux 2.6.32-696.10.3.el6.x86_64 amd64
INFO  11:48:34,119 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02
INFO  11:48:34,122 HelpFormatter - Program Args: -T CombineVariants -R /reference/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr5:1-10000000 --genotypemergeoption UNIQUIFY --variant:set1,vcf set1.snps.hg38.chr5.vcf.gz --variant:set2,vcf set2.chr5.snps.vcf.gz -o /set1.set2.chr5.1-10000000.snps.vcf

...

##### ERROR MESSAGE: The allele with index 107548038 is not defined in the REF/ALT columns in the record

I ran ValidateVariants on the set2 SNP file and got the same error.

INFO  11:14:16,261 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
...
INFO  11:14:16,262 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02
INFO  11:14:16,265 HelpFormatter - Program Args: -T ValidateVariants -L chr5:1-10000000 -R GRCh38_full_analysis_set_plus_decoy_hla.fa -V set2.chr5.snps.vcf.gz
...

INFO  11:14:16,301 NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/dantakli/bin/GenomeAnalysisTK-3.8-1/GenomeAnalysisTK.jar!/com/intel/gkl/native/libgkl_compression.so
INFO  11:14:16,313 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO  11:14:16,313 GenomeAnalysisEngine - Inflater: IntelInflater
INFO  11:14:16,313 GenomeAnalysisEngine - Strictness is SILENT
INFO  11:14:17,634 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  11:14:18,822 IntervalUtils - Processing 10000000 bp from intervals
WARN  11:14:18,822 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  11:14:18,890 GenomeAnalysisEngine - Preparing for traversal
....
##### ERROR MESSAGE: File set2.chr5.snps.vcf.gz fails strict validation: The allele with index 107548038 is not defined in the REF/ALT columns in the record
##### ERROR ------------------------------------------------------------------------------------------

I get the same error with the SNP+INDEL vcf (before splitting) too

##### ERROR MESSAGE: File set2.chr5.vcf fails strict validation: The allele with index 107548038 is not defined in the REF/ALT columns in the record

I don't get this error when splitting the VCF into SNPs and INDELs. So why am I getting it when I combine the variants?

Thanks.

Best Answer

  • dantakidantaki La Jolla
    Accepted Answer

    I solved it. The VCFs I'm working on have thousands of samples so it's hard to spot check errors. There was a formatting error in the VCF. Something odd happened and two VCF entries were combined into one record.

Answers

  • dantakidantaki La JollaMember
    Accepted Answer

    I solved it. The VCFs I'm working on have thousands of samples so it's hard to spot check errors. There was a formatting error in the VCF. Something odd happened and two VCF entries were combined into one record.

Sign In or Register to comment.