Combine Variants throwing a bug error

everestial007everestial007 GreensboroMember
edited March 7 in Ask the WDL team

@Geraldine_VdAuwera @Sheila @shlee

I am having this issue while merging several single-sample VCF's into a single multisample VCF. CombineVariants is throwing a bug error message

$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -V MA605_phased.vcf -V MA611_phased.vcf -V MA622_phased.vcf -V MA625_phased.vcf -V MA629_phased.vcf -V Ncm8_phased.vcf -V Sp3_phased.vcf -V Sp21_phased.vcf -V Sp76_phased.vcf -V Sp154_phased.vcf -V Sp164_phased.vcf -V SpNor33_phased.vcf -o RBphased_variants.AllSamples.Final.vcf -genotypeMergeOptions UNSORTED

INFO 18:53:40,486 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:53:40,489 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO 18:53:40,489 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 18:53:40,490 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 18:53:40,490 HelpFormatter - [Tue Mar 06 18:53:40 EST 2018] Executing on Linux 4.13.0-36-generic amd64
INFO 18:53:40,490 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
INFO 18:53:40,494 HelpFormatter - Program Args: -T CombineVariants -R /media/everestial007/SeagateBackup4.0TB2/New_Alignment_Set/RefNindex_lyrata/lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -V MA605_phased.vcf -V MA611_phased.vcf -V MA622_phased.vcf -V MA625_phased.vcf -V MA629_phased.vcf -V Ncm8_phased.vcf -V Sp3_phased.vcf -V Sp21_phased.vcf -V Sp76_phased.vcf -V Sp154_phased.vcf -V Sp164_phased.vcf -V SpNor33_phased.vcf -o RBphased_variants.AllSamples.Final.vcf -genotypeMergeOptions UNSORTED
INFO 18:53:40,497 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.13.0-36-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12.
INFO 18:53:40,498 HelpFormatter - Date/Time: 2018/03/06 18:53:40
INFO 18:53:40,498 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:53:40,498 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 18:53:40,829 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 18:53:40,831 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 18:53:40,832 GenomeAnalysisEngine - Strictness is SILENT
INFO 18:53:41,263 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 18:53:42,443 GenomeAnalysisEngine - Preparing for traversal
INFO 18:53:42,451 GenomeAnalysisEngine - Done preparing for traversal
INFO 18:53:42,452 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 18:53:42,452 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 18:53:42,452 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

ERROR --
ERROR stack trace

java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.valueOf(Integer.java:766)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:724)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:132)
at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
at htsjdk.variant.variantcontext.GenotypesContext.iterator(GenotypesContext.java:465)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.mergeGenotypes(GATKVariantContextUtils.java:1573)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1223)
at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:361)
at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:143)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: ""
ERROR ------------------------------------------------------------------------------------------

But, I can combine several other samples together if some of the samples are removed.

 $java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o RBphased.F1_Samples.merged.vcf

INFO 18:55:20,023 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:55:20,026 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO 18:55:20,027 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 18:55:20,028 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 18:55:20,028 HelpFormatter - [Tue Mar 06 18:55:19 EST 2018] Executing on Linux 4.13.0-36-generic amd64
INFO 18:55:20,028 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
INFO 18:55:20,033 HelpFormatter - Program Args: -T CombineVariants -R /media/everestial007/SeagateBackup4.0TB2/New_Alignment_Set/RefNindex_lyrata/lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o RBphased.F1_Samples.merged.vcf
INFO 18:55:20,037 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.13.0-36-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12.
INFO 18:55:20,037 HelpFormatter - Date/Time: 2018/03/06 18:55:20
INFO 18:55:20,037 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:55:20,037 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 18:55:20,258 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 18:55:20,258 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 18:55:20,259 GenomeAnalysisEngine - Strictness is SILENT
INFO 18:55:20,653 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 18:55:21,264 GenomeAnalysisEngine - Preparing for traversal
INFO 18:55:21,267 GenomeAnalysisEngine - Done preparing for traversal
INFO 18:55:21,268 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 18:55:21,268 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 18:55:21,268 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 18:55:23,074 ProgressMeter - done 1934.0 1.0 s 15.6 m 88.9% 1.0 s 0.0 s

INFO 18:55:23,077 ProgressMeter - Total runtime 1.81 secs, 0.03 min, 0.00 hours

Done. There were no warn messages.

But, the thing is - all these files came out of sample pipeline (from "phaser" tool https://github.com/secastel/phaser ) and have the same file structure.

Can you suggest what the bug is referring to and if there is any solution to it.

I am attaching the associated files if need be.

Update: To circumvent the merging issue, I did the following, but I think it may not be totally good (or may produce some problematic records at some places of the merged VCF's).

#merging all the samples that worked together
$java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -V MA605_phased.vcf -V MA611_phased.vcf -V MA622_phased.vcf -V MA625_phased.vcf -V MA629_phased.vcf -V Sp21_phased.vcf -V Sp76_phased.vcf -V Sp154_phased.vcf -V Sp164_phased.vcf -o RBphased.ms01e_02g_03g_04h.MA605_611_622_625_629.Sp21_76_154_164.merged.vcf

 # merge using bcftools - for the samples that didn't work 
 $bcftools merge Ncm8_phased.vcf.gz Sp3_phased.vcf.gz SpNor33_phased.vcf.gz -O v -o RBphased.Ncm8.Sp3_Nor33.vcf

 # Now, merge two files
 $java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V RBphased.ms01e_02g_03g_04h.MA605_611_622_625_629.Sp21_76_154_164.merged.vcf -V RBphased.Ncm8.Sp3_Nor33.vcf -o RBphased_variants.AllSamples.Final03.vcf 

# But, there is validataion error
$$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V RBphased_variants.AllSamples.Final03.vcf

INFO 19:45:21,504 HelpFormatter - ----------------------------------------------------------------------------------
INFO 19:45:21,507 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO 19:45:21,508 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 19:45:21,508 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 19:45:21,508 HelpFormatter - [Tue Mar 06 19:45:21 EST 2018] Executing on Linux 4.13.0-36-generic amd64
INFO 19:45:21,509 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
INFO 19:45:21,514 HelpFormatter - Program Args: -T ValidateVariants -R /media/everestial007/SeagateBackup4.0TB2/New_Alignment_Set/RefNindex_lyrata/lyrata_genome.fa -V RBphased_variants.AllSamples.Final03.vcf
INFO 19:45:21,517 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.13.0-36-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12.
INFO 19:45:21,518 HelpFormatter - Date/Time: 2018/03/06 19:45:21
INFO 19:45:21,518 HelpFormatter - ----------------------------------------------------------------------------------
INFO 19:45:21,518 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 19:45:21,700 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 19:45:21,700 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 19:45:21,701 GenomeAnalysisEngine - Strictness is SILENT
INFO 19:45:22,156 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 19:45:22,559 GenomeAnalysisEngine - Preparing for traversal
INFO 19:45:22,564 GenomeAnalysisEngine - Done preparing for traversal
INFO 19:45:22,569 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 19:45:22,569 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 19:45:22,577 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/02_outputs_RBphased_VCF/new_merge/RBphased_variants.AllSamples.Final03.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 7:21087512 are not observed at all in the sample genotypes
ERROR ------------------------------------------------------------------------------------------

Thanks,

Post edited by everestial007 on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The error message suggests that one of your vcf files is malformed. Try running ValidateVariants on your VCFs to find the problem.

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera : I ran that, but since all the files are single-sampled VCF it is giving the error at AN for all the samples. But, AN is not the issue; it's something else. See the codes and messages below.

    All, three samples throw validation error at AN.

    $ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalys
    

    isTK.jar -T ValidateVariants -R ${ref_genome} -V ms01e_phased.vcf
    .......
    ......
    ##### ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/02_outputs_RBphased_VCF/new_merge/ms01e_phased.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position 2:181028, 6 vs. 2
    ##### ERROR ------------------------------------------------------------------------------------------

    java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalys
    isTK.jar -T ValidateVariants -R ${ref_genome} -V ms02g_phased.vcf
    ......
    .......
    ##### ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/02_outputs_RBphased_VCF/new_merge/ms02g_phased.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position 2:181028, 6 vs. 2
    ##### ERROR ------------------------------------------------------------------------------------------

    $ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalys
    

    isTK.jar -T ValidateVariants -R ${ref_genome} -V Sp3_phased.vcf
    .........
    ..........
    ##### ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/02_outputs_RBphased_VCF/new_merge/Sp3_phased.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 2:180278 are not observed at all in the sample genotypes
    ##### ERROR ------------------------------------------------------------------------------------------

    merging of "ms01e" and "ms02g" works even with invalid AN

    $ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalys
    

    isTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -o RBphased.ms01e_02g.vcf
    ...........
    .............
    Done. There were no warn messages.
    ------------------------------------------------------------------------------------------

    But, merging of "ms01e", "ms02g" and "Sp3" doesn't work.

    $ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalys
    

    isTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V Sp3_phased.vcf -o RBphased.ms01e_02g.Sp3.vcf
    .......
    .......
    ##### ERROR MESSAGE: For input string: ""
    ##### ERROR ------------------------------------------------------------------------------------------

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You might be able to get past the AN thing by setting validation to lenient (iirc it’s possible to do it, though I don’t remember the argument off the top of my head).

    Unfortunately there’s not much we can do for you if your files have formatting issues that seem to have been introduced by a third party program. Have you tried validating the files you had before putting them through phaser?

Sign In or Register to comment.