We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Errors in CombineGVCF

Hi,
I have converted already generated bam files (old project) in to GVCF using GATK4. Bam files were generated using GATK3 (older version). I tried to Combine 50 GVCF files. I encountered an error of malformed GVCF file. I searched about this. To deal with this I followed already given solutions - regenerate idx files, validate variants and regenerate GVCF again. idx creation didn't help. ValidateVariants failed for all my GVCFs (not sure why, though test run of 3 files was successful). To deal with this later, I omitted this file to joint rest 49. Again, I got an error -

01:08:08.917 INFO ProgressMeter - 1:16621897 18.2 119613000 6580730.3
01:08:18.918 INFO ProgressMeter - 1:16754369 18.3 120875000 6589731.2
01:08:26.432 INFO CombineGVCFs - Shutting down engine
[November 22, 2019 1:08:26 AM MST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 23.99 minutes.
Runtime.totalMemory()=95565643776
htsjdk.tribble.TribbleException$InternalCodecException: The following invalid GT allele index was encountered in the file: <NON_REF>

I performed interval padding and interval merging as suggested but didn't help.

Commands I used :

Haplotype Caller -

gatk HaplotypeCaller --java-options "-Xmx8G -XX:+UseParallelGC -XX:ParallelGCThreads=4" -R ../reference/GRCh37.fa -I test.bam -O test.raw.snps.indels.g.vcf -L b37_wgs_calling_regions.v1.list -ERC GVCF

CombineGVCF
gatk --java-options "-Xmx150g" CombineGVCFs -V test.list -R ../reference/GRCh37.fa -O Combine_49.g.vcf -ip 50 -imr ALL

Another error I encountered is missing INFO in VCF header for different set of GVCF.
Though as a test I performed on 3 GVCF files, they combined perfectly. Please help to resolve these.With so many errors in this step, I am thinking of using old UnifiedGenotyper.

Thank you
Ankita

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Ankita_N

    1. To reiterate what you said, you generated the bams using GATK3? Which tools from GATK3 did you use?
    2. Can you please run ValidateSam using the latest GATK4.1.4.0 to verify if the bam files are not malformed?
    3. Which version of GATK4 are you using for HaplotypeCaller and CombineGVCFs?
    4. Please post the entire error log when you run ValidateVariants. It is possible the gvcfs are malformed.
    5. Please post the entire error log when you run CombineGVCFs.
  • Ankita_NAnkita_N CalgaryMember
    Hi @bhanuGandham,

    Thank you for your reply.
    1. To reiterate what you said, you generated the bams using GATK3? Which tools from GATK3 did you use?
    These BAM files were generated using GATK v3.2-2. This is an old project which we want to use for some reanalysis by gVCF approach.
    After posting query, I checked bam files and I realized there is no TAG which is indicative of indel realignment and base quality recalibration has been performed on these files. Appropriate tags are only way of knowing whether files are recalibrated or not. For example, BD/BI Tags.

    ### lines from BAM file ####
    A00454:25:H757HDSXX:2:1671:18014:7701 99 1 21408 0 151M = 21620 363 CTTGTCCCTTCCGTGACGGATGCCTGAGGAACCTTCCCCAAACTCTTCTGTCCCATCCCTGCCCTGCTCAAAATCCAATCACAGCTCCCTAACACGCCTGAATCAACTTGAAGTCCTGTCTTGAGTAATCCGTGGGCCCTAACTCACTCAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:151 AS:i:151 XS:i:151 RG:Z:001_S1 XA:Z:9,+21521,151M,0;15,-102509606,151M,0;19,+63016,151M,1;2,-114349461,151M,3;12,-82120,151M,4;
    A00454:25:H757HDSXX:1:2544:3812:14481 147 1 21408 0 150M = 21259 -299 GTTGTCCCTTCCGTGACGGATGCCTGAGGAACCTTCCCCAAACTCTTCTGTCCCATCCCTGCCCTGCTCAAAATCCAATCACAGCTCCCTAACACGCCTGAATCAACTTGAAGTCCTGTCTTGAGTAATCCGTGGGCCCTAACTCACTCA ,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:1 MD:Z:0C149 AS:i:149 XS:i:149 RG:Z:001_S1 XA:Z:9,-21521,150M,1;15,+102509607,150M,1;19,-63016,150M,2;2,+114349462,146M4S,2;12,+82121,146M4S,3;
    A00454:25:H757HDSXX:1:2119:8938:35759 163 1 21416 0 151M = 21622 357 TTCCGTGACGGATGCCTGAGGAACCTTCCCCAAACTCTTCTGTCCCATCCCTGCCCTGCTCAAATTCCAATCACAGCTCCCTAACACTCCTGAATCAACTTGAAGTCCTGTCTTGAGTAATCCGTGGGCCCTAACTCACTCATCCCGACTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:3 MD:Z:64A22G58A4 AS:i:136 XS:i:136 RG:Z:001_S1 XA:Z:9,+21529,151M,3;2,-114349453,151M,3;15,-102509598,151M,3;19,+63024,151M,4;12,-82112,151M,4;
    A00454:25:H757HDSXX:4:1510:20582:31720 147 1 21420 0 151M = 21179 -392 GTGATGGATGCTTGAGGAACCTTCCCCAAACTCTTCTGTCCCATCCCTGCCCTGCTCAAAATCCAATCACAGCTCCCTAACACTCCTGAATCAACTTGAAGTCCTGTCTTGAGTAATCCGTGGGCCCTAACTCACTCATCCCAACTCTTCA FFFFFFFFFFFFFF:FFFFFFF,FFFFFF:FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFF NM:i:3 MD:Z:4C6C71G67 AS:i:136 XS:i:136 RG:Z:001_S1 XA:Z:19,-63028,151M,3;2,+114349449,151M,3;9,-21533,151M,3;15,+102509594,151M,3;12,+82108,151M,4;

    2. Can you please run ValidateSam using the latest GATK4.1.4.0 to verify if the bam files are not malformed?

    ValidateSam doesn’t give any error.
    ### log ###

    INFO 2019-11-25 14:43:45 SamFileValidator Validated Read 750,000,000 records. Elap00,000: 42s. Last read position: GL000199.1:162,186
    INFO 2019-11-25 14:44:30 SamFileValidator Validated Read 760,000,000 records. Elap00,000: 44s. Last read position: */*
    INFO 2019-11-25 14:45:00 SamFileValidator Validated Read 770,000,000 records. Elap00,000: 30s. Last read position: */*
    INFO 2019-11-25 14:45:31 SamFileValidator Validated Read 780,000,000 records. Elap00,000: 30s. Last read position: */*
    INFO 2019-11-25 14:46:00 SamFileValidator Validated Read 790,000,000 records. Elap00,000: 29s. Last read position: */*
    INFO 2019-11-25 14:46:32 SamFileValidator Validated Read 800,000,000 records. Elap00,000: 31s. Last read position: */*
    INFO 2019-11-25 14:47:03 SamFileValidator Validated Read 810,000,000 records. Elap00,000: 31s. Last read position: */*
    INFO 2019-11-25 14:47:33 SamFileValidator Validated Read 820,000,000 records. Elap00,000: 30s. Last read position: */*
    INFO 2019-11-25 14:48:04 SamFileValidator Validated Read 830,000,000 records. Elap00,000: 30s. Last read position: */*
    No errors found
    [Mon Nov 25 14:49:51 MST 2019] picard.sam.ValidateSamFile done. Elapsed time: 50.54 minutes.

    3. Which version of GATK4 are you using for HaplotypeCaller and CombineGVCFs?
    (GATK) v4.1.4.0


    4. Please post the entire error log when you run ValidateVariants. It is possible the gvcfs are malformed.

    For every gVCF file, error is

    ### log ####

    GATK4/051.raw.snps.indels.g.vcf
    15:28:17.477 INFO ValidateVariants - Done initializing engine
    15:28:17.478 WARN ValidateVariants - IDS validation cannot be done because no DBSNP file was provided
    15:28:17.478 WARN ValidateVariants - Other possible validations will still be performed
    15:28:17.478 INFO ProgressMeter - Starting traversal
    15:28:17.478 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    15:28:17.509 INFO ValidateVariants - Shutting down engine
    [November 25, 2019 3:28:17 PM MST] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.12 minutes.
    Runtime.totalMemory()=2311061504
    ***********************************************************************

    A USER ERROR has occurred: Input 051.raw.snps.indels.g.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 1:10137 are not observed at all in the sample genotypes of type:

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.


    5. Please post the entire error log when you run CombineGVCFs.
    Even if, I removed malformed BAM file (which is otherwise not because it passes ValidateSAM) , I still got error for merging 49 files –

    ### LOG ###
    [November 22, 2019 1:08:26 AM MST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 23.99 minutes.
    Runtime.totalMemory()=95565643776
    htsjdk.tribble.TribbleException$InternalCodecException: The following invalid GT allele index was encountered in the file: <NON_REF>
    at htsjdk.variant.vcf.AbstractVCFCodec.oneAllele(AbstractVCFCodec.java:578)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseGenotypeAlleles(AbstractVCFCodec.java:602)
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited December 2019

    @Ankita_N

    The issue is there is a NON_REF in the GT field. It looks like a possible bug in the code. We need to find the line where that error occurs.

    1. Try to run ValidateVariants with --validation-type-to-exclude ALLELES and --warn-on-errors.
    2. Please provide entire stack trace for the CombineGVCFs.
  • Ankita_NAnkita_N CalgaryMember
    Hi @bhanuGandham,

    Thank you for your reply.

    Sure, I can provide details after running ValidateVariants with --validation-type-to-exclude ALLELES and --warn-on-errors as well as for stack trace for CombineGVCFs.

    Apart from that, another thing I found that files might not have undergone GATK pre-processing like indel realignment and base quality recalibration steps. I didn't locate any PS tags in the BAM file (and no information is present in headers as well). Do you think it might be the reason for errors that are coming up.

    ### few lines from BAM file #######


    A00454:25:H757HDSXX:3:2105:6641:36292 147 1 24684 0 150M = 24487 -347 TGCCCAGGACAGGGATGGCCCTCTCATCAGGTGGGGGTGAGTGGCAGCACCCACCTGCTGAAGATGTCTCCAGAGACCTTCTGCAGGTACTGCAGGGCATCCGCCATCTGCTGGACGGCCTCCTCTCGCCGCAGGTCTGGCTGGATGAGG FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF NM:i:1 MD:Z:148A1 AS:i:148 XS:i:150 RG:Z:001_S1 XA:Z:12,+78844,150M,0;9,-24797,150M,1;19,-66292,150M,1;15,+102506331,150M,1;2,+114346185,150M,2;
    A00454:25:H757HDSXX:3:1337:27236:34851 99 1 24686 0 151M = 24796 260 CCCAGGACAGGGATGGCCCTCTCATCAGGTGGGGGTGAGTGGCAGCACCCACCTGCTGAAGATGTCTCCAGAGACCTTCTGCAGGTACTGCAGGGCATCCACCATCTGCTGGACGGCCTCCTCTCGCCGCAGGTCTGGCTGGATGAAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF NM:i:1 MD:Z:100G50 AS:i:146 XS:i:146 RG:Z:001_S1 XA:Z:9,+24799,151M,1;19,+66294,151M,1;12,-78841,151M,2;15,-102506328,151M,3;2,-114346182,151M,4;
    A00454:25:H757HDSXX:2:1419:27896:20682 83 1 24686 0 151M = 24396 -441 CCCAGGACAGGGATGGCCCTCTCATCAGGTGGGGGTGAGTGGCAGCACCCACCTGCTGAAGATGTCTCCAGAGACCTTCTGCAGGTACTGCAGGGCATCCGCCATCTGCTGGACGGCCTCCTCTCGCCGCAGGTCTGGCTGGATGAAGGGC FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:151 AS:i:151 XS:i:151 RG:Z:001_S1 XA:Z:19,-66294,151M,0;9,-24799,151M,0;12,+78841,151M,1;15,+102506328,151M,2;2,+114346182,151M,3;
  • Ankita_NAnkita_N CalgaryMember
    Overall, for some files it is working fine and for some it is throwing errors even with options that you have suggested. I believe some of gVCFs for which ValidateVariants is showing up error is creating trouble while combining gVCFs. I always got error with these specific files. CombineGVCF was running but ended with errors due to some files.


    ### ValidateVariants stack race for one sample

    11:54:15.980 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    11:54:20.754 INFO ValidateVariants - Shutting down engine
    [December 2, 2019 11:54:20 AM MST] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.21 minutes.
    Runtime.totalMemory()=7429160960
    htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 2806121: the END value in the INFO field is not valid, for input source: 051.raw.snps.indels.g.vcf
    at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:883)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:436)
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:373)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:354)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:315)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
    at org.broadinstitute.hellbender.Main.main(Main.java:292)

    ##### for another sample

    12:02:39.069 WARN ValidateVariants - Other possible validations will still be performed
    12:02:39.069 INFO ProgressMeter - Starting traversal
    12:02:39.069 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    12:02:43.460 INFO ValidateVariants - Shutting down engine
    [December 2, 2019 12:02:43 PM MST] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.16 minutes.
    Runtime.totalMemory()=6396313600
    htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 2166949: unparsable vcf record with allele , for input source: 073.raw.snps.indels.g.vcf
    at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
    at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:678)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:706)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:645)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:443)
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:373)
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Ankita_N

    As requested above,

    1. Please provide entire stack trace for the CombineGVCFs.
Sign In or Register to comment.