GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

Happy New Year!

I'm attempting to joint genotype ~1000 exomes using GATK4. I've run HC per sample with the following command:

java -Xmx7g -jar gatk-package-4.beta.6-local.jar HaplotypeCaller -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation --maxReadsPerAlignmentStart 0 -GQB 5 -GQB 10 -GQB 15 -GQB 20 -GQB 25 -GQB 30 -GQB 35 -GQB 40 -GQB 45 -GQB 50 -GQB 55 -GQB 60 -GQB 65 -GQB 70 -GQB 75 -GQB 80 -GQB 85 -GQB 90 -GQB 95 -GQB 99 -I example.bam -O example.g.vcf.gz -R /path/to/GRCh38.d1.vd1.fa

And then attempted to create a GenomicDB per chromosome with the following command:

java -Xmx70g -jar gatk-package-4.beta.6-local.jar GenomicsDBImport -genomicsDBWorkspace chrX_db --overwriteExistingGenomicsDBWorkspace true --intervals chrX -V gvcfs.list

I get the following error:

Exception: [January 2, 2018 9:36:26 AM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=2238185472 htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at chrX:251751. Field had 4 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias"> at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1571) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546) at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530) at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:176) at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1232) at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1282) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158) at org.broadinstitute.hellbender.Main.main(Main.java:239)

Which refers the following line in one of the GVCFs:

chrX 251751 . G A,<NON_REF> 46.56 . AS_RAW_BaseQRankSum=|30,1,33,1|;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=|60,2|;AS_RAW_ReadPosRankSum=|5,1,20,1|;AS_SB_TABLE=0,0|0,0|0,0;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:GQ:PL:SB 1/1:0,2,0:6:73,6,0,73,6,73:0,0,1,1

I haven't found a way to get past this error. I found this post from a while back with a very similar error:

https://gatkforums.broadinstitute.org/gatk/discussion/comment/43382#Comment_43382

But they seemed to indicate that it was fixed for them in GATK4beta6.

Any help/insight in to how to resolve it, or if its an unimportant annotation how to ignore it would be greatly appreciated. Thanks!

Ben

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi again @bkelly01,

    From the link you share, I see @Sheila said back in October:

    Looking at the code changes, it looks like Allele-specific support in GenotypeGVCFs is just being added now. Can you wait a bit for a new release of GATK4 beta? Or, you can continue on with GATK3. I think you are able to work around your issues in that.

    From GATK3 documentation, we see for AS_ReadPosRankSum (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_AS_ReadPosRankSumTest.php) that we expect:

    INFO=<ID=AS_ReadPosRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias">
    

    which differs for your GATK4 AS_RAW_ReadPosRankSum:

    INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
    

    The Number and Type differ. The error you get:

    htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at chrX:251751. Field had 4 values but the header says this should have 1 values based on header record

    suggests if you change the INFO header line, perhaps to Number=A,Type=Float, perhaps you will no longer get this error.

  • bkelly01bkelly01 USAMember

    Different error, but its still complaining about the same annotation:

    htsjdk.tribble.TribbleException: Could not decode field AS_RAW_ReadPosRankSum with value |5 of declared type Float

    Full output:

    15:41:36.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/igm/apps/gatk/gatk-4.beta.6/gatk-package-4.beta.6-local.jar!/com/intel/gkl/native/libgkl_compression.so [January 2, 2018 3:41:36 PM EST] GenomicsDBImport --genomicsDBWorkspace chrX_db_3 --variant TCGA-3C-AAAU.GERMLINE.g.vcf --overwriteExistingGenomicsDBWorkspace true --intervals chrX --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --batchSize 0 --consolidate false --validateSampleNameMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPrefetchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false [January 2, 2018 3:41:36 PM EST] Executing as bjk003@rpl-igmmaster01 on Linux 3.10.0-693.2.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: 4.beta.6 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 15:41:40.394 INFO GenomicsDBImport - Deflater: IntelDeflater 15:41:40.394 INFO GenomicsDBImport - Inflater: IntelInflater 15:41:40.394 INFO GenomicsDBImport - GCS max retries/reopens: 20 15:41:40.394 INFO GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 15:41:40.394 INFO GenomicsDBImport - Initializing engine 15:41:40.911 INFO IntervalArgumentCollection - Processing 156040895 bp from intervals 15:41:40.915 INFO GenomicsDBImport - Done initializing engine Created workspace /igm/projects/tcga_runs/test/chrX_db_3 15:41:41.682 INFO GenomicsDBImport - Vid Map JSON file will be written to chrX_db_3/vidmap.json 15:41:41.682 INFO GenomicsDBImport - Callset Map JSON file will be written to chrX_db_3/callset.json 15:41:41.682 INFO GenomicsDBImport - Importing to array - chrX_db_3/genomicsdb_array 15:41:41.692 INFO ProgressMeter - Starting traversal 15:41:41.693 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute 15:41:41.730 INFO GenomicsDBImport - Importing batch 1 with 1 samples 15:41:42.344 INFO GenomicsDBImport - Shutting down engine [January 2, 2018 3:41:42 PM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=2228224000 htsjdk.tribble.TribbleException: Could not decode field AS_RAW_ReadPosRankSum with value |5 of declared type Float at htsjdk.variant.variantcontext.VariantContext.decodeOne(VariantContext.java:1630) at htsjdk.variant.variantcontext.VariantContext.decodeValue(VariantContext.java:1601) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1561) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546) at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530) at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:176) at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1232) at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1282) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158) at org.broadinstitute.hellbender.Main.main(Main.java:239)

    Is this a problem with GenomicsDBImport not being able to handle HaplotypeCaller's correct output, or is HaplotypeCaller not outputting valid annotations?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Is this a problem with GenomicsDBImport not being able to handle HaplotypeCaller's correct output, or is HaplotypeCaller not outputting valid annotations?

    Great question @bkelly01.

    Seems like your VCF could have multiple header lines containing INFO=<ID=AS_ReadPosRankSum. Can you confirm this for us, e.g. with gzcat xyz.g.vcf.gz | grep 'INFO=<ID=AS_ReadPosRankSum' on the output of the HaplotypeCaller step? Thanks.

    I see the following open ticket that suggests some issue with header lines and GenomicsDB:

    I'll see what our developers think. In the meanwhile, would it be possible to get some test data from you to recapitulate the error? Instructions for prepping a bug report are at https://software.broadinstitute.org/gatk/guide/article?id=1894. Thanks again.

  • bkelly01bkelly01 USAMember

    Definitely no duplicate header lines.

    Looking at the developer response, it seems that they know about the issue and will have a fixed version soon. Any idea on if this will be in the major release or a nightly?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @bkelly01, one of the developers states that this will be fixed sometime this week, in preparation for the official release of GATK4, which is planned for January 9. You can track the progress of the work in a different ticket: https://github.com/broadinstitute/gatk/issues/3707.

  • gauthiergauthier Member, Broadie, Moderator, Dev

    @bkelly01 there's another issue also. In GATK 3.8 I changed the format of the AS_RAW RankSum annotations to be a rank sum z-score per allele from the previous format when it was a list of values per allele (as you have). The changes didn't get ported to GATK4 until recently, i.e. since the last release of 4beta6. When allele-specific annotation support is available, it will use the new format. Your options are 1) re-run HaplotypeCaller (from version 3.8 or @shlee can tell you how to get a GATK4 nightly) 2) drop the AS RankSums 3) write code to convert the GVCF raw annotation lists to rank sum z-scores. Most users only re-run HaplotypeCaller as a last resort. If you want to drop the AS RankSums, VQSR will default to using the non-AS RankSums downstream. Anecdotally I've found that the AS_FS and AS_SOR are more powerful, so I don't think you'd be sacrificing that much. To drop the annotations I think your best bet is to write a new VCF with a third party tool (e.g. Python's pyvcf).

  • shleeshlee CambridgeMember, Broadie, Moderator

    GATK4 nightlies can be found in Docker image format at https://hub.docker.com/r/broadinstitute/gatk-nightly/.

  • bkelly01bkelly01 USAMember
    edited January 3

    @gauthier Just to be clear, there isn't a way to have GenomicsDBImport ignore an annotation, I'd have to actually go through and manually edit remove all of those annotations? Is there any way I could edit the header in order for it to ignore it?

    I have nearly 1000 GVCFs to edit. I'll automate it, sure, but I'm looking for any shortcut I can. :)

    Edit: I don't see AS_FS or AS_SOR in my GVCF, are they added by VQSR? It wasn't quite clear from your message.

    Thanks!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    I wanted to give you a solution guaranteed to work. It's likely that if you remove the headers for the AS annotations that they will be ignored, but I'm honestly not sure. There's also a possibility of editing a JSON file, but that gets really gross really fast.

  • gauthiergauthier Member, Broadie, Moderator, Dev

    AS_FS and AS_SOR get generated from the AS_SB_Table annotation in the GenotypeGVCFs stage.

    If you don't want to wait for the GenomicsDB update, the new GATK4 CombineGVCFs should work now (from a nightly). With 1000 samples you may want to do a hierarchical merge. But you'll need to fix the old format HC AS annotations either way.

  • bkelly01bkelly01 USAMember

    @gauthier OK, just so I'm understanding this correctly, I have to fix (i.e. remove) the AS_*RankSum annotations from the GVCFs no matter what (because I'm not going to rerun HC on all of them). I can then merge the GVCFs together with the nightly CombineGVCFs and then GenotypeGVCFs once I get to a single file. Or alternatively I can wait for the fixed version of GenomicsDB that supports AS annotations and GenotypeGVCFs on that?

    I'll run a trial of just removing from the headers and report back.

    Thanks!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    Now that I think about it, you could probably run CombineGVCFs on what you have now with the AS_*RankSum values if you're careful with the annotation arguments. Where the example from https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering has ./gatk CombineGVCFs -R $reference \ -V mySample1.AS.g.vcf -V mySample2.AS.g.vcf -V mySample3.AS.g.vcf \ -o allSamples.g.AS.vcf \ -G StandardAnnotation -G AS_StandardAnnotation
    you would instead run CombineGVCFs and GenotypeGVCFs with
    -G StandardAnnotation -A AS_FS -A AS_SOR -A AS_MQ -A AS_QD instead. Then for VQSR, specify the classic ReadPosRankSum and MQRankSum and the others allele-specific. (With the -AS argument, of course, similar to the GATK3 example in https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering)

    GenomicsDB will be faster and the output will be smaller, but the above is an option if dropping the header lines doesn't work.

  • bkelly01bkelly01 USAMember

    @gauthier So I tried stripping the header, and as I sort of suspected, it complained about there being annotations not found in the header.

    I also tried using CombineGVCFs from the nightly as such:

    java -jar gatk.jar CombineGVCFs -R GRCh38.d1.vd1.fa -V sample1.g.vcf.gz -V sample2.g.vcf.gz -O test.noAnn.g.vcf.gz -G StandardAnnotation

    but it throws a cryptic error almost immediately:

    [January 4, 2018 1:53:33 PM EST] CombineGVCFs --output test.noAnn.g.vcf.gz --variant TCGA-3C-AAAU.GERMLINE.g.vcf.gz --variant ../TCGA-3C-AALI/TCGA-3C-AALI.GERMLINE.g.vcf.gz --reference ../../../archive/RESEARCH/TCGA/reference/GRCh38.d1.vd1.fa --annotation-group StandardAnnotation --convertToBasePairResolution false --breakBandsAtMultiplesOf 0 --ignore_variants_starting_outside_interval false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disable-tool-default-read-filters false [January 4, 2018 1:53:33 PM EST] Executing as bjk003@rpl-igmmaster01 on Linux 3.10.0-693.2.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: 4.beta.6-132-g7454c15-SNAPSHOT 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 13:53:37.188 INFO CombineGVCFs - Deflater: IntelDeflater 13:53:37.188 INFO CombineGVCFs - Inflater: IntelInflater 13:53:37.188 INFO CombineGVCFs - GCS max retries/reopens: 20 13:53:37.188 INFO CombineGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 13:53:37.188 INFO CombineGVCFs - Initializing engine 13:53:37.761 INFO FeatureManager - Using codec VCFCodec to read file file:///igm/projects/tcga_runs/test/TCGA-3C-AAAU.GERMLINE.g.vcf.gz 13:53:37.879 INFO FeatureManager - Using codec VCFCodec to read file file:///igm/projects/tcga_runs/test/../TCGA-3C-AALI/TCGA-3C-AALI.GERMLINE.g.vcf.gz 13:53:43.848 INFO CombineGVCFs - Done initializing engine 13:53:44.540 INFO ProgressMeter - Starting traversal 13:53:44.540 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 13:53:44.643 INFO CombineGVCFs - Shutting down engine [January 4, 2018 1:53:44 PM EST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 0.18 minutes. Runtime.totalMemory()=2822242304 java.lang.IllegalArgumentException: Invalid interval. Contig:chr1 start:0 end:0 at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:687) at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:60) at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:36) at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.mergeWithNewVCs(CombineGVCFs.java:258) at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.apply(CombineGVCFs.java:133) at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:67) at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108) at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:112) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:875) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:128) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:185) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:204) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:156) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:199) at org.broadinstitute.hellbender.Main.main(Main.java:279)

    It seems like stripping out the AS_*RankSum annotations is my only path forward. Will run a test on that and let you know how it goes.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Invalid interval. Contig:chr1 start:0 end:0

    Sounds like you even stripped out the contig lines from the header.

  • bkelly01bkelly01 USAMember

    @shlee They're definitely still in there. I didn't touch the GVCF produced by HC at all in that test.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Looks like you are using CombineGVCFs from GATK4.beta.6, which may still be undergoing development as it was a tool that we recently started porting over. Please use CombineGVCFs from GATK3.7 or GATK3.8.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hey @bkelly01,

    I'm just recalling the Number field of the INFO line can have a . value. I encountered this issue a year ago when running MuTect1 and found a solution by combing the forum. My notes say to change the value of the Number field to .. This allows the number of values for the annotation to vary per record.

    To quote @flytrap who is quoting some other unknown source:

    The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. There are also certain special characters used to define special cases: If the field has one value for each possible allele (including the reference), then this value should be 'R'

    Apologies for the late solution. I don't man the forum regularly and I haven't much experience in VCF annotations and their respective errors.

  • bkelly01bkelly01 USAMember

    @gauthier

    Stripping out the AS annotations seemed to work. I'm currently able to import in to GenomicsDB.

    Couple quick questions, though. How long might I expect this to take? Do using batches, consolidating, and increasing the reader threads help? I kicked off chr1, using a batch size of 50, consolidate=true, and 32 reader threads. 10 hrs later its still on the first batch. These are high depth exomes. Any insight in to what combinations of parameters might help would be greatly appreciated. Thanks!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    @bkelly01 In production we use a batch size of 50 and 5 reader threads. I'm told that more than 5 reader threads actually hurts performance scalability. We actually don't use the consolidate flag in production, though I don't know why not. I'm pretty sure those were the parameters we used to make a 20K WGS callset, so they've been thoroughly optimized.

    I think I saw that you're using either 4beta6 or the nightly, which is good, but I want to be explicit that you shouldn't use GenomicsDBImport from a version before that if you're using batching because there was a sample swap bug.

  • bkelly01bkelly01 USAMember

    @gauthier I'm using 4beta6. The nightly I have seems to not allow a gvcf list for the -V parameter.

    So after about 60 hrs, I've got different chromosome jobs still running, ranging from batch 1 to 7 (of ~20 batches, 50 samples per batch). Is there any realistic chance at this finishing without taking weeks?

    I did also run in to a problem with running all of these imports from the same directory. It creates a .tmp_loader.json file in the directory you kick the imports off from, which is problematic when you kick off 24 imports, one for each chromosome. This json file seems to specify the location of the GenomicsDB which that job is to write to. With that file being overwritten each time I kicked off a new import, I got core dumps and jobs writing to the wrong db. I had specified a unique db for each, but with jobs reading from the same json file, what I had specified on the command line and what the jobs actually did didn't line up. I ended up creating a separate dir for each chromosome, and kicked each job off from those dirs, which solved the problem.

    Thank you for your help on this!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    @bkelly01 GenotypeGVCFs no longer supports multiple inputs. It requires a single GenomicsDB instance or a single combined GVCF.

    For a ~1200 sample WGS callset I just ran, I scattered the import ~2000 ways and each scatter took about 3 hours. It should be linear in the number of samples and the size of the interval, in which case each chromosome would take (on average) a little over 200 hours or ab

    We run the workflow with GenomicsDBImport in the cloud, so each job gets its own virtual machine and we haven't seen that problem before. I've created multiple GDBs in the same directory, though not at the same time. I will note that as a usage limitation in the docs.

  • JaimeGJaimeG SpainMember

    Hi everyone!
    I have similar issue as @bkelly01 when using the latest release of GATK (gatk-4.0.2.1). I did the variant calls with the same version and when I am trying to join GVCFs with CombineGVCFs it crashes in the last contig with this error:

    java.lang.IllegalArgumentException: Invalid interval. Contig:NW_017493675-1 start:575 end:316

    I have checked all GVCFs with ValidateVariants and it is all OK and when I see the last two contigs in GVCF files these are the lengths:

    NW_017493675-1 510 . T <NON_REF> . . END=575 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
    NW_017493676-1 1 . T <NON_REF> . . END=316 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

    The rest of the contigs are named like them, if that could do something...
    Any idea?
    Thanks for your help,
    Jaime.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JaimeG
    Hi Jaime,

    Can you post the exact command you ran?

    Thanks,
    Sheila

  • JaimeGJaimeG SpainMember
    edited March 20

    Hi @Sheila !
    This was part of the command (There are more variants in the command):

    java -Dsamjdk.use_async_io_read_samtools=false \
    -Dsamjdk.use_async_io_write_samtools=true  \
    -Dsamjdk.use_async_io_write_tribble=false  \
    -Dsamjdk.compression_level=1  \
    -jar /share/apps/local/gatk/GATK_4.0.2.1/gatk-package-4.0.2.1-local.jar  \
    CombineGVCFs -R /share/references/genomes/Juglans_regia/wgs.5d/GCF_001411555.1_wgs.5d_genomic.fa  \
    --variant /share/agro/trash/variant_calling/gatk/AE006403/gatk_haplotype_caller_agro_AE006403.g.vcf  \
    --variant /share/agro/trash/variant_calling/gatk/AE006404/gatk_haplotype_caller_agro_AE006404.g.vcf  \
    -O /share/agro/trash/variant_calling/gatk/combine_GVCFs_agro.vcf
    

    Thank you!,

    Jaime.

    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @JaimeG,

    Sheila is away at a workshop. Can you double-check if your VCFs are coordinate-sorted by running them through SortVcf? Also, is it possible for you to narrow down the VCF file(s) that may be causing the error by subsetting the VCFs into smaller sets and running them through CombineGVCFs?

  • ghislainghislain OsloMember

    Hi,
    I have exactly the same problem with CombineGVCFs (GATK ver. 4.0.2.0), and SortVcf doesn't fix the problem. The end coordinate is coming from the next contig.

    Error:

    java.lang.IllegalArgumentException: Invalid interval. Contig:HLA-DRB1*15:03:01:02 start:11569 end:11005
            at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:687)
            at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:61)
            at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:37)
            at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.onTraversalSuccess(CombineGVCFs.java:415)
            at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:895)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:159)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:201)
            at org.broadinstitute.hellbender.Main.main(Main.java:287)
    
    

    the g.vcf input:

    HLA-DRB1*15:02:01       1       .       G       <NON_REF>       .       .       END=10313       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    HLA-DRB1*15:03:01:01    1       .       G       <NON_REF>       .       .       END=11567       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    HLA-DRB1*15:03:01:02    1       .       G       <NON_REF>       .       .       END=11569       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    HLA-DRB1*16:02:01       1       .       G       <NON_REF>       .       .       END=11005       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    
    

    Ghislain

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ghislain,

    Can you post the exact command that gives you the error? If you are using an intervals list, it looks like our developers implemented a new argument -ignore_variants_starting_outside_intervaltowards this particular error. Otherwise, I think this may be a bug and we will need a small snippet of data to recapitulate the error from you. Instructions for bug reports are at https://software.broadinstitute.org/gatk/guide/article?id=1894.

  • ghislainghislain OsloMember
    edited March 23

    Hi Shlee,

    I don't use intervals.

    The command line used is :

    java -Xmx16G -jar /opt/conda/share/gatk4-4.0.2.0-0/gatk-package-4.0.2.0-local.jar CombineGVCFs \
    -V /data/inputs/SAMPLE1.g.vcf \
    -V /data/inputs/SAMPLE2.g.vcf \
    -O /data/Outputs/combined.g.vcf \
    -R /references/Homo_sapiens_assembly38.fasta \
    -G AS_StandardAnnotation \
    -G StandardAnnotation
    

    I had this error with several WGS samples, on different positions and chromosomes, unfortunately data are sensitives and I cannot export anything from the working environment.
    Data used on my previous post are just a small dataset able to generate what it seems to be the same error, I'll join it for the bug report.
    (Data uploaded : Discussion-GATK4beta6-annotation-incompatibility-between-HaplotypeCaller-and-GenomicsDBImport-Post-by-Ghislain.tar.xz)

    Post edited by ghislain on

    Issue · Github
    by shlee

    Issue Number
    3026
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • SkyWarriorSkyWarrior TurkeyMember
    edited March 23

    If you are only interested in the main reference contigs why not use the Broad's HG38 wgs calling regions list file to limit your calling intervals and get rid of this error?

    I believe those HLA contigs are not for the purpose of variant calling but for HLA typing only.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Thanks, @ghislain. I've received your data bundle. We will be in touch again soon.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ghislain, I can recapitulate your observation and have asked the developers to look into this error at https://github.com/broadinstitute/gatk/issues/4572.

    Given the HLA contigs appear to contain no variation in the parts of your data I looked into, perhaps you can move on in your endeavors by specifying -XL HLA-DRB1*16:02:01 in your CombineGVCF commands. This, unfortunately, drops the last contig of data.

  • ghislainghislain OsloMember

    Hi,

    Thank you for the feed back.
    With the production pipeline we already ignore unplace/unlocalise contigs, but unfortunately this problem occurs on all chromosomes. On the last run, with 900 wgs we got:

    • Chr1 error on 35 different positions of this chromosome.
    • Chr2 18
    • Chr3 5
    • Chr4 1
    • Chr5 2
    • Chr6 3
    • Chr7 8
      etc...
  • shleeshlee CambridgeMember, Broadie, Moderator

    @ghislain, so this is a wider issue of dropping the last GVCF block. Ouch. I will also convey this to the developer.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ghislain,

    The fix for this bug is nearly complete. Once the following issues are labeled closed, then the next release of GATK4 will work as expected:

Sign In or Register to comment.