To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

Happy New Year!

I'm attempting to joint genotype ~1000 exomes using GATK4. I've run HC per sample with the following command:

java -Xmx7g -jar gatk-package-4.beta.6-local.jar HaplotypeCaller -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation --maxReadsPerAlignmentStart 0 -GQB 5 -GQB 10 -GQB 15 -GQB 20 -GQB 25 -GQB 30 -GQB 35 -GQB 40 -GQB 45 -GQB 50 -GQB 55 -GQB 60 -GQB 65 -GQB 70 -GQB 75 -GQB 80 -GQB 85 -GQB 90 -GQB 95 -GQB 99 -I example.bam -O example.g.vcf.gz -R /path/to/GRCh38.d1.vd1.fa

And then attempted to create a GenomicDB per chromosome with the following command:

java -Xmx70g -jar gatk-package-4.beta.6-local.jar GenomicsDBImport -genomicsDBWorkspace chrX_db --overwriteExistingGenomicsDBWorkspace true --intervals chrX -V gvcfs.list

I get the following error:

Exception: [January 2, 2018 9:36:26 AM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=2238185472 htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at chrX:251751. Field had 4 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias"> at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1571) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546) at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530) at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:176) at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1232) at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1282) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158) at org.broadinstitute.hellbender.Main.main(Main.java:239)

Which refers the following line in one of the GVCFs:

chrX 251751 . G A,<NON_REF> 46.56 . AS_RAW_BaseQRankSum=|30,1,33,1|;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=|60,2|;AS_RAW_ReadPosRankSum=|5,1,20,1|;AS_SB_TABLE=0,0|0,0|0,0;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:GQ:PL:SB 1/1:0,2,0:6:73,6,0,73,6,73:0,0,1,1

I haven't found a way to get past this error. I found this post from a while back with a very similar error:

https://gatkforums.broadinstitute.org/gatk/discussion/comment/43382#Comment_43382

But they seemed to indicate that it was fixed for them in GATK4beta6.

Any help/insight in to how to resolve it, or if its an unimportant annotation how to ignore it would be greatly appreciated. Thanks!

Ben

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi again @bkelly01,

    From the link you share, I see @Sheila said back in October:

    Looking at the code changes, it looks like Allele-specific support in GenotypeGVCFs is just being added now. Can you wait a bit for a new release of GATK4 beta? Or, you can continue on with GATK3. I think you are able to work around your issues in that.

    From GATK3 documentation, we see for AS_ReadPosRankSum (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_AS_ReadPosRankSumTest.php) that we expect:

    INFO=<ID=AS_ReadPosRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias">
    

    which differs for your GATK4 AS_RAW_ReadPosRankSum:

    INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
    

    The Number and Type differ. The error you get:

    htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at chrX:251751. Field had 4 values but the header says this should have 1 values based on header record

    suggests if you change the INFO header line, perhaps to Number=A,Type=Float, perhaps you will no longer get this error.

  • bkelly01bkelly01 USAMember

    Different error, but its still complaining about the same annotation:

    htsjdk.tribble.TribbleException: Could not decode field AS_RAW_ReadPosRankSum with value |5 of declared type Float

    Full output:

    15:41:36.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/igm/apps/gatk/gatk-4.beta.6/gatk-package-4.beta.6-local.jar!/com/intel/gkl/native/libgkl_compression.so [January 2, 2018 3:41:36 PM EST] GenomicsDBImport --genomicsDBWorkspace chrX_db_3 --variant TCGA-3C-AAAU.GERMLINE.g.vcf --overwriteExistingGenomicsDBWorkspace true --intervals chrX --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --batchSize 0 --consolidate false --validateSampleNameMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPrefetchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false [January 2, 2018 3:41:36 PM EST] Executing as bjk003@rpl-igmmaster01 on Linux 3.10.0-693.2.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: 4.beta.6 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false 15:41:40.394 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 15:41:40.394 INFO GenomicsDBImport - Deflater: IntelDeflater 15:41:40.394 INFO GenomicsDBImport - Inflater: IntelInflater 15:41:40.394 INFO GenomicsDBImport - GCS max retries/reopens: 20 15:41:40.394 INFO GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 15:41:40.394 INFO GenomicsDBImport - Initializing engine 15:41:40.911 INFO IntervalArgumentCollection - Processing 156040895 bp from intervals 15:41:40.915 INFO GenomicsDBImport - Done initializing engine Created workspace /igm/projects/tcga_runs/test/chrX_db_3 15:41:41.682 INFO GenomicsDBImport - Vid Map JSON file will be written to chrX_db_3/vidmap.json 15:41:41.682 INFO GenomicsDBImport - Callset Map JSON file will be written to chrX_db_3/callset.json 15:41:41.682 INFO GenomicsDBImport - Importing to array - chrX_db_3/genomicsdb_array 15:41:41.692 INFO ProgressMeter - Starting traversal 15:41:41.693 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute 15:41:41.730 INFO GenomicsDBImport - Importing batch 1 with 1 samples 15:41:42.344 INFO GenomicsDBImport - Shutting down engine [January 2, 2018 3:41:42 PM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=2228224000 htsjdk.tribble.TribbleException: Could not decode field AS_RAW_ReadPosRankSum with value |5 of declared type Float at htsjdk.variant.variantcontext.VariantContext.decodeOne(VariantContext.java:1630) at htsjdk.variant.variantcontext.VariantContext.decodeValue(VariantContext.java:1601) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1561) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546) at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530) at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:176) at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1232) at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1282) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158) at org.broadinstitute.hellbender.Main.main(Main.java:239)

    Is this a problem with GenomicsDBImport not being able to handle HaplotypeCaller's correct output, or is HaplotypeCaller not outputting valid annotations?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Is this a problem with GenomicsDBImport not being able to handle HaplotypeCaller's correct output, or is HaplotypeCaller not outputting valid annotations?

    Great question @bkelly01.

    Seems like your VCF could have multiple header lines containing INFO=<ID=AS_ReadPosRankSum. Can you confirm this for us, e.g. with gzcat xyz.g.vcf.gz | grep 'INFO=<ID=AS_ReadPosRankSum' on the output of the HaplotypeCaller step? Thanks.

    I see the following open ticket that suggests some issue with header lines and GenomicsDB:

    I'll see what our developers think. In the meanwhile, would it be possible to get some test data from you to recapitulate the error? Instructions for prepping a bug report are at https://software.broadinstitute.org/gatk/guide/article?id=1894. Thanks again.

  • bkelly01bkelly01 USAMember

    Definitely no duplicate header lines.

    Looking at the developer response, it seems that they know about the issue and will have a fixed version soon. Any idea on if this will be in the major release or a nightly?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @bkelly01, one of the developers states that this will be fixed sometime this week, in preparation for the official release of GATK4, which is planned for January 9. You can track the progress of the work in a different ticket: https://github.com/broadinstitute/gatk/issues/3707.

  • gauthiergauthier Member, Broadie, Moderator, Dev

    @bkelly01 there's another issue also. In GATK 3.8 I changed the format of the AS_RAW RankSum annotations to be a rank sum z-score per allele from the previous format when it was a list of values per allele (as you have). The changes didn't get ported to GATK4 until recently, i.e. since the last release of 4beta6. When allele-specific annotation support is available, it will use the new format. Your options are 1) re-run HaplotypeCaller (from version 3.8 or @shlee can tell you how to get a GATK4 nightly) 2) drop the AS RankSums 3) write code to convert the GVCF raw annotation lists to rank sum z-scores. Most users only re-run HaplotypeCaller as a last resort. If you want to drop the AS RankSums, VQSR will default to using the non-AS RankSums downstream. Anecdotally I've found that the AS_FS and AS_SOR are more powerful, so I don't think you'd be sacrificing that much. To drop the annotations I think your best bet is to write a new VCF with a third party tool (e.g. Python's pyvcf).

  • shleeshlee CambridgeMember, Broadie, Moderator

    GATK4 nightlies can be found in Docker image format at https://hub.docker.com/r/broadinstitute/gatk-nightly/.

  • bkelly01bkelly01 USAMember
    edited January 3

    @gauthier Just to be clear, there isn't a way to have GenomicsDBImport ignore an annotation, I'd have to actually go through and manually edit remove all of those annotations? Is there any way I could edit the header in order for it to ignore it?

    I have nearly 1000 GVCFs to edit. I'll automate it, sure, but I'm looking for any shortcut I can. :)

    Edit: I don't see AS_FS or AS_SOR in my GVCF, are they added by VQSR? It wasn't quite clear from your message.

    Thanks!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    I wanted to give you a solution guaranteed to work. It's likely that if you remove the headers for the AS annotations that they will be ignored, but I'm honestly not sure. There's also a possibility of editing a JSON file, but that gets really gross really fast.

  • gauthiergauthier Member, Broadie, Moderator, Dev

    AS_FS and AS_SOR get generated from the AS_SB_Table annotation in the GenotypeGVCFs stage.

    If you don't want to wait for the GenomicsDB update, the new GATK4 CombineGVCFs should work now (from a nightly). With 1000 samples you may want to do a hierarchical merge. But you'll need to fix the old format HC AS annotations either way.

  • bkelly01bkelly01 USAMember

    @gauthier OK, just so I'm understanding this correctly, I have to fix (i.e. remove) the AS_*RankSum annotations from the GVCFs no matter what (because I'm not going to rerun HC on all of them). I can then merge the GVCFs together with the nightly CombineGVCFs and then GenotypeGVCFs once I get to a single file. Or alternatively I can wait for the fixed version of GenomicsDB that supports AS annotations and GenotypeGVCFs on that?

    I'll run a trial of just removing from the headers and report back.

    Thanks!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    Now that I think about it, you could probably run CombineGVCFs on what you have now with the AS_*RankSum values if you're careful with the annotation arguments. Where the example from https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering has ./gatk CombineGVCFs -R $reference \ -V mySample1.AS.g.vcf -V mySample2.AS.g.vcf -V mySample3.AS.g.vcf \ -o allSamples.g.AS.vcf \ -G StandardAnnotation -G AS_StandardAnnotation
    you would instead run CombineGVCFs and GenotypeGVCFs with
    -G StandardAnnotation -A AS_FS -A AS_SOR -A AS_MQ -A AS_QD instead. Then for VQSR, specify the classic ReadPosRankSum and MQRankSum and the others allele-specific. (With the -AS argument, of course, similar to the GATK3 example in https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering)

    GenomicsDB will be faster and the output will be smaller, but the above is an option if dropping the header lines doesn't work.

  • bkelly01bkelly01 USAMember

    @gauthier So I tried stripping the header, and as I sort of suspected, it complained about there being annotations not found in the header.

    I also tried using CombineGVCFs from the nightly as such:

    java -jar gatk.jar CombineGVCFs -R GRCh38.d1.vd1.fa -V sample1.g.vcf.gz -V sample2.g.vcf.gz -O test.noAnn.g.vcf.gz -G StandardAnnotation

    but it throws a cryptic error almost immediately:

    [January 4, 2018 1:53:33 PM EST] CombineGVCFs --output test.noAnn.g.vcf.gz --variant TCGA-3C-AAAU.GERMLINE.g.vcf.gz --variant ../TCGA-3C-AALI/TCGA-3C-AALI.GERMLINE.g.vcf.gz --reference ../../../archive/RESEARCH/TCGA/reference/GRCh38.d1.vd1.fa --annotation-group StandardAnnotation --convertToBasePairResolution false --breakBandsAtMultiplesOf 0 --ignore_variants_starting_outside_interval false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disable-tool-default-read-filters false [January 4, 2018 1:53:33 PM EST] Executing as bjk003@rpl-igmmaster01 on Linux 3.10.0-693.2.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: 4.beta.6-132-g7454c15-SNAPSHOT 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 13:53:37.188 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 13:53:37.188 INFO CombineGVCFs - Deflater: IntelDeflater 13:53:37.188 INFO CombineGVCFs - Inflater: IntelInflater 13:53:37.188 INFO CombineGVCFs - GCS max retries/reopens: 20 13:53:37.188 INFO CombineGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 13:53:37.188 INFO CombineGVCFs - Initializing engine 13:53:37.761 INFO FeatureManager - Using codec VCFCodec to read file file:///igm/projects/tcga_runs/test/TCGA-3C-AAAU.GERMLINE.g.vcf.gz 13:53:37.879 INFO FeatureManager - Using codec VCFCodec to read file file:///igm/projects/tcga_runs/test/../TCGA-3C-AALI/TCGA-3C-AALI.GERMLINE.g.vcf.gz 13:53:43.848 INFO CombineGVCFs - Done initializing engine 13:53:44.540 INFO ProgressMeter - Starting traversal 13:53:44.540 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 13:53:44.643 INFO CombineGVCFs - Shutting down engine [January 4, 2018 1:53:44 PM EST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 0.18 minutes. Runtime.totalMemory()=2822242304 java.lang.IllegalArgumentException: Invalid interval. Contig:chr1 start:0 end:0 at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:687) at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:60) at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:36) at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.mergeWithNewVCs(CombineGVCFs.java:258) at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.apply(CombineGVCFs.java:133) at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:67) at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108) at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:112) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:875) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:128) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:185) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:204) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:156) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:199) at org.broadinstitute.hellbender.Main.main(Main.java:279)

    It seems like stripping out the AS_*RankSum annotations is my only path forward. Will run a test on that and let you know how it goes.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Invalid interval. Contig:chr1 start:0 end:0

    Sounds like you even stripped out the contig lines from the header.

  • bkelly01bkelly01 USAMember

    @shlee They're definitely still in there. I didn't touch the GVCF produced by HC at all in that test.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Looks like you are using CombineGVCFs from GATK4.beta.6, which may still be undergoing development as it was a tool that we recently started porting over. Please use CombineGVCFs from GATK3.7 or GATK3.8.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hey @bkelly01,

    I'm just recalling the Number field of the INFO line can have a . value. I encountered this issue a year ago when running MuTect1 and found a solution by combing the forum. My notes say to change the value of the Number field to .. This allows the number of values for the annotation to vary per record.

    To quote @flytrap who is quoting some other unknown source:

    The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. There are also certain special characters used to define special cases: If the field has one value for each possible allele (including the reference), then this value should be 'R'

    Apologies for the late solution. I don't man the forum regularly and I haven't much experience in VCF annotations and their respective errors.

  • bkelly01bkelly01 USAMember

    @gauthier

    Stripping out the AS annotations seemed to work. I'm currently able to import in to GenomicsDB.

    Couple quick questions, though. How long might I expect this to take? Do using batches, consolidating, and increasing the reader threads help? I kicked off chr1, using a batch size of 50, consolidate=true, and 32 reader threads. 10 hrs later its still on the first batch. These are high depth exomes. Any insight in to what combinations of parameters might help would be greatly appreciated. Thanks!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    @bkelly01 In production we use a batch size of 50 and 5 reader threads. I'm told that more than 5 reader threads actually hurts performance scalability. We actually don't use the consolidate flag in production, though I don't know why not. I'm pretty sure those were the parameters we used to make a 20K WGS callset, so they've been thoroughly optimized.

    I think I saw that you're using either 4beta6 or the nightly, which is good, but I want to be explicit that you shouldn't use GenomicsDBImport from a version before that if you're using batching because there was a sample swap bug.

  • bkelly01bkelly01 USAMember

    @gauthier I'm using 4beta6. The nightly I have seems to not allow a gvcf list for the -V parameter.

    So after about 60 hrs, I've got different chromosome jobs still running, ranging from batch 1 to 7 (of ~20 batches, 50 samples per batch). Is there any realistic chance at this finishing without taking weeks?

    I did also run in to a problem with running all of these imports from the same directory. It creates a .tmp_loader.json file in the directory you kick the imports off from, which is problematic when you kick off 24 imports, one for each chromosome. This json file seems to specify the location of the GenomicsDB which that job is to write to. With that file being overwritten each time I kicked off a new import, I got core dumps and jobs writing to the wrong db. I had specified a unique db for each, but with jobs reading from the same json file, what I had specified on the command line and what the jobs actually did didn't line up. I ended up creating a separate dir for each chromosome, and kicked each job off from those dirs, which solved the problem.

    Thank you for your help on this!

  • gauthiergauthier Member, Broadie, Moderator, Dev

    @bkelly01 GenotypeGVCFs no longer supports multiple inputs. It requires a single GenomicsDB instance or a single combined GVCF.

    For a ~1200 sample WGS callset I just ran, I scattered the import ~2000 ways and each scatter took about 3 hours. It should be linear in the number of samples and the size of the interval, in which case each chromosome would take (on average) a little over 200 hours or ab

    We run the workflow with GenomicsDBImport in the cloud, so each job gets its own virtual machine and we haven't seen that problem before. I've created multiple GDBs in the same directory, though not at the same time. I will note that as a usage limitation in the docs.

Sign In or Register to comment.