Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Will you make the allele-specific HC annotations adhere to the VCF standard?

laashelaashe NorwayMember

Hi,

I am having an issue related to the allele-specific annotations added during the cohort germline calling pipeline (HaplotypeCaller -> CombineGVCFs -> GenotypeGVCFs -> VQSR) with the option "-G AS_StandardAnnotation".

Specifically I have to remove variants detected in too few samples (according to some arbitrary threshold) from CombineGVCFs results, before passing them on to GenotypeGVCFs. In some multiallelic sites there might be only one of the variants that needs to be removed, so I go through the INFO and FORMAT tags that have [ARG] number of values, and remove the values specific to the targeted alternate allele. The issue is that, to my understanding of the VCF standard, the allele-specific annotations added by "-G AS_StandardAnnotation" (prefixed "AS") do not adhere to it. They obviously contain multiple values, even though they are listed with "Number=1" in the file header, and they seem to use '|' as a separator in addition to ','. This makes it impossible for me to correctly trim these annotations when removing variants, and by not doing so, it seems to make the following GenotypeGVCFs step crash, with a failure relating to one of the allele-specific annotations ("AS_StrandBiasTest"), relevant partial log output pasted below.

java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at java.util.Collections$UnmodifiableList.get(Collections.java:1309)
        at org.broadinstitute.hellbender.tools.walkers.annotator.allelespecific.AS_StrandBiasTest.parseRawDataString(AS_StrandBiasTest.java:183)
        at org.broadinstitute.hellbender.tools.walkers.annotator.allelespecific.AS_StrandBiasTest.combineRawData(AS_StrandBiasTest.java:122)
        at org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotatorEngine.combineAnnotations(VariantAnnotatorEngine.java:304)
        at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.mergeAttributes(ReferenceConfidenceVariantContextMerger.java:267)
        at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:101)
        at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:200)
        at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:159)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:201)
        at org.broadinstitute.hellbender.Main.main(Main.java:287)

So I am just wandering if you either have a plan to change the formatting of the "AS"-tags, or if you could please tell me how to correctly interpret them for the purpose of trimming away allele-specific data? This would be greatly appreciated.

GATK version 4.0.2.0.

Regards,
Lars

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @laashe
    Hi Lars,

    Sorry for the delay.

    Specifically I have to remove variants detected in too few samples (according to some arbitrary threshold) from CombineGVCFs results, before passing them on to GenotypeGVCFs.

    What do you mean by this? You should not be using any GVCF for analysis. Wait until you get the final VCF from GenotypeGVCFs for any analysis/post-processing.

    -Sheila

  • laashelaashe NorwayMember

    Hi,

    Thank you for the reply, I am sorry for my late one, I have been quite busy since last week.

    We aim to gather variants from many different sites around the country, but due to political and juridical reasons we are not allowed to gather variants found in fewer than five samples on-site. Our plan is to run HC and CombineGVCFs on-site, and then run CombineGVCFs, GenotypeGVCFs, and VQSR centrally. To be on the correct side of the law we have to remove variants found in less than five samples before shipping data from the site to the central unit.

    To do this I have to trim away some variants from multi-allelic sites, which creates the problems described above. I hope you can help!

    Regards,
    Lars

  • SheilaSheila Broad InstituteMember, Broadie admin

    @laashe
    Hi Lars,

    I see. So, you would like to remove any sites that have variants in less than 5 samples? How did you do this? Have you tried using SelectVariants with JEXL to select for sites that are variant in at least 5 samples? Is the output of SelectVariants throwing the error?

    Thanks,
    Sheila

  • laashelaashe NorwayMember

    @Sheila
    Hi Sheila,

    We actually don't want to remove the site, as that would remove desired information regarding the site's callability etc, only the targeted variant and its related data (allele fraction etc). This means that for multi-allelic sites where there are both too rare (found in <5 samples) and less rare variants(>=5), we attempt to remove all data relating to the <5 variant(s). We have an in-house tool that works very well as long as the information is properly adhering to the VCF standard, but, as described above, fails to handle the AS_* tags added by the -G AS_StandardAnnotation option, as they are not following the VCF standard.

    I assume JEXL works on a site-basis and not allele-basis? If so it is unfortunately not suitable for our needs.

    So, I would still like to know how to correctly decompose the AS_* tags so that I can remove the relevant information. That is if you are not planning to change the formatting of the tags so that they actually adhere to the VCF standard, which would of course be the better option ;-)

    Regards,
    Lars

    Issue · Github
    by Sheila

    Issue Number
    3091
    State
    open
    Last Updated
    Assignee
    Array
  • SheilaSheila Broad InstituteMember, Broadie admin

    @laashe
    Hi Lars,

    I assume JEXL works on a site-basis and not allele-basis?

    I think that is correct. You may be able to tweak it to work on an allele-basis, but I am not sure how much extra effort that is.

    So, I would still like to know how to correctly decompose the AS_* tags so that I can remove the relevant information.

    Let me check with the team on this. I am pretty sure the AS annotation adhere to the VCF spec. Our outputs always adhere to the specs :smiley:

    -Sheila

  • laashelaashe NorwayMember

    @gauthier
    Hi Laura,

    Thank you very much for the informative reply, the information in the last paragraph of your reply is what I was looking for!

    I am sorry for being unclear on the "not adhering to the VCF spec" part of my issue. What I meant was that I could clearly see that there obviously were more than one value in the tags, while they were formatted as a single string. I now understand why you did this, thanks again for the clarification, it was just an issue of me not understanding how to decompose the data. Yes, you are technically adhering to the VCF standard, but in a way that is difficult to handle when needing to do custom operations.

    We are gathering data from institutions all over the country, aggregating centrally. We can not run GenotypeGVCF until the data is located centrally, and as explained above we need to remove some of the variants before centralising the data.

    @gauthier, @Sheila, thanks for your help the both of you, I hope the next VCF spec allows for lists of lists.

    Regards,
    Lars

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Hi Lars,

    I opened an issue for proper allele-specific annotation parsing in GATK. You can monitor the progress here: https://github.com/broadinstitute/gatk/issues/4795

    Best,
    Laura

Sign In or Register to comment.