Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

How to left normalize multiallelic indels in ploidy=6 vcf files?

Hi,
I have generated 26 GVCFs (with ploidy parameter=6), then run GATK-GenotypeGVCFs, successfully obtaining my genotype calls. Now I am trying to left normalize indels, using GATK-LeftAlignAndTrimVariants, and I get an ERROR MESSAGE.

The command line:

java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T LeftAlignAndTrimVariants -R hg38/Homo_sapiens_assembly38.fasta --variant samples_6N.vcf -o samples_6N_LeftNorm.vcf --splitMultiallelics   

The ERROR message:

##### ERROR --
##### ERROR stack trace 
java.lang.IllegalStateException: Must initialize the cache of allele anyploid indices for ploidy 6
        at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAlleles(GenotypeLikelihoods.java:532)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.getLikelihoodIndexes(GATKVariantContextUtils.java:681)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.determineLikelihoodIndexesToUse(GATKVariantContextUtils.java:639)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.subsetAlleles(GATKVariantContextUtils.java:610)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.splitVariantContextToBiallelics(GATKVariantContextUtils.java:1071)
        at org.broadinstitute.gatk.tools.walkers.variantutils.LeftAlignAndTrimVariants.map(LeftAlignAndTrimVariants.java:212)
        at org.broadinstitute.gatk.tools.walkers.variantutils.LeftAlignAndTrimVariants.map(LeftAlignAndTrimVariants.java:137)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Must initialize the cache of allele anyploid indices for ploidy 6
##### ERROR ------------------------------------------------------------------------------------------  

Could you, please, help me with this problem? I could not find a similar discussion.
Thanks a lot!

Issue · Github
by Sheila

Issue Number
3045
State
open
Last Updated

Answers

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Hi @lido_cm,

    There should be no need for you to left-align your variants if you called them using GATK tools. GATK already left-aligns. Are you finding otherwise?

  • Hi Shlee,
    thanks for your answer.
    I think my question was not clear enough (or I am confuse!), as my problem seems to be not only the left normalization, but mainly the split of multiallelic indels (into multiple rows) to further annotate them.
    I need to use LeftAlignAndTrimVariants to deal with the following well-described problem (by ANNOVAR):
    vcf_complications
    The final vcf from joint analysis turns SNVs into multi-nucleotide variants: "(...) T->A SNV, but because the deletion/insertion hijack the locus, it is written as CTTT->CTTA rather than T->A. Considering that an allele frequency database (say 1000 Genomes Project frequency database) would only have T->A but not CTTT->CTTA, then this variant will be missed by annotation software as a private variant, even if it is actually observed in 1000G."
    ANNOVAR recommends the use of bcftools norm as below:

    bcftools norm -m-both -o ex1.step1.vcf ex1.vcf.gz  
    bcftools norm -f human_g1k_v37.fasta -o ex1.step2.vcf ex1.step1.vcf  
    

    With my ploidy=6 vcf files, bcftools norm complains about PL files of the first multiallelic indel:

    Error at chr1:111489032, the tag PL has wrong number of fields  
    

    And GATK-LeftAlignAndTrimVariants gives the error message posted above.
    I hope you can help me with this issue.
    Thanks again, sorry if I was not clear.

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    @lido_cm,

    There is no need to convert multiallelic sites to biallelic records. when using LeftAlignAndTrimVariants. The tool can handle multiallelic sites as I show here. So you should theoretically be able to run your command without specifying the --splitMutliallelics argument. That being said, I suggest seeing what ValidateVariants says about your VCF first.

    As for the error:

    java.lang.IllegalStateException: Must initialize the cache of allele anyploid indices for ploidy 6

    It looks like others have encountered this error and found an explanation here.

    If you suspect the tool may not handle the high ploidy, let us know. We'll have to file a bug report.

  • Hi shlee, thank you very much for your answer.

    I've run ValidateVariants and my vcf looks fine.
    Also, removing --splitMultiallelics from command line, LeftAlignAndTrimVariants runs fine, but it doesn't solve my problem with multiallelic sites, which I need to split to further annotate it correctly.

    The second link you have suggested to deal with ERROR MESSAGE: Must initialize the cache of allele anyploid indices for ploidy 6 also did not helped me. If I understood correctly, the bug was solved by removing "." sample fields. I get the same error message for a single sample VCF, but also with ploidy=6 (pool of individuals).

    I have the feeling that bcftools norm and LeftAlignAndTrimVariants cannot handle ploidy>2 to split multiallelic sites.

    To remind, using bcftools norm -m -both, I get "Error at chr1:111489032, the tag PL has wrong number of fields"
    This problematic row:

    chr1    111489032       rs138279022     AACACAC A,AACAC 4885.86 PASS AC=2,1;AF=0.333,0.167;AN=6;BaseQRankSum=-1.095e+00;ClippingRankSum=0.00;DB;DP=395;FS=1.331;MLEAC=2,1;MLEAF=0.333,0.167;MQ=60.52;MQRankSum=4.98;QD=14.90;ReadPosRankSum=-6.781e+00;SOR=0.789 GT:AD:DP:GQ:PL 0/0/0/1/1/2:181,136,11:328:97:4921,324,114,130,317,793,9678,4748,167,0,97,497,6920,4900,359,284,605,6589,5119,657,805,6467,5439,1184,6491,5998,6728,11522  
    

    Also, looking for samtools htsjdk.variant.variantcontext.GenotypeLikelihoods.getAlleles (the tool LeftAlignAndTrimVariants calls), I see that there is an internal function calculateAnyPloidPLCache:

    calculateAnyploidPLcache
    protected static java.util.List<java.util.List<java.lang.Integer>> calculateAnyploidPLcache(int altAlleles,                                                                                            int ploidy)  
    Calculate the cache of allele indices for each PL index for a ploidy. Calculation in @see #calculatePLIndexToAlleleIndices  
    
    Parameters:
    altAlleles - Number of alternate alleles  
    ploidy - Number of chromosomes in set  
    Returns:
    PL index to the alleles of general ploidy over all possible alternate alleles
    

    Is there a way to inform the ploidy so PL index can be estimated accordingly?
    Does it make sense?

    I've found a similar discussion here, and I hope something has changed since 2014.
    Please, do you have another suggestion?
    Thanks a lot!!!

    Issue · Github
    by shlee

    Issue Number
    3022
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    chandrans
  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for the followup @lido_cm. Can you tell us the version of GATK you used to generated the GVCF and the final callset and post the exact commands you ran for each analysis? I will consult with others on the team on this one.

  • Hi shlee, thank you very much for your help.

    I've used GATK v3.7-0, and below you can find the commands (and links to results):

    GVCF

    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 4 -R hg38/Homo_sapiens_assembly38.fasta -I Genomes_recal.bam -L targetRegion38.bed -ERC GVCF -ploidy 6 -o Genomes_6N.g.vcf 2> gVCF_Genomes_ploidy6.log &  
    

    VCF

    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg38/Homo_sapiens_assembly38.fasta -D dbsnp_146.hg38.vcf -V Genomes_6N.g.vcf -o Genomes_6N.vcf 2> genotype_GVCF_Genomes_6N.log &   
    

    validadeVariants_log

    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T ValidateVariants -R hg38/Homo_sapiens_assembly38.fasta -V Genomes_6N.vcf --dbsnp dbsnp_146.hg38.vcf 2> validateVariants_Genomes_6N.log &   
    

    leftAlignAndTrimVariants_log

    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T LeftAlignAndTrimVariants -R hg38/Homo_sapiens_assembly38.fasta --variant Genomes_6N.vcf -o Genomes_6N_LeftNorm.vcf --splitMultiallelics 2> leftNorm_Genomes_6N.log &  
    

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Hi @lido_cm, we'll have to confirm with the devs but I agree with your suspicion that the tools aren't able to split multiallelics correctly in the non-diploid case. That seems like the most likely explanation given what you're observing in your various tests. Unfortunately if that's the case we're not in a position to do anything about it, at least not for a while. We still have a lot of work to do on our primary pipelines so it's unlikely that we'll be able to put effort into the non-diploid cases in the near future. I would recommend looking for other solutions out in the research community.

    That being said if you can provide us with the files to reproduce your issue we can put in a feature request just in case it tickles the inspiration of a developer waiting for their tests to run :)

  • Hi Geraldine, thank you for your reply.
    My files are linked to the previous comment.
    I've also linked this post to samtools/bcftools here, as the problem with bcftools looks the same. ;)
    Thanks!

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for linking the file @lido_cm. @Sheila will follow up.

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @lido_cm
    Hi,

    Sorry for the delay. I am working on testing bugs in a new way, and things are taking a bit longer than expected. I should get back to you soon.

    -Sheila

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @lido_cm
    Hi again,

    It seems you are using a non-standard reference. Can you please upload the reference you are using? Instructions are here.

    Thanks,
    Sheila

  • Dear Sheila,
    I apologize for the late response. I've uploaded on GATK FTP server the following reference files:

    LIDO_hg38_fasta.tar.gz  #Homo_sapiens_assembly38.fasta   
    LIDO_hg38_fai.tar.gz #Homo_sapiens_assembly38.fasta.fai   
    LIDO_hg38_dict.tar.gz #Homo_sapiens_assembly38.dict   
    

    Thanks

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @lido_cm
    Hi,

    Thanks. I am having a look now and will get back to you soon.

    -Sheila

Sign In or Register to comment.