How to left normalize multiallelic indels in ploidy=6 vcf files?

I have generated 26 GVCFs (with ploidy parameter=6), then run GATK-GenotypeGVCFs, successfully obtaining my genotype calls. Now I am trying to left normalize indels, using GATK-LeftAlignAndTrimVariants, and I get an ERROR MESSAGE.

The command line:

java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T LeftAlignAndTrimVariants -R hg38/Homo_sapiens_assembly38.fasta --variant samples_6N.vcf -o samples_6N_LeftNorm.vcf --splitMultiallelics   

The ERROR message:

##### ERROR --
##### ERROR stack trace 
java.lang.IllegalStateException: Must initialize the cache of allele anyploid indices for ploidy 6
        at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAlleles(
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.getLikelihoodIndexes(
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.determineLikelihoodIndexesToUse(
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.subsetAlleles(
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.splitVariantContextToBiallelics(
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions
##### ERROR
##### ERROR MESSAGE: Must initialize the cache of allele anyploid indices for ploidy 6
##### ERROR ------------------------------------------------------------------------------------------  

Could you, please, help me with this problem? I could not find a similar discussion.
Thanks a lot!

Issue · Github
by Sheila

Issue Number
Last Updated


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @lido_cm,

    There should be no need for you to left-align your variants if you called them using GATK tools. GATK already left-aligns. Are you finding otherwise?

  • Hi Shlee,
    thanks for your answer.
    I think my question was not clear enough (or I am confuse!), as my problem seems to be not only the left normalization, but mainly the split of multiallelic indels (into multiple rows) to further annotate them.
    I need to use LeftAlignAndTrimVariants to deal with the following well-described problem (by ANNOVAR):
    The final vcf from joint analysis turns SNVs into multi-nucleotide variants: "(...) T->A SNV, but because the deletion/insertion hijack the locus, it is written as CTTT->CTTA rather than T->A. Considering that an allele frequency database (say 1000 Genomes Project frequency database) would only have T->A but not CTTT->CTTA, then this variant will be missed by annotation software as a private variant, even if it is actually observed in 1000G."
    ANNOVAR recommends the use of bcftools norm as below:

    bcftools norm -m-both -o ex1.step1.vcf ex1.vcf.gz  
    bcftools norm -f human_g1k_v37.fasta -o ex1.step2.vcf ex1.step1.vcf  

    With my ploidy=6 vcf files, bcftools norm complains about PL files of the first multiallelic indel:

    Error at chr1:111489032, the tag PL has wrong number of fields  

    And GATK-LeftAlignAndTrimVariants gives the error message posted above.
    I hope you can help me with this issue.
    Thanks again, sorry if I was not clear.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭


    There is no need to convert multiallelic sites to biallelic records. when using LeftAlignAndTrimVariants. The tool can handle multiallelic sites as I show here. So you should theoretically be able to run your command without specifying the --splitMutliallelics argument. That being said, I suggest seeing what ValidateVariants says about your VCF first.

    As for the error:

    java.lang.IllegalStateException: Must initialize the cache of allele anyploid indices for ploidy 6

    It looks like others have encountered this error and found an explanation here.

    If you suspect the tool may not handle the high ploidy, let us know. We'll have to file a bug report.

  • Hi shlee, thank you very much for your answer.

    I've run ValidateVariants and my vcf looks fine.
    Also, removing --splitMultiallelics from command line, LeftAlignAndTrimVariants runs fine, but it doesn't solve my problem with multiallelic sites, which I need to split to further annotate it correctly.

    The second link you have suggested to deal with ERROR MESSAGE: Must initialize the cache of allele anyploid indices for ploidy 6 also did not helped me. If I understood correctly, the bug was solved by removing "." sample fields. I get the same error message for a single sample VCF, but also with ploidy=6 (pool of individuals).

    I have the feeling that bcftools norm and LeftAlignAndTrimVariants cannot handle ploidy>2 to split multiallelic sites.

    To remind, using bcftools norm -m -both, I get "Error at chr1:111489032, the tag PL has wrong number of fields"
    This problematic row:

    chr1    111489032       rs138279022     AACACAC A,AACAC 4885.86 PASS AC=2,1;AF=0.333,0.167;AN=6;BaseQRankSum=-1.095e+00;ClippingRankSum=0.00;DB;DP=395;FS=1.331;MLEAC=2,1;MLEAF=0.333,0.167;MQ=60.52;MQRankSum=4.98;QD=14.90;ReadPosRankSum=-6.781e+00;SOR=0.789 GT:AD:DP:GQ:PL 0/0/0/1/1/2:181,136,11:328:97:4921,324,114,130,317,793,9678,4748,167,0,97,497,6920,4900,359,284,605,6589,5119,657,805,6467,5439,1184,6491,5998,6728,11522  

    Also, looking for samtools htsjdk.variant.variantcontext.GenotypeLikelihoods.getAlleles (the tool LeftAlignAndTrimVariants calls), I see that there is an internal function calculateAnyPloidPLCache:

    protected static java.util.List<java.util.List<java.lang.Integer>> calculateAnyploidPLcache(int altAlleles,                                                                                            int ploidy)  
    Calculate the cache of allele indices for each PL index for a ploidy. Calculation in @see #calculatePLIndexToAlleleIndices  
    altAlleles - Number of alternate alleles  
    ploidy - Number of chromosomes in set  
    PL index to the alleles of general ploidy over all possible alternate alleles

    Is there a way to inform the ploidy so PL index can be estimated accordingly?
    Does it make sense?

    I've found a similar discussion here, and I hope something has changed since 2014.
    Please, do you have another suggestion?
    Thanks a lot!!!

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for the followup @lido_cm. Can you tell us the version of GATK you used to generated the GVCF and the final callset and post the exact commands you ran for each analysis? I will consult with others on the team on this one.

  • Hi shlee, thank you very much for your help.

    I've used GATK v3.7-0, and below you can find the commands (and links to results):


    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 4 -R hg38/Homo_sapiens_assembly38.fasta -I Genomes_recal.bam -L targetRegion38.bed -ERC GVCF -ploidy 6 -o Genomes_6N.g.vcf 2> gVCF_Genomes_ploidy6.log &  


    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg38/Homo_sapiens_assembly38.fasta -D dbsnp_146.hg38.vcf -V Genomes_6N.g.vcf -o Genomes_6N.vcf 2> genotype_GVCF_Genomes_6N.log &   


    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T ValidateVariants -R hg38/Homo_sapiens_assembly38.fasta -V Genomes_6N.vcf --dbsnp dbsnp_146.hg38.vcf 2> validateVariants_Genomes_6N.log &   


    java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T LeftAlignAndTrimVariants -R hg38/Homo_sapiens_assembly38.fasta --variant Genomes_6N.vcf -o Genomes_6N_LeftNorm.vcf --splitMultiallelics 2> leftNorm_Genomes_6N.log &  


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @lido_cm, we'll have to confirm with the devs but I agree with your suspicion that the tools aren't able to split multiallelics correctly in the non-diploid case. That seems like the most likely explanation given what you're observing in your various tests. Unfortunately if that's the case we're not in a position to do anything about it, at least not for a while. We still have a lot of work to do on our primary pipelines so it's unlikely that we'll be able to put effort into the non-diploid cases in the near future. I would recommend looking for other solutions out in the research community.

    That being said if you can provide us with the files to reproduce your issue we can put in a feature request just in case it tickles the inspiration of a developer waiting for their tests to run :)

  • Hi Geraldine, thank you for your reply.
    My files are linked to the previous comment.
    I've also linked this post to samtools/bcftools here, as the problem with bcftools looks the same. ;)

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for linking the file @lido_cm. @Sheila will follow up.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Sorry for the delay. I am working on testing bugs in a new way, and things are taking a bit longer than expected. I should get back to you soon.


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi again,

    It seems you are using a non-standard reference. Can you please upload the reference you are using? Instructions are here.


  • Dear Sheila,
    I apologize for the late response. I've uploaded on GATK FTP server the following reference files:

    LIDO_hg38_fasta.tar.gz  #Homo_sapiens_assembly38.fasta   
    LIDO_hg38_fai.tar.gz #Homo_sapiens_assembly38.fasta.fai   
    LIDO_hg38_dict.tar.gz #Homo_sapiens_assembly38.dict   


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Thanks. I am having a look now and will get back to you soon.


Sign In or Register to comment.