Attention:
The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

CalculateGenotypePosteriors error

amywilliamsamywilliams Ithaca, NYMember

Hi,

I've run the GATK best practices pipeline up through VQSR and have the recalibrated variants in a VCF file. Because I'm analyzing pedigree samples, I'm now attempting to run CalculateGenotypePosteriors. When I do this, I get the following error:

ERROR MESSAGE: Variant does not contain the same number of MLE allele counts as alternate alleles for record at 1:768589

However, when I looked at that variant in the recalibrated VCF file, I see that MLEAC=1, AC=1, and I can confirm that there is only one sample that is heterozygous, so both AC and MLEAC are correct.

Might this be a bug?

Thanks,
Amy

Issue · Github
by Geraldine_VdAuwera

Issue Number
845
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Amy,

    Can you please confirm that you are using the latest version (3.3-0)?

    If so, can you please post the variant records associated with this error?

  • amywilliamsamywilliams Ithaca, NYMember
    edited January 2015

    HI Geraldine,

    Yes, I'm using 3.3-0 (specifically 3.3-0-g37228af). Attached is a short VCF file with the header (fake sample ids), and the offending line from the VCF. If you need more, I'd like to transmit outside the forum.

    Post edited by amywilliams on
  • amywilliamsamywilliams Ithaca, NYMember

    Thanks, Geraldine, I'll send this right away.

  • amywilliamsamywilliams Ithaca, NYMember

    Hi Geraldine,

    Just posted this. Filename is calc_geno_post_bug.tgz

    Best,
    Amy

  • amywilliamsamywilliams Ithaca, NYMember

    Hi Geraldine,

    Is it possible to check the status of this bug?

    Best,
    Amy

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amywilliams
    Hi Amy,

    Thanks for checking. Unfortunately, there have been so many bugs that I have not submitted your bug report yet. I will try to get it in tonight.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amywilliams
    Hi Amy,

    Your bug report is in. However, I do not know when it will get fixed, but I will keep you updated.

    -Sheila

  • amywilliamsamywilliams Ithaca, NYMember

    Hi @Sheila, any updates?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amywilliams
    Hi Amy,

    Sorry, I do not have any updates. The developers have a lot on their plates right now, and unfortunately, this is on the backburner.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amywilliams
    Hi Amy,

    Unfortunately, the bug is taking longer to fix than expected. However, there is a quick workaround for the time being. You can simply use your supporting VCF subset to just biallelics (with SelectVariants).

    -Sheila

  • amywilliamsamywilliams Ithaca, NYMember

    @Sheila, thank you. I'll use the workaround. Let me know when there's a fix, too, but no big rush for me as I'm focused on biallelics.

  • amywilliamsamywilliams Ithaca, NYMember

    @Sheila, Finally had a moment to do this. Unfortunately the variant that had failed is biallelic and the filtered version of the data hits the same bug. Any thoughts on this? Is there a flag to tell the GATK to bypass such discrepancies?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @amywilliams ,

    Can you confirm that you have subset your supporting VCF file to only biallelics, not just your callset?

    So the problem would be a mismatch between the alt allele reported in your callset vs your supporting set?

  • amywilliamsamywilliams Ithaca, NYMember

    Hi @Geraldine_VdAuwera, yes, I used SelectVariants --restrictAllelesTo BIALLELIC and then used the resulting VCF as input to CalculateGenotypePosteriors. I encountered the same bug, and the failure was at the same site: a biallelic SNP.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @amywilliams Can you please post the variant records at that position in each of your files, the input variants and the supporting variants?

  • amywilliamsamywilliams Ithaca, NYMember

    @Geraldine_VdAuwera I uploaded the dataset that reproduces the bug before as filename calc_geno_post_bug.tgz. Do you still want this?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, just as a quick spot check please. We haven't been able to get to your bug report yet due to some high priority items, unfortunately.

  • amywilliamsamywilliams Ithaca, NYMember

    Here is a VCF with the problem variant. Let me know if you need anything else.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amywilliams
    Hi Amy,

    Can you also post the record from the supporting variants file?

    Thanks,
    Sheila

  • amywilliamsamywilliams Ithaca, NYMember

    Is the supporting variants file the gvcf?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    No, it is the file you input to the --supporting argument. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors.php#--supporting

    I am assuming you uploaded the input vcf (-V) record above.

  • amywilliamsamywilliams Ithaca, NYMember

    Ah, sorry. I used the GATK's 1000G_phase1.snps.high_confidence.b37.vcf file. A VCF with the relevant line is attached.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited March 2015

    @amywilliams
    Hi Amy,

    The issue is that the site is not biallelic in the supporting vcf. Unfortunately, we have still not fixed this bug yet. You need to use Select Variants and restrict the supporting vcf file to only biallelics. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#--restrictAllelesTo

    -Sheila

  • amywilliamsamywilliams Ithaca, NYMember

    Ah, good to know. Thanks for your help.

  • amywilliamsamywilliams Ithaca, NYMember

    One more question @Sheila, I'd previously used --restrictAllelesTo BIALLELIC, but that only applied to the provided VCF file. Is it possible to give SelectVariants the supporting variants to use for filtering?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amywilliams
    Yes, that is exactly what you need to do.

  • amywilliamsamywilliams Ithaca, NYMember

    How? I don't see an option for including supporting variants for SelectVariants. I can readily come up with an alternate solution if needed.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amywilliams
    You just input the supporting vcf as the input (-V) to SelectVariants.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited July 2015

    Update on this issue: this turns out to be a problem with the older VCF (4.0) specification, which allows the AC and AF entry counts to be "unbounded", leading to AC and AF fields where there are fewer entries than there are alleles, which the tool cannot handle properly.
    More recent version of the VCF spec (4.1 and up) require that there should be as many AC entries as alternate alleles. We are not going to patch the tool to handle the older spec (because it is a bad, bad, spec), so this tool should be used only with files complying with the newer VCF specs (4.1 and up).

    We've updated the resource bundle to include the file 1000G_phase3_v4_20130502.sites.vcf, which is a valid 4.2 VCF and should be used as supporting VCF for genotype refinement from now on.

    This resolves the issue, which we will now consider closed.

  • d0ct0rd0ct0r AarhusMember
    edited October 2015

    @Geraldine_VdAuwera I don't see 1000G_phase3_v4_20130502.sites.vcf in the bundle 2.8.

    Update: Sorry! I didn't see properly. But I am not able to download the files. I tried multiple times but couldn't download. Is there any external mirrors available to download these resources?

    Post edited by d0ct0r on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No mirror, sorry. Can you clarify how you tried to download the file and what happened (ie error message etc)?

  • sbourgeoissbourgeois London, UKMember

    Hi @Geraldine_VdAuwera

    I converted 1000G_phase3_v4_20130502.sites.vcf to hg19 using Picard 2.10.0 using the following command:

    java -Xincgc -Xmx5G -jar /data/home/hhx037/picard-tools/picard.jar LiftoverVcf I=/data/WHRI-Panos-Scratch/REPOSITORY/GATK_resources/b37/1000G_phase3_v4_20130502.sites.vcf O=/data/WHRI-Panos-Scratch/REPOSITORY/GATK_resources/hg19/1000G_phase3_v4_20130502.hg19.sites.vcf CHAIN=/data/WHRI-Panos-Scratch/REPOSITORY/GATK_resources/liftover/b37tohg19.chain REJECT=/data/WHRI-Panos-Scratch/REPOSITORY/GATK_resources/hg19/1000G_phase3_v4_20130502.hg19.liftover_rejected.vcf R=/data/WHRI-Panos-Scratch/REPOSITORY/GATK_resources/hg19/ucsc.hg19.fasta

    and when running CalculateGenotypePosteriors I get the following error:

    OpenJDK 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
    OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
    INFO 13:02:12,260 HelpFormatter - ----------------------------------------------------------------------------------
    INFO 13:02:12,266 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
    INFO 13:02:12,267 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 13:02:12,267 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 13:02:12,267 HelpFormatter - [Sat Jul 08 13:02:12 BST 2017] Executing on Linux 3.10.0-514.16.1.el7.x86_64 amd64
    INFO 13:02:12,267 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_131-b12
    INFO 13:02:12,270 HelpFormatter - Program Args: -T CalculateGenotypePosteriors -R /data/WHRI-Panos-Scratch/REPOSITORY/GATK_resources/hg19/ucsc.hg19.fasta -V /data/WHRI-Panos-Scratch/hhx037/piperi/filtered/piperi_snps_indels.vcf --supporting /data/WHRI-Panos-Scratch/REPOSITORY/GATK_resources/hg19/1000G_phase3_v4_20130502.hg19.sites.vcf -ped /data/WHRI-Panos-Scratch/hhx037/piperi/piperi.ped -o ./refined/piperi_snps_indels.postCGP.vcf
    INFO 13:02:12,274 HelpFormatter - Executing as [email protected] on Linux 3.10.0-514.16.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-b12.
    INFO 13:02:12,275 HelpFormatter - Date/Time: 2017/07/08 13:02:12
    INFO 13:02:12,275 HelpFormatter - ----------------------------------------------------------------------------------
    INFO 13:02:12,275 HelpFormatter - ----------------------------------------------------------------------------------
    INFO 13:02:12,496 GenomeAnalysisEngine - Strictness is SILENT
    INFO 13:02:12,813 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 13:02:13,584 PedReader - Reading PED file /data/WHRI-Panos-Scratch/hhx037/piperi/piperi.ped with missing fields: []
    INFO 13:02:13,599 PedReader - Phenotype is other? false
    INFO 13:02:13,882 GenomeAnalysisEngine - Preparing for traversal
    INFO 13:02:13,890 GenomeAnalysisEngine - Done preparing for traversal
    INFO 13:02:13,890 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 13:02:13,890 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 13:02:13,891 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

    ERROR --
    ERROR stack trace

    java.lang.ArrayIndexOutOfBoundsException: 2
    at org.broadinstitute.gatk.tools.walkers.variantutils.FamilyLikelihoodsUtils.getLikelihoodsAsMapSafeNull(FamilyLikelihoodsUtils.java:455)
    at org.broadinstitute.gatk.tools.walkers.variantutils.FamilyLikelihoodsUtils.updateFamilyGenotypes(FamilyLikelihoodsUtils.java:392)
    at org.broadinstitute.gatk.tools.walkers.variantutils.FamilyLikelihoodsUtils.calculatePosteriorGLs(FamilyLikelihoodsUtils.java:263)
    at org.broadinstitute.gatk.tools.walkers.variantutils.CalculateGenotypePosteriors.map(CalculateGenotypePosteriors.java:345)
    at org.broadinstitute.gatk.tools.walkers.variantutils.CalculateGenotypePosteriors.map(CalculateGenotypePosteriors.java:189)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk

    Is this caused by the liftover or it's a different issue? Any idea how to fix this?

    Cheers,

    Steph

  • zejunyanzejunyan EdinburghMember

    @Geraldine_VdAuwera
    Hi, I have enjoyed your workshop a lot at Edinburgh. Thank you again.
    I got a question about priors involved in Gene Refinement Workflow. As I have more than 10 samples in my chicken trio, in addition to Family Prior, I could use input-derived population prior as my second input for calculating GenotypePosterior. My question is how and where I can calculate input-derived population prior?

    Bests

    Dr Yan

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @zejunyan,

    Geraldine has yet to return to Boston. In the meanwhile, please check out these documents:

    If things are still unclear, please do followup by posting again to this same thread.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sbourgeois
    Hi Steph,

    Sorry for the delay. Can you confirm all of you input VCFs validate with ValidateVariants?

    Thanks,
    Sheila

  • sbourgeoissbourgeois London, UKMember

    @Sheila

    Hi Sheila,

    no worries, my issue has been solved (I created another thread); the issue lied that in the calling step I did not tell HCaller to use the intersection of the two files specified with -L.

    Thank you :)

    Steph

  • jdenvirjdenvir Marshall UniversityMember

    Sorry to resurrect an old thread, but it seems to be the only reference to this error that I can find. (I can open a new thread with this question if that is more convenient/appropriate).

    I'm running CalculateGenotypePosteriors on a vcf generated from HaplotypeCaller and GenotypeGVCFs using v3.7.0. The main difference between my use case and the original question here is that I'm using GRCh38 as the reference genome. The resource bundle for that genome build does not include a 1000G_*.sites.vcf file, so I am using 1000G_phase1.snps.high_confidence.hg38.vcf as the supporting file. This appears to be a valid vcf v 4.1 file, so according to the previous discussion (AIUI) should work.

    I get the error

    Variant does not contain the same number of MLE allele counts as alternate alleles for record at chr1:1274721

    The relevant line (line broken for readability) from the input VCF is

    chr1    1274721 .   G   A   15.52   PASS\
    AC=2;AF=0.125;AN=16;DP=16;ExcessHet=0.1472;FS=0.000;MLEAC=1;MLEAF=0.063;MQ=60.00;POSITIVE_TRAIN_SITE;QD=7.76;SOR=0.693;VQSLOD=6.64;culprit=FS\
        GT:AD:DP:GQ:PL\
        ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0\
    1/1:0,2:2:6:49,6,0  ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0\
    0/0:1,0:1:3:0,3,29  0/0:1,0:1:3:0,3,33  ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0\
    ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:2,0:2:6:0,6,71\
    ./.:0,0:0:.:0,0,0   0/0:1,0:1:3:0,3,29  ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:1,0:1:3:0,3,29\
    ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:2,0:2:6:0,6,64\
    ./.:0,0:0:.:0,0,0   0/0:6,0:6:18:0,18,191   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0
    

    and from the supporting file is

    chr1    1274721 rs11260574  G   A,C 228.28  PASS\
    AC=15;AF=0.00884;AN=1696;BaseQRankSum=-3.700;DB;DP=4462;Dels=0.00;FS=10.553;HRun=1;HaplotypeScore=0.4438;InbreedingCoeff=-0.0325;MQ=81.11;MQ0=1;MQRankSum=-0.274;QD=4.10;ReadPosRankSum=-1.693;SB=-158.74;VQSLOD=5.9135;pop=ALL
    

    I confess to not understanding all of this - I'm not really clear if the error is referring to the input vcf or the supporting vcf, and I'm not too clear on how MLEAC is computed, but the input vcf looks correct/valid. Fairly obvious things I note here (I'm not sure these are helpful):

    1. The ALT list is different in the input vcf to the supporting vcf, which doesn't seem unreasonable (we simply don't observe the C allele in our samples)
    2. The MLEAC value in the input vcf is not equal to the AC value in the vcf. I don't think this is an issue.
    3. The AC field in the supporting vcf appears not to have the correct number of values, given the ALT list. I'm not sure if this is the problem (because the error message cites MLEAC, not AC) but it seems inconsistent.

    Since there doesn't seem to be a version of the 1000G_phase3_v4_20130502.sites.vcf available for h38, is my best option to filter the supporting file I am using to only biallelic variants? Or is there another issue entirely.

    Here's the top of the log file from the run:

    INFO  15:33:02,371 HelpFormatter - ---------------------------------------------------------------------------------- 
    INFO  15:33:02,374 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
    INFO  15:33:02,375 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
    INFO  15:33:02,375 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
    INFO  15:33:02,375 HelpFormatter - [Tue Aug 15 15:33:02 EDT 2017] Executing on Linux 3.10.0-514.26.2.el7.x86_64 amd64 
    INFO  15:33:02,375 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 
    INFO  15:33:02,378 HelpFormatter - Program Args: -T CalculateGenotypePosteriors -R /seqdata/Genomes/Homo_sapiens/broad/Homo_sapiens_assembly38.fasta -supporting /seqdata/Genomes/Homo_sapiens/broad/1000G_phase1.snps.high_confidence.hg38.vcf -V recalibrated_variants/snp_indel_recalibrated.vcf -o snp_indel_with_posterior.vcf 
    INFO  15:33:02,387 HelpFormatter - Executing as [email protected] on Linux 3.10.0-514.26.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14. 
    INFO  15:33:02,388 HelpFormatter - Date/Time: 2017/08/15 15:33:02 
    INFO  15:33:02,388 HelpFormatter - ---------------------------------------------------------------------------------- 
    

    Thanks for any help,
    Jim

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited August 2017

    Hi Jim (@jdenvir),

    As you describe in [3], the record in the supporting file has two ALT alleles (A and C) but only one value each for annotations, e.g. AC or AF. Which allele do the annotation values refer to?

    chr1    1274721 rs11260574  G   A,C 228.28  PASS\
    AC=15;AF=0.00884;AN=1696;BaseQRankSum=-3.700;DB;DP=4462;Dels=0.00;FS=10.553;HRun=1;HaplotypeScore=0.4438;InbreedingCoeff=-0.0325;MQ=81.11;MQ0=1;MQRankSum=-0.274;QD=4.10;ReadPosRankSum=-1.693;SB=-158.74;VQSLOD=5.9135;pop=ALL
    

    I think it's likely this is what is tripping up the tool, despite the error message itself not being on point. Can you test this by removing one of the alleles or by removing such records altogether and then running your command?

  • jdenvirjdenvir Marshall UniversityMember

    Hi @shlee,

    Thanks for the response.

    As a quick way to test, I simply removed the second ALT from the line in the supporting vcf corresponding to the error: i.e. I edited it and simply deleted ",C" from the 5th column.

    This resulted in the error being reported at a later line, which had a similar structure (two alt alleles but only a single AC annotation value.

    Which allele do the annotation values refer to?

    No idea! This is in a file I downloaded from your resource bundle. Specifically, it is at

    ftp://[email protected]/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz

    So the questions now are

    1. Is the above file incorrectly formatted? If so, is there a chance you can get it fixed?
    2. How should I move forward? Should I filter the above file to only include biallelic snps? Or should I use ftp://[email protected]/bundle/b37/1000G_phase3_v4_20130502.sites.vcf.gz (which, as I understand it, is built relative to hg19: can you verify that?) and use something like LiftOver convert to hg38 coordinates? Or is there another reasonably straightforward way of generating an equivalent file for hg38?

    Thanks again,
    Jim

    Issue · Github
    by shlee

    Issue Number
    2428
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @jdenvir,

    We've narrowed down the error then to what our tools would normally call "Array index out of bounds" errors. Thanks for testing that. I will ask the team where this resource came from as I do not know how it was derived.

    In the meanwhile, today I actually learned that HaplotypeCaller's (and GenotypeGVCF's) new QUAL model (invoked with -newQual or --useNewAFCalculator) does something similar to CalculateGenotypePosteriors in that it will allow iteration over the genotypes in a cohort and use evidence within the cohort towards the prior.

    I'm not familiar with CalculateGenotypePosteriors nor how it uses the resource file, and so I'll need to ask someone on the team to followup.

  • jdenvirjdenvir Marshall UniversityMember

    @shlee Makes sense: I speak Java pretty well and can make a good guess as to how you parse this :smile:. I'll look for further posts on this discussion in the next few days.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @jdenvir,

    1. Is the above file incorrectly formatted? If so, is there a chance you can get it fixed?

    You have shown, and I have confirmed, yes, the file 1000G_phase1.snps.high_confidence.hg38.vcf.gz is incorrectly formatted.

    1. How should I move forward? Should I filter the above file to only include biallelic snps? Or should I use ftp://[email protected]/bundle/b37/1000G_phase3_v4_20130502.sites.vcf.gz (which, as I understand it, is built relative to hg19: can you verify that?) and use something like LiftOver convert to hg38 coordinates? Or is there another reasonably straightforward way of generating an equivalent file for hg38?

    I've consulted with a teammate and they say that CalculateGenotypePosteriors only uses biallelic sites, so you should filter out the multi-allelic sites. You can do this with SelectVariants. I have asked and it seems the provenance of the file is unknown to our team.

    You are correct about bundle/b37/1000G_phase3_v4_20130502.sites.vcf.gz. It is for hg19/GRCh37/b37 reference assemblies.

    I hope removing multialleleic sites allows you to move forward in your research.

  • aschoenraschoenr Member

    I was having the same problem @jdenvir was having and, since there were no further updates here, I thought I'd post that what @shlee suggested seems to be working for me. I am using the hg38 reference genome and gatk4 in my pipeline, a newer combination that others who may come across this issue in the future will likely be using as well.

    As suggested, I just made a biallelic only version of 1000G_phase1.snps.high_confidence.hg38.vcf.gz using SelectVariants:

    gatk SelectVariants \
    -R Homo_sapiens_assembly38.fasta \
    -V 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    -O 1000G_phase1.snps.high_confidence.hg38_BIALLELIC_ONLY.vcf.gz \
    --restrict-alleles-to BIALLELIC

    Note: the restrict-alleles-to flag has slightly changed since GATK3.

    CalculateGenotypePosteriors is running using this right now, hopefully I don't run into any further issues. Super @shlee to the rescue again!

  • aschoenraschoenr Member

    Looks like I may have spoken too soon. CalculateGenotypePosteriors finished running, but I did get an error message at the end. I will paste the tail of it below. Just thought I'd add that it looks like this error occurs afterCalculateGenotypePosteriors finished processing the VCF. I used ValidateVariants on the vcf that was produced and it passed. Maybe I don't need to worry about this error?

    13:28:12.263 INFO ProgressMeter - chr19:21187477 18.3 6686000 365047.7
    13:28:22.360 INFO ProgressMeter - chr19:47660978 18.5 6751000 365240.7
    13:28:32.479 INFO ProgressMeter - chr20:10590772 18.7 6815000 365369.5
    13:28:42.552 INFO ProgressMeter - chr20:29882175 18.8 6886000 365882.8
    13:28:52.689 INFO ProgressMeter - chr20:53518457 19.0 6951000 366050.8
    13:29:02.730 INFO ProgressMeter - chr21:17380342 19.2 7023000 366611.2
    13:29:12.757 INFO ProgressMeter - chr21:38420519 19.3 7084000 366597.4
    13:29:22.775 INFO ProgressMeter - chr22:21609497 19.5 7161000 367407.6
    13:29:32.921 INFO ProgressMeter - chr22:46209207 19.7 7224000 367451.9
    13:29:35.815 INFO CalculateGenotypePosteriors - Shutting down engine
    [November 6, 2018 1:29:35 EST PM] org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors done. Elapsed time: 19.73 minutes.
    Runtime.totalMemory()=3229089792
    java.lang.IllegalStateException: Likelihoods not of correct size: expected 3, observed 2
    at org.broadinstitute.hellbender.tools.walkers.variantutils.PosteriorProbabilitiesUtils.calculatePosteriorProbs(PosteriorProbabilitiesUtils.java:126)
    at org.broadinstitute.hellbender.tools.walkers.variantutils.PosteriorProbabilitiesUtils.calculatePosteriorProbs(PosteriorProbabilitiesUtils.java:58)
    at org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors.apply(CalculateGenotypePosteriors.java:316)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:159)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:202)
    at org.broadinstitute.hellbender.Main.main(Main.java:288)

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @aschoenr,

    It would be great to get to the bottom of what is causing the error. Based on the error:

    Likelihoods not of correct size: expected 3, observed 2

    It appears that for the number of alleles present in the record, there are not enough likelihoods that represent the genotype possibilities. Thanks for checking your VCF validates with ValidateVariants. You should check that all of the VCF records you started with are represented in the output. Please let us know the CalculateGenotypePosteriors command you are running and also what version of GATK you are using and whether the latest release (v4.0.11.0) recapitulates the error (if you are not on the latest release already).

    To help those on our end who may look into this, the relevant code that triggers the error is at https://github.com/broadinstitute/gatk/blob/4.0.11.0/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/PosteriorProbabilitiesUtils.java#L215-L218.

  • aschoenraschoenr Member
    edited November 2018

    Hi @shlee ,

    The error I posted happened when I was using version 4.0.2.1. I have updated to version 4.0.11.0. Here is the command I am running:

    gatk CalculateGenotypePosteriors \
    -V input_dir/2003_57_recalibrated_variants.vcf \
    -O output_dir/2003_57_recalibratedVariants.postCGP.vcf \
    -R ref_dir/Homo_sapiens_assembly38.fasta \
    --supporting ref_dir/1000G_phase1.snps.high_confidence.hg38_BIALLELIC_ONLY.vcf.gz \
    -ped ped_dir/2003_57.ped

    and I got a similar error after processing:


    10:14:35.857 INFO ProgressMeter - chr21:35948075 58.8 7077000 120287.8
    10:14:45.922 INFO ProgressMeter - chr22:17917600 59.0 7151000 121200.0
    10:14:56.070 INFO ProgressMeter - chr22:36552356 59.2 7200000 121681.6
    10:15:03.081 INFO CalculateGenotypePosteriors - Shutting down engine
    [November 8, 2018 10:15:03 EST AM] org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors done. Elapsed time: 59.73 minutes.
    Runtime.totalMemory()=12106858496
    java.lang.IllegalStateException: Likelihoods not of correct size: expected 3, observed 2
    at org.broadinstitute.hellbender.tools.walkers.variantutils.PosteriorProbabilitiesUtils.calculatePosteriorProbs(PosteriorProbabilitiesUtils.java:217)
    at org.broadinstitute.hellbender.tools.walkers.variantutils.PosteriorProbabilitiesUtils.calculatePosteriorProbs(PosteriorProbabilitiesUtils.java:129)
    at org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors.apply(CalculateGenotypePosteriors.java:339)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:153)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:151)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)


    Please let me know if I can provide any other information. Do you think the data that is produced here is OK to proceed with for my analysis? Also, this is being run on a family of 5 individuals (2 parents, 3 children). I assume this is OK and my data doesn't need to be broken down into trios for the genotype refinement workflow?

  • aschoenraschoenr Member
    edited November 2018

    .

    Post edited by aschoenr on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @aschoenr,

    From Article#4723, under the section Caveats, we see the following:

    Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios.

    At the very bottom, we see confirmation in September of this year that these caveats still held true. There have been no changes to the CalculateGenotypePosteriors nor PosteriorProbabilitiesUtils code files since August (when parameter names were updated) and so it's a safe bet that nothing has changed for these tools.

    Can you see if when you run a trio consisting of parents plus one child, your CalculateGenotypePosteriors analysis completes successfully? If this is the case, seems a shame you have to do the additional preprocessing on your family of five dataset. If you are interested, we can put in a request for you for the tool to accept larger family sizes than trios.

  • manolismanolis Member ✭✭
    edited November 2018

    Hi,

    I'm interested to the same issue...

    1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf file is corrupted; I tried to create a new one without any success.

    As @aschoenr , I get the 1000G_phase1.snps.high_confidence.hg38.vcf.gz file and with SelectVariants coupled by the "--restrict-alleles-to BIALLELIC" option I created a new one = 1000G_phase1.snps.high_confidence_BIALLELIC_ONLY.hg38.vcf.gz

    I used it to the CGP code with my cohor/raw vcf post-VQSR:

    ${GATK4} CalculateGenotypePosteriors \
    -R ${hg38} \
    -V ${raw} \
    -O ${CGP} \
    -supporting 1000G_phase1.snps.high_confidence_BIALLELIC_ONLY.hg38.vcf.gz \
    --tmp-dir ${tmp}

    13:33:43.454 INFO  CalculateGenotypePosteriors - Done initializing engine
    13:33:43.480 INFO  CalculateGenotypePosteriors - No PED file passed or no *non-skipped* trios found in PED file. Skipping family priors.
    13:33:43.568 INFO  ProgressMeter - Starting traversal
    13:33:43.569 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    13:33:53.589 INFO  ProgressMeter -        chr1:19420105              0.2                  9000          53897.6
    13:34:04.025 INFO  ProgressMeter -        chr1:52055186              0.3                 19000          55729.4
    ...
    ...
    13:43:11.567 INFO  ProgressMeter -        chrX:38802863              9.5                582000          61479.1
    13:43:20.235 INFO  CalculateGenotypePosteriors - No variants filtered by: AllowAllVariantsVariantFilter
    13:43:20.236 INFO  ProgressMeter -       chrX:154440335              9.6                590663          61456.3
    13:43:20.236 INFO  ProgressMeter - Traversal complete. Processed 590663 total variants in 9.6 minutes.
    13:43:21.282 INFO  CalculateGenotypePosteriors - Shutting down engine
    [November 9, 2018 1:43:21 PM CET] org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors done. Elapsed time: 9.68 minutes.
    

    I runned in both files (input and output) the VariantFiltration with GQ<20 and the number of variants with GQ<20 change in the two files

    Total variants 590663; 105 samplesl; GATK 4.0.11.0

    grep -v "^#" raw_GQ.vcf | grep "GQless20" | wc -l
    566012
    
    grep -v "^#" raw_postCGP_GQ.vcf | grep "GQless20" | wc -l
    476292
    

    Please @shlee could you confirm me that everything is ok?

    Many thanks

  • aschoenraschoenr Member

    Hi @shlee,

    I trimmed down my VCF file to contain just a single trio using vcf-subset:

    vcf-subset -c 2003003,2003057,2003058 2003_57_recalibrated_variants.vcf > 2003058.vcf

    I then modified my pedigree file for this trio (2003058.ped):

    2003_57 2003003 0 0 2 0
    2003_57 2003057 0 0 1 0
    2003_57 2003058 2003057 2003003 2 0

    and then tried again:

    gatk CalculateGenotypePosteriors \
    -V 2003058.vcf \
    -O 2003058_recalibratedVariants.postCGP.vcf \
    -R Homo_sapiens_assembly38.fasta \
    --supporting 1000G_phase1.snps.high_confidence.hg38_BIALLELIC_ONLY.vcf.gz \
    -ped 2003058.ped

    and ended up with a very similar error:


    17:12:01.683 INFO ProgressMeter - chr21:26626688 35.4 7042000 199109.4
    17:12:11.759 INFO ProgressMeter - chr21:45128696 35.5 7096000 199688.1
    17:12:21.948 INFO ProgressMeter - chr22:26089046 35.7 7163000 200614.9
    17:12:31.990 INFO ProgressMeter - chr22:47134396 35.9 7217000 201184.2
    17:12:34.562 INFO CalculateGenotypePosteriors - Shutting down engine
    [November 8, 2018 5:12:34 EST PM] org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors done. Elapsed time: 35.96 minutes.
    Runtime.totalMemory()=5766643712
    java.lang.IllegalStateException: Likelihoods not of correct size: expected 3, observed 2
    at org.broadinstitute.hellbender.tools.walkers.variantutils.PosteriorProbabilitiesUtils.calculatePosteriorProbs(PosteriorProbabilitiesUtils.java:217)
    at org.broadinstitute.hellbender.tools.walkers.variantutils.PosteriorProbabilitiesUtils.calculatePosteriorProbs(PosteriorProbabilitiesUtils.java:129)
    at org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors.apply(CalculateGenotypePosteriors.java:339)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:153)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:151)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)


    Again, it looks like it completed before throwing this error. Is it possible that the resulting VCF is OK?
    Is it also possible that I need to filter my input VCF (2003058.vcf in this case) to include only biallelic snps as I had to do for the 1000G_phase1.snps.high_confidence file earlier? Just FYI, my input VCF file came directly out of the germline short variant discovery workflow.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @aschoenr,

    It seems @manolis and I can run CalculateGenotypePosteriors without error.

    I suspect something in the -V input is amiss. Is there is anything in your pipeline that you could think of that could be the cause of the mismatch in alleles and likelihoods? If not, then would you mind submitting a small subset of your data so we can take a look? Bug report instructions are above and to the left. Thanks.

  • aschoenraschoenr Member
    edited November 2018

    Hi @shlee,

    I think I know what is wrong. The error occurs on chrX. When using Haplotypecaller in the first stage of the germline short variant discovery workflow, I process each individual's chromosomes as separate jobs, meaning a for a family of 5 I would have at least 115 independent jobs. All of these jobs are the same except for the male sex chromosomes. Instead of running haplotype caller regularly on these, I would explicitly set the ploidy to one (-ploidy 1 argument). The reason I did this was because of the discussion here: https://gatkforums.broadinstitute.org/gatk/discussion/4639/x-chromosome-gentyping. My impression was that I should set ploidy=1 for males on chrX and chrY. Looking back it seems that there is no real position on this.

    Would this likely be the reason this is happening? Do you think it would be worthwhile going back to that point and excluding the ploidy argument altogether for all variant calling?

    EDIT: I just reran this on chr1-chr22 and it finished with no errors. The error is definitely in the chrX data and I doubt it would be introduced by anything other than what I talked about above. If you could advise on what you think I should do I would appreciate it. I assume I was overthinking this at the time and should not have even worried about the ploidy argument for HaplotypeCaller. I assume I should just remove that argument altogether and then this will likely be resolved.

    Post edited by aschoenr on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @aschoenr,

    My impression was that I should set ploidy=1 for males on chrX and chrY.

    I think you will find Article#7857 helpful--especially the bit about Analysis set references where PAR regions in Y are masked.

    It seems in your tests you've confirmed the allosomal contigs are the regions causing the error.

Sign In or Register to comment.