Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Error in HaplotypeCaller with GENOTYPE_GIVEN_ALLELES

I've got a few different VCFs from various experiments and sources, and am trying to get population level information for a gene of interest. I merged the vcfs with vcftools and ran into the un-called hom-refs problem, for example among a small family study most alleles are hom-ref and I get no genotype calls. After merging , the cohorts are almost mutually exclusive regarding which alleles are seen. I would need to impute uncalled SNPs but with the read data available it makes more sense to call reference genotypes at given locations. So I'll use --genotyping_mode=GENOTYPE_GIVEN_ALLELES and give the merged VCF as a reference for the genomic positions I want called for everyone. This has worked for UnifiedGenotyper last year, possibly with a different alleles format.

I cropped the alleles VCF to only contain one individual, at 600 loci. For many sites any one individual has a no-call genotype, and I find the HaplotypeCaller is trying to parse the VCF and chokes on a no-call site. Log and error are attached. Java chokes trying to parseDouble for a period character.

If I clip off the genotypes columns (only interested in sites) with cut -f1-5, this retains the chr,pos,rs,ref,alt and I think it should work, but the GATK refuses to read it as an invalid VCF

ERROR MESSAGE: Your input file has a malformed header: there are not enough columns present in the header line:

I don't think the HC should need an individual's genotype quality for the GIVEN_ALLELES option, this should just be specifying loci, right? That's almost more worrying that it wants to parse the individuals from the GIVEN_ALLELES file, but I suspect that's just code-reuse and a default behavior for any VCF reader, since the error came out of org.broadinstitute.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:113)

The error comes when it gets to this no-call line, with ./. filled in by vcftools merge operation.

14      23857606        rs115570443     G       C       100     PASS    .       GT:GL:DS        0|0:0.00,-5.00,-5.00:0.000
14      23857626        .       C       A       51.84   .       .       GT:GQ:DP:PL:AD  .:.:.:.:.

For the moment I'm going to try to edit the specified alleles VCF to put mock genotypes, and hope it doesn't impact the results.

Best Answer

Answers

  • pdexheimerpdexheimer Member ✭✭✭✭

    Your reasoning is right for GGA mode - it does not care about the genotypes at all. But the VCF spec says that the first 8 columns are required to be present, it's columns 9 and on that encode the genotypes and are therefore optional. So if you modify your cut command, I think you'll be in great shape. And you're right, HC does not use genotypes in GGA mode.

    Your stack trace for the parseDouble exception indicates that it's failing when trying to parse a GL element, which is actually only present in your variant with the genotype in the example. It's kind of strange that your file is mixing PL and GL, I wonder if it's not some interaction between the two that's causing problems?

  • KStammKStamm Member

    Since the alleles file comes from several different tools, the FORMAT column is sometimes "GT:GL:DS" or "GT:GQ:DP:PL:AD" or even "GT:GL:GQ:DP:DS:PL:AD"
    When I copied an apparently functioning genotype call from one line to the other, this caused a GL of "-0.72,-0.09,-5.00" to be parsed as a GQ, which failed a different java.parseDouble(). Maybe the tool is not respecting the FORMAT column, and assuming theyre all the same as the first entry (since 99% of the VCFs in the world probably have a single value for the FORMAT columns, it's a realistic coding-shortcut).

    I think I have the problem sorted now by replacing all FORMAT values with "GT" and all genotype calls with a simple "0/1", this fools the validator into using my sites. Of course I won't know if it completes successfully for a few more hours.

    I know no one wants to support a third party tool, but I think merged VCFs could be a common source of GENOTYPE_GIVEN_ALLELES, and if it had known to only review columns 1-5 (or also FILTER) we could bypass this type of error in the future.

Sign In or Register to comment.