Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

ERROR MESSAGE: The provided VCF file is malformed

tiffytiffy AustriaMember

Hello,

I have whole genome data for around 500 individuals for which I am running the GATK variant calling pipeline. I have run the exact same pipeline before on a much smaller data set and didn't experience any problems. Thus I am not sure what causes the problem described below and I hope you can provide me with some help on this issue.

Here is a short description of what I have done:

After all the data pre-processing steps, I have run GATK's HaplotypeCaller on each sample's bam-file using the following command:

"java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam --genotyping_mode DISCOVERY
--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample.gvcf"

As suggested in the Best Practice Guide, I combined multiple gvcf-files using the following command:

"java -Xmx50g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R reference.fasta --variant sample1.gvcf --variant sample2.gvcf (...) -o combined.gvcf".

This step runs through without any problem for all my samples but when I am trying to genotype them using

"java -Xmx100g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta --variant combined1.gvcf --variant combined2.gvcf (...)
-o variants.vcf"

GATK throws the following error message: "ERROR MESSAGE: The provided VCF file is malformed at approximately line number 86995422: ./.:0:0:0:0,0,0 is not a valid start position in the VCF format"

Thank you in advance for your help!

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tiffy‌

    Hi,

    What version of GATK are you using?

    Can you please try running GenotypeGVCFs on only 1 of the combined GVCF files and see if it runs?

    Thanks,
    Sheila

  • tiffytiffy AustriaMember

    Hi Sheila,
    I should have mentioned this in my earlier post: I am running the latest GATK version (3.3-0) and I have already tried running GenotypeGVCFs on a single of the combined gvcf-files but I am getting the exact same error message that ./.:0:0:0:0,0,0 is not a valid starting position (just for a different genomic position).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tiffy‌

    Hi,

    Thanks. This may be an issue that has been fixed in the nightly builds. The issue was in Combine GVCFs. Can you please try with the latest nightly build, and let us know if the error still occurs. https://www.broadinstitute.org/gatk/nightly

    -Sheila

  • tiffytiffy AustriaMember

    Ok, thank you for your fast response.
    I will try re-running it and let you know how it goes.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tiffy‌

    I am sorry for the last minute update, but if you have not already ran with the latest nightly build, you can try running again with GATK version 3.3. This may actually be a file system error and not an error that is specific to the nightly build fix.

    I realize it may be easier for you to continue using a stable release of GATK.

    -Sheila

  • tiffytiffy AustriaMember

    Hi Sheila,

    when trying to combine my gvcf-files using Combine GVCFs GATK version nightly-2015-01-14-g22d6966, some runs finish without any problems, others abort with an error message like this (although all input gvcf-files have been produced using the exact same pipeline):

    "ERROR MESSAGE: Line 2170707: there aren't enough columns for line T . . END=1359729 GT:DP:GQ:MIN_DP:PL 0/0:3:6:3:0,6,90 (we expected 9 tokens, and saw 7 ), for input source: RB221_rmdup.gvcf"

    When I had a closer look at file RB221_rmdup.gvcf, the line GATK's error message is referring to has as many entries than any other line:
    NW_006501363.1 1359690 . T . . END=1359705 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,140
    NW_006501363.1 1359706 . T . . END=1359722 GT:DP:GQ:MIN_DP:PL 0/0:4:9:3:0,9,95
    NW_006501363.1 1359723 . T . . END=1359729 GT:DP:GQ:MIN_DP:PL 0/0:3:6:3:0,6,90
    NW_006501363.1 1359730 . G . . END=1359741 GT:DP:GQ:MIN_DP:PL 0/0:2:3:1:0,3,33
    NW_006501363.1 1359742 . C . . END=1359742 GT:DP:GQ:MIN_DP:PL 0/0:1:2:1:0,3,22

    Do you have any idea what might go wrong here?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The fact that the line quote in the error message seems to start at the ref allele instead of the contig identifier makes me think it's a whitespace formatting issue, though I have no idea why that would happen.

    I notice you're giving your files a .gvcf extension -- not sure this has anything to do with your issue, but you should use .vcf instead (or .g.vcf if you want to identify them as gvcf for yourself). Some file handling and parsing decisions are made internally based on extension names, and GATK expects a .vcf extension even for GVCFs (because they are just a subspecies of VCF).

  • tiffytiffy AustriaMember

    Hi Geraldine,

    thank you for your reply.

    I have tried renaming my files using the .g.vcf extension but the issue remained. However, after I rerun 'HaplotypeCaller' on the problematic files, CombineGVCFs run through without any problems. I have experienced this issue with multiple of my .g.vcf-files - some can be combined, some can't - but I am not sure what's causing the difference as I am using the exact same parameters for all runs...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tiffy
    Hi,

    After rerunning HaplotypeCaller on the problematic files, are there any remaining problematic files, or is the issue resolved completely?

    Thanks,
    Sheila

  • tiffytiffy AustriaMember

    Hi Sheila,
    for some batches of files that I am combining, the issue is resolved entirely by rerunning HaplotypeCaller on the problematic files; for others I get the same error message for a different file (still not sure what's causing the issue though).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If it's a recurring but non-deterministic issue, it could be that your filesystem is glitching and causing write errors to the files.

  • tiffytiffy AustriaMember

    Yes, that might be the case as I rerun the problematic jobs and this time everything worked well. Thank you.

Sign In or Register to comment.