Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

format fields and sample entries in VCF files

eflynn90eflynn90 Washington DCMember

I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column.

For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF:
1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 ..............

The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable.

It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields.

Best Answer

Answers

  • eflynn90eflynn90 Washington DCMember

    Thanks! Please let me know if it's added as an option.

  • munzmattmunzmatt GermanyMember
    edited February 4

    I am using GATK 3.8 and I just noticed that the trailing fields that are empty are still dropped. Because of this, I got an error with verifyBamId. Is there an option to disable dropping?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @munzmatt

    There is a tool called VariantsToTable that can preserve empty fields for downstream analysis.

    Also, I am not sure I understood how your verifyBamID analysis was affected by the VCF empty fields, maybe if you post an example, we could help trouble shoot this issue.

    In the meantime, you could use this tool to recreate the columns for the empty fields.

  • munzmattmunzmatt GermanyMember

    Dear @AdelaideR,

    thanks for the tool, will try to to recover the empty fields in my vcf with it.

    Regarding verifyBamID: I have opened an issue on the respective github repository (https://github.com/Griffan/VerifyBamID/issues/13) and it seems that verifyBamID just cannot handle missing empty trailing fields yet. But the authors want to fix it.

  • munzmattmunzmatt GermanyMember

    @AdelaideR said:
    Hello @munzmatt

    There is a tool called VariantsToTable that can preserve empty fields for downstream analysis.

    Also, I am not sure I understood how your verifyBamID analysis was affected by the VCF empty fields, maybe if you post an example, we could help trouble shoot this issue.

    In the meantime, you could use this tool to recreate the columns for the empty fields.

    Dear @AdelaideR,
    VariantsToTable can't produce output files in vcf format. So, I can't use it to add missing fields to a vcf. Please, correct me if I am wrong.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @munzmatt

    Another user on the forum recommended using VCF-Simplify for this purpose.

    The original discussion can be found here

Sign In or Register to comment.