Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Large cohort VCFs in GATK4 - to combine or not ...

foxyjohnfoxyjohn Member
edited October 2 in Ask the GATK team

Hi,

I've somatically called a few thousand samples against a PoN. I'm now looking through the results and wondering how best to collate all these single VCFs. Is there a tool like GenotypeGVCF for VCFs? (CombineVariants is no longer available - and would take too long presumably).
If not, what would be a strategy for pooling these result files into an analysis set? Is there an alternative analysis strategy to pooling?

Thanks.

Best Answer

Answers

  • foxyjohnfoxyjohn Member
    Accepted Answer

    As an FYI,

    I'm using 'bcftools merge' to combine VCFs. All good.

    Thanks.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Can you confirm that AF field is calculated properly for each site after bcftools merge?

    Last time I used bcftools merge for a job AF fields were not calculated properly therefore I gave up using it.

  • foxyjohnfoxyjohn Member

    Hi,

    Can you confirm what you mean by properly?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited October 7

    I noticed that some of the bcftools tools do not calculate AF after a VCF is processed such as merge or select operation and leaves the original AF value as is. That's why I asked that. For example you have a SNP with 0.005 AF however once you select out one sample with that SNP out AF still stays as 0.005 even though it has to be 0 for the rest of the samples and 0.5 for the one single sample.

    Allele fraction is not calculated properly.

  • foxyjohnfoxyjohn Member

    To me, the merge shouldn't be doing a calculation on the AFs though. I just wanted to square off a bunch of VCFs into one file as it were.
    What I think you're talking about is like a joint genotyping, where you're calculating a combined AF for every sample where variants are found and have that as the called AF (a population AF) - which I can also see the merits of, by the way.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Well older tools such as GATK CombineVariants from GATK3 recalculates AF even when vcfs are merged from multiple samples. That's why I mentioned this.

Sign In or Register to comment.