We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Large cohort VCFs in GATK4 - to combine or not ...

foxyjohnfoxyjohn Member
edited October 2019 in Ask the GATK team

Hi,

I've somatically called a few thousand samples against a PoN. I'm now looking through the results and wondering how best to collate all these single VCFs. Is there a tool like GenotypeGVCF for VCFs? (CombineVariants is no longer available - and would take too long presumably).
If not, what would be a strategy for pooling these result files into an analysis set? Is there an alternative analysis strategy to pooling?

Thanks.

Best Answer

Answers

  • foxyjohnfoxyjohn Member
    Accepted Answer

    As an FYI,

    I'm using 'bcftools merge' to combine VCFs. All good.

    Thanks.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Can you confirm that AF field is calculated properly for each site after bcftools merge?

    Last time I used bcftools merge for a job AF fields were not calculated properly therefore I gave up using it.

  • foxyjohnfoxyjohn Member

    Hi,

    Can you confirm what you mean by properly?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited October 2019

    I noticed that some of the bcftools tools do not calculate AF after a VCF is processed such as merge or select operation and leaves the original AF value as is. That's why I asked that. For example you have a SNP with 0.005 AF however once you select out one sample with that SNP out AF still stays as 0.005 even though it has to be 0 for the rest of the samples and 0.5 for the one single sample.

    Allele fraction is not calculated properly.

  • foxyjohnfoxyjohn Member

    To me, the merge shouldn't be doing a calculation on the AFs though. I just wanted to square off a bunch of VCFs into one file as it were.
    What I think you're talking about is like a joint genotyping, where you're calculating a combined AF for every sample where variants are found and have that as the called AF (a population AF) - which I can also see the merits of, by the way.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Well older tools such as GATK CombineVariants from GATK3 recalculates AF even when vcfs are merged from multiple samples. That's why I mentioned this.

Sign In or Register to comment.