Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Combining separately joint called vcfs

Hello,

I have read through the guides and man pages I could find here, but am a bit confused. I have 2 joint called VCFs, produced with the same GATK3.7 pipeline, 3000 samples and 1000 samples. Am I able to combine those VCFs, or is it wiser to re-joint call the 4000 samples together.

https://gatkforums.broadinstitute.org/gatk/discussion/53/combining-variants-from-different-files-into-one

This page mentions (as an aside) joint calling in batches of 200 samples, and then combining the results. However it does not mention how that combining would occur - the three combining methods it talks about are for cases different to this one.

https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php

It seems like this tool is technically capable of merging vcfs, as well as other non-gatk tools. However I believe that generally merging vcfs is hard, many edge cases and missing data and so on. That is after all the reason for the gvcf workflow. I think the output of that tool merging would be markedly different from a single joint called vcf.

https://gatkforums.broadinstitute.org/gatk/discussion/23201/merging-population-vcf-files-without-gvcf

In this question you recommended not to attempt to merge vcfs, but this seems to conflict with the first link above.

https://software.broadinstitute.org/gatk/documentation/article?id=11019

This page does not mention the batching at all. I think because genomicsDB and GATK4 is expected to scale better with more samples.

Hope you can clear up my confusion

Thanks!

Best Answer

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Evan_Benn

    This page mentions (as an aside) joint calling in batches of 200 samples, and then combining the results.

    That is incorrect. It says,We recommend combining the output gVCF in batches of e.g. 200 before putting them through joint genotyping with GenotypeGVCFs.

    Am I able to combine those VCFs, or is it wiser to re-joint call the 4000 samples together.

    You should do joint genotyping on all samples together.

  • Evan_BennEvan_Benn Member

    Thank you for your help Bhanu, I am still confused by that sentence though!

    I am taking it to mean, 'call GenotypeGVCFs with batches of 200 gvcfs'

    My only guess is that it means 'create batches of 200 gvcfs and then call GenotypeGVCFs on all the gvcfs'. But then what is the batches talking about?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 9

    HI @Evan_Benn

    What it means is,
    1) run HaplotypeCaller in batches of 200 to output gVCFs
    2) Combine the gvcfs generated in 1)
    3) Then run GenotypeGVCFs on the combined single gvcf

    Does that help clarify?

  • Evan_BennEvan_Benn Member

    Hi Bhanu,

    I had thought HaplotypeCaller was run on single bams, very confused!

    1) 200 bams -> HaplotypeCaller -> 1 GVCF(200 samples)

    2) 1 GVCF(200 samples) -> ???

    3) 1 GVCF(200 samples) -> GenotypeGVCFs -> 1 VCF(200 samples)

    I am sure I am missing something, Thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Evan_Benn

    You are right the wording is confusing in the doc. Let me confirm with the author and get back to you shortly.

  • Evan_BennEvan_Benn Member

    Thank you very much Bhanu for your thorough work here

Sign In or Register to comment.