Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Best practices for joint genotyping of a very large sample size

We are going to variant call 25k+ WGS samples soon. We want to adopt the joint genotyping pipeline provided at https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.wdl & https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.hg38.wgs.inputs.json.

Two questions:
1. One problem is that this pipeline uses a gvcf file, and gvcf is much bigger than vcf in size. So I am not sure if it is practical to have a gvcf file for 25k+ samples.
2. Another problem is memory usage. Can we joint genotype 25k+ WGS samples at once?

The above being said, I am wondering if we could divide the 25k+ samples into smaller groups (e.g. 1000 samples each group), do joint genotyping group by group, without compromising variant calling quality too much. By dividing, we should save space, memory, and time.

BTW, where can I find the gvcfs such as "/home/bshifaw/data/joint_discovery/NA12878.g.vcf.gz" and ""/home/bshifaw/data/joint_discovery/NA12878.g.vcf.gz.tbi"?

Thanks.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @moxu, sorry for the lag. I'll try to answer your questions:

    1. To clarify, the GVCF file here is just an intermediate used in the joint calling pipeline; the final output is a normal multisample VCF file (although even that will be very large and you may prefer to make per-chromosome final VCFs for practical reasons). And to be clear, it is not possible to do joint calling on 25K samples in a principle way without using the GVCF-based workflow.

    2. Yes, and we have in fact done larger cohorts than that -- but you must use the version of the workflow that uses GenomicsDB, and there are some technical tweaks that can help. Have a look at our reference implementations for guidance.

    We do not recommend doing joint genotyping in batches because that will introduce batch effects that will confound any downstream analyses. And it should not be necessary if you're using the workflows we recommend. As I mentioned we have run on larger cohorts than 25K.

    Regarding the files, see the non-local pipeline json of inputs for that pipeline here: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.hg38.wgs.inputs.json. It contains the equivalent files with locations in a Google Cloud bucket that is publicly accessible.

Sign In or Register to comment.