Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Variant calling for 2000 exome samples (multi-sample)

Hi,
I am trying to do multi-sample SNP calling for over 2000 exome samples.

I have few queries while setting up the following workflow

  1. HaplotypeCaller for individual sample to generate gVCF (with -L whole exome)
  2. generate CombineGVCFs for 50 samples in a list ( so 40 GVCF list 2000%50=40)
  3. generate GenotypeGCFs for 40 GVCFlist (with -L whole exome)

Could anyone comment on the above workflow. Thanks.

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Lavanya
    Hi,

    Your workflow looks fine. A few points to note:
    1) You can run CombineGVCFs on up to 200 GVCFs at a time, but 50 is fine too.
    2) You do not need to use -L after HaplotypeCaller, as the intervals are output in the vcf. http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

    -Sheila

  • LavanyaLavanya Member

    Hi Shiela,

    Thanks for your response. I have few queries regarding parameters
    1. -contamination 0.0 --What is the impact of using this parameter in HaplotypeCaller step?
    2. max_alternate_alleles 3 --What is the impact when this parameter is changed from default 6 to 3? Does it have any impact on calling SNPs and INDELs
    3. Actually the final VCF GenotypeGCFs contains both SNPs and INDELs (with max_alternate_alleles 3). Any impact when the default value is changed from 6 to 3.
    4. Unified genotyper has a parameter "-glm BOTH" to call both SNPs and indels. Is there any explicit flag necessary for GenotypeGCF step? I just want to confirm as I would like to call both SNPs and INDELs.
    5. I have used --variant_index_type LINEAR and --variant_index_parameter 128000 as suggested under the document while calling HaplotypeCaller (https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--max_alternate_alleles)
    What are the impacts of these parameters on results?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Lavanya
    Hi,

    1) You only need to set this to greater than 0 if you know there is contamination in your dataset. If not, it will cause biased downsampling.

    2) If you have more than 3 potential alternate alleles at a site, they will not be reported. Only the top three most likely alleles will be reported.

    3) No impact if you set Haplotype Caller max_alternate_alleles to 3. But, if you left Haplotype Caller max_alternate_alleles at 6 and set GenotypeGVCFs max_alternate_alleles to 3, only the top 3 alternate alleles will be reported in the final vcf.

    4) There is no need for a special flag, as Haplotype Caller calls both indels and snps together.

    5) Those were simply parameters related to indexing. You needed to add them in 3.3, but in the latest builds, you do not need to add them.

    -Sheila

  • LavanyaLavanya Member

    Thanks Sheila

Sign In or Register to comment.