We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Variant calling for 2000 exome samples (multi-sample)

Hi,
I am trying to do multi-sample SNP calling for over 2000 exome samples.

I have few queries while setting up the following workflow

  1. HaplotypeCaller for individual sample to generate gVCF (with -L whole exome)
  2. generate CombineGVCFs for 50 samples in a list ( so 40 GVCF list 2000%50=40)
  3. generate GenotypeGCFs for 40 GVCFlist (with -L whole exome)

Could anyone comment on the above workflow. Thanks.

Answers

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @Lavanya
    Hi,

    Your workflow looks fine. A few points to note:
    1) You can run CombineGVCFs on up to 200 GVCFs at a time, but 50 is fine too.
    2) You do not need to use -L after HaplotypeCaller, as the intervals are output in the vcf. http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

    -Sheila

  • LavanyaLavanya Member

    Hi Shiela,

    Thanks for your response. I have few queries regarding parameters
    1. -contamination 0.0 --What is the impact of using this parameter in HaplotypeCaller step?
    2. max_alternate_alleles 3 --What is the impact when this parameter is changed from default 6 to 3? Does it have any impact on calling SNPs and INDELs
    3. Actually the final VCF GenotypeGCFs contains both SNPs and INDELs (with max_alternate_alleles 3). Any impact when the default value is changed from 6 to 3.
    4. Unified genotyper has a parameter "-glm BOTH" to call both SNPs and indels. Is there any explicit flag necessary for GenotypeGCF step? I just want to confirm as I would like to call both SNPs and INDELs.
    5. I have used --variant_index_type LINEAR and --variant_index_parameter 128000 as suggested under the document while calling HaplotypeCaller (https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--max_alternate_alleles)
    What are the impacts of these parameters on results?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @Lavanya
    Hi,

    1) You only need to set this to greater than 0 if you know there is contamination in your dataset. If not, it will cause biased downsampling.

    2) If you have more than 3 potential alternate alleles at a site, they will not be reported. Only the top three most likely alleles will be reported.

    3) No impact if you set Haplotype Caller max_alternate_alleles to 3. But, if you left Haplotype Caller max_alternate_alleles at 6 and set GenotypeGVCFs max_alternate_alleles to 3, only the top 3 alternate alleles will be reported in the final vcf.

    4) There is no need for a special flag, as Haplotype Caller calls both indels and snps together.

    5) Those were simply parameters related to indexing. You needed to add them in 3.3, but in the latest builds, you do not need to add them.

    -Sheila

  • LavanyaLavanya Member

    Thanks Sheila

Sign In or Register to comment.