Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

CombineGVCFs subsampling questions

shenglaishenglai ChicagoMember


I want to merge ~3000 HC outputs into one large cohort. However, even I run it directly by scattering on 30M genome chunk, it would still take a long time to compute. So I think I should first merge them to several small cohorts and then merge all small cohorts.

I had a subsampling test, a group of 300 samples v.s. 10 groups of 30 samples. However, the outputs are different in md5sum after excluding the header. I could understand that CombineGVCFs outputs have some cohort information, but I'm wondering how much they would matter in downstream VQSR pipeline, and how important they are.

Thank you!


Best Answer


  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Shenglai,

    Can you confirm you used the same version of GATK throughout your analysis? Are you saying when you run CombineGVCFs then GenotypeGVCFs on the same sample but with different groups of GVCFs, you get different results? Can you post some example records, if so?


  • shenglaishenglai ChicagoMember
    edited September 2016

    Hi Sheila,

    1. Yes, I'm using nightly-2016-02-25-gf39d340 for all the testing.

    2. No. I'm only comparing the outputs from CombineGVCFs. 1) 300 samples --> 1 cohort. 2) 300 samples --> 10 cohorts, each contains 30 samples --> combine 10 cohorts to 1 cohort. The outputs are slight different. I'm wondering how much it would affect to downstream VQSR (GenotypeGVCFs and etc.) pipeline.

    3. The example record is shown as below:

    chr1    17365   rs369606208 C   G,<NON_REF> .   .   BaseQRankSum=0.538;ClippingRankSum=0.189;DP=2811;ExcessHet=3.01;MQRankSum=0.053;RAW_MQ=2182025.00;ReadPosRankSum=0.337  GT:AD:DP:GQ:MIN_DP:PGT:PID:PL:SB
    chr1    17365   rs369606208 C   G,<NON_REF> .   .   BaseQRankSum=0.688;ClippingRankSum=0.449;DP=2811;ExcessHet=3.01;MQRankSum=0.053;RAW_MQ=2182025.00;ReadPosRankSum=0.950  GT:AD:DP:GQ:MIN_DP:PGT:PID:PL:SB

    I'm also wondering if it's possible to seed any random selection processes to produce identical outputs?

    Thank you!

  • SheilaSheila Broad InstituteMember, Broadie admin


    Sorry for the late response. Are you using multi-threading? It looks like there are some slight differences in the Rank Sum annotations which could be due to multi-threading.


  • shenglaishenglai ChicagoMember
    edited September 2016

    Yes, I'm scattering by ~30M chunk. I'm wondering if the differences in the Rank Sum annotation would affect VQSR pipeline, and if so, how much it would affect. Thank you so much! BTW, are you saying even if I stick with the same scattering method, the Rank Sum annotations would still be different?

    Post edited by shenglai on

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • shenglaishenglai ChicagoMember

    Thank you @Geraldine_VdAuwera I will check the value just in case. Thank you for your help!

Sign In or Register to comment.