CombineGVCFs subsampling questions

Hi GATK!

I want to merge ~3000 HC outputs into one large cohort. However, even I run it directly by scattering on 30M genome chunk, it would still take a long time to compute. So I think I should first merge them to several small cohorts and then merge all small cohorts.

I had a subsampling test, a group of 300 samples v.s. 10 groups of 30 samples. However, the outputs are different in md5sum after excluding the header. I could understand that CombineGVCFs outputs have some cohort information, but I'm wondering how much they would matter in downstream VQSR pipeline, and how important they are.

Thank you!
Shenglai

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shenglai
    Hi Shenglai,

    Can you confirm you used the same version of GATK throughout your analysis? Are you saying when you run CombineGVCFs then GenotypeGVCFs on the same sample but with different groups of GVCFs, you get different results? Can you post some example records, if so?

    Thanks,
    Sheila

  • shenglaishenglai ChicagoMember
    edited September 2016

    Hi Sheila,

    1. Yes, I'm using nightly-2016-02-25-gf39d340 for all the testing.

    2. No. I'm only comparing the outputs from CombineGVCFs. 1) 300 samples --> 1 cohort. 2) 300 samples --> 10 cohorts, each contains 30 samples --> combine 10 cohorts to 1 cohort. The outputs are slight different. I'm wondering how much it would affect to downstream VQSR (GenotypeGVCFs and etc.) pipeline.

    3. The example record is shown as below:

    chr1    17365   rs369606208 C   G,<NON_REF> .   .   BaseQRankSum=0.538;ClippingRankSum=0.189;DP=2811;ExcessHet=3.01;MQRankSum=0.053;RAW_MQ=2182025.00;ReadPosRankSum=0.337  GT:AD:DP:GQ:MIN_DP:PGT:PID:PL:SB
    
    chr1    17365   rs369606208 C   G,<NON_REF> .   .   BaseQRankSum=0.688;ClippingRankSum=0.449;DP=2811;ExcessHet=3.01;MQRankSum=0.053;RAW_MQ=2182025.00;ReadPosRankSum=0.950  GT:AD:DP:GQ:MIN_DP:PGT:PID:PL:SB
    

    I'm also wondering if it's possible to seed any random selection processes to produce identical outputs?

    Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shenglai
    Hi,

    Sorry for the late response. Are you using multi-threading? It looks like there are some slight differences in the Rank Sum annotations which could be due to multi-threading.

    -Sheila

  • shenglaishenglai ChicagoMember
    edited September 2016

    Yes, I'm scattering by ~30M chunk. I'm wondering if the differences in the Rank Sum annotation would affect VQSR pipeline, and if so, how much it would affect. Thank you so much! BTW, are you saying even if I stick with the same scattering method, the Rank Sum annotations would still be different?

    Post edited by shenglai on

    Issue · Github
    by Sheila

    Issue Number
    1262
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • shenglaishenglai ChicagoMember

    Thank you @Geraldine_VdAuwera I will check the value just in case. Thank you for your help!

Sign In or Register to comment.