Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Strange results generated from HaplotypeCaller when comparing "-minPruning 3" with "-minPruning 2"

qingfengzealotqingfengzealot Member
edited August 2014 in Ask the GATK team

Hi GATK developer,

I am working on a human WES project with an average sequencing depth 42X per individual. Recent version of GATK (v3.2-2) was used in my study. To investigate which "-minPruning" setting is better for my project, I ran HaplotypeCaller and tested the "-minPruning 3" and "-minPruning 2", so I generate two sets of gvcf files. I found some interesting cases when I compared these two sets of files.

For example in the same individual (we name this individual as A):

Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 3) is listed as below:

chr1 1654007 . A . . END=1654007 GT:DP:GQ:MIN_DP:PL 0/0:64:0:64:0,0,1922

Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 2) is listed as below:

chr1 1654007 rs199558549 A T, 155.90 . DB;DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=27.73;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,6,0:6:18:189,18,0,189,18,189:0,0,4,2

You can see it is quite different for the results in terms of the final genotyping call and the sequencing depth ("DP" tag) generated from "-minPruning 3" and "-minPruning 2" setting. I am wondering what exactly happens for this locus so I check this locus using IGV. I found at this locus, there are 6 reads supporting the ALT allele ("T") and 64 reads supporting the REF allele ("A"). So I believe the result from "-minPruning 3" should be correct.
Just wondering is it normal considering the different setting of "-minPruning 3" and "-minPruning 2"or is it just a bug? If it is just because of the different setting of "-minPruning 3" and "-minPruning 2", I would say the setting of "-minPruning" is critical for the final results so a guideline of how to set the "-minPruning" based on the sequencing depth will be important. How do you think?

Thanks,

Qiongyi

Post edited by qingfengzealot on

Issue · Github
by Geraldine_VdAuwera

Issue Number
837
State
closed
Last Updated
Assignee
Array
Closed By
ldgauthier

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @qingfengzealot‌

    Hi Qiongyi,

    The -minPruning argument is related to the graph assembling stage of Haplotype Caller. Please read about this step here: http://www.broadinstitute.org/gatk/guide/article?id=4146

    When you set the -minPruning value to 2, any branch of the graph that has less than 2 reads to support it is thrown out. When you set the -minPruning value to 3, any branch that has less than 3 reads to support it is thrown out. In your case, the branches that support the T allele must have only 2 reads to support each of the 3 branches it shows up on. So, when -minPruning is 2, the T branches are still present in the haplotypes. But, when you set -minPruning to 3, the T branches are pruned because they only have 2 reads to support each of the branches.

    Increasing the -minPruning value will lead to faster processing and higher specificity, but will decrease sensitivity. Decreasing the -minPruning value will do the opposite, decreasing specificity but increasing sensitivity. You can see in your case that when you increase the value, you lose a potential variant. However, it is up to you to decide how sensitive you want to be.

    I hope this helps.

    -Sheila

  • Hi Sheila,

    Thanks for your quick response. Yes, "-minPruning" should be working exactly like you said and that's what I thought. However, in my example it is clear that we have 64 reads supporting the REF allele ("A") and 6 reads supporting the ALT allele ("T"), which I've checked through IGV. Here "64" and "6" reads have high quality of mapping scores. So I feel curious why "-minPruning 2" gave me "1/1:0,6" and "-minPruning 3" gave me "0/0:64". In any case, I feel "-minPruning 2" shouldn't discard the 64 reads that support the REF allele.

    Best regards,

    Qiongyi

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @qingfengzealot‌

    Hi Qiongyi,

    Okay. Can you please try this again with the latest nightly build? You can download it here: http://www.broadinstitute.org/gatk/nightly

    Please let me know if you get the same error. If the error persists, I will need you to upload a snippet of the file so we can debug locally.

    Thanks,
    Sheila

  • Hi Sheila,

    Thanks for your prompt reply and sorry for the delay. I've tried the latest nightly build version of GATK, but the error persists. I will generate a snippet of the file for debugging locally. I will let you know when the snippet of the file is ready.

    Thanks,

    Qiongyi

  • qingfengzealotqingfengzealot Member
    edited September 2014

    Hi Sheila,

    I've uploaded a bam snippet to your ftp. The file name is "minpruning_bug_20140919.tar.gz" under folder "/minpruning_bug_20140919_Qiongyi". It includes 7 files (please see details below).

    1) bed file: "minpruning_bug_20140919.bed";

    2) the BAM snippet file: "minpruning_bug.sorted.bam";

    3) the index (.bai) file: "minpruning_bug.sorted.bam.bai";

    4) the commands to reproduce this error: "work.sh";

    5) The standard output message when running the commands: output_message.txt;

    6-7) GVCF files from "-minPruning 2" and "-minPruning 3": "minpruning.2.g.vcf" and "minpruning.3.g.vcf";

    You may check the locus "chr1 1654007" from two GVCF files where the results are quite different. Hope these files would be helpful for your team to debug. Thanks.

    Cheers,

    Qiongyi

    @Sheila said:
    qingfengzealot‌

    Hi Qiongyi,

    Okay. Can you please try this again with the latest nightly build? You can download it here: http://www.broadinstitute.org/gatk/nightly

    Please let me know if you get the same error. If the error persists, I will need you to upload a snippet of the file so we can debug locally.

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @qingfengzealot‌

    Hi Qiongyi,

    I have just submitted a bug report. I will let you know when the issue has been resolved.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @qingfengzealot I just realized we forgot to let you know about the resolution for your question. In-depth analysis showed that the program performed as expected. Here are the developer's comments:

    This is expected behavior for tail merging.

    So at first glance, it's not so surprising that increasing minPruning (i.e. increasing the amount of evidence necessary to keep a path in the graph) removes a variant. The raw graph (before pruning and branch recovery) looks like this:

    image

    Yes, the alternate T still has 6 reads supporting it (no local realignment), but the last two bases in the graph only have 2 reads supporting them. Working from the end of the dangling tail, we have to find a base to get merged back in that will pass the minPruning threshold, so we skip those two. By the time we get to the A with 4 reads of support, the branch length is only 3, which is less than the default minDanglingBranchLength of 4. In the case of the minPruning=3, those last two bases survive and the tail gets merged back in and the T->A variant called. Totally expected behavior based on the graph building parameters.

    The variant could probably be recovered for minPruning=3 using minDanglingBranchLength < 4, but the more merging the user attempts, the slower the code will run.

Sign In or Register to comment.