Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

rules for max_alternate_alleles in HaplotypeCaller

pawel_osipowskipawel_osipowski Warsaw, PolandPosts: 8Member
edited December 2013 in Ask the GATK team

Hi,

I can't come to any clear conclusion how this parameter works. Help me, please. I worked on the same files with exact command but the max_alternate_alleles. In first command I put 1 for its arguments (--max_alternate_alleles 1) and 2 in second. Output was different by number of 600 SNVs,

a) There are sites on which haplotype caller for second command changed SNV on the one with better scores than in first command. eg. CSB10A_v1_contig_682 232 ref.: G first: GT(90.75) second: GTT ( 135.73). Scores in brackets.

b) There are sites where unlike first command, second command didn't give any SNVs, because there was no mapped reads

c) This is not sure, because I can't track back what I think I saw: the opposite to a) - scores from second command were worse than those from first.

Could you explain me why?

Paul

Post edited by pawel_osipowski on

Best Answer

Answers

  • pawel_osipowskipawel_osipowski Warsaw, PolandPosts: 8Member

    a) went messy after approval: ref.: G first: GT(90.75) second: GTT ( 135.73). Scores in brackets off course.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,902Administrator, GATK Developer admin

    Hi Paul,

    This argument sets a limit on the number of alternate alleles that the HaplotypeCaller will consider when evaluating haplotypes. If it sees more possibilities in the data than are allowed by this argument, it will proceed with the most likely and discard the rest. The effects on number of variant calls are not easy to predict since it changes the decisions that the caller has to make depending on the data.

    Geraldine Van der Auwera, PhD

  • pawel_osipowskipawel_osipowski Warsaw, PolandPosts: 8Member
    edited December 2013

    Geraldine,

    Thank you for your answer. Do you think it's worth to change that argument to 1 in order to align reads of haploid genomes and call variants? Can it be beneficial in any way? When I add the argument with value 1 I get more possible variants from the caller than without one. I guess due to the constraints I putted caller doesn't treat some variant regions as one module, splits them and realign separately. That's how I get more variants from it. Like in this case:

    with changed argument to 1:

    CSB10A_v1_contig_10002  3405    .       C       CT      529.73  .       AC=2;AF=1.00;AN=2;DP=19;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=35.25;MQ0=0;QD=27.88     GT:AD:DP:GQ:PL  1/1:0,17:17:51:567,51,0
    CSB10A_v1_contig_10002  3411    .       G       GTT     80.94   .       AC=2;AF=1.00;AN=2;DP=20;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=33.02;MQ0=0;QD=2.02      GT:AD:DP:GQ:PL  1/1:3,10:13:13:118,13,0
    CSB10A_v1_contig_10002  3427    .       GT      G       481.73  .       AC=2;AF=1.00;AN=2;DP=17;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=34.27;MQ0=0;QD=28.34     GT:AD:DP:GQ:PL  1/1:0,15:15:45:519,45,0

    without change:

    CSB10A_v1_contig_10002  3405    .       C       CT      529.73  .       AC=2;AF=1.00;AN=2;DP=19;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=35.25;MQ0=0;QD=27.88     GT:AD:DP:GQ:PL  1/1:0,17:17:51:567,51,0
    CSB10A_v1_contig_10002  3427    .       GT      G       481.73  .       AC=2;AF=1.00;AN=2;DP=17;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=34.27;MQ0=0;QD=28.34     GT:AD:DP:GQ:PL  1/1:0,15:15:45:519,45,0

    Regards, Paul

    Post edited by pawel_osipowski on
  • smk_84smk_84 Posts: 59Member

    What would be a good argument for --max_alternate_alleles argument. I suppose it will vary in different organisms. What should be a good value to be set in the max alleles argument the current default is six in my case. I am running the analysis on soybean.

  • SheilaSheila Broad InstitutePosts: 295Member, GATK Developer, Broadie, Moderator admin

    @smk_84

    Hi,

    You are free to experiment with this! You can run it with the default settings, and then try rerunning with the maximum alternate alleles you find from the default (the output will tell you when there are more than the default alleles).

    This parameter is related to the size of the cohort (how many different samples are being analyzed together) and to how diverse you expect the population to be. It really depends on how diverse you expect your population to be. If they are all from the same family, you expect them to be closely related, but if they are all strangers, they may not be so closely related.

    Good luck!

    -Sheila

  • jamjam Posts: 9Member

    Hi, Could you tell me if the default 6 is based on human and cohorts of maybe 1000? I deal with many different experiments, all non-human with different cohort sizes and relatedness, so will need to play with this quite a bit I think.

  • SheilaSheila Broad InstitutePosts: 295Member, GATK Developer, Broadie, Moderator admin

    @jam

    Hi,

    The number 6 is indeed based on human cohorts, probably from the 1000 Genomes project. It is meant as a reasonable compromise between computing requirements and what is likely to occur in populations.

    For larger cohort sizes, it should probably be increased. I am not sure about non-human though. You will need to adapt it to what you see in your data. The best thing to do is to play around with it, as you correctly assume.

    -Sheila

Sign In or Register to comment.