The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

rules for max_alternate_alleles in HaplotypeCaller

pawel_osipowskipawel_osipowski Warsaw, PolandMember
edited December 2013 in Ask the GATK team

Hi,

I can't come to any clear conclusion how this parameter works. Help me, please. I worked on the same files with exact command but the max_alternate_alleles. In first command I put 1 for its arguments
(--max_alternate_alleles 1) and 2 in second. Output was different by number of 600 SNVs,

a) There are sites on which haplotype caller for second command changed SNV on the one with better scores than in first command.
eg.
CSB10A_v1_contig_682 232 ref.: G first: GT(90.75) second: GTT ( 135.73). Scores in brackets.

b) There are sites where unlike first command, second command didn't give any SNVs, because there was no mapped reads

c) This is not sure, because I can't track back what I think I saw: the opposite to a) - scores from second command were worse than those from first.

Could you explain me why?

Paul

Tagged:

Best Answer

Answers

  • pawel_osipowskipawel_osipowski Warsaw, PolandMember

    a) went messy after approval: ref.: G first: GT(90.75) second: GTT ( 135.73). Scores in brackets off course.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Paul,

    This argument sets a limit on the number of alternate alleles that the HaplotypeCaller will consider when evaluating haplotypes. If it sees more possibilities in the data than are allowed by this argument, it will proceed with the most likely and discard the rest. The effects on number of variant calls are not easy to predict since it changes the decisions that the caller has to make depending on the data.

  • pawel_osipowskipawel_osipowski Warsaw, PolandMember
    edited December 2013

    Geraldine,

    Thank you for your answer. Do you think it's worth to change that argument to 1 in order to align reads of haploid genomes and call variants? Can it be beneficial in any way? When I add the argument with value 1 I get more possible variants from the caller than without one. I guess due to the constraints I putted caller doesn't treat some variant regions as one module, splits them and realign separately. That's how I get more variants from it. Like in this case:

    with changed argument to 1:
    CSB10A_v1_contig_10002 3405 . C CT 529.73 . AC=2;AF=1.00;AN=2;DP=19;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=35.25;MQ0=0;QD=27.88 GT:AD:DP:GQ:PL 1/1:0,17:17:51:567,51,0 CSB10A_v1_contig_10002 3411 . G GTT 80.94 . AC=2;AF=1.00;AN=2;DP=20;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=33.02;MQ0=0;QD=2.02 GT:AD:DP:GQ:PL 1/1:3,10:13:13:118,13,0 CSB10A_v1_contig_10002 3427 . GT G 481.73 . AC=2;AF=1.00;AN=2;DP=17;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=34.27;MQ0=0;QD=28.34 GT:AD:DP:GQ:PL 1/1:0,15:15:45:519,45,0

    without change:
    CSB10A_v1_contig_10002 3405 . C CT 529.73 . AC=2;AF=1.00;AN=2;DP=19;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=35.25;MQ0=0;QD=27.88 GT:AD:DP:GQ:PL 1/1:0,17:17:51:567,51,0 CSB10A_v1_contig_10002 3427 . GT G 481.73 . AC=2;AF=1.00;AN=2;DP=17;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=34.27;MQ0=0;QD=28.34 GT:AD:DP:GQ:PL 1/1:0,15:15:45:519,45,0

    Regards,
    Paul

  • smk_84smk_84 Member

    What would be a good argument for --max_alternate_alleles argument. I suppose it will vary in different organisms. What should be a good value to be set in the max alleles argument the current default is six in my case. I am running the analysis on soybean.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @smk_84‌

    Hi,

    You are free to experiment with this! You can run it with the default settings, and then try rerunning with the maximum alternate alleles you find from the default (the output will tell you when there are more than the default alleles).

    This parameter is related to the size of the cohort (how many different samples are being analyzed together) and to how diverse you expect the population to be. It really depends on how diverse you expect your population to be. If they are all from the same family, you expect them to be closely related, but if they are all strangers, they may not be so closely related.

    Good luck!

    -Sheila

  • jamjam Member

    Hi,
    Could you tell me if the default 6 is based on human and cohorts of maybe 1000? I deal with many different experiments, all non-human with different cohort sizes and relatedness, so will need to play with this quite a bit I think.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jam

    Hi,

    The number 6 is indeed based on human cohorts, probably from the 1000 Genomes project. It is meant as a reasonable compromise between computing requirements and what is likely to occur in populations.

    For larger cohort sizes, it should probably be increased. I am not sure about non-human though. You will need to adapt it to what you see in your data. The best thing to do is to play around with it, as you correctly assume.

    -Sheila

  • Greg_OwensGreg_Owens Member
    edited September 2014

    Hi,

    I'm considering this parameter for some non-human data with ~1000 samples. I'm running the pipeline with HaplotypeCaller on each single individual to produce .gvcf files and then GenotypeGVCFs on all the samples together. What I'd like to know is if I'm running HaplotypeCaller on one diploid individual, whether it is reasonable to set the max_alternate_alleles to 1? This makes sense biologically because there are only two alleles total but I'm not sure how the algorithm works exactly.

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Greg_Owens‌

    Hi,

    When running Haplotype Caller, the samples are compared to the reference. For diploid samples, there is a possibility that the sample may be heterozygous for two different alleles than the reference. For example, if the reference is A, and the sample is T/C, setting max_alternate_alleles to 1 will only return either T or C as an alternate allele (whichever allele is more likely). So, you will miss an important second alternate allele.

    When working with 1000 samples, setting max_alternate_alleles to a higher number also allows you to include variants that are not seen in large quantities in one individual sample. But, if the variant is seen in smaller quantities in many samples, the variant is more likely to be a true variant.

    I hope this helps.

    -Sheila

  • Thanks Sheila. I went with max alternate alleles = 2 for the Haplotype Caller (which is being run on individual samples) for the reasons you talked about. For the GenotypeGVCFs, I'm going to use max_alternate_alleles = 3 because I'm not going to use the indels and SNPs can only be four states.

  • aneekaneek Member
    edited August 2016

    Hi,
    For 50 human whole exome samples what value do you suggest for max_alternate_alleles parameter in GenotypeGVCFs? For your information while using haplotypecaller to generate GVCF for individual samples I did not change this parameter and it ran with the default value 6. Here my purpose is to generate a generic in-house database with allele frequency for use in variant annotation.

    Thank you. Please suggest..

    Post edited by aneek on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @aneek
    Hi,

    We don't really have any recommendations for how many alternate alleles you should accept. The default is 6 in HaplotypeCaller, but it is up to you to decide what you are interested in. As you have seen, there is a warning that tells you when there are more alternate alleles at a site than the default 6. If you wish to include those extra alternate alleles, you should change the default settings.

    -Sheila

  • @ Sheila

    Hi,

    Thank you very much for the explanation. As you said, I repeatedly tried with different values of --max_alternate_alleles parameter until I receive no warning message while performing the GenotypeGVCFs step and I have discovered maximum number of alternate allele 38 in one specific location. Therefore at last I ran the program with --max_alternate_alleles value 40 and it completed the task without any warning, means max number achieved hopefully.

    Although I did not see any computational problems and the program ran smoothly, however, I am fearing, using such high --max_alternate_alleles value is ok for only 50 whole exome samples. Please advice.

    Another query is, since in HaplotypeCaller step for generating individual g.vcfs I did not change this parameter from default, is it wise to change this parameter in GenotypeGVCFs step, so also in such high value (40)?

    In one sentence I just want to be sure that I not doing anything wrong which may finally end up in a database file with wrong alternate allele frequencies.

    Thanks.

    Issue · Github
    by Sheila

    Issue Number
    1205
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Any suggestion please..

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @aneek, we do not provide guidance for this because it is entirely up to you as an analyst to decide what is appropriate for your study. In some case it may be meaningful to capture all possible alternate alleles observed at a particular locus, especially if you have reason to believe that the polymorphism observed there holds some biological significance. However, in other cases the presence of very many alternate alleles just indicates technical difficulties in the sequencing process and is not useful (seeing 38 alleles in 50 samples seems like it might fall under that category). The development team has been working on methods to applying allele-specific filtering, which may prove useful for distinguishing such cases, but this is not yet ready for wide use. In the meantime you need to decide what you are willing to include in your analysis.

  • @Geraldine_VdAuwera

    Hi, thanks a lot. I understood. In my case it might be the technical error since I am getting 38 alleles in a particular locus. However even for that if I set the --max_alternate_alleles value to 40 and proceed for the analysis, do you think it can affect the final output of the file (alternate allele frequencies etc.).

    Also is there any way (commandline etc.) to detect how many loci are having such high number of alternate alleles in the samples.

    Thanks..

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @aneek
    Hi,

    No, there is no argument. You will need to to look at the WARN statements that are output.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yes, any sites where you allow that many alleles may have different allele frequencies as a result, because instead of constraining the calls to a few alleles, you can end up with calls split over many more possible alleles. Whether that's a problem or not is up to you to decide. I would say that the allele frequencies at those sites will be questionable anyway, since you do not know which is the correct configuration.

    There's not any easy straightforward way to test for this upfront that I know of.

  • @Sheila @Geraldine_VdAuwera

    Hi,
    Thanks for all.. I've understood.. Since the warnings for more than 6 alleles are for few specific sites I think I should ignore the warnings and proceed with the default value of the --max_alternate_alleles argument...

  • mglclinicalmglclinical USAMember

    @Sheila , thanks for mentioning above that the default value 6 for max_alternate_alleles is based on human cohorts 1000Genomes

Sign In or Register to comment.