Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Ploidy level in HaplotypeCaller in GATK 4.0

prateekg04prateekg04 IndiaMember
edited January 30 in Ask the GATK team

Hi,

Thanks for the new version of GATK (GATK4.0).

We have a pooling of 48 samples and the organism is diploid, we are using ploidy of 96 (48x2=96). earlier when I am using HaplotypeCaller for variant calling in older versions of GATK, I am getting the error not enough memory to run this program., so was unable to run this with HaplotypeCaller earlier. Now when I tried it with GATK 4.0 version I am not getting this error, but a warn message mentioned below

12:40:23.159 WARN HaplotypeCallerGenotypingEngine - Removed alt alleles where ploidy is 96 and original allele count is 3, whereas after trimming the allele count becomes 2. Alleles kept are:[T*, C]

The command line which we have used is below

java -jar -Xmx64g gatk-package-4.0.0.0-local.jar HaplotypeCaller -R tilling.fa -I C1_S1.sorted.bam -O C1_S1.vcf -stand-call-conf 20.0 -ploidy 96

Can you please help us what does the warn message means, whether the command and the options which I am using are right, or I need to include more options for efficient variant calling.

Thanks in advance.

Regards,
Prateek

Post edited by Sheila on

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited January 30

    @prateekg04
    Hi Prateek,

    Thank you for adding an edited post. I just edited your original post (I cannot delete the new post, but perhaps you can so others will find the thread easier to read). For the future, you can simply click on the wheel looking icon in the top right of the post and select EDIT to edit a post :smile:

    The WARN message is telling you the number of alleles at the site combined with high ploidy is too much for the tool to handle, so it is removing some less common alternate alleles. The tool removes the least common alternate alleles so there is not as much work involved in determining genotypes (it takes more compute for higher ploidy/more alternate alleles). Have a look at the methods and algorithms section for more information on genotyping. You can change the default value with --max-genotype-count, however, it may be best to leave the default. 96 is very high ploidy, and if you are just looking for the most common alleles, 2 alternate alleles should be enough. What is your end goal? If you are looking for all possible alternate alleles at all sites, you can consider lowering the ploidy so more alternate alleles can be considered, or you can indeed increase the --max-genotype-count which will in turn increase compute.

    -Sheila

  • prateekg04prateekg04 IndiaMember

    Hi Sheila,

    Thanks for you answer. But our samples are pooled samples and we can't reduce the ploidy level. initially, I tried to run it with --max-genotype-count 4 to check whether the warning comes or not, but it is giving the below mentioned error and it's not running. please find the error below

    java.lang.IllegalArgumentException: VariantContext has only a single reference allele, but getLog10PNonRef requires at least one alt allele [VC HC40 @ Psy1:28 Q. of type=NO_VARIATION alleles=[A*] attr={} GT=[[C1_S1 ./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././. PL 0]]

    Can you please help us in this regard.

    Thanks in advance.

    Regards,
    Prateek

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @prateekg04
    Hi Prateek,

    If you limit max_genotype_count to 4 with ploidy 96, the tool will crash, because 4 is too low of a number for genotypes. If you would like 1 alternate allele present in the VCF, you need to at least set max_genotype_count to 97. Have a look at this page for the relationship between possible number of genotypes and ploidy. Notice the number of possible genotypes blows up after ~3 alleles (2 alternate alleles plus ref allele). That is why we recommend setting the number of alt alleles or ploidy to a lower number. Have a look at this thread for more information.

    -Sheila

  • kjngokjngo Member
    edited February 13

    Hi Sheila,

    I have a similar experiment setup as Prateek, but our experiment has a ploidy of 128 (2x64 individuals) instead. Our goal is to detect rare variants. I was wondering what parameters should I adjust in order to have ~4 alleles (3 alternate alleles plus ref allele) written out to the VCF?

    Thanks in advance.

    Best Regards,
    Kathie

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kjngo
    Hi Kathie,

    You should try setting --max_alternate_alleles and --max_num_PL_values to higher values.

    I hope that works. The links I provided in the post above should help as well.

    -Sheila

  • kjngokjngo Member

    @Sheila
    Hi Sheila,

    Thank you for the suggestions. For GATK4 HaplotypeCaller, I don't see the option --max_num_PL_values, is this only available in GATK3?

    Best Regards,
    Kathie

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kjngo
    Hi Kathie,

    Ah, yes you are correct. The new option is --max-genotype-count. The number of PL values is equivalent to the number of genotype counts :smile:

    -Sheila

  • kjngokjngo Member

    @Sheila
    Hi Sheila,

    Thank you for the clarification. After increasing the --max-genotype-count parameter, I was able to see 3 alleles kept instead of just 2. However I don't know if this is a separate issue I'm running into or related to this one. I'm trying to detect a control insertion deletion in our pooled dataset. Where a T is deleted in the reference then followed by an insertion of CCAAGTCTGTA. I was hoping by outputting more alternative alleles, it would solve this issue. However, what I observe in the VCF is that the insertion is broken down into 2 separate insertion variant calls (G>GGCCAA and T>TCTGTA). Do you have any suggestions on how I can adjust any parameters to fix this issue?

    Thanks for your time and help.

    Best Regards,
    Kathie

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kjngo
    Hi Kathie,

    Can you post some IGV screenshots of the sites and the VCF records?

    Thanks,
    Sheila

Sign In or Register to comment.