Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Different HaplotypeCaller variant calls based on Java version?

Hi, we are running HaplotypeCaller on two different Linux servers and getting slightly different VCF calls (only with SNVs, not Indels) with identical BAM files despite using the same script and having the same version of GATK, 4.0.3.0. The only difference between the servers is the Java versions: openjdk 1.8.0_161_b14 vs. 1.8.0_222_b10. Could this be causing differences in the variant calls? It seems like previous versions of GATK did have non-deterministic components, but that's not the case in v 4.0.3.0. Thank you.

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Are those 2 servers using same type of processor arch? Is gatk using the same pair hmm acceleration for both servers?

  • rsinghaniarsinghania Member
    The two servers both have x86_64, but there are significant differences in the number of CPUs (2 vs 40), Threads per core (1 vs 2), and cores per socket (1 vs 10). Not sure how I can find the pair hmm acceleration used by GATK. Let me know what other info I should provide. Thanks.
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You may want to check the haplotypecaller log.

  • rsinghaniarsinghania Member
    This is what we see in the log from one server:

    20:01:36.550 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    20:01:36.551 INFO IntelPairHmm - Available threads: 2
    20:01:36.551 INFO IntelPairHmm - Requested threads: 4
    20:01:36.551 WARN IntelPairHmm - Using 2 available threads, but 4 were requested
    20:01:36.551 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation


    And from the other server:

    13:00:21.900 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    13:00:21.901 INFO IntelPairHmm - Available threads: 40
    13:00:21.901 INFO IntelPairHmm - Requested threads: 4
    13:00:21.901 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation

    Would this make a difference?
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    It is better to try the java implementation of the pairhmm algorithm in both systems to rule out that possibility. Can you try that?

  • rsinghaniarsinghania Member
    I don't know if we can try that as it would be a significant deviation from our pipeline, but could you please provide some info on how we would go about invoking the Java implementation? Also as the Java versions are different in the two servers, as originally noted, this may create issues. In the new server, we are not able to install the earlier Java version in the old server, as we are not able to find the source for openjdk 1.8.0_161. Could you please point us to its installation file? Thanks again.
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Before jumping to premature conclusions it is better to delineate the problem and try to find the root of the problem. If java version is truly the issue then you may have to convert all your pipeline using docker version of the gatk as that would be the only way to keep your java version intact between different versions of the server.

    First of all using different number of threads in the native pairHMM implementation could be an issue by using the native java implementation you can rule out that possibility.

    If you are still getting differences between different servers then you may need to check your reference genomes, interval files, resource files etc.

    Here is the parameter you need to test

    --pair-hmm-implementation,-pairHMM:Implementation
                                  The PairHMM implementation to use for genotype likelihood calculations  Default value:
                                  FASTEST_AVAILABLE. Possible values: {EXACT, ORIGINAL, LOGLESS_CACHING,
                                  AVX_LOGLESS_CACHING, AVX_LOGLESS_CACHING_OMP, EXPERIMENTAL_FPGA_LOGLESS_CACHING,
                                  FASTEST_AVAILABLE}
    

    Try logless_caching option for the java implementation. You may also try EXACT or ORIGINAL to see if it makes any difference between servers.

    I have been coding java for years and have not seen a difference between different versions of the VM other than feature ad removes for certain objects or libraries. Calculations should not differ dramatically to cause any issues.

  • rsinghaniarsinghania Member
    Ok we can have a look at the different options for the Pair HMM implementation. The detailed feedback is much appreciated!
  • rsinghaniarsinghania Member
    Is there a document that describes what these different options for --pair-hmm-implementation do? Also, can we know what the default value FASTEST_AVAILABLE is for a given analysis - it doesn't seem to be in the log files. Will the FASTEST_AVAILABLE be one of the other options, and can this value differ depending on the server, i.e., can a faster server have a different FASTEST_AVAILABLE than a slower server? Thanks.
  • rsinghaniarsinghania Member
    Hi, just wanted to ask again if FASTEST_AVAILABLE can be different depending on the server, i.e., can a faster server have a different FASTEST_AVAILABLE than a slower server? The issue persists when using the same number of threads on both the servers; i.e., the VCF differences between the two servers remain. Trying EXACT does not remove these differences...
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal/erroneous results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, see this [announcement](https://software.broadinstitute.org/gatk/blog?id=24419 “announcement”) and check out our [support policy](https://gatkforums.broadinstitute.org/gatk/discussion/24417/what-types-of-questions-will-the-gatk-frontline-team-answer/p1?new=1 “support policy”).

Sign In or Register to comment.