We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Safe to use HaplotypeCallerSpark?

RafaelMarcondesRafaelMarcondes Louisiana State UniversityMember
Hi all,

I'm wondering how bad it is to use use HaplotypeCallerSpark in GATK 4.0.2.1/JDK 1.8? I realize it's in beta, but I'm wondering if that means "your results will be useless" or just "use with caution".

The reason I'm asking that is because it seems like Spark is the only way to multi-thread in GATK 4, and just 1 of my bams took 64 hrs to run on single node, and I have 110 bams, so parallelizing is a must.

Btw, when I tried to run HaplotypeCallerSpark in parallel with 48 nodes, my job crashed after running for two days. I thought since with 1 node it took 64 hrs, using 48 nodes would mean it would finish in wayyy less than 2 days.

Here's what I have:

gatk --java-options "-Xmx32g -XX:ParallelGCThreads=1" HaplotypeCallerSpark --spark-master local[48] -R myref.2bit -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, see this [announcement](https://software.broadinstitute.org/gatk/blog?id=24419 “announcement”) and check out our [support policy](https://gatkforums.broadinstitute.org/gatk/discussion/24417/what-types-of-questions-will-the-gatk-frontline-team-answer/p1?new=1 “support policy”).

  • wbsimeywbsimey California Academy of SciencesMember ✭✭
    edited December 2019

    @RafaelMarcondes ,
    I have not been able to get HaplotypeCallerSpark to work, but I have a simple bash script to parallelize many HaplotypeCaller jobs. Note that the server I am running on has 192 threads and 1TB RAM, so you will have to adjust memory and the number of simultaneous jobs in the script to match your server. Initially I assigned too much memory to each job (--java-options "-Xmx200g") and ran out of memory and went into swap, which slowed things to a crawl. You will also have to edit the directory names and filenames to match yours.

    Also, when you say "nodes" do you mean threads? My definition of "node" is one server of many in a cluster. My servers are single node (one motherboard with four Xeon chips).

    I tried to paste the code here using the "code" markdown, but it ends up a total mess. I tried many things but nothing works. I will leave the mess below, but if you want the script, I will send it to you.

    `

    #!/bin/bash
    

    function set_bamprefix {
    bamprefix=$(basename $1 | sed "s/_sorted.*//")
    }
    function set_outgVCFs {
    outgVCFs="${outdir}/${bamprefix}_sorted_dedup_rg.gVCF"
    }

    function HapCall {
    bamfile=$1
    Ref="../reference/Tse_SBAPGDGG_D3B.fa"
    set_bamprefix $bamfile
    set_outgVCFs
    #scaff="scaffold_26"

    HaplotypeCaller="gatk --java-options -Xmx${java_mem} HaplotypeCaller -R $Ref -I $bamfile -ERC GVCF -O $outgVCFs"
    
    echo "$HaplotypeCaller" >/dev/stderr
    time $HaplotypeCaller
    

    }

    # vars to set
    

    outdir="../gVCFs"
    bamdir="../elprep_bams"

    java_mem="50g"
    max_executing=20

    # see how many are currently executing and how many we can start up
    

    process_str="gatk --java-options.*HaplotypeCaller"
    num_executing=$(ps -ef | grep -v grep | grep "$process_str" -c)
    let "available = $max_executing - $num_executing"
    if [ "$available" -gt "0" ]; then
    echo Can start up to $available jobs >/dev/stderr
    else
    echo Already have $num_executing jobs executing, no slots available >/dev/stderr
    exit 1
    fi

    # loop through all files ready for processing and start up as many jobs as we can for those not yet done
    

    bam_files=$(ls $bamdir/*_sorted_dedup_rg.bam)
    for f in $bam_files
    do
    set_bamprefix $f
    set_outgVCFs
    echo bamprefix $bamprefix $outgVCFs
    if [ -f "$outgVCFs" ]; then
    echo "$bamprefix finished or in process" >/dev/stderr
    elif [ "$available" -gt "0" ]; then
    HapCall "$f"
    let "available = available-1"
    else
    echo No more job slots available >/dev/stderr
    exit
    fi
    done

    # if we get here there were no files that didnt have a corresponding gVCF file
    

    echo All files finished or in process now
    `

  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    I should add that my observations of HaplotypeCaller, via top and htop, suggest that it uses up to four threads maximum (Broad folks, please correct me if this is not true). If your server has 48 threads (24 cores) then you could run 12 simultaneous HaplotypeCaller jobs as long as you have enough RAM. If you ran 12 simultaneous jobs at 40g of memory each you should have at least 480g RAM, but safer to have 512g or more. If you max out your memory usage, your system will start using swap space, which will dramatically slow down your analyses.

Sign In or Register to comment.