To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

gatk3.8 vs gatk4 va gatk4spark ,the newer the slower!!

I use the gatk3.8 gatk4.0.0and gatkspark to test my data . I received a suprising result. gatk4 is slower than gatk3.8 ,and gatkspark is slower than them. The times are 17.3 vs 19.2 vs 24 min . The codes are basic and as follows:

gatk3.8

java -jar /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R cr.fa -I 10_dedup_reads.bam -o testgatk3.raw.variants.vcf

gatk4.0.0

/gatk-4.0.0.0/gatk HaplotypeCaller -R cr.fa -I 10_dedup_reads.bam -O 10.g.vcf.gz

gatkspark

/gatk-4.0.0.0/gatk HaplotypeCallerSpark -R cr.2bit -I 10_dedup_reads.bam -O 10.g.vcf.gz
And I am sure that the IO ,the cpus,and the memory are not reach the limit, so did I do something wrong ? Thanks a lot for reading or replying my quesion!!!

Issue · Github
by Sheila

Issue Number
2880
State
open
Last Updated
Assignee
Array
Milestone
Array

Answers

  • SkyWarriorSkyWarrior TurkeyMember

    The default native core request for pairHMM lib is 4 for GATK4.0 but 1 for 3.8. Can you check the speed by changing the native core request for GATK4 to 1 and try again? At best the difference will be marginal but also take heed that some of the optimizations made for 3.8 are no longer there for gatk4 for a better reason I believe. Still I am holding onto my legacy scripts with 3.8 just to be sure about what I do is consistent.

  • @SkyWarrior said:
    The default native core request for pairHMM lib is 4 for GATK4.0 but 1 for 3.8. Can you check the speed by changing the native core request for GATK4 to 1 and try again? At best the difference will be marginal but also take heed that some of the optimizations made for 3.8 are no longer there for gatk4 for a better reason I believe. Still I am holding onto my legacy scripts with 3.8 just to be sure about what I do is consistent.

    thank you for your reply,i did some test by adding pairHMM attribute ,but the difference is marginal as you sad . I think since gatk4 is released ,it should be the right and stable,and more quickly as what it sad.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sacuba
    Hi,

    I will have someone else from the team get back to you.

    -Sheila

  • Hi Scheila,

    could you please point me to this team member? since I am interested in running GATK-4.0 with Spark tools, and as @sacuba, I have noticed that the gatkspark is slower.

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SergioBenvenuti
    Hi Giuseppe,

    Geraldine @Geraldine_VdAuwera should get back to you all soon.

    -Sheila

  • Hi Sheila,

    thanks a lot! Looking forward to it!

    Giuseppe

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @sacuba and @SergioBenvenuti , sorry for the delayed response.

    For the *Spark tools, be aware that you need to specify spark arguments in order to get the spark parallelism to kick in; see this doc for more info.

    For the rest, between 3.8 and 4.0, we're getting scattered reports from people who are not seeing much difference between them, with quite a bit of variation depending on the hardware they use. It seems there's some variability from run to run as well -- did you average over multiple runs or do single runs only?

  • Hi @Geraldine_VdAuwera,

    thank you for the answer.
    You are right, and in fact when I run a GATK Spark tool I pay attention to specifying the spark arguments, using e.g a similar following command line:

    ./gatk BaseRecalibratorSpark \
            --input  inputfile.bam \
            --reference referencefile.2bit \
            --known-sites file.vcf \
            --intervals intervalfile.bed \
            --output recalibration_spark.table \
            -- --spark-runner SPARK --spark-master spark://${MASTER} \
            --driver-memory 80g \
            --num-executors 16 \
            --executor-memory 20g
    

    Please, see also my post here GATK - 4.0.0.0 [BaseRecalibratorSpark low performance].

    Finally, my performance results are based on a series of runs made with the aforementioned Spark tool, all them giving always long computation times.

    Kind regards,
    Giuseppe

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, got it. Looks like some of what you're observing may be a known bug that is getting fixed as we speak. In general the Spark tools are still extremely new so you can expect some instability there; as always getting feedback like yours is very important so please do continue to let us know how they behave in your hands.Thanks!

Sign In or Register to comment.