gatk3.8 vs gatk4 va gatk4spark ,the newer the slower!!

I use the gatk3.8 gatk4.0.0and gatkspark to test my data . I received a suprising result. gatk4 is slower than gatk3.8 ,and gatkspark is slower than them. The times are 17.3 vs 19.2 vs 24 min . The codes are basic and as follows:

gatk3.8

java -jar /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R cr.fa -I 10_dedup_reads.bam -o testgatk3.raw.variants.vcf

gatk4.0.0

/gatk-4.0.0.0/gatk HaplotypeCaller -R cr.fa -I 10_dedup_reads.bam -O 10.g.vcf.gz

gatkspark

/gatk-4.0.0.0/gatk HaplotypeCallerSpark -R cr.2bit -I 10_dedup_reads.bam -O 10.g.vcf.gz
And I am sure that the IO ,the cpus,and the memory are not reach the limit, so did I do something wrong ? Thanks a lot for reading or replying my quesion!!!

Issue · Github
by Sheila

Issue Number
2880
State
closed
Last Updated
Assignee
Array
Closed By
vdauwera

Answers

  • SkyWarriorSkyWarrior TurkeyMember

    The default native core request for pairHMM lib is 4 for GATK4.0 but 1 for 3.8. Can you check the speed by changing the native core request for GATK4 to 1 and try again? At best the difference will be marginal but also take heed that some of the optimizations made for 3.8 are no longer there for gatk4 for a better reason I believe. Still I am holding onto my legacy scripts with 3.8 just to be sure about what I do is consistent.

  • @SkyWarrior said:
    The default native core request for pairHMM lib is 4 for GATK4.0 but 1 for 3.8. Can you check the speed by changing the native core request for GATK4 to 1 and try again? At best the difference will be marginal but also take heed that some of the optimizations made for 3.8 are no longer there for gatk4 for a better reason I believe. Still I am holding onto my legacy scripts with 3.8 just to be sure about what I do is consistent.

    thank you for your reply,i did some test by adding pairHMM attribute ,but the difference is marginal as you sad . I think since gatk4 is released ,it should be the right and stable,and more quickly as what it sad.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sacuba
    Hi,

    I will have someone else from the team get back to you.

    -Sheila

  • Hi Scheila,

    could you please point me to this team member? since I am interested in running GATK-4.0 with Spark tools, and as @sacuba, I have noticed that the gatkspark is slower.

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SergioBenvenuti
    Hi Giuseppe,

    Geraldine @Geraldine_VdAuwera should get back to you all soon.

    -Sheila

  • Hi Sheila,

    thanks a lot! Looking forward to it!

    Giuseppe

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @sacuba and @SergioBenvenuti , sorry for the delayed response.

    For the *Spark tools, be aware that you need to specify spark arguments in order to get the spark parallelism to kick in; see this doc for more info.

    For the rest, between 3.8 and 4.0, we're getting scattered reports from people who are not seeing much difference between them, with quite a bit of variation depending on the hardware they use. It seems there's some variability from run to run as well -- did you average over multiple runs or do single runs only?

  • Hi @Geraldine_VdAuwera,

    thank you for the answer.
    You are right, and in fact when I run a GATK Spark tool I pay attention to specifying the spark arguments, using e.g a similar following command line:

    ./gatk BaseRecalibratorSpark \
            --input  inputfile.bam \
            --reference referencefile.2bit \
            --known-sites file.vcf \
            --intervals intervalfile.bed \
            --output recalibration_spark.table \
            -- --spark-runner SPARK --spark-master spark://${MASTER} \
            --driver-memory 80g \
            --num-executors 16 \
            --executor-memory 20g
    

    Please, see also my post here GATK - 4.0.0.0 [BaseRecalibratorSpark low performance].

    Finally, my performance results are based on a series of runs made with the aforementioned Spark tool, all them giving always long computation times.

    Kind regards,
    Giuseppe

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, got it. Looks like some of what you're observing may be a known bug that is getting fixed as we speak. In general the Spark tools are still extremely new so you can expect some instability there; as always getting feedback like yours is very important so please do continue to let us know how they behave in your hands.Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I should add -- we have some important updates coming soon to a subset of the Spark tools, and as part of that we're going to make some benchmarks available to help set expectations more clearly, since that has been a sticking point of late.

  • Dear Geraldine,

    thank you for the answer and my apologies for the very late reply.
    Looking forward to seeing your benchmarks!

    Best,
    Giuseppe

Sign In or Register to comment.