Multi threading in GATK 4 is done with spark now?

oskarvoskarv BergenMember

In GATK4 noticed I can't use -nt or -nct with tools that support it in GATK 3.x, and I understand that you removed it due to the complexities that it introduced to the code from this discussion: https://github.com/broadinstitute/gatk/issues/2345
So the current solution is to use either a local temporary spark server and "--sparkMaster 'local[N]'", or a permanent local or remote spark server. I tried running HaplotypeCallerSpark locally and it said needed a .2bit reference file?

A USER ERROR has occurred: Bad input: Running this tool with BROADCAST strategy requires a 2bit reference. To create a 2bit reference from an existing fasta file, download faToTwoBit from the link on https://genome.ucsc.edu/goldenPath/help/twoBit.html, then run faToTwoBit in.fasta out.2bit

So I created one with FaToTwoBit but it still didn't work.

Here's the command I used:

gatk-launch HaplotypeCallerSpark -O output.vcf -R human_g1k_v37_decoy.2bit --input input.bam 

And the error message:

Exception in thread "main" java.lang.AssertionError: assertion failed: Version must be zero
        at scala.Predef$.assert(Predef.scala:170)
        at org.bdgenomics.adam.util.TwoBitFile.readHeader(TwoBitFile.scala:85)
        at org.bdgenomics.adam.util.TwoBitFile.<init>(TwoBitFile.scala:62)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReferenceTwoBitSource.<init>(ReferenceTwoBitSource.java:43)
        at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.<init>(ReferenceMultiSource.java:41)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReference(GATKSparkTool.java:393)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:360)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:351)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:173)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:233)
17/08/02 15:27:14 INFO ShutdownHookManager: Shutdown hook called
17/08/02 15:27:14 INFO ShutdownHookManager: Deleting directory /tmp/travis/spark-f911bb61-2fb0-48d1-8c6a-49ff149f14e3

Is it a bug? And (why) do I need a .2bit fasta reference file? Can I shut off the broadcast strategy and skip the need for a .2bit reference file?

Best Answers

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @oskarv,

    GATK4 HaplotypeCaller is still in BETA as it is awaiting tie-outs. You should use GATK3 HaplotypeCaller for now.

  • oskarvoskarv BergenMember
    edited August 2017

    I would say it's ok to use the regular (non-Spark) version of GATK4 HaplotypeCaller for most research purposes (ie not clinical diagnostics). The issue here is arising from the immaturity of the Spark version.

    Alright, I'm just testing it right now, but local multi threading shouldn't work with just HalpotypeCaller, or any *Spark tool?
    And what about the .2bit reference thing, will it be a necessity or is it some kind of a bug?

    Issue · Github
    by Sheila

    Issue Number
    2405
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @oskarv
    Hi,

    Have a look at this thread. I will also ask Geraldine to jump in here.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @oskarv At the moment there's some inconsistency in how the tools are named -- for tools that were ported from GATK3, there is a separate version that supports spark and is denoted by the *Spark suffix. The version of the tool that does not have this suffix will not be capable of multithreading. Meanwhile tools that are new in GATK4 exist in a single form with dual capabilities (spark and non-spark), iirc, and as a result do not have the Spark suffix.

    I believe the 2bit requirement is only for running on proper spark clusters but I could be wrong, will check with the team.

  • oskarvoskarv BergenMember

    @Geraldine_VdAuwera
    I think I wasn't clear enough in my question earlier, I was trying to figure whether I could run *Spark tools straight on my laptop with multithreading enabled, I tried HaplotypeCallerSpark for no specific reason, but I made it work with ApplyBQSRSpark.

    If anyone else reads this and is curious about the syntax to run Spark tools with e.g 4 threads it is:

    java -jar gatk-local.jar ToolNameSpark --sparkMaster local[4] -etc ...
    

    I tried using gatk-spark.jar but it doesn't work for local execution, and HaplotypeCallerSpark still wants a fasta in .2bit format with a fresh compiled version of gatk4 from github, but it says "Running this tool with BROADCAST strategy requires a 2bit reference." so it sounds like if one were to not use "BROADCAST strategy" it would work with a regular fasta file. I don't know whether that's an option or just a cryptic error message though.

    I like the new development of gatk 4, keep it up!

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    There's supposed to be a launch script for this. I remember GATK team recommends using the script instead of calling the jar file directly. I have not tried GATK 4 yet but I am very curious about the speed gains especially.

  • oskarvoskarv BergenMember

    @SkyWarrior said:
    There's supposed to be a launch script for this. I remember GATK team recommends using the script instead of calling the jar file directly. I have not tried GATK 4 yet but I am very curious about the speed gains especially.

    Using the launch script works too, I just like to have a bit more control to understand the boundaries. The launch script applies some automatic settings that I think are useful so it's probably what I'll use in the end anyways.

    But regarding your question on the speed gains, I haven't run ApplyBQSR with just one thread so I don't know what the raw comparison would be. But I ran it with 8 threads for 10 minutes on a 40GB bam file from MarkDuplicates, and if I understand the output correctly, it had finished 380 tasks out of 1306, so roughly 30%, so lets say it would finish after 35 minutes, and it was the first out of eight stages, so 35*8/60 is roughly 4 hours and 40 minutes. But I think PrintReads in gatk 3.8 is faster on Intel hardware since it uses Intels Genomics Kernel Library which speeds things up on Intel processors with AVX support. I have yet to test that.

  • oligoelementooligoelemento PolandMember
    edited October 2017

    I obtain the same error "A USER ERROR has occurred: Bad input: Running this tool with BROADCAST strategy requires a 2bit reference." when running GATK 4 Beta 'BaseRecalibratorSpark'. Previously I generated the 2bit reference file with 'faToTwoBit'.

    It must be some error in the code that checks the reference file when using Spark, the non-Spark version of the command 'BaseRecalibrator' works smoothly (but slowly) with the same params.

    Here are the details with '-DGATK_STACKTRACE_ON_USER_EXCEPTION':

    $ java -jar /opt/gatk4-beta/gatk-launch --javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' BaseRecalibratorSpark -knownSites /db/dbsnp/All_20170710.vcf.gz -R /db/ensembl/GRCh38.fa -I SRR2601758.grouped.bam -O ALN.bqsr
    # Using GATK jar /opt/gatk4-beta/gatk-package-4.beta.5-local.jar
    # Running:
    #     java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /opt/gatk4-beta/gatk-package-4.beta.5-local.jar BaseRecalibratorSpark -knownSites /home/alvaro/db/dbsnp/All_20170710.vcf.gz -R /home/alvaro/db/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa -I wes_normal/SRR2601758.grouped.bam -O /tmp/ALN.bqsr
    # 14:02:11.514 WARN  SparkContextFactory - Environment variables HELLBENDER_TEST_PROJECT and HELLBENDER_JSON_SERVICE_ACCOUNT_KEY must be set or the GCS hadoop connector will not be configured properly
    # 14:02:11.779 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/gatk4-beta/gatk-package-4.beta.5-local.jar!/com/intel/gkl/native/libgkl_compression.so
    # [October 31, 2017 2:02:11 PM CET] BaseRecalibratorSpark  --knownSites /home/alvaro/db/dbsnp/All_20170710.vcf.gz --output /tmp/ALN.bqsr --reference /home/alvaro/db/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa --input wes_normal/SRR2601758.grouped.bam  --joinStrategy BROADCAST --mismatches_context_size 2 --indels_context_size 3 --maximum_cycle_value 500 --mismatches_default_quality -1 --insertions_default_quality 45 --deletions_default_quality 45 --low_quality_tail 2 --quantizing_levels 16 --bqsrBAQGapOpenPenalty 40.0 --preserve_qscores_less_than 6 --enableBAQ false --computeIndelBQSRTables false --useOriginalQualities false --defaultBaseQualities -1 --readShardSize 10000 --readShardPadding 1000 --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --bamPartitionSize 0 --disableSequenceDictionaryValidation false --shardedOutput false --numReducers 0 --sparkMaster local[*] --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
    # [October 31, 2017 2:02:11 PM CET] Executing as [email protected] on Linux 4.4.0-83-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11; Version: 4.beta.5
    # 14:02:11.974 INFO  BaseRecalibratorSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    # 14:02:11.974 INFO  BaseRecalibratorSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    # 14:02:11.974 INFO  BaseRecalibratorSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    # 14:02:11.974 INFO  BaseRecalibratorSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    # 14:02:11.974 INFO  BaseRecalibratorSpark - Deflater: IntelDeflater
    # 14:02:11.974 INFO  BaseRecalibratorSpark - Inflater: IntelInflater
    # 14:02:11.974 INFO  BaseRecalibratorSpark - GCS max retries/reopens: 20
    # 14:02:11.975 INFO  BaseRecalibratorSpark - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    # 14:02:11.975 INFO  BaseRecalibratorSpark - Initializing engine
    # 14:02:11.975 INFO  BaseRecalibratorSpark - Done initializing engine
    # Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    # 17/10/31 14:02:12 INFO SparkContext: Running Spark version 2.0.2
    # 17/10/31 14:02:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    # 17/10/31 14:02:12 WARN Utils: Your hostname, moncayo resolves to a loopback address: 127.0.1.1; using 150.254.123.188 instead (on interface eth4)
    # 17/10/31 14:02:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    # 17/10/31 14:02:12 INFO SecurityManager: Changing view acls to: alvaro
    # 17/10/31 14:02:12 INFO SecurityManager: Changing modify acls to: alvaro
    # 17/10/31 14:02:12 INFO SecurityManager: Changing view acls groups to: 
    # 17/10/31 14:02:12 INFO SecurityManager: Changing modify acls groups to: 
    # 17/10/31 14:02:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(alvaro); groups with view permissions: Set(); users  with modify permissions: Set(alvaro); groups with modify permissions: Set()
    # 17/10/31 14:02:12 INFO Utils: Successfully started service 'sparkDriver' on port 39843.
    # 17/10/31 14:02:12 INFO SparkEnv: Registering MapOutputTracker
    # 17/10/31 14:02:12 INFO SparkEnv: Registering BlockManagerMaster
    # 17/10/31 14:02:12 INFO DiskBlockManager: Created local directory at /tmp/alvaro/blockmgr-8135c2e8-1b92-4c2a-b143-c50f57b15646
    # 17/10/31 14:02:12 INFO MemoryStore: MemoryStore started with capacity 15.8 GB
    # 17/10/31 14:02:12 INFO SparkEnv: Registering OutputCommitCoordinator
    # 17/10/31 14:02:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
    # 17/10/31 14:02:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://150.254.123.188:4040
    # 17/10/31 14:02:13 INFO Executor: Starting executor ID driver on host localhost
    # 17/10/31 14:02:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35006.
    # 17/10/31 14:02:13 INFO NettyBlockTransferService: Server created on 150.254.123.188:35006
    # 17/10/31 14:02:13 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 150.254.123.188, 35006)
    # 17/10/31 14:02:13 INFO BlockManagerMasterEndpoint: Registering block manager 150.254.123.188:35006 with 15.8 GB RAM, BlockManagerId(driver, 150.254.123.188, 35006)
    # 17/10/31 14:02:13 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 150.254.123.188, 35006)
    # 17/10/31 14:02:13 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
    # 17/10/31 14:02:14 INFO SparkUI: Stopped Spark web UI at http://150.254.123.188:4040
    # 17/10/31 14:02:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    # 17/10/31 14:02:14 INFO MemoryStore: MemoryStore cleared
    # 17/10/31 14:02:14 INFO BlockManager: BlockManager stopped
    # 17/10/31 14:02:14 INFO BlockManagerMaster: BlockManagerMaster stopped
    # 17/10/31 14:02:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    # 17/10/31 14:02:14 INFO SparkContext: Successfully stopped SparkContext
    # 14:02:14.229 INFO  BaseRecalibratorSpark - Shutting down engine
    # [October 31, 2017 2:02:14 PM CET] org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark done. Elapsed time: 0.04 minutes.
    # Runtime.totalMemory()=1898446848
    # ***********************************************************************
    # 
    # A USER ERROR has occurred: Bad input: Running this tool with BROADCAST strategy requires a 2bit reference. To create a 2bit reference from an existing fasta file, download faToTwoBit from the link on https://genome.ucsc.edu/goldenPath/help/twoBit.html, then run faToTwoBit in.fasta out.2bit
    # 
    # ***********************************************************************
    # org.broadinstitute.hellbender.exceptions.UserException$Require2BitReferenceForBroadcast: Bad input: Running this tool with BROADCAST strategy requires a 2bit reference. To create a 2bit reference from an existing fasta file, download faToTwoBit from the link on https://genome.ucsc.edu/goldenPath/help/twoBit.html, then run faToTwoBit in.fasta out.2bit
    #         at org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark.runTool(BaseRecalibratorSpark.java:87)
    #         at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
    #         at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
    #         at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
    #         at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
    #         at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
    #         at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
    #         at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
    #         at org.broadinstitute.hellbender.Main.main(Main.java:233)
    # 17/10/31 14:02:14 INFO ShutdownHookManager: Shutdown hook called
    # 17/10/31 14:02:14 INFO ShutdownHookManager: Deleting directory /tmp/alvaro/spark-18856b1a-0cbf-4c9e-ab8e-766f3f9985a1
    

    Issue · Github
    by Sheila

    Issue Number
    2686
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited November 2017

    @oligoelemento
    Hi,

    Can you confirm that this occurs with the latest beta?

    I will check with the team on what else could be happening.

    Thanks,
    Sheila

  • oligoelementooligoelemento PolandMember

    It happened with Version:4.beta.5, should be corrected in Beta6?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @oligoelemento
    Hi,

    There is a possibility it is fixed in the latest beta 6 :smile: The developers are making many changes, and one of them may have fixed the issue. However, I am not 100% sure, but before we proceed, it would help me if you could check.

    Thanks,
    Sheila

  • oligoelementooligoelemento PolandMember

    I can confirm that Beta 6 outputs the same error :(

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @oligoelemento
    Hi,

    Okay. I will get back to you when I hear from the developers.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @oligoelemento
    Hi,

    I heard back from the developers, and here is what they say:

    Keep in mind the tools are still in beta, so some kinks are being worked out. Right now the spark version requires a 2bit reference. We are planning on changing that in the future.

    What do you mean by "the regular BaseRecalibrator runs "slowly""? It should be dramatically faster than the GATK3 version. If you're seeing performance problems there, we'd like to know about it. Can you post some runtimes for GATK3 vs GATK4?

    Thanks,
    Sheila

Sign In or Register to comment.