Error generated running GATK with WDL/Json

Part of the error is as follows:


java.lang.IllegalArgumentException: gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta exists on a filesystem not supported by this instance of Cromwell

The entire error message is as follows:


[2018-03-23 16:07:18,48] [error] WorkflowManagerActor Workflow 1315e829-1d7d-4e47-a092-4b1129f4ceec failed (during ExecutingWorkflowState): Evaluating size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB") failed: java.lang.IllegalArgumentException: gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta exists on a filesystem not supported by this instance of Cromwell. Supported filesystems are: MacOSXFileSystem. Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
java.lang.RuntimeException: Evaluating size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB") failed: java.lang.IllegalArgumentException: gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta exists on a filesystem not supported by this instance of Cromwell. Supported filesystems are: MacOSXFileSystem. Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
at cromwell.engine.workflow.lifecycle.execution.keys.ExpressionKey.processRunnable(ExpressionKey.scala:26)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.$anonfun$startRunnableNodes$4(WorkflowExecutionActor.scala:438)
at cats.instances.ListInstances$$anon$1.$anonfun$traverse$2(list.scala:65)
at cats.instances.ListInstances$$anon$1.loop$2(list.scala:58)
at cats.instances.ListInstances$$anon$1.$anonfun$foldRight$1(list.scala:58)
at cats.Eval$Compute.loop$1(Eval.scala:313)
at cats.Eval$Compute.value(Eval.scala:324)
at cats.Eval$Call.value(Eval.scala:257)
at cats.instances.ListInstances$$anon$1.traverse(list.scala:64)
at cats.instances.ListInstances$$anon$1.traverse(list.scala:12)
at cats.Traverse$Ops.traverse(Traverse.scala:19)
at cats.Traverse$Ops.traverse$(Traverse.scala:19)
at cats.Traverse$ToTraverseOps$$anon$3.traverse(Traverse.scala:19)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.cromwell$engine$workflow$lifecycle$execution$WorkflowExecutionActor$$startRunnableNodes(WorkflowExecutionActor.scala:432)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:152)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:150)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
at akka.actor.FSM.processEvent(FSM.scala:668)
at akka.actor.FSM.processEvent$(FSM.scala:662)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowExecutionActor.scala:43)
at akka.actor.LoggingFSM.processEvent(FSM.scala:801)
at akka.actor.LoggingFSM.processEvent$(FSM.scala:783)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.processEvent(WorkflowExecutionActor.scala:43)
at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:659)
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:653)
at akka.actor.Actor.aroundReceive(Actor.scala:514)
at akka.actor.Actor.aroundReceive$(Actor.scala:512)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$Timers$$super$aroundReceive(WorkflowExecutionActor.scala:43)
at akka.actor.Timers.aroundReceive(Timers.scala:40)
at akka.actor.Timers.aroundReceive$(Timers.scala:36)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.aroundReceive(WorkflowExecutionActor.scala:43)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
at akka.actor.ActorCell.invoke(ActorCell.scala:496)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Evaluating size(dbSNP_vcf, "GB") failed: java.lang.IllegalArgumentException: gs://broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp119.vcf exists on a filesystem not supported by this instance of Cromwell. Supported filesystems are: MacOSXFileSystem. Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
java.lang.RuntimeException: Evaluating size(dbSNP_vcf, "GB") failed: java.lang.IllegalArgumentException: gs://broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp119.vcf exists on a filesystem not supported by this instance of Cromwell. Supported filesystems are: MacOSXFileSystem. Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
at cromwell.engine.workflow.lifecycle.execution.keys.ExpressionKey.processRunnable(ExpressionKey.scala:26)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.$anonfun$startRunnableNodes$4(WorkflowExecutionActor.scala:438)
at cats.instances.ListInstances$$anon$1.$anonfun$traverse$2(list.scala:65)
at cats.instances.ListInstances$$anon$1.loop$2(list.scala:58)
at cats.instances.ListInstances$$anon$1.$anonfun$foldRight$1(list.scala:58)
at cats.Eval$Compute.loop$1(Eval.scala:310)
at cats.Eval$Compute.value(Eval.scala:324)
at cats.Eval$Call.value(Eval.scala:257)
at cats.instances.ListInstances$$anon$1.traverse(list.scala:64)
at cats.instances.ListInstances$$anon$1.traverse(list.scala:12)
at cats.Traverse$Ops.traverse(Traverse.scala:19)
at cats.Traverse$Ops.traverse$(Traverse.scala:19)
at cats.Traverse$ToTraverseOps$$anon$3.traverse(Traverse.scala:19)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.cromwell$engine$workflow$lifecycle$execution$WorkflowExecutionActor$$startRunnableNodes(WorkflowExecutionActor.scala:432)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:152)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:150)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
at akka.actor.FSM.processEvent(FSM.scala:668)
at akka.actor.FSM.processEvent$(FSM.scala:662)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowExecutionActor.scala:43)
at akka.actor.LoggingFSM.processEvent(FSM.scala:801)
at akka.actor.LoggingFSM.processEvent$(FSM.scala:783)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.processEvent(WorkflowExecutionActor.scala:43)
at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:659)
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:653)
at akka.actor.Actor.aroundReceive(Actor.scala:514)
at akka.actor.Actor.aroundReceive$(Actor.scala:512)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$Timers$$super$aroundReceive(WorkflowExecutionActor.scala:43)
at akka.actor.Timers.aroundReceive(Timers.scala:40)
at akka.actor.Timers.aroundReceive$(Timers.scala:36)
at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.aroundReceive(WorkflowExecutionActor.scala:43)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
at akka.actor.ActorCell.invoke(ActorCell.scala:496)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[2018-03-23 16:07:18,48] [info] WorkflowManagerActor WorkflowActor-1315e829-1d7d-4e47-a092-4b1129f4ceec is in a terminal state: WorkflowFailedState
[2018-03-23 16:07:24,26] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2018-03-23 16:07:27,54] [info] Message [akka.actor.FSM$Transition] from Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/ServiceRegistryActor/MetadataService/WriteMetadataActor#1056125202] to Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor#-52822304] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2018-03-23 16:07:27,54] [info] Message [cromwell.core.actor.StreamActorHelper$StreamFailed] without sender to Actor[akka://cromwell-system/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2018-03-23 16:07:27,55] [info] Message [cromwell.core.actor.StreamActorHelper$StreamFailed] without sender to Actor[akka://cromwell-system/deadLetters] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2018-03-23 16:07:27,55] [info] Message [cromwell.core.actor.StreamActorHelper$StreamFailed] without sender to Actor[akka://cromwell-system/deadLetters] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2018-03-23 16:07:27,55] [error] Outgoing request stream error
akka.stream.AbruptTerminationException: Processor actor [Actor[akka://cromwell-system/user/StreamSupervisor-1/flow-3-0-mergePreferred#-1582581254]] terminated abruptly
[2018-03-23 16:07:27,55] [error] Outgoing request stream error
akka.stream.AbruptTerminationException: Processor actor [Actor[akka://cromwell-system/user/StreamSupervisor-1/flow-7-0-mergePreferred#-2054082751]] terminated abruptly
Workflow 1315e829-1d7d-4e47-a092-4b1129f4ceec transitioned to state Failed
[2018-03-23 16:07:27,57] [info] Automatic shutdown of the async connection
[2018-03-23 16:07:27,57] [info] Gracefully shutdown sentry threads.

[2018-03-23 16:07:27,57] [info] Shutdown finished.

The command line used was:


java -jar ~/xlib/java/cromwell-31.jar run PairedEndSingleSampleWf.wdl --inputs PairedEndSingleSampleWf.moTest_hg19.inputs.json

The WDL & the json files are not allowed to attach, so I'll have to to copy/paste below. Basically I've 1) changed hg38 to hg19, and changed the sample name.


{
"##_COMMENT1": "Take note of the .64 extensions on the reference files, issues between 32 and 64 bit OS",

"##_COMMENT2": "SAMPLE NAME AND UNMAPPED BAMS - read the README to find other examples.",
"PairedEndSingleSampleWorkflow.sample_name": "NA12878_hc",
"PairedEndSingleSampleWorkflow.base_file_name": "NA12878_hc",
"PairedEndSingleSampleWorkflow.flowcell_unmapped_bams": ["file://Users/moushengxu/xdata/NGS/NA12878/high_coverage_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam"],
"PairedEndSingleSampleWorkflow.final_gvcf_base_name": "NA12878_hc",
"PairedEndSingleSampleWorkflow.unmapped_bam_suffix": ".bam",

"##_COMMENT3": "REFERENCES",
"PairedEndSingleSampleWorkflow.fingerprint_genotypes_file": "gs://dsde-data-na12878-public/NA12878_hc.hg19.reference.fingerprint.vcf",
"PairedEndSingleSampleWorkflow.contamination_sites_ud": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.contam.UD",
"PairedEndSingleSampleWorkflow.contamination_sites_bed": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.contam.bed",
"PairedEndSingleSampleWorkflow.contamination_sites_mu": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.contam.mu",
"PairedEndSingleSampleWorkflow.scattered_calling_intervals": ["gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0001_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0002_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0003_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0004_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0005_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0006_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0007_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0008_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0009_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0010_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0011_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0012_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0013_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0014_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0015_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0016_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0017_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0018_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0019_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0020_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0021_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0022_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0023_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0024_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0025_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0026_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0027_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0028_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0029_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0030_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0031_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0032_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0033_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0034_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0035_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0036_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0037_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0019_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0039_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0040_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0041_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0042_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0043_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0044_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0045_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0046_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0047_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0048_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0049_of_50/scattered.interval_list", "gs://broad-references/hg19/v0/scattered_calling_intervals/temp_0050_of_50/scattered.interval_list"],
"PairedEndSingleSampleWorkflow.wgs_calling_interval_list": "gs://broad-references/hg19/v0/wgs_calling_regions.hg19.interval_list",
"PairedEndSingleSampleWorkflow.ref_dict": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.dict",
"PairedEndSingleSampleWorkflow.ref_fasta": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta",
"PairedEndSingleSampleWorkflow.ref_fasta_index": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai",
"PairedEndSingleSampleWorkflow.ref_alt": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.alt",
"PairedEndSingleSampleWorkflow.ref_sa": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.sa",
"PairedEndSingleSampleWorkflow.ref_amb": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.amb",
"PairedEndSingleSampleWorkflow.ref_bwt": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.bwt",
"PairedEndSingleSampleWorkflow.ref_ann": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.ann",
"PairedEndSingleSampleWorkflow.ref_pac": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.pac",
"PairedEndSingleSampleWorkflow.known_indels_sites_VCFs": [
"gs://broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz",
"gs://broad-references/hg19/v0/Homo_sapiens_assembly19.known_indels.vcf.gz"
],
"PairedEndSingleSampleWorkflow.known_indels_sites_indices": [
"gs://broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz.tbi",
"gs://broad-references/hg19/v0/Homo_sapiens_assembly19.known_indels.vcf.gz.tbi"
],
"PairedEndSingleSampleWorkflow.dbSNP_vcf": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp119.vcf",
"PairedEndSingleSampleWorkflow.dbSNP_vcf_index": "gs://broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp119.vcf.idx",
"PairedEndSingleSampleWorkflow.wgs_coverage_interval_list": "gs://broad-references/hg19/v0/wgs_coverage_regions.hg19.interval_list",
"PairedEndSingleSampleWorkflow.wgs_evaluation_interval_list": "gs://broad-references/hg19/v0/wgs_evaluation_regions.hg19.interval_list",

"##_COMMENT4": "PRIVATE REFERENCES",
"##PairedEndSingleSampleWorkflow.haplotype_database_file": "gs://gatk-aas-test-data/small/empty.vcf",

"##_COMMENT5": "DISK SIZES + MISC",
"PairedEndSingleSampleWorkflow.flowcell_small_disk": 100,
"PairedEndSingleSampleWorkflow.flowcell_medium_disk": 200,
"PairedEndSingleSampleWorkflow.agg_small_disk": 200,
"PairedEndSingleSampleWorkflow.agg_medium_disk": 300,
"PairedEndSingleSampleWorkflow.agg_large_disk": 400,
"PairedEndSingleSampleWorkflow.preemptible_tries": 3,
"PairedEndSingleSampleWorkflow.agg_preemptible_tries": 3,

"##_COMMENT6": "MISC",
"PairedEndSingleSampleWorkflow.break_bands_at_multiples_of": 1000000,
"PairedEndSingleSampleWorkflow.haplotype_scatter_count": 50

}

I do not know what "gs://" actually means. Guess it is a abbreviation for "geneset" internally accessible to Broadies? I don't have access to Broad internal databases.

Another question is that the gatk command line seems to be quite simple: you input the reference genome, the bam files, a few other simple parameters, and you are done. Why in the "json" file there are so many parameters -- it looks like the reference genome is broken into many smaller pieces. This might help to divide and conquer to speed up the computation, but the problem is that we don't know how to modify them when necessary.

Thanks for any advices.

Tagged:

Best Answers

  • Accepted Answer

    gs stands for "Google Storage". Looks like you either want to run with a Google backend (docs here) or get a copy of that file to use locally.

  • shleeshlee Cambridge admin
    edited April 2 Accepted Answer

    Hi @moxu,

    I'm a GATK/Picard support person, so bear with me. It appears you are running Cromwell locally. In this case, your input files should be either (i) local or (ii) take advantage of GATK4's NIO (new IO) feature that allows streaming Google Cloud Storage (GCS) files in memory for analysis. You can read more about GATK4 NIO streaming from GCS at https://github.com/broadinstitute/gatk#gcs. This feature basically sends cloud data in packets to your local system for in-memory analysis, without the data being stored anywhere locally.

    If you will run many analyses locally that require a particular file type, it makes sense for you to have a local copy of this file. I should think that a genome reference and its accompanying dictionary and index files etc are exactly the types of resources that you would keep a local copy of for efficient frequent access.

    In general, if you will use GATK4's NIO feature, then the type of the input should be String and not File in the pipeline script. I hear in future versions of Cromwell, you will not have to change the File type for GCS files and Cromwell will interpret them correctly for NIO. However, for now, you have to change this to String type.

    Next, you should be sure you have access rights to the GCS files in question. You can check this with gsutil. For example, for your reference fasta in question, you would type:

    gsutil ls gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
    

    If you have access, then this returns the same file path from the local command line. So then you can copy such a file to your local system with:

    gsutil cp gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta .
    

    P.S. Cromwell interprets local File types as symbolic links. It should not try to localize by copying the files, unless you are accessing files across disks.

Answers

  • danbdanb Member, Broadie
    Accepted Answer

    gs stands for "Google Storage". Looks like you either want to run with a Google backend (docs here) or get a copy of that file to use locally.

  • moxumoxu Member

    However, those gs://xxx files were provided by the GATK4 distribution. If these file links are no longer valid, how are we going to modify the .json file to make the GATK variant calling work?

  • moxumoxu Member

    @danb said:
    gs stands for "Google Storage". Looks like you either want to run with a Google backend (docs here) or get a copy of that file to use locally.

    Hi Dan,

    Thanks for the reply. In the past 3 days, I have been trying to get the "google backend" ready, but this is really a new area for me and I still cannot get it work, or maybe get a clue about what I need to do. The link you provided is about cromwell, and cromwell requires the installation of google pipelines api, and google pipelines api requires google cloud sdk. I installed google cloud sdk and made the server run, then I ran the above mentioned command again and got the same errors. Guess the learning curve is kind of steep here for me.

    You mentioned "get a copy of that file". Could you please tell me how to get such a file/files? For instance, where and how to get "gs://dsde-data-na12878-public/NA12878_hc.hg19.reference.fingerprint.vcf"?

    Your help would be highly appreciated!

  • danbdanb Member, Broadie

    Hi @moxu , I'm afraid I may have led you down the wrong path. The configuration setting you want is the "filesystem" as documented here.

    Note the naming convention of the stanza makes it backend.providers.MacOSXFileSystem

  • moxumoxu Member

    @danb said:
    Hi @moxu , I'm afraid I may have led you down the wrong path. The configuration setting you want is the "filesystem" as documented here.

    Note the naming convention of the stanza makes it backend.providers.MacOSXFileSystem

    Hi @danb , thanks much for the reply. But still, I couldn't figure out how. I am very new to Cromwell, wdl, cdl stuff. I tried "grep backend" . under the "broad-prod-wgs-germline-snps-indels" directory and found nothing.

    I worked with GATK before using command line. The WDL stuff is supposed to improve performance by divide-and-conquer and keep the work environment necessary to run, which is great. However, I thought it would be easy to make it work even with WDL.

    Could you please give me instructions on how to modify the WDL and json files to make it work? Also, I don't know how to download the necessary files to local disk?

    Your help would be highly appreciated!

    Thanks!

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited April 2 Accepted Answer

    Hi @moxu,

    I'm a GATK/Picard support person, so bear with me. It appears you are running Cromwell locally. In this case, your input files should be either (i) local or (ii) take advantage of GATK4's NIO (new IO) feature that allows streaming Google Cloud Storage (GCS) files in memory for analysis. You can read more about GATK4 NIO streaming from GCS at https://github.com/broadinstitute/gatk#gcs. This feature basically sends cloud data in packets to your local system for in-memory analysis, without the data being stored anywhere locally.

    If you will run many analyses locally that require a particular file type, it makes sense for you to have a local copy of this file. I should think that a genome reference and its accompanying dictionary and index files etc are exactly the types of resources that you would keep a local copy of for efficient frequent access.

    In general, if you will use GATK4's NIO feature, then the type of the input should be String and not File in the pipeline script. I hear in future versions of Cromwell, you will not have to change the File type for GCS files and Cromwell will interpret them correctly for NIO. However, for now, you have to change this to String type.

    Next, you should be sure you have access rights to the GCS files in question. You can check this with gsutil. For example, for your reference fasta in question, you would type:

    gsutil ls gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
    

    If you have access, then this returns the same file path from the local command line. So then you can copy such a file to your local system with:

    gsutil cp gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta .
    

    P.S. Cromwell interprets local File types as symbolic links. It should not try to localize by copying the files, unless you are accessing files across disks.

  • moxumoxu Member

    @shlee said:
    Hi @moxu,

    I'm a GATK/Picard support person, so bear with me. It appears you are running Cromwell locally. In this case, your input files should be either (i) local or (ii) take advantage of GATK4's NIO (new IO) feature that allows streaming Google Cloud Storage (GCS) files in memory for analysis. You can read more about GATK4 NIO streaming from GCS at https://github.com/broadinstitute/gatk#gcs. This feature basically sends cloud data in packets to your local system for in-memory analysis, without the data being stored anywhere locally.

    If you will run many analyses locally that require a particular file type, it makes sense for you to have a local copy of this file. I should think that a genome reference and its accompanying dictionary and index files etc are exactly the types of resources that you would keep a local copy of for efficient frequent access.

    In general, if you will use GATK4's NIO feature, then the type of the input should be String and not File in the pipeline script. I hear in future versions of Cromwell, you will not have to change the File type for GCS files and Cromwell will interpret them correctly for NIO. However, for now, you have to change this to String type.

    Next, you should be sure you have access rights to the GCS files in question. You can check this with gsutil. For example, for your reference fasta in question, you would type:

    gsutil ls gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
    

    If you have access, then this returns the same file path from the local command line. So then you can copy such a file to your local system with:

    gsutil cp gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta .
    

    P.S. Cromwell interprets local File types as symbolic links. It should not try to localize by copying the files, unless you are accessing files across disks.

    After getting an account with gcs and authorizing the access, it worked!

    Thanks a million!

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
  • moxumoxu Member

    @shlee said:
    Great to hear @moxu.

    I should say I was too quick to say "it worked". It actually didn't work, but lasted longer (for ~ 2hrs instead of 2mins) before generating an error.

    • I ran "gcloud auth application-default login" to allow gcloud access

    • Now I can "gsutil ls xxx" & "gsutil cp xxx"

    • Should I use Cromwell-31.jar or Cromwell-29.jar? Or does it matter? They generate different error messages.

    With "java -jar ~/xlib/java/cromwell-29.jar run PairedEndSingleSampleWf.wdl --inputs PairedEndSingleSampleWf.hg38.inputs.json", I got error messages like


    ...
    "Workflow input processing failed:
    Workflow has invalid declarations: Could not evaluate workflow declarations:
    PairedEndSingleSampleWorkflow.bwa_ref_size:
    java.lang.IllegalArgumentException: Could not find suitable filesystem among Default to parse gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.
    Could not find suitable filesystem among Default to parse gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta" after "java -jar ~/xlib/java/cromwell-29.jar run PairedEndSingleSampleWf.wdl --inputs PairedEndSingleSampleWf.hg38.inputs.json" with the .wdl & .json included in the gatk4 download package."

    ...

    It looks like Cromwell does not recognize what "gs://" is.

    Does "java -jar ~/xlib/java/cromwell-29.jar run PairedEndSingleSampleWf.wdl --inputs PairedEndSingleSampleWf.hg38.inputs.json" send the input genome to the cloud to compute or does it read the "gs://" files from the cloud and runs gatk locally?

    Thanks much!

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    @moxu, if you are specifying the gs:// type inputs as String types, then Cromwell should not be trying to localize them. Please see this document. The yellow box shows how different inputs are typed.

  • moxumoxu Member

    @shlee said:
    @moxu, if you are specifying the gs:// type inputs as String types, then Cromwell should not be trying to localize them. Please see this document. The yellow box shows how different inputs are typed.

    Then what should I do to make gatk4 work? I don't want to learn WDL, CDL, Docker if I don't have to :) , but we do need high performance (a humongous amount of samples).

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Hello @moxu,

    GATK4 runs on the command-line. You only need to meet the platform requirements listed at the top of this article to run GATK4. Namely, it is sufficient to have a Unix/Linux OS and Java 1.8. You can make sure most of the tools in the program will run correctly by following instructions here.

    For particular plotting tools, there are R and R package requirements that some of our users find tricky to install. Using Docker makes it easy for these types of users and also is actually rather convenient if you don't want to change your system's R configuration.

    We provide WDL pipelining scripts and Docker images for release versions for your convenience. Again, using these scripts and Docker is NOT necessary for you to run GATK4. I develop most tutorial content by running GATK4 on a Google Compute Engine VM (cloud) and on my MacBook Pro laptop. But I find it convenient to also run the GATK4 repository's WDL scripts for preliminary results (again on a GCE VM). I think perhaps you may also find these baseline WDL scripts a convenient starting point and the effort you put into setting up Cromwell and learning WDL will be worth it in the long run. May I ask why you are hesitant to learn these?

    I don't want to learn WDL, CDL, Docker if I don't have to....

    If by high-performance you mean high throughput and compute-efficiency, then I believe multiple approaches are available to you, including Intel hardware that is wired for use with GATK and cloud solutions. I'm not familiar with the details of these. I've asked @LeeTL1220 to chime in here.

  • LeeTL1220LeeTL1220 Arlington, MAMember, Broadie, Dev ✭✭✭

    @moxu
    1. Use cromwell 31. There are differences from v29 and, if you are using the latest version of the WDL, then there might be an issue there.
    2. I think you may need to modify the WDL. Wherever you are using a gs:// URL, the task (not the workflow) must use String as input. The WDL/Cromwell Engineers are working to eliminate this as a requirement, but that has not happened yet.

    Tell me if this helps. If not, please post here.

  • moxumoxu Member

    @LeeTL1220 said:
    @moxu
    1. Use cromwell 31. There are differences from v29 and, if you are using the latest version of the WDL, then there might be an issue there.
    2. I think you may need to modify the WDL. Wherever you are using a gs:// URL, the task (not the workflow) must use String as input. The WDL/Cromwell Engineers are working to eliminate this as a requirement, but that has not happened yet.

    Tell me if this helps. If not, please post here.

    1. I tried Cromwell 31, and got the same errors.

    2. There is one line in the "PairedEndSingleSampleWf.wdl" that has "gs://":

      String sub_strip_path = "gs://.*/"

      Under the method "scatter" as follows:

      scatter (unmapped_bam in flowcell_unmapped_bams) {

      Float unmapped_bam_size = size(unmapped_bam, "GB")

      String sub_strip_path = "gs://.*/"
      String sub_strip_unmapped = unmapped_bam_suffix + "$"
      String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")
      ...

      How should I make the suggested changes? Or maybe I should change the .json file PairedEndSingleSampleWf.hg38.inputs.json?

    Thanks much!

  • jsotojsoto Broad InstituteMember, Broadie, Dev
    String sub_strip_path = "gs://.*/"
    String sub_strip_unmapped = unmapped_bam_suffix + "$"
    String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")
    

    these three lines can be replaced by

    String sub_sub = basename(unmapped_bam, unmapped_bam_suffix)
    

    This will work with local or gcs paths

    basename was added in cromwell 27 - https://github.com/broadinstitute/cromwell/releases/tag/27 - so as long as you are using anything newer than that you should be fine. Also for reference this is the usage of basename - https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#string-basenamestring

  • moxumoxu Member

    @shlee said:
    Hello @moxu,

    GATK4 runs on the command-line. You only need to meet the platform requirements listed at the top of this article to run GATK4. Namely, it is sufficient to have a Unix/Linux OS and Java 1.8. You can make sure most of the tools in the program will run correctly by following instructions here.

    For particular plotting tools, there are R and R package requirements that some of our users find tricky to install. Using Docker makes it easy for these types of users and also is actually rather convenient if you don't want to change your system's R configuration.

    We provide WDL pipelining scripts and Docker images for release versions for your convenience. Again, using these scripts and Docker is NOT necessary for you to run GATK4. I develop most tutorial content by running GATK4 on a Google Compute Engine VM (cloud) and on my MacBook Pro laptop. But I find it convenient to also run the GATK4 repository's WDL scripts for preliminary results (again on a GCE VM). I think perhaps you may also find these baseline WDL scripts a convenient starting point and the effort you put into setting up Cromwell and learning WDL will be worth it in the long run. May I ask why you are hesitant to learn these?

    I don't want to learn WDL, CDL, Docker if I don't have to....

    If by high-performance you mean high throughput and compute-efficiency, then I believe multiple approaches are available to you, including Intel hardware that is wired for use with GATK and cloud solutions. I'm not familiar with the details of these. I've asked @LeeTL1220 to chime in here.

    I am trying the WDL/json files you provided online somewhere (sorry I am not able to find out the url today). The workflow is used at Broad internally for production. I am working on the MVP project which has more than half a million GWAS samples now and will soon surpass 1 million, so performance is very important for us. We cannot change the hardware -- we have linux clusters, and we cannot run it on the cloud. We really want to try your production workflow first. Sorry for the part about "I don't want to learn WDL ...". I guess the purpose of WDL/json is to make the workflow transportable, right? I really hope to be able to run your WDL without going to the cloud.

    Thanks!

  • moxumoxu Member

    Hi Jsoto,

    I did the recommended changes in the .wdl file, and tried again. But still, I got the error messages as follows:


    [2018-05-04 15:41:28,64] [error] WorkflowManagerActor Workflow b9df9fc8-40a1-4b45-b54b-cce65af4a21c failed (during MaterializingWorkflowDescriptorState): Workflow input processing failed:
    Workflow has invalid declarations: Could not evaluate workflow declarations:
    PairedEndSingleSampleWorkflow.bwa_ref_size:
    java.lang.IllegalArgumentException: Could not find suitable filesystem among Default to parse gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.
    Could not find suitable filesystem among Default to parse gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.


    I tried "gsutil ls gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta", and the file can be seen. I tried "gsutil cp gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta ." and it actually copied the .fasta file to my local drive.

    Also, I have used my local .bam the following way:

    ---------------------------------------------------------"PairedEndSingleSampleWorkflow.sample_name": "NA12878_hc",
    "PairedEndSingleSampleWorkflow.base_file_name": "NA12878_hc",
    "PairedEndSingleSampleWorkflow.flowcell_unmapped_bams": ["file:///myhome/NA12878/high_coverage_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam"],
    "PairedEndSingleSampleWorkflow.final_gvcf_base_name": "NA12878_hc",
    "PairedEndSingleSampleWorkflow.unmapped_bam_suffix": ".bam",


    Is the "file:///xxx" syntax correct?

    Your help would be highly appreciated!

    Thanks!

    @jsoto said:

    String sub_strip_path = "gs://.*/"
    String sub_strip_unmapped = unmapped_bam_suffix + "$"
    String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")
    

    these three lines can be replaced by

    String sub_sub = basename(unmapped_bam, unmapped_bam_suffix)
    

    This will work with local or gcs paths

    basename was added in cromwell 27 - https://github.com/broadinstitute/cromwell/releases/tag/27 - so as long as you are using anything newer than that you should be fine. Also for reference this is the usage of basename - https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#string-basenamestring

  • moxumoxu Member

    Hi @jsoto @shlee,

    It turned out that we cannot use any sort of remote access (e.g. cloud computing, sftp, etc.) in the variant calling process. Can you guys make a version that uses local files only? That would be very helpful. Besides security reasons for a version that uses local files only, simplicity and performance could be two other issues -- reading the gs://files would take a significant amount of time.

    Your kind help would be highly appreciated!

    Thanks so much in advance!

    @jsoto said:

    String sub_strip_path = "gs://.*/"
    String sub_strip_unmapped = unmapped_bam_suffix + "$"
    String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")
    

    these three lines can be replaced by

    String sub_sub = basename(unmapped_bam, unmapped_bam_suffix)
    

    This will work with local or gcs paths

    basename was added in cromwell 27 - https://github.com/broadinstitute/cromwell/releases/tag/27 - so as long as you are using anything newer than that you should be fine. Also for reference this is the usage of basename - https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#string-basenamestring

Sign In or Register to comment.