We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
Questions regarding "minimal WDL for joint genotyping"

Hi, Geraldine recommends us to ask the questions here. The questions below are regarding a WDL she shared with us, WGS_Joint_Analysis_160909.wdl. WGS_Joint_Analysis_160909.inputs.json.
- In the inputs.json file, I see that all input files are specified using full path names. The output files for each job, however, do not have a full path specified. For instance, “unzipped_basename” for task UnzipGVCF is just defined as “temp_unzipped”. Does each task instance would have an output directory unique to itself, assigned by job scheduler (e.g. Cromwell)?
- I see that JointAnalysis.scattered_calling_intervals has 50 intervals. That means the scatter calling GenotypeGVCFs would have 50 docker container launched, each handling one interval, and each GenotypeGVCFs container requires 10GB of memory (as specified in runtime of task GenotypeGVCFs)?
- Some of the “File” defined in tasks are not explicitly referred inside the task, are they implicitly used by the application called in the task? For istance, “File ref_dict” in task UnzipGVCF is not explicitly used in “command”, but it probably is used by the application GATK4.jar, which implicitly obtain the file name of ref_dict from the fasta file and assumes ref_dict is located in the same directory with the fasta file?
- Geraldine mentioned that the workflow run to completion on a wholte-genome sample. Any information on how long did it take to complete, and how much is the input data size?
Thanks!
Kitty
Best Answer
-
KateN Cambridge, MA admin
Yes, each task output has an output directory unique to itself, assigned by Cromwell. At the end of a run, Cromwell will output a list of your output files, along with the full path of where they are located. An example output for a workflow called
helloWorld
with a task calledtaskA
that creates an output specified asoutput.txt
might look like this:
```
{
"myOutput" : "/Users/knooblett/cromwell-executions/helloWorld/f8bcb3f5-9979-4c09-859a-6e901370fad9/call-taskA/output.txt"
}You can find more information about that specific part of the pipeline here. Although it was written about a different pipeline, the ideas there could help you understand further about the parallelization decisions. Your understanding for this part does look correct.
That is correct. While GATK makes an assumption that the ref.fasta, ref.fasta.fai, and ref.dict are all located in the same directory, Cromwell currently does not (though it is a point of development). In order for GATK to make its assumptions, Cromwell needs to be told these files exist, and to pull them into the working directory. You can do that by specifying the file variable, then not using it in the actual command.
Currently, I do not have any data on that, though it is something we intend to document in the future.
I hope this answers all your question, aside from the last one which will be answered in future documentation. Please don't hesitate to ask if you need further clarification.
Answers
Yes, each task output has an output directory unique to itself, assigned by Cromwell. At the end of a run, Cromwell will output a list of your output files, along with the full path of where they are located. An example output for a workflow called
helloWorld
with a task calledtaskA
that creates an output specified asoutput.txt
might look like this:```
{
"myOutput" : "/Users/knooblett/cromwell-executions/helloWorld/f8bcb3f5-9979-4c09-859a-6e901370fad9/call-taskA/output.txt"
}
You can find more information about that specific part of the pipeline here. Although it was written about a different pipeline, the ideas there could help you understand further about the parallelization decisions. Your understanding for this part does look correct.
That is correct. While GATK makes an assumption that the ref.fasta, ref.fasta.fai, and ref.dict are all located in the same directory, Cromwell currently does not (though it is a point of development). In order for GATK to make its assumptions, Cromwell needs to be told these files exist, and to pull them into the working directory. You can do that by specifying the file variable, then not using it in the actual command.
Currently, I do not have any data on that, though it is something we intend to document in the future.
I hope this answers all your question, aside from the last one which will be answered in future documentation. Please don't hesitate to ask if you need further clarification.
Thanks. That answers my questions.