Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Implementing your workflows from git repository to run locally

dodauspdodausp DenmarkMember
edited August 12 in Ask the GATK team

First of all, thanks a lot for putting up those GATK workflows. As a non-bioinformatician at the core, I truly appreciate those shortcuts!
And this question comes right after I performed your first tutorial on workflows, and now I want to explore more of them.
I was particularly interested in starting by your "cnn-variant-filter", after reading about it here.
So, I pull the workflow and went straight to change its json file (cram2filtered.inputs.json). First thing I notice is that all paths are set to gs://. And second was that there are a lot of inputs that I don't know where to retrieve them. More specifically these lines:

"Cram2FilteredVcf.reference_fasta": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
"Cram2FilteredVcf.reference_dict": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
"Cram2FilteredVcf.reference_fasta_index": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
"Cram2FilteredVcf.resource_fofn": "gs://gatk-best-practices/cnn-h38/resource_fofn.txt",
"Cram2FilteredVcf.resource_fofn_index": "gs://gatk-best-practices/cnn-h38/resource_fofn_index.txt",
"Cram2FilteredVcf.calling_intervals": "gs://broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list"

Another thing is that the workflow is set to run by using a .cram file as input. In my case, I would like to use BAM and BAI as inputs.

That being said, how could I implement this to run locally and using BAM and BAI files as inputs? Would using the gatk docker help here? If so, how would I use it?

Any help is very much appreciated!

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @dodausp

    I am moving this to the WDL forum and someone from that team should be able to help you put with this.

  • dodauspdodausp DenmarkMember

    Hi @bhanuGandham
    Thanks a lot! I was redirected and @Sushma was very helpful.
    That being said, I am afraid I am facing a very similar problem to the previous one.
    Since I am running it locally, I follow all steps and downloaded all data required on the workflow. As you can see on my local directory:

    (base) [email protected]:~/Desktop/gatk_output/gatk_reference_files$ ls -lR
    .:
    total 3181760
    drwxr-xr-x 2 doc doc       4096 Aug 15 13:38 fofn
    -rw-r--r-- 9 doc doc     581712 Aug 15 12:55 Homo_sapiens_assembly38.dict
    -rw-r--r-- 9 doc doc 3249912778 Aug 15 12:54 Homo_sapiens_assembly38.fasta
    -rw-r--r-- 9 doc doc     160928 Aug 15 12:55 Homo_sapiens_assembly38.fasta.fai
    -rw-r--r-- 5 doc doc    6836531 Aug 15 13:02 NA12878.cram
    -rw-r--r-- 1 doc doc        298 Aug 15 13:44 resource_fofn_index.txt
    -rw-r--r-- 1 doc doc        286 Aug 15 13:44 resource_fofn.txt
    -rw-r--r-- 5 doc doc     599399 Aug 15 12:56 wgs_calling_regions.hg38.interval_list
    
    ./fofn:
    total 1929872
    -rw-r--r-- 1 doc doc 1888262073 Aug 15 13:37 1000G_phase1.snps.high_confidence.hg38.vcf.gz
    -rw-r--r-- 1 doc doc    2128536 Aug 15 13:38 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
    -rw-r--r-- 1 doc doc   62043448 Aug 15 13:36 hapmap_3.3.hg38.vcf.gz
    -rw-r--r-- 1 doc doc    1552123 Aug 15 13:37 hapmap_3.3.hg38.vcf.gz.tbi
    -rw-r--r-- 1 doc doc   20685880 Aug 15 13:38 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    -rw-r--r-- 1 doc doc    1500013 Aug 15 13:38 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
    

    In the fofn sub-directory, i stored the data indexed on the files resource_fofn_index.txt and resource_fofn.txt.
    Now, when I try to run the command (on the directory that contains the cromwell.jar file):

    java -jar cromwell-33.1.jar run ./gatk4-cnn-variant-filter/cram2filtered.wdl --inputs ./gatk4-cnn-variant-filter/cram2filtered.inputs.json
    

    I get several warning and one error message (enclosed). I am also attaching the json file, so you could take a look if there is something that I am missing.

    Any support, again, is extremely helpful.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @dodausp

    @SChaluvadi will help you out with this. She will get back to you shortly.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @dodausp The error looks as though you do not have permissions to the Cromwell-executions folder created by cromwell. Can you try adding 'sudo' to the start of your command?

    sudo java -jar cromwell-33.1.jar run ./gatk4-cnn-variant-filter/cram2filtered.wdl --inputs ./gatk4-cnn-variant-filter/cram2filtered.inputs.json

    Does this change anything?

  • dodauspdodausp DenmarkMember

    Hi @SChaluvadi
    Oh, sorry, I forgot to mention that I also use with sudo. It still did not work, although I could it went a bit further.
    In any case, here is the log I get when trying to run with sudo (file enclosed).

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @dodausp It looks like there is no input being read in for the MergeVcfs step. If you look in the error log:
    gatk --java-options "-Xmx2500m" MergeVcfs \
    -I -O "hg38_20k_na12878_DOL_cnn_scored.vcf.gz" has no input file so the script is failing. Can you check that you are assigning an input file to the command.

  • dodauspdodausp DenmarkMember

    Hi again, @SChaluvadi
    And thanks again!
    Would this hg38_20k_na12878_DOL_cnn_scored.vcf.gz be generated during the workflow? Or is it a file that should be referenced in the json file?
    I am asking because even the input file being used here is the one referenced on the original json file from the github repository, it is NA12878.cram.
    Sorry again for all this trouble, and thanks a lot!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    In this command it looks like the output to be generated is the hg38_20k_na12878_DOL_cnn_scored.vcf.gz (it is also not listed in the input json file) but the input file if referenced in the json is most likely the input file that should be designated with the -I parameter. It doesn't look like it is being read in as an input so you might need to check the path to the input.

  • dodauspdodausp DenmarkMember

    Hi, @SChaluvadi
    It tuns out that likely it was a case-sensitive typo.
    I changed the Cram2FilteredVcf.output_prefix field from hg38_20k_na12878 to hg38_20k_NA12878, and there was no error message (despite the warning messages) when the json file was executed.

    Problem is, I cannot find the output file now. This is extremely embarassing, but where do I find it?

    As a small input, I would suggest you folks to correct that typo either on the jsonfile (replace by uppercase NA), or the name of the cram file (replace by lowercase na).

    Thanks a lot for the troubleshoot!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    Hello @dodausp the output files can be located in a folder called cromwell-executions. It should have been generated in the directory from which you ran your command to run the WDL. Within the cromwell-executions, there may be a few directories, each unique and representing a submission, and contain subdirectories organized by each call and its resulting outputs.

    Thanks for the feedback!

  • dodauspdodausp DenmarkMember

    Thanks again, @SChaluvadi
    That was also my first attempt, but I couldn't find the output file. I was expecting a vcffile. Is that wrong?
    In any case, if you could take a look at it, here is the full directory list of the successful run (prefix d6a9dd7):

    (base) [email protected]:~/Desktop/gatk_example_github/cromwell-executions/Cram2FilteredVcf/d6a9ddf7-7df5-4718-94e0-7c63e486fab2$ ls -lR
    .:
    total 12
    drwxr-xrwx 5 root root 4096 Aug 20 16:04 call-CramToBam
    drwxr-xrwx 4 root root 4096 Aug 20 16:04 call-MergeVCF_HC4
    drwxr-xrwx 5 root root 4096 Aug 20 16:04 call-SplitIntervals
    
    ./call-CramToBam:
    total 12
    drwxr-xrwx 2 root root 4096 Aug 20 16:05 execution
    drwxr-xrwx 3 root root 4096 Aug 20 16:03 inputs
    drwxrwxrwx 2 root root 4096 Aug 20 16:04 tmp.9348530f
    
    ./call-CramToBam/execution:
    total 9548
    -rw-r--r-- 1 root root      64 Aug 20 16:04 docker_cid
    -rw-r--r-- 1 root root 9739953 Aug 20 16:05 hg38_20k_NA12878.bam
    -rw-r--r-- 1 root root       4 Aug 20 16:05 rc
    -rw-r--r-- 1 root root    2527 Aug 20 16:03 script
    -rw-r--r-- 1 root root     411 Aug 20 16:03 script.background
    -rw-r--r-- 1 root root     178 Aug 20 16:05 script.kill
    -rw-r--r-- 1 root root    1421 Aug 20 16:03 script.submit
    -rw-r--r-- 1 root root       0 Aug 20 16:04 stderr
    -rw-r--r-- 1 root root       0 Aug 20 16:03 stderr.background
    -rw-r--r-- 1 root root       0 Aug 20 16:05 stderr.kill
    -rw-r--r-- 1 root root     370 Aug 20 16:04 stdout
    -rw-r--r-- 1 root root      71 Aug 20 16:05 stdout.background
    -rw-r--r-- 1 root root      65 Aug 20 16:05 stdout.kill
    
    ./call-CramToBam/inputs:
    total 4
    drwxr-xrwx 2 root root 4096 Aug 20 16:03 490818688
    
    ./call-CramToBam/inputs/490818688:
    total 3181160
    -rw-r--r-- 23 doc doc     581712 Aug 15 12:55 Homo_sapiens_assembly38.dict
    -rw-r--r-- 23 doc doc 3249912778 Aug 15 12:54 Homo_sapiens_assembly38.fasta
    -rw-r--r-- 23 doc doc     160928 Aug 15 12:55 Homo_sapiens_assembly38.fasta.fai
    -rw-r--r-- 12 doc doc    6836531 Aug 15 13:02 NA12878.cram
    
    ./call-CramToBam/tmp.9348530f:
    total 0
    
    ./call-MergeVCF_HC4:
    total 8
    drwxr-xrwx 2 root root 4096 Aug 20 16:05 execution
    drwxrwxrwx 2 root root 4096 Aug 20 16:04 tmp.eeb6f536
    
    ./call-MergeVCF_HC4/execution:
    total 32
    -rw-r--r-- 1 root root   64 Aug 20 16:04 docker_cid
    -rw-r--r-- 1 root root    2 Aug 20 16:05 rc
    -rw-r--r-- 1 root root 1772 Aug 20 16:04 script
    -rw-r--r-- 1 root root  417 Aug 20 16:04 script.background
    -rw-r--r-- 1 root root 1426 Aug 20 16:04 script.submit
    -rw-r--r-- 1 root root 3557 Aug 20 16:05 stderr
    -rw-r--r-- 1 root root    0 Aug 20 16:04 stderr.background
    -rw-r--r-- 1 root root   17 Aug 20 16:05 stdout
    -rw-r--r-- 1 root root   71 Aug 20 16:05 stdout.background
    
    ./call-MergeVCF_HC4/tmp.eeb6f536:
    total 0
    
    ./call-SplitIntervals:
    total 12
    drwxr-xrwx 3 root root 4096 Aug 20 16:04 execution
    drwxr-xrwx 3 root root 4096 Aug 20 16:03 inputs
    drwxrwxrwx 2 root root 4096 Aug 20 16:04 tmp.41d5c25e
    
    ./call-SplitIntervals/execution:
    total 2328
    -rw-r--r-- 1 root root 582927 Aug 20 16:04 0000-scattered.interval_list
    -rw-r--r-- 1 root root 582857 Aug 20 16:04 0001-scattered.interval_list
    -rw-r--r-- 1 root root 583968 Aug 20 16:04 0002-scattered.interval_list
    -rw-r--r-- 1 root root 586817 Aug 20 16:04 0003-scattered.interval_list
    -rw-r--r-- 1 root root     64 Aug 20 16:04 docker_cid
    drwxr-xr-x 2 root root   4096 Aug 20 16:04 glob-6f4bc12a708659d4f5f3eecd1cdffff7
    -rw-r--r-- 1 root root      0 Aug 20 16:04 glob-6f4bc12a708659d4f5f3eecd1cdffff7.list
    -rw-r--r-- 1 root root      2 Aug 20 16:04 rc
    -rw-r--r-- 1 root root   3705 Aug 20 16:03 script
    -rw-r--r-- 1 root root    421 Aug 20 16:03 script.background
    -rw-r--r-- 1 root root   1440 Aug 20 16:03 script.submit
    -rw-r--r-- 1 root root   3293 Aug 20 16:04 stderr
    -rw-r--r-- 1 root root     62 Aug 20 16:04 stderr.background
    -rw-r--r-- 1 root root      0 Aug 20 16:04 stdout
    -rw-r--r-- 1 root root     71 Aug 20 16:04 stdout.background
    
    ./call-SplitIntervals/execution/glob-6f4bc12a708659d4f5f3eecd1cdffff7:
    total 4
    -rw-r--r-- 1 root root 277 Aug 20 16:04 cromwell_glob_control_file
    
    ./call-SplitIntervals/inputs:
    total 4
    drwxr-xrwx 2 root root 4096 Aug 20 16:03 490818688
    
    ./call-SplitIntervals/inputs/490818688:
    total 3175068
    -rw-r--r-- 23 doc doc     581712 Aug 15 12:55 Homo_sapiens_assembly38.dict
    -rw-r--r-- 23 doc doc 3249912778 Aug 15 12:54 Homo_sapiens_assembly38.fasta
    -rw-r--r-- 23 doc doc     160928 Aug 15 12:55 Homo_sapiens_assembly38.fasta.fai
    -rw-r--r-- 12 doc doc     599399 Aug 15 12:56 wgs_calling_regions.hg38.interval_list
    
    ./call-SplitIntervals/tmp.41d5c25e:
    total 0
    

    As you will notice, there is no vcf file here. Am I looking for the wrong thing?

    And just to make sure I am looking into the right directory, here is the log file containing the submission run (d6a9ddf7-7df5-4718-94e0-7c63e486fab2).

    Many thanks again!

  • dodauspdodausp DenmarkMember
    edited August 21

    Hi again @SChaluvadi ,
    Never mind what I wrote above, about being able to run the code.
    I am back to the same issue - I am not able to run it.

    What is it that I'm doing wrong?

Sign In or Register to comment.