We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Starting a wdl from a particular gatk step

I looked through the joint-discovery-gatk4-local.wdl pipeline, it has both GenomicsDBImport and GenotypeGVCFs implemented. I have already manually generated the database from GenomicsDBImport for 3000 samples and wish to avoid regenerating the database again using the WDL. Is it possible to run the WDL from the next step GenotypeGVCFs? If so how do I go about doing that? That is specify the database, knownSNPs (for VQSR) and provide the interval file (.bed) used for the GenomicsDBImport step to the WDL, so that it can start execution of the best practice from GenotypeGVCFs to VQSR. Note I am currently running the GenotypeGVCFs manually, it is running very slow using the database I created which is why I plan to change to using a WDL. I am also working with a non-human model.


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @eyeamnice

    1) VQSR is not ideal for non-model organisms. Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11097

    2) to run the WDL from the next step GenotypeGVCFs, you will need to edit this WDL: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.wdl to adjust the inputs to start with genomicsdb database as opposed to individual gvcfs. We have documentation here to learn how you would go about editing the WDL: https://software.broadinstitute.org/wdl/documentation/topic?name=wdl-tutorials. Specifically: https://software.broadinstitute.org/wdl/documentation/article?id=7614

  • eyeamniceeyeamnice AZMember
    I understand that I need to edit the WDL having gone through the tutorial. My problem is how to specify the sample names in the database and how to specify the genomicsDBImport_database created as input . For example. After modification and excluding VQSR and the likes, I used the wdltool.jar to generate the following json file

    "JointGenotyping.medium_disk_override": "(optional) Int?",
    "JointGenotyping.input_gvcfs_indices": "Array[File]",
    "JointGenotyping.gatk_path_override": "(optional) String?",
    "JointGenotyping.eval_interval_list": "File",
    "JointGenotyping.small_disk_override": "(optional) Int?",
    "JointGenotyping.dbsnp_vcf": "File",
    "JointGenotyping.callset_name": "String",
    "JointGenotyping.unpadded_intervals_file": "File",
    "JointGenotyping.dbsnp_vcf_index": "File",
    "JointGenotyping.ref_fasta_index": "File",
    "JointGenotyping.sample_names": "Array[String]",
    "JointGenotyping.ref_dict": "File",
    "JointGenotyping.huge_disk_override": "(optional) Int?",
    "JointGenotyping.gatk_docker_override": "(optional) String?",
    "JointGenotyping.snp_filter_level": "Float",
    "JointGenotyping.input_gvcfs": "Array[File]",
    "JointGenotyping.large_disk_override": "(optional) Int?",
    "JointGenotyping.indel_filter_level": "Float",
    "JointGenotyping.ref_fasta": "File"

    Will this be correct to use for the minimal pipeline? How do I specify the existing genomicsDBImport database in this case? The database already contains the indices and everything needed, will the below lines be necessary?

    "JointGenotyping.input_gvcfs_indices": "Array[File]", "JointGenotyping.sample_names": "Array[String]", "JointGenotyping.callset_name": "String" .

    Finally, do you think this part of the wdl is accurate? I do have a dbsnp_vcf file to use as known snps

    call GenotypeGVCFs {
    workspace_tar = my_genomicsdb_database,
    interval = intervals.bed,
    output_vcf_filename = "output.vcf.gz",
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    ref_dict = ref_dict,
    dbsnp_vcf = dbsnp_vcf,
    dbsnp_vcf_index = dbsnp_vcf_index,
    disk_size = medium_disk,
    gatk_path = gatk_path

    task GenotypeGVCFs {
    File workspace_tar
    String interval

    String output_vcf_filename

    File ref_fasta
    File ref_fasta_index
    File ref_dict

    File dbsnp_vcf
    File dbsnp_vcf_index

    String gatk_path
    String docker
    Int disk_size

    command <<<
    set -e

    #tar -xf ${workspace_tar}
    WORKSPACE=$( basename ${workspace_tar})

    ${gatk_path} --java-options "-Xmx5g -Xms5g" \
    GenotypeGVCFs \
    -R ${ref_fasta} \
    -O ${output_vcf_filename} \
    -D ${dbsnp_vcf} \
    -G StandardAnnotation \
    --only-output-calls-starting-in-intervals \
    --use-new-qual-calculator \
    -V gendb://$WORKSPACE \
    -L ${interval}

    I am trying my best to edit based on the documentation. Thank you
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited July 2019

    Hi @eyeamnice

    @SChaluvadi from the WDL team should be able to help you out with this. We will get back to you shortly.

  • eyeamniceeyeamnice AZMember
    Hi @bhanuGandham

    Thank you. I just want the possibility of running a WDL starting from GenotypeGVCFs using database generated by GenomicsDBimport
Sign In or Register to comment.