To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

How to run HaplotypeCaller in pooled DNA data locally using scatter-gatter implemented in WDL?

Hola WDL team,

I have whole-genome data of pooled DNA per population (one BAM file for each population) of several populations, and have access to a local Desktop with 125 GB RAM, 32 cores. My goal is to do SNP calling in a cohort of population files using the GVCF mode and WDL in "server" mode. For this I plan to:

a) Do scatter-gather of the HaplotypeCaller (-ERC mode) by scaffold per population file, separately
b) Once I have all g.vcfs, run the HaplotypeCaller in genotype mode (GenotypeGVCFs) to do SNP calling for all populations in once

Questions:

1. Since I am using pooled DNA data (40 to 50 individuals per population, diploid organism), I will set -ploidy to 100 and -max_alternate_alleles to 6. Should I set these parameters when obtaining the g.vcfs (-ERC) or during SNP calling (GenotypeGVCFs)?

2. I made a test run in one file of the scatter part of step a) to have a sense of how long it could take when run locally; commands used below, adapted from the public WG pipeline. I limited run in Cromwell to 10 concurrent jobs, and used HaplotypeCaller -ERC by scaffold with the -L flag, default threading (1 CPU?). So far, the program is using 99-110GB of RAM but 2 days have passed and it has not finished yet. How can I make it run faster?

workflow ScatterHaplotypeCaller {
    File scaffoldsFile
    Array[String] callingScaffolds = read_lines(scaffoldsFile)
    scatter(subScaffold in callingScaffolds) {
        call haplotypeCaller { input: scaffold=subScaffold }
    }
}

task haplotypeCaller {
    File GATK
    File RefFasta
    String sampleName
    File inputBAM
    File RefIndex
    File RefDict
    File bamIndex
    String scaffold
    command {
        java -jar ${GATK} \
            -T HaplotypeCaller \
            -R ${RefFasta} \
            -I ${inputBAM} \
            -L ${scaffold} \
            -ERC GVCF \
            -mbq 20 \
            -minPruning 5 \
            -o ${sampleName}.{scaffold}.rawLikelihoods.g.vcf
    }
    output {
        File GVCF = "${sampleName}.{scaffold}.rawLikelihoods.g.vcf"
    }
}
  • 2.1 I have 14 000 scaffolds in the reference genome, maybe this multiple-scattering is making the run slow? Should I establish intervals of several scaffolds instead of scattering per scaffold? If so, how long they should be? It is a 1GB genome, so maybe I should aim to split to 50 intervals?

  • 2.2 Could I use multi-threading (-nct 4) safely in the HaplotypeCaller -ERC when scattering with WDL?
    I know that several users have reported problems with multi-threading, but in the GATK documentation -nct 4 is recommended and the Intel white document shows 4T reduces significantly running time.

  • 2.4 In the WG public pipeline, within the command section of the task definition of the HaplotypeCaller (see bellow), the lowest memory for java is set with the command java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m and in runtime section with memory: "10 GB". I wonder if it is necessary to set a memory limit when running GATK locally (in my Desktop computer, I know each file run uses ~10GB, this is why I set Cromwell to run 10 concurrent jobs max.). If so, can I use the very same commands?

# Call variants on a single sample with HaplotypeCaller to produce a GVCF
task HaplotypeCaller {
  File input_bam
  File input_bam_index
  File interval_list
  String gvcf_basename
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  Float? contamination
  Int disk_size
  Int preemptible_tries

  # tried to find lowest memory variable where it would still work, might change once tested on JES
  command {
    java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m \
      -jar /usr/gitc/GATK35.jar \
      -T HaplotypeCaller \
      -R ${ref_fasta} \
      -o ${gvcf_basename}.vcf.gz \
      -I ${input_bam} \
      -L ${interval_list} \
      -ERC GVCF \
      --max_alternate_alleles 3 \
      -variant_index_parameter 128000 \
      -variant_index_type LINEAR \
      -contamination ${default=0 contamination} \
      --read_filter OverclippedRead
  }
  runtime {
    docker: "broadinstitute/genomes-in-the-cloud:2.2.3-1469027018"
    memory: "10 GB"
    cpu: "1"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptible_tries
  }
  output {
    File output_gvcf = "${gvcf_basename}.vcf.gz"
    File output_gvcf_index = "${gvcf_basename}.vcf.gz.tbi"
  }
}

3. When scattering by scaffold for one file using WDL, should the output g.vcf file of the task HaplotypeCaller definition have the scaffold name or not? I mean, -o ${sampleName}.{scaffold}.rawLikelihoods.g.vcfor -o ${sampleName}.rawLikelihoods.g.vcf? I wonder this because maybe if I don't put the scaffold name then the g.vcfs will be overwritten with the latest scaffold run, is it?

Thanks very much in advance for any help!!

Best Answer

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    Hi @apfuentes -

    I hope this is helpful - I can't help with the biology or GATK tooling so this is just a comment rather than an answer!

    Since you mentioned that it's been running 2 days already, if you want to see how your workflow is getting on, since you're in server mode you can access a timing diagram for the workflow by navigating a browser to:

    http://<HOST>:<PORT>/api/workflows/v1/<WORKFLOWID>/timing
    
  • Hi @ChrisL,
    Thanks for your comment. I actually tried to use that command but it did not seem to work (see image attached) despite the alternative command to verify "status" works (shows "running"), although the output command does not show anything and when I tried "abort", it did not work either, any thoughts on that?

  • Hi again @ChrisL,
    Never mind, I clicked in the button "Continue" and I got the Gantt charts. Attached, some screenshots of the run. Most of the jobs (1 job=1 scaffold) are still queued in Cromwell (bars in blue) and a few have been completed (bars in green). If I am interpreting these bars correctly, then the command I used is extremely inefficient...any ideas on how to make it better? I forgot to mention, I am using GATK 3.7 and Cromwell version 26. Thanks very much for any help.

    image

  • Hi @Geraldine_VdAuwera,
    Thanks a lot for your quick response and taking the time to answer my questions! I really appreciate it.

    1. Lowering ploidy: Would it be possible that you please confirm me whether I am in the right track in the decision making of a reasonable ploidy number for my study case?

    • 1.1. I understand that with ploidy 2 the algorithm looks for alleles ~50% 50% in a sample, ploidy 3 expects alleles ~33%, ploidy 10 looks for alleles 10%, and ploidy 20 looks for 5%. Higher the ploidy, higher the sensitivity for rare alleles, but increasing run-time in the HaplotypeCaller.
      The importance of choosing a ploidy >2 for DNA pools is that the algorithm would give a higher genotype likelihood value to alleles with frequency <50%, that otherwise would have a lower likelihood value when using default ploidy of 2, is that correct?

    • 1.2. As my samples correspond to pooled DNA of a population (50 ind. per pool = ploidy 100) and most likely I won't have the power and sequencing depth (50x) to detect rare variants, a ploidy of 10 could work for me as it means at least 5 reads (ind.) should have the variant allele in the population to be called, and this approach would allow me to compare relatively frequent variants between populations, is that right?

    2. Is this what you mean with "setting for a less sensitive allele fraction detection threshold"?

    Thanks a lot!!

    Cheers,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Right, that's the general idea, although there's more to genotype likelihoods than just straight up read counts -- quality of mapping and bases also comes into play. But it's a reasonable approximation for evaluating limitations.

    The main reason high ploidy is a problem for runtime is that we have to calculate likelihoods for explodingly large numbers of possible genotypes. It's a combinatorial problem. In the latest version we mitigate this by discarding some of the genotypes as too unlikely, but it was done as a patch job and there are some flaws that remain. Consider running some test on a subset of intervals that show a lot of variation to see how performance scales with ploidy. We haven't analyzed that systematically ourselves because we only work with diploid data, but I think in your case it may be a worthwhile exercise.
  • Thanks a lot for the clarification and valuable recommendations!

    I have a question regarding Cromwell in server mode:

    • If I submit a bash script with several curl commands, one for each scaffold, how can I control that Cromwell takes one command at a time?

    I wonder this because, instead of scattering by 14000 scaffolds for each of the 10 BAM files, I am trying scattering by sample (10 BAM files) for each scaffold. I created a bash script with one curl command for each of the 14000 scaffolds, like this:

    #!/bin/bash -l
    
    curl -v "localhost:port/api/workflows/v1" -F wdlSource=@/home/PATH/script.wdl -F workflowInputs=@/home/PATH/script.wdl_inputs.scaffold1.json ;
    echo "########## scaffold1 done #########" ;
    
    curl -v "localhost:port/api/workflows/v1" -F wdlSource=@/home/PATH/script.wdl -F workflowInputs=@/home/PATH/script.wdl_inputs.scaffold2.json ;
    echo "########## scaffold2 done #########" ;
    
    curl -v "localhost:port/api/workflows/v1" -F wdlSource=@/home/PATH/script.wdl -F workflowInputs=@/home/PATH/script.wdl_inputs.scaffold3.json ;
    echo "########## scaffold3 done #########" ;
    
    .
    .
    .
    curl -v "localhost:port/api/workflows/v1" -F wdlSource=@/home/PATH/script.wdl -F workflowInputs=@/home/PATH/script.wdl_inputs.scaffold14135.json ;
    echo "########## scaffold14135 done #########" ;
    
    

    First I tested a bash script with 3 curl commands (for 3 scaffolds), and noticed Cromwell started running subjobs (BAM files) for the three scaffolds. Then I tried a bash script with curl commands for 10 scaffolds, and the same happened, a few subjobs run in one of the each 10 scaffolds. This suggest all the commands are being submitted in once to Cromwell (and I was expecting one by one). I read the curl manual but did not really find any option that controls the number of commands are submitted when using a bash script. Any ideas?

    Best wishes,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    I'm not sure I understand this approach. Normally we'd handle the scattering within the WDL. I get that you're trying to bypass some of the limitations I mentioned, but I suspect this will only make it worse, to be honest. But maybe I'm not seeing something. Can you explain your rationale?
  • I apologize that I was not clear. I am skipping for now the "stitching scaffolds to obtain super scaffolds" option as it means I may need to start from scratch with read mapping, etc.; also, I would like to avoid the possibility of having reads mapping to wrong scaffolds within large stitched scaffolds...that is possible, right?

    1. The reference genome has 14000 scaffolds and I have 10 pool samples (=10 BAM files). One option could be:

    • Run HaplotypeCaller -ERC for one scaffold and all the 10 samples in once (scatter-gather by sample) => I'd obtain one GVCF file for the scaffold with data for the 10 samples.
    • Repeat for the rest of scaffolds => I'd obtain 14000 GVCF files with data for the 10 samples.
    • Merge the 14000 GVCFs (using MergeVCF of Picard tools)=> I'd obtain a large GVCF file with all scaffolds and all samples.
    • Because GenotypeGVCF requires GVCF files for each sample, I might need to split the large GVCF into individual samples, which tool could I use for this? I don't find any that extracts sample info in Picard tools.

    Does it make sense or it is too complicated? Better to try the stitching option?

    2. How can I submit multiple WDL scripts to Cromwell but make the engine takes one at a time (i.e. after one is done, do the next one)? I tested creating one bash file with multiple curl commands, each of them submitting one WDL script + JSON input file (e.g. curl -v "localhost:port/api/workflows/v1" -F wdlSource=@/home/PATH/script.wdl -F workflowInputs=@/home/PATH/script.wdl_inputs.scaffold1.json) but Cromwell seems to take all the WDL scripts because it runs subjobs within the scatter-gather part of the multiple WDL scripts.

    Thanks very much for your help

  • EADGEADG KielMember

    Hi @apfuentes,

    i can help you with your second question. You have to change your application.conf the following way:
    system.max-concurrent-workflows = 1
    system.new-workflow-poll-rate = 120s
    system.max-workflow-launch-count = 1

    And then start your cromwell-server with it like:
    java -Dconfig.file=/path/to/application.conf -jar ....

    You can also peek at this thread:processing-samples-successivly ;)

    Greetings EADG

  • Hi @EADG,
    Thank you very much for your recommendation, it was really helpful and solved my problem :smile:
    Best wishes

Sign In or Register to comment.