We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
Does MergeVCFs have a limit to how many gvcf files it can merge?

I have tried running the haplotypecaller-gvcf-gatk4.wdl on one pool (I added arguments to the HaplotypeCaller command to adjust for our data, and -L with a master list that has all the scaffolds in our reference genome), and the HaplotypeCaller command works and finishes, but the MergeVCFs command doesn't after 5 days. I have 9,000 scaffolds that each have a gVCF file after HaplotypeCaller, is there some limitation to the number of gVCF files that MergeVCF can combine?
I would like to have one gVCF for each pool after the HaplotypeCaller to be able to use JointGenotyping to get a final VCF with all 12 pools.
Is there a better tool or a different way to combine the gVCF after HaplotypeCaller to get a single gVCF per pool to use in JointGenotyping? I have seen forum posts from 2015 that say CatVariants might be a good solution, but I don't know if that is appropriate for this pipeline.
Thank you!
Best Answers
-
bshifaw admin
When I looked again at the workflow log file, I can see that for two days the job results were being received, and occasionally the HaplotypeCaller was restarted
The haplotypecaller runs was most likely preempted
If that is the case, then I have re-started this WDL workflow on another sample. I'm going to let it run for a week and see if it finishes running.
Good call, this job is very large and will require a considerable amount of time.
Is there a way to improve the run time without racking up huge costs?
One way I can think of that isn't already implemented in the haplotypecaller-gvcf-gatk4.wdl is the use of NIO ("protocol called NIO that allows us to only retrieve the piece of the file we care about for a given task, which neatly blows away the localization problem and delivers substantial savings" link ). You can alter aspects of the workflow to improve time while sacrificing cost like decreasing preemption or the number of scattering, but evening this increase in cost by using NIO. This is more of a theory than an answer. There is already nio version of haplotypecaller available in the git repo you mentioned called haplotypecaller-gvcf-gatk4-nio.wdl
-
bshifaw admin
One thing i noticed is the large amount of scattering taking place during an individual run. The workflow bases the number of times the haplotypecaller task is scattered by the number of intervals in the provided interval list, I can't view your interval list but I assume it has over 3000 intervals. This is one way of scattering the haplotypecaller task but in this case it can be overwhelming if the thousands of intervals are used to spin up thousands of jobs. Another way of scattering the task is to evenly divide intervals into a set number, this is done in the five-dollar-genome-analysis-pipeline using Utils.ScatterIntervalList task. You could incorporate this script into to your method, this way you aren't running 3000+ scatter jobs.
Post edited by bshifaw on -
bshifaw admin
I'm thinking of running the HC command without the preemptible allowance, because I think I am just wasting time allowing the job to be preempted.
Be aware it will increase the cost of your run.
How do I know whether to increase the memory or the CPUs of the job thoughtfully- that is without asking for more than I need and wasting money?
Good question, You could add a few commands in the task command block that writes the system resources being used into a log file. Then use this file to determine whether or not the task is being limited by cpu, memory, etc. (FYI: you can adjust the cpu number)
I would run HaplotypeCaller on the 24 largest scaffolds I have using the original scatter method (one HC task for each line in the interval file -24 scaffolds in the interval file) (I would probably run this preemptively to save money) and then merge the resulting 24 gvcf files to produce one gvcf. I would run the remaining 8286 very very small scaffolds using the non-scattered protocol (just a wdl of HC with -L for the 8286 scaffolds within the body of the command) to produce one gvcf file for all of those small scaffolds. Then I would MergeVCF the two gvcf files to get one gvcf that contained all 8310 scaffolds.
This sounds good to me and well thought out.
Answers
Hello @cmt - Are you setting the -ploidy option with HaplotypeCaller?
Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
It would be helpful to see a copy of the error log from the run as well as the exact command used.
Thank you for helping me out! I did use -ploidy 20 in my HaplotypeCaller command, I have a true ploidy of between 92-96 per pool, but it what I have learned from other posts on the forum is that anything more than 21 is an overwhelming computation load. I might switch to ploidy 10 if that would help get genotypes.
The wdl that I used is the last and largest chunk of code at the bottom of this page, it is basically the exact one posted on the github haplotypecaller-gvcf-gatk4-nio.wdl but I added more arguments to HaplotypeCaller, and I also changed the name of the input and output files for the MergeVCF command. (maybe my messing with the names of the inputs and outputs is what went wrong? )
There was no explicit error message for the MergeVCF not working, but when I try to look at the error logs, I get this message:
``` {"status":404,"response":{"req":{"method":"HEAD","url":"[edited out link]//fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded.storage.googleapis.com/4b7942b4-f236-41ae-9c16-9347dc4fabf9%2FHaplotypeCallerGvcf_GATK4%2F78499ac8-3480-46d6-b0b7-3555480e4f41%2Fcall-MergeGVCFs%2FMergeGVCFs.log","headers":{"user-agent":"node-superagent/3.8.3","authorization":"Bearer ya29.c.Elp4BjkUA8I_aShpEVb1fEFJ0D6_t6gr58ryQ9hRkRRf_--0zPyBsBezV7VT7AJqqkAwV_hYIfLy7bTPxSdBkTms3UNY1DPjwev4I9Cl677ZHuyftSseCX5x1NE"}},"header":{"x-guploader-uploadid":"AEnB2Upy2ocbjeKYf9P5UoH895oaZ8WSP4kNJwf4jsa2KJ8kkQa1eLlmaZedpUZF1qVEd3Iyl-2Tsn4Dq4dQEbnwXhKZOcPvJfSRRUEUG5p5u_oJnZJlCl0","content-type":"application/xml; charset=UTF-8","content-length":"332","date":"Thu, 20 Dec 2018 17:09:36 GMT","expires":"Thu, 20 Dec 2018 17:09:36 GMT","cache-control":"private, max-age=0","server":"UploadServer","alt-svc":"quic=\":443\"; ma=2592000; v=\"44,43,39,35\"","connection":"close"},"status":404}```
There was a workflow log in the google bucket that is quite large, but this little portion shows that it finished the HaplotypeCaller command and started working on the MergeVCF portion:
```2018-12-15 16:00:27,100 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.HaplotypeCaller:5:4]: Status change from Running to Success
2018-12-15 16:16:06,041 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.HaplotypeCaller:2:2]: Status change from Running to Success
2018-12-15 16:16:11,979 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Starting HaplotypeCallerGvcf_GATK4.MergeGVCFs
2018-12-15 16:16:13,316 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: `set -e
/gatk/gatk --java-options "-Xmx2G" \
MergeVcfs \
--INPUT /cromwell_root/fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded/4b7942b4-f236-41ae-9c16-9347dc4fabf9/HaplotypeCallerGvcf_GATK4/78499ac8-3480-46d6-b0b7-3555480e4f41/call-HaplotypeCaller/shard-0/301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz --INPUT /cromwell_root/fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded/4b7942b4-f236-41ae-9c16-9347dc4fabf9/HaplotypeCallerGvcf_GATK4/78499ac8-3480-46d6-b0b7-3555480e4f41/call-HaplotypeCaller/shard-1/301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz ```
It never finished the MergeVCF, after 4 days I aborted it and there were no files in the google bucket except for the script. These are the last couple of lines in the workflow.log before I aborted the mergeVCF:
```'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(8), attempt 4)
2018-12-19 16:24:53,900 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(5593), attempt 3)
2018-12-19 16:24:53,992 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(21), attempt 4)
2018-12-19 16:25:00,054 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(5), attempt 4)
2018-12-19 16:25:03,124 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Restarting HaplotypeCallerGvcf_GATK4.MergeGVCFs
2018-12-19 16:26:04,335 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: job id: operations/EOHcnZX7LBiZtoL9geuiBCCT043OohUqD3Byb2R1Y3Rpb25RdWV1ZQ
2018-12-19 16:26:19,692 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from - to Initializing
2018-12-19 17:49:49,825 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Aborting workflow```
This is the full WDL for this run of the Haplotypes caller and the MergeVCFs:
```
# WORKFLOW DEFINITION
workflow HaplotypeCallerGvcf_GATK4 {
File input_bam
File input_bam_index
File ref_dict
File ref_fasta
File ref_fasta_index
File scattered_calling_intervals_list
Boolean? make_gvcf
Boolean making_gvcf = select_first([make_gvcf,true])
String? gatk_docker_override
String gatk_docker = select_first([gatk_docker_override, "broadinstitute/gatk:4.0.6.0"])
String? gatk_path_override
String gatk_path = select_first([gatk_path_override, "/gatk/gatk"])
String? gitc_docker_override
String gitc_docker = select_first([gitc_docker_override, "broadinstitute/genomes-in-the-cloud:2.3.1-1500064817"])
Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list)
Int ploidy
#is the input a cram file?
Boolean is_cram = sub(basename(input_bam), ".*\\.", "") == "cram"
String sample_basename = if is_cram then basename(input_bam, ".cram") else basename(input_bam, ".bam")
String vcf_basename = sample_basename
String output_suffix = if making_gvcf then ".g.vcf.gz" else ".vcf.gz"
String output_filename = vcf_basename + output_suffix
# We need disk to localize the sharded input and output due to the scatter for HaplotypeCaller.
# If we take the number we are scattering by and reduce by 20 we will have enough disk space
# to account for the fact that the data is quite uneven across the shards.
Int potential_hc_divisor = length(scattered_calling_intervals) - 20
Int hc_divisor = if potential_hc_divisor > 1 then potential_hc_divisor else 1
if ( is_cram ) {
call CramToBamTask {
input:
input_cram = input_bam,
sample_name = sample_basename,
ref_dict = ref_dict,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
docker = gitc_docker
}
}
# Call variants in parallel over grouped calling intervals
scatter (interval_file in scattered_calling_intervals) {
# Generate GVCF by interval
call HaplotypeCaller {
input:
input_bam = select_first([CramToBamTask.output_bam, input_bam]),
input_bam_index = select_first([CramToBamTask.output_bai, input_bam_index]),
interval_list = interval_file,
output_filename = output_filename,
ref_dict = ref_dict,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
hc_scatter = hc_divisor,
make_gvcf = making_gvcf,
docker = gatk_docker,
gatk_path = gatk_path,
ploidy = ploidy
}
}
# Merge per-interval GVCFs
call MergeGVCFs {
input:
input_gvcfs = HaplotypeCaller.output_gvcf,
input_gvcfs_indexes = HaplotypeCaller.output_gvcf_index,
output_filename = output_filename,
docker = gatk_docker,
gatk_path = gatk_path
}
# Outputs that will be retained when execution is complete
output {
File output_gvcf = MergeGVCFs.output_gvcf
File output_gvcf_index = MergeGVCFs.output_gvcf_index
}
}
# TASK DEFINITIONS
task CramToBamTask {
# Command parameters
File ref_fasta
File ref_fasta_index
File ref_dict
File input_cram
String sample_name
# Runtime parameters
String docker
Int? machine_mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? preemptible_attempts
Float output_bam_size = size(input_cram, "GB") / 0.60
Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
Int disk_size = ceil(size(input_cram, "GB") + output_bam_size + ref_size) + 20
command {
set -e
set -o pipefail
samtools view -h -T ${ref_fasta} ${input_cram} |
samtools view -b -o ${sample_name}.bam -
samtools index -b ${sample_name}.bam
mv ${sample_name}.bam.bai ${sample_name}.bai
}
runtime {
docker: docker
memory: select_first([machine_mem_gb, 15]) + " GB"
disks: "local-disk " + select_first([disk_space_gb, disk_size]) + if use_ssd then " SSD" else " HDD"
preemptible: select_first([preemptible_attempts, 3])
}
output {
File output_bam = "${sample_name}.bam"
File output_bai = "${sample_name}.bai"
}
}
# HaplotypeCaller per-sample in GVCF mode
task HaplotypeCaller {
String input_bam
String input_bam_index
String interval_list
String output_filename
File ref_dict
File ref_fasta
File ref_fasta_index
Boolean make_gvcf
Int hc_scatter
Int ploidy
String gatk_path
String? java_options
String java_opt = select_first([java_options, "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10"])
# Runtime parameters
String docker
Int? mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? preemptible_attempts
Int machine_mem_gb = select_first([mem_gb, 7])
Int command_mem_gb = machine_mem_gb - 1
Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
Int disk_size = ceil(((size(input_bam, "GB") + 30) / hc_scatter) + ref_size) + 20
command <<<
set -e
${gatk_path} --java-options "-Xmx${command_mem_gb}G ${java_opt}" \
HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-L ${interval_list} \
--disable-read-filter NotDuplicateReadFilter \
--pcr-indel-model NONE \
--sample-ploidy ${default = 20 ploidy}\
--max-alternate-alleles 3 \
--min-pruning 3 \
--max-genotype-count 10000 \
--read-filter OverclippedReadFilter \
-G StandardAnnotation \
-G AS_StandardAnnotation \
-G StandardHCAnnotation \
-new-qual \
-ERC GVCF \
-O ${output_filename}
>>>
runtime {
docker: docker
memory: machine_mem_gb + " GB"
disks: "local-disk " + select_first([disk_space_gb, disk_size]) + if use_ssd then " SSD" else " HDD"
preemptible: select_first([preemptible_attempts, 3])
}
output {
File output_gvcf = "${output_filename}"
File output_gvcf_index = "${output_filename}.tbi"
}
}
# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
Array[File] input_gvcfs
Array[File] input_gvcfs_indexes
String output_filename
String gatk_path
# Runtime parameters
String docker
Int? mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? preemptible_attempts
Int machine_mem_gb = select_first([mem_gb, 3])
Int command_mem_gb = machine_mem_gb - 1
command <<<
set -e
${gatk_path} --java-options "-Xmx${command_mem_gb}G" \
MergeVcfs \
--INPUT ${sep=' --INPUT ' input_gvcfs} \
--OUTPUT ${output_filename}
>>>
runtime {
docker: docker
memory: machine_mem_gb + " GB"
disks: "local-disk " + select_first([disk_space_gb, 100]) + if use_ssd then " SSD" else " HDD"
preemptible: select_first([preemptible_attempts, 3])
}
output {
File output_gvcf = "${output_filename}"
File output_gvcf_index = "${output_filename}.tbi"
}
}
```
Hi @cmt
The
status":404
isn't a tool error. Would it be possible share your workspace with [email protected]?I shared my workspace with [email protected]
I moved on from trying to get MergeVCFs to work, and have been troubleshooting using CatVariants to combine all the gvcf files that were generated from HaplotypeCaller. Though MergeVCF didn't work, the HC command did finish for the sample, and produced 8310 gvcf files, one for each of my linkage groups or scaffolds.
Because they syntax is so strange for CatVariants //software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_CatVariants.php I am having a lot of issues to get it to work.
There are a lot of errors each time I try to run it, and I adjust something in the code each time. The main issue is that I don't understand how to run the command on FireCloud if it is bypassing the gatk engine ( this part of the command seems to throw errors: java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants )
The second issue is that there seems to be a limit to the number of files you combine, though the documentation says there isnt. I always get this error, though I am trying to combine subsets of the gvcf files, not all 8310:
/cromwell_root/script: line 24: /gatk/gatk: Argument list too long
How can I combine my 8310 gvcf files that I already have generated, and going forward, how do I get the MergeVCF command to work in the example GATK pipeline posted on github: github haplotypecaller-gvcf-gatk4-nio.wdl
Thank you for your help!
I have taken another look at the original issue that I was having with MergeVCFs, and I think that I probably did not give it enough time to run considering the number of gvcf files that it was merging. When I looked again at the workflow log file, I can see that for two days the job results were being received, and occasionally the HaplotypeCaller was restarted:
2018-12-19 16:24:42,329 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(2262), attempt 3)
2018-12-19 16:24:43,353 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Restarting HaplotypeCallerGvcf_GATK4.HaplotypeCaller
2018-12-19 16:24:49,975 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(15), attempt 3)
2018-12-19 16:24:53,674 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41
Having the HaplotypeCaller restart so often is what threw me off. In the monitor tab of FireCloud it just showed that for days the HaplotypeCaller was being restarted, so I finally aborted the run.
Right before I aborted the run I think it was starting to try MergeVCFs:
2018-12-19 16:25:03,124 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Restarting HaplotypeCallerGvcf_GATK4.MergeGVCFs
2018-12-19 16:26:04,335 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: job id: operations/EOHcnZX7LBiZtoL9geuiBCCT043OohUqD3Byb2R1Y3Rpb25RdWV1ZQ
2018-12-19 16:26:19,692 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from - to Initializing
If that is the case, then I have re-started this WDL workflow on another sample. I'm going to let it run for a week and see if it finishes running.
Is there a way to improve the run time without racking up huge costs? I'm obviously very new to GCP and FireCloud, and do not have a good sense of how to change these parameters.
Thank you!
The haplotypecaller runs was most likely preempted
Good call, this job is very large and will require a considerable amount of time.
One way I can think of that isn't already implemented in the haplotypecaller-gvcf-gatk4.wdl is the use of NIO ("protocol called NIO that allows us to only retrieve the piece of the file we care about for a given task, which neatly blows away the localization problem and delivers substantial savings" link ). You can alter aspects of the workflow to improve time while sacrificing cost like decreasing preemption or the number of scattering, but evening this increase in cost by using NIO. This is more of a theory than an answer. There is already nio version of haplotypecaller available in the git repo you mentioned called haplotypecaller-gvcf-gatk4-nio.wdl
One thing i noticed is the large amount of scattering taking place during an individual run. The workflow bases the number of times the haplotypecaller task is scattered by the number of intervals in the provided interval list, I can't view your interval list but I assume it has over 3000 intervals. This is one way of scattering the haplotypecaller task but in this case it can be overwhelming if the thousands of intervals are used to spin up thousands of jobs. Another way of scattering the task is to evenly divide intervals into a set number, this is done in the five-dollar-genome-analysis-pipeline using Utils.ScatterIntervalList task. You could incorporate this script into to your method, this way you aren't running 3000+ scatter jobs.
The the firecloud run that I started last week finally finished, but with an error. I was running just HaplotypeCaller (which finished with no errors) and then MergeVCFs that finished with the error: message: The job was aborted from outside Cromwell
The workflow.log has just these couple of lines that relate to the MergeVCF command:
2019-01-11 15:33:40,696 INFO - WorkflowExecutionActor-63ac6674-20cb-4a1b-9f25-b3b4caaea7aa [UUID(63ac6674)]: Restarting HaplotypeCallerGvcf_GATK4.MergeGVCFs
2019-01-11 15:34:00,836 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(63ac6674)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: job id: operations/EOLDvvqBLRikqNislMm02YcBIJPTjc6iFSoPcHJvZHVjdGlvblF1ZXVl
2019-01-11 15:34:08,767 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(63ac6674)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from - to Initializing
2019-01-11 19:38:44,690 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(63ac6674)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from Initializing to Cancelled
I did not cancel the run; do you think that it was cancelled because I reached some firecloud or GCP limit, maybe because there were 8310 gvcf files to merge and after a certain point the job was automatically cancelled? Do you have any ideas of what could have caused the run to cancel?
I'm going to try using the Utils.ScatterIntervalList task from the five-dollar-genome-analysis-pipeline for the next pool to reduce the number of gvcf files that need to be merged and hopefully to avoid this in the future. But I now have two pools that have 8310 gvcf files that need to be merged into one. Can you recommend a way to do that?
Thank you so much for all your help!
This was due a GCP limit, GCP will cancel a job if it takes longer than 6 days. The message is hidden in the operations log.
Operation:operations/EOLDvvqBLRikqNislMm02YcBIJPTjc6iFSoPcHJvZHVjdGlvblF1ZXVl
Another tool for merging gvcfs would be GenomicsDBImport, but try testing your updated wdl with Utils.ScatterIntervalList task first. You can then determine if its worth spending time on GenomicsDBImport or just rerunning the data on updated wdl.
I think I can't use the GenomicsDBImport because I have a ploidy of 20, which is a shame because it definitely seems like the easiest route.
Looking at the Utils.ScatterIntervalList task I was confused about what it was actually doing, and how best to apply it to my job. If I have 8310 scaffolds (I can't stitch them together because I have the aligned sequence bam files, not the raw reads to be able to realign to an altered version of the genome) do I use the intervals list (-L) in the HaplotypeCaller command and then also the task Utils.ScatterIntervalList? That seems like Utils.ScatterIntervalList takes my intervals and splits them up even further which does not solve the problem of how to combine all 8310 resulting gvcfs after HaplotypeCaller, but I might just not be understanding it. There is no way to have intervals that span multiple scaffolds, right? So could I use no intervals (-L) in HaplotypeCaller but then scatter the job with Utils.ScatterIntervalList ? And if I go that route, I still need to provide Utils.ScatterIntervalList with a list of intervals, which would be my 8310 scaffolds, right? And that is OK, because it can hopefully gather all the 8310 gvcf files together using the Utils.ScatterIntervalList instead of with MergeVCFs which times out.
I hope question that makes sense!
Thanks!
@cmt another option might be to run the samples on Haplotype Caller and then run GenotypeGVCF? Genotype GVCFs does not require a ploidy specificaton.
Here is another thread about poolseq data that had a similar issue. What type of data and organism are you working with?
I would like to use GenotypeGVCF, and I have in some tests that were smaller than these full pools that I am trying to run through the pipeline. The issue that I run into is that GenotypeGVCF can take multiple GVCF files, but aren't they each intended to be different discrete samples? I have 8310 files from one sample (one pool of 50 individual fish) that I want to combine into 1 gvcf file for that sample before using GenotypeGVCF to genotype 12 samples (12 gvcf files; 12 pools of 50 fish each) at once. Combining all 8310 gvcf files is where I run into trouble- MergeVCFs times out before it can finish, though it works if I do smaller tests (ie 30 scaffolds instead of all 8310).
I can't reduce the number of scaffolds that I have, so I have to figure out either a hierarchical way of running HaplotypeCaller to produce fewer GVCF files per sample, or some way to combine all the GVCF files from one sample before moving on to the GenotypeGVCFs with multiple samples, maybe a hierarchical way of running MergeVCFS. Do you have any ideas?
Thanks!
Regarding your earlier question about ScatterIntervalList. The task uses IntervalListTools (Picard) "to scatter the input interval list into scatter_count sub interval lists". Meaning the input interval list you provided is broken down to sub interval list files (the number of files is determined by the scatter_count variable), each containing a list of intervals from the input interval list. This set of sub interval list files are then used in the haplotypecaller scatter call so that each scatter uses a sub interval list file to run its own job.
This is different from the way the workflow is currently being run. It looks like your workflow is scattering the haplotypecaller call based on each line in the interval list and your interval list has 8k lines, hence the 8k scattered jobs.
Note, scattering the haplotypcaller call is used to parallelize the workflow, it isn't required. The workflow can simply run so that haplotypcaller task is called once using one interval list, instead of being called multiple time for each line in the interval list. Not scattering the call would also mean you wouldn't have to use Mergegvcf since the workflow would be executing one haplotypecaller task creating one gvcf.
I will restructure how I'm running HaplotypeCaller and give it a shot without the additional scattering and without using MergeVCFs and see if that solves my problem.
Not to get too ahead of myself, but I think this is the answer to my issues- thanks :)
Good luck! Feel free to let us know how it went, this is helpful information for other forum users.
I have a couple more questions about optimizing my HC run, thank you for all your help so far!
I removed the scatter that had the HaplotypeCaller running a task for each of the lines of my interval list (8310) and instead have it running on a test of 6 intervals, using -L in the body of the HC command, but no additional scattering/parallelization. It has been running for more than 30 hours because it was preempted a couple of times and had to re-start after each from the beginning of the HC task. It looks like maybe it just restarted again. I'm thinking of running the HC command without the preemptible allowance, because I think I am just wasting time allowing the job to be preempted.
But I also wonder if I am not running HC on a fast or powerful enough VM on Google. How do I know whether to increase the memory or the CPUs of the job thoughtfully- that is without asking for more than I need and wasting money? I have asked for 7GB of memory in the wdl, but no explicit number of CPUs. Maybe another clue that my wdl isn't optimized for this task is that the HaplotypeCaller.log ahs these lines: (?)
20:44:44.919 INFO IntelPairHmm - Available threads: 2
20:44:44.919 INFO IntelPairHmm - Requested threads: 4
20:44:44.920 WARN IntelPairHmm - Using 2 available threads, but 4 were requested
In addition to optimizing the VM I'm running HC on in terms of computing power and memory, I want to optimize the time it takes to run HC using some sort of parallelization because I think HC is stalling on the large scaffolds because they are so big. So the other question I had is if you think this new plan I have is a good idea: I would run HaplotypeCaller on the 24 largest scaffolds I have using the original scatter method (one HC task for each line in the interval file -24 scaffolds in the interval file) (I would probably run this preemptively to save money) and then merge the resulting 24 gvcf files to produce one gvcf. I would run the remaining 8286 very very small scaffolds using the non-scattered protocol (just a wdl of HC with -L for the 8286 scaffolds within the body of the command) to produce one gvcf file for all of those small scaffolds. Then I would MergeVCF the two gvcf files to get one gvcf that contained all 8310 scaffolds.
What do you think?
Thanks!!
Be aware it will increase the cost of your run.
Good question, You could add a few commands in the task command block that writes the system resources being used into a log file. Then use this file to determine whether or not the task is being limited by cpu, memory, etc. (FYI: you can adjust the cpu number)
This sounds good to me and well thought out.