Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Does MergeVCFs have a limit to how many gvcf files it can merge?

cmtcmt Seattle WAMember
I am running the gatk best practices pipeline on FireCloud on 12 pooled WGS samples (50 individuals per barcode, one barcode per pool, but I'm using a ploidy of 20) using gatk v4 and wdl adapted from the examples on github: github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.wdl

I have tried running the haplotypecaller-gvcf-gatk4.wdl on one pool (I added arguments to the HaplotypeCaller command to adjust for our data, and -L with a master list that has all the scaffolds in our reference genome), and the HaplotypeCaller command works and finishes, but the MergeVCFs command doesn't after 5 days. I have 9,000 scaffolds that each have a gVCF file after HaplotypeCaller, is there some limitation to the number of gVCF files that MergeVCF can combine?

I would like to have one gVCF for each pool after the HaplotypeCaller to be able to use JointGenotyping to get a final VCF with all 12 pools.

Is there a better tool or a different way to combine the gVCF after HaplotypeCaller to get a single gVCF per pool to use in JointGenotyping? I have seen forum posts from 2015 that say CatVariants might be a good solution, but I don't know if that is appropriate for this pipeline.

Thank you!

Best Answers

Answers

  • AdelaideRAdelaideR Member admin

    Hello @cmt - Are you setting the -ploidy option with HaplotypeCaller?

    Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).

    It would be helpful to see a copy of the error log from the run as well as the exact command used.

  • cmtcmt Seattle WAMember
    edited December 2018
    Hi @AdelaideR
    Thank you for helping me out! I did use -ploidy 20 in my HaplotypeCaller command, I have a true ploidy of between 92-96 per pool, but it what I have learned from other posts on the forum is that anything more than 21 is an overwhelming computation load. I might switch to ploidy 10 if that would help get genotypes.

    The wdl that I used is the last and largest chunk of code at the bottom of this page, it is basically the exact one posted on the github haplotypecaller-gvcf-gatk4-nio.wdl but I added more arguments to HaplotypeCaller, and I also changed the name of the input and output files for the MergeVCF command. (maybe my messing with the names of the inputs and outputs is what went wrong? )

    There was no explicit error message for the MergeVCF not working, but when I try to look at the error logs, I get this message:

    ``` {"status":404,"response":{"req":{"method":"HEAD","url":"[edited out link]//fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded.storage.googleapis.com/4b7942b4-f236-41ae-9c16-9347dc4fabf9%2FHaplotypeCallerGvcf_GATK4%2F78499ac8-3480-46d6-b0b7-3555480e4f41%2Fcall-MergeGVCFs%2FMergeGVCFs.log","headers":{"user-agent":"node-superagent/3.8.3","authorization":"Bearer ya29.c.Elp4BjkUA8I_aShpEVb1fEFJ0D6_t6gr58ryQ9hRkRRf_--0zPyBsBezV7VT7AJqqkAwV_hYIfLy7bTPxSdBkTms3UNY1DPjwev4I9Cl677ZHuyftSseCX5x1NE"}},"header":{"x-guploader-uploadid":"AEnB2Upy2ocbjeKYf9P5UoH895oaZ8WSP4kNJwf4jsa2KJ8kkQa1eLlmaZedpUZF1qVEd3Iyl-2Tsn4Dq4dQEbnwXhKZOcPvJfSRRUEUG5p5u_oJnZJlCl0","content-type":"application/xml; charset=UTF-8","content-length":"332","date":"Thu, 20 Dec 2018 17:09:36 GMT","expires":"Thu, 20 Dec 2018 17:09:36 GMT","cache-control":"private, max-age=0","server":"UploadServer","alt-svc":"quic=\":443\"; ma=2592000; v=\"44,43,39,35\"","connection":"close"},"status":404}```

    There was a workflow log in the google bucket that is quite large, but this little portion shows that it finished the HaplotypeCaller command and started working on the MergeVCF portion:

    ```2018-12-15 16:00:27,100 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.HaplotypeCaller:5:4]: Status change from Running to Success
    2018-12-15 16:16:06,041 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.HaplotypeCaller:2:2]: Status change from Running to Success
    2018-12-15 16:16:11,979 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Starting HaplotypeCallerGvcf_GATK4.MergeGVCFs
    2018-12-15 16:16:13,316 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: `set -e

    /gatk/gatk --java-options "-Xmx2G" \
    MergeVcfs \
    --INPUT /cromwell_root/fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded/4b7942b4-f236-41ae-9c16-9347dc4fabf9/HaplotypeCallerGvcf_GATK4/78499ac8-3480-46d6-b0b7-3555480e4f41/call-HaplotypeCaller/shard-0/301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz --INPUT /cromwell_root/fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded/4b7942b4-f236-41ae-9c16-9347dc4fabf9/HaplotypeCallerGvcf_GATK4/78499ac8-3480-46d6-b0b7-3555480e4f41/call-HaplotypeCaller/shard-1/301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz ```

    It never finished the MergeVCF, after 4 days I aborted it and there were no files in the google bucket except for the script. These are the last couple of lines in the workflow.log before I aborted the mergeVCF:

    ```'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(8), attempt 4)
    2018-12-19 16:24:53,900 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(5593), attempt 3)
    2018-12-19 16:24:53,992 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(21), attempt 4)
    2018-12-19 16:25:00,054 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(5), attempt 4)
    2018-12-19 16:25:03,124 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Restarting HaplotypeCallerGvcf_GATK4.MergeGVCFs
    2018-12-19 16:26:04,335 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: job id: operations/EOHcnZX7LBiZtoL9geuiBCCT043OohUqD3Byb2R1Y3Rpb25RdWV1ZQ
    2018-12-19 16:26:19,692 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from - to Initializing
    2018-12-19 17:49:49,825 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Aborting workflow```

    This is the full WDL for this run of the Haplotypes caller and the MergeVCFs:

    ```
    # WORKFLOW DEFINITION

    workflow HaplotypeCallerGvcf_GATK4 {
    File input_bam
    File input_bam_index
    File ref_dict
    File ref_fasta
    File ref_fasta_index
    File scattered_calling_intervals_list

    Boolean? make_gvcf
    Boolean making_gvcf = select_first([make_gvcf,true])

    String? gatk_docker_override
    String gatk_docker = select_first([gatk_docker_override, "broadinstitute/gatk:4.0.6.0"])
    String? gatk_path_override
    String gatk_path = select_first([gatk_path_override, "/gatk/gatk"])
    String? gitc_docker_override
    String gitc_docker = select_first([gitc_docker_override, "broadinstitute/genomes-in-the-cloud:2.3.1-1500064817"])

    Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list)

    Int ploidy

    #is the input a cram file?
    Boolean is_cram = sub(basename(input_bam), ".*\\.", "") == "cram"

    String sample_basename = if is_cram then basename(input_bam, ".cram") else basename(input_bam, ".bam")
    String vcf_basename = sample_basename
    String output_suffix = if making_gvcf then ".g.vcf.gz" else ".vcf.gz"
    String output_filename = vcf_basename + output_suffix

    # We need disk to localize the sharded input and output due to the scatter for HaplotypeCaller.
    # If we take the number we are scattering by and reduce by 20 we will have enough disk space
    # to account for the fact that the data is quite uneven across the shards.
    Int potential_hc_divisor = length(scattered_calling_intervals) - 20
    Int hc_divisor = if potential_hc_divisor > 1 then potential_hc_divisor else 1


    if ( is_cram ) {
    call CramToBamTask {
    input:
    input_cram = input_bam,
    sample_name = sample_basename,
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    docker = gitc_docker
    }
    }

    # Call variants in parallel over grouped calling intervals
    scatter (interval_file in scattered_calling_intervals) {

    # Generate GVCF by interval
    call HaplotypeCaller {
    input:
    input_bam = select_first([CramToBamTask.output_bam, input_bam]),
    input_bam_index = select_first([CramToBamTask.output_bai, input_bam_index]),
    interval_list = interval_file,
    output_filename = output_filename,
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    hc_scatter = hc_divisor,
    make_gvcf = making_gvcf,
    docker = gatk_docker,
    gatk_path = gatk_path,
    ploidy = ploidy
    }
    }

    # Merge per-interval GVCFs
    call MergeGVCFs {
    input:
    input_gvcfs = HaplotypeCaller.output_gvcf,
    input_gvcfs_indexes = HaplotypeCaller.output_gvcf_index,
    output_filename = output_filename,
    docker = gatk_docker,
    gatk_path = gatk_path
    }

    # Outputs that will be retained when execution is complete
    output {
    File output_gvcf = MergeGVCFs.output_gvcf
    File output_gvcf_index = MergeGVCFs.output_gvcf_index
    }
    }

    # TASK DEFINITIONS

    task CramToBamTask {
    # Command parameters
    File ref_fasta
    File ref_fasta_index
    File ref_dict
    File input_cram
    String sample_name

    # Runtime parameters
    String docker
    Int? machine_mem_gb
    Int? disk_space_gb
    Boolean use_ssd = false
    Int? preemptible_attempts

    Float output_bam_size = size(input_cram, "GB") / 0.60
    Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
    Int disk_size = ceil(size(input_cram, "GB") + output_bam_size + ref_size) + 20

    command {
    set -e
    set -o pipefail

    samtools view -h -T ${ref_fasta} ${input_cram} |
    samtools view -b -o ${sample_name}.bam -
    samtools index -b ${sample_name}.bam
    mv ${sample_name}.bam.bai ${sample_name}.bai
    }
    runtime {
    docker: docker
    memory: select_first([machine_mem_gb, 15]) + " GB"
    disks: "local-disk " + select_first([disk_space_gb, disk_size]) + if use_ssd then " SSD" else " HDD"
    preemptible: select_first([preemptible_attempts, 3])
    }
    output {
    File output_bam = "${sample_name}.bam"
    File output_bai = "${sample_name}.bai"
    }
    }

    # HaplotypeCaller per-sample in GVCF mode
    task HaplotypeCaller {
    String input_bam
    String input_bam_index
    String interval_list
    String output_filename
    File ref_dict
    File ref_fasta
    File ref_fasta_index
    Boolean make_gvcf
    Int hc_scatter
    Int ploidy

    String gatk_path
    String? java_options
    String java_opt = select_first([java_options, "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10"])

    # Runtime parameters
    String docker
    Int? mem_gb
    Int? disk_space_gb
    Boolean use_ssd = false
    Int? preemptible_attempts

    Int machine_mem_gb = select_first([mem_gb, 7])
    Int command_mem_gb = machine_mem_gb - 1

    Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
    Int disk_size = ceil(((size(input_bam, "GB") + 30) / hc_scatter) + ref_size) + 20

    command <<<
    set -e

    ${gatk_path} --java-options "-Xmx${command_mem_gb}G ${java_opt}" \
    HaplotypeCaller \
    -R ${ref_fasta} \
    -I ${input_bam} \
    -L ${interval_list} \
    --disable-read-filter NotDuplicateReadFilter \
    --pcr-indel-model NONE \
    --sample-ploidy ${default = 20 ploidy}\
    --max-alternate-alleles 3 \
    --min-pruning 3 \
    --max-genotype-count 10000 \
    --read-filter OverclippedReadFilter \
    -G StandardAnnotation \
    -G AS_StandardAnnotation \
    -G StandardHCAnnotation \
    -new-qual \
    -ERC GVCF \
    -O ${output_filename}
    >>>

    runtime {
    docker: docker
    memory: machine_mem_gb + " GB"
    disks: "local-disk " + select_first([disk_space_gb, disk_size]) + if use_ssd then " SSD" else " HDD"
    preemptible: select_first([preemptible_attempts, 3])
    }

    output {
    File output_gvcf = "${output_filename}"
    File output_gvcf_index = "${output_filename}.tbi"
    }
    }
    # Merge GVCFs generated per-interval for the same sample
    task MergeGVCFs {
    Array[File] input_gvcfs
    Array[File] input_gvcfs_indexes
    String output_filename

    String gatk_path

    # Runtime parameters
    String docker
    Int? mem_gb
    Int? disk_space_gb
    Boolean use_ssd = false
    Int? preemptible_attempts

    Int machine_mem_gb = select_first([mem_gb, 3])
    Int command_mem_gb = machine_mem_gb - 1

    command <<<
    set -e

    ${gatk_path} --java-options "-Xmx${command_mem_gb}G" \
    MergeVcfs \
    --INPUT ${sep=' --INPUT ' input_gvcfs} \
    --OUTPUT ${output_filename}
    >>>

    runtime {
    docker: docker
    memory: machine_mem_gb + " GB"
    disks: "local-disk " + select_first([disk_space_gb, 100]) + if use_ssd then " SSD" else " HDD"
    preemptible: select_first([preemptible_attempts, 3])
    }


    output {
    File output_gvcf = "${output_filename}"
    File output_gvcf_index = "${output_filename}.tbi"
    }
    }
    ```
  • bshifawbshifaw Member, Broadie, Moderator admin
    edited January 3

    Hi @cmt

    The status":404 isn't a tool error. Would it be possible share your workspace with [email protected]?

  • cmtcmt Seattle WAMember
    Hi @bshifaw

    I shared my workspace with [email protected]

    I moved on from trying to get MergeVCFs to work, and have been troubleshooting using CatVariants to combine all the gvcf files that were generated from HaplotypeCaller. Though MergeVCF didn't work, the HC command did finish for the sample, and produced 8310 gvcf files, one for each of my linkage groups or scaffolds.

    Because they syntax is so strange for CatVariants //software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_CatVariants.php I am having a lot of issues to get it to work.

    There are a lot of errors each time I try to run it, and I adjust something in the code each time. The main issue is that I don't understand how to run the command on FireCloud if it is bypassing the gatk engine ( this part of the command seems to throw errors: java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants )

    The second issue is that there seems to be a limit to the number of files you combine, though the documentation says there isnt. I always get this error, though I am trying to combine subsets of the gvcf files, not all 8310:
    /cromwell_root/script: line 24: /gatk/gatk: Argument list too long

    How can I combine my 8310 gvcf files that I already have generated, and going forward, how do I get the MergeVCF command to work in the example GATK pipeline posted on github: github haplotypecaller-gvcf-gatk4-nio.wdl

    Thank you for your help!
  • cmtcmt Seattle WAMember
    Hi @bshifaw

    I have taken another look at the original issue that I was having with MergeVCFs, and I think that I probably did not give it enough time to run considering the number of gvcf files that it was merging. When I looked again at the workflow log file, I can see that for two days the job results were being received, and occasionally the HaplotypeCaller was restarted:

    2018-12-19 16:24:42,329 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(2262), attempt 3)
    2018-12-19 16:24:43,353 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Restarting HaplotypeCallerGvcf_GATK4.HaplotypeCaller
    2018-12-19 16:24:49,975 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Job results retrieved (FetchedFromJobStore): 'HaplotypeCallerGvcf_GATK4.HaplotypeCaller' (scatter index: Some(15), attempt 3)
    2018-12-19 16:24:53,674 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41

    Having the HaplotypeCaller restart so often is what threw me off. In the monitor tab of FireCloud it just showed that for days the HaplotypeCaller was being restarted, so I finally aborted the run.

    Right before I aborted the run I think it was starting to try MergeVCFs:

    2018-12-19 16:25:03,124 INFO - WorkflowExecutionActor-78499ac8-3480-46d6-b0b7-3555480e4f41 [UUID(78499ac8)]: Restarting HaplotypeCallerGvcf_GATK4.MergeGVCFs
    2018-12-19 16:26:04,335 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: job id: operations/EOHcnZX7LBiZtoL9geuiBCCT043OohUqD3Byb2R1Y3Rpb25RdWV1ZQ
    2018-12-19 16:26:19,692 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(78499ac8)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from - to Initializing


    If that is the case, then I have re-started this WDL workflow on another sample. I'm going to let it run for a week and see if it finishes running.

    Is there a way to improve the run time without racking up huge costs? I'm obviously very new to GCP and FireCloud, and do not have a good sense of how to change these parameters.

    Thank you!
  • cmtcmt Seattle WAMember
    Thank you @bshifaw ! I will look at that pipeline and see if I can use that method to scatter the task.
  • cmtcmt Seattle WAMember
    Hi @bshifaw,

    The the firecloud run that I started last week finally finished, but with an error. I was running just HaplotypeCaller (which finished with no errors) and then MergeVCFs that finished with the error: message: The job was aborted from outside Cromwell

    The workflow.log has just these couple of lines that relate to the MergeVCF command:

    2019-01-11 15:33:40,696 INFO - WorkflowExecutionActor-63ac6674-20cb-4a1b-9f25-b3b4caaea7aa [UUID(63ac6674)]: Restarting HaplotypeCallerGvcf_GATK4.MergeGVCFs
    2019-01-11 15:34:00,836 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(63ac6674)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: job id: operations/EOLDvvqBLRikqNislMm02YcBIJPTjc6iFSoPcHJvZHVjdGlvblF1ZXVl
    2019-01-11 15:34:08,767 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(63ac6674)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from - to Initializing
    2019-01-11 19:38:44,690 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(63ac6674)HaplotypeCallerGvcf_GATK4.MergeGVCFs:NA:1]: Status change from Initializing to Cancelled

    I did not cancel the run; do you think that it was cancelled because I reached some firecloud or GCP limit, maybe because there were 8310 gvcf files to merge and after a certain point the job was automatically cancelled? Do you have any ideas of what could have caused the run to cancel?

    I'm going to try using the Utils.ScatterIntervalList task from the five-dollar-genome-analysis-pipeline for the next pool to reduce the number of gvcf files that need to be merged and hopefully to avoid this in the future. But I now have two pools that have 8310 gvcf files that need to be merged into one. Can you recommend a way to do that?

    Thank you so much for all your help!
  • bshifawbshifaw Member, Broadie, Moderator admin

    This was due a GCP limit, GCP will cancel a job if it takes longer than 6 days. The message is hidden in the operations log.
    Operation:operations/EOLDvvqBLRikqNislMm02YcBIJPTjc6iFSoPcHJvZHVjdGlvblF1ZXVl

    "error": {
        "code": 1,
        "details": [],
        "message": "Operation canceled at 2019-01-11T11:37:30-08:00 because it is older than 6 days"
      }
    

    Another tool for merging gvcfs would be GenomicsDBImport, but try testing your updated wdl with Utils.ScatterIntervalList task first. You can then determine if its worth spending time on GenomicsDBImport or just rerunning the data on updated wdl.

  • cmtcmt Seattle WAMember
    Thanks @bshifaw!

    I think I can't use the GenomicsDBImport because I have a ploidy of 20, which is a shame because it definitely seems like the easiest route.

    Looking at the Utils.ScatterIntervalList task I was confused about what it was actually doing, and how best to apply it to my job. If I have 8310 scaffolds (I can't stitch them together because I have the aligned sequence bam files, not the raw reads to be able to realign to an altered version of the genome) do I use the intervals list (-L) in the HaplotypeCaller command and then also the task Utils.ScatterIntervalList? That seems like Utils.ScatterIntervalList takes my intervals and splits them up even further which does not solve the problem of how to combine all 8310 resulting gvcfs after HaplotypeCaller, but I might just not be understanding it. There is no way to have intervals that span multiple scaffolds, right? So could I use no intervals (-L) in HaplotypeCaller but then scatter the job with Utils.ScatterIntervalList ? And if I go that route, I still need to provide Utils.ScatterIntervalList with a list of intervals, which would be my 8310 scaffolds, right? And that is OK, because it can hopefully gather all the 8310 gvcf files together using the Utils.ScatterIntervalList instead of with MergeVCFs which times out.

    I hope question that makes sense!

    Thanks!
  • AdelaideRAdelaideR Member admin

    @cmt another option might be to run the samples on Haplotype Caller and then run GenotypeGVCF? Genotype GVCFs does not require a ploidy specificaton.

    Here is another thread about poolseq data that had a similar issue. What type of data and organism are you working with?

  • cmtcmt Seattle WAMember
    Hi @AdelaideR!

    I would like to use GenotypeGVCF, and I have in some tests that were smaller than these full pools that I am trying to run through the pipeline. The issue that I run into is that GenotypeGVCF can take multiple GVCF files, but aren't they each intended to be different discrete samples? I have 8310 files from one sample (one pool of 50 individual fish) that I want to combine into 1 gvcf file for that sample before using GenotypeGVCF to genotype 12 samples (12 gvcf files; 12 pools of 50 fish each) at once. Combining all 8310 gvcf files is where I run into trouble- MergeVCFs times out before it can finish, though it works if I do smaller tests (ie 30 scaffolds instead of all 8310).

    I can't reduce the number of scaffolds that I have, so I have to figure out either a hierarchical way of running HaplotypeCaller to produce fewer GVCF files per sample, or some way to combine all the GVCF files from one sample before moving on to the GenotypeGVCFs with multiple samples, maybe a hierarchical way of running MergeVCFS. Do you have any ideas?


    Thanks!
  • bshifawbshifaw Member, Broadie, Moderator admin

    Regarding your earlier question about ScatterIntervalList. The task uses IntervalListTools (Picard) "to scatter the input interval list into scatter_count sub interval lists". Meaning the input interval list you provided is broken down to sub interval list files (the number of files is determined by the scatter_count variable), each containing a list of intervals from the input interval list. This set of sub interval list files are then used in the haplotypecaller scatter call so that each scatter uses a sub interval list file to run its own job.

    This is different from the way the workflow is currently being run. It looks like your workflow is scattering the haplotypecaller call based on each line in the interval list and your interval list has 8k lines, hence the 8k scattered jobs.

    Note, scattering the haplotypcaller call is used to parallelize the workflow, it isn't required. The workflow can simply run so that haplotypcaller task is called once using one interval list, instead of being called multiple time for each line in the interval list. Not scattering the call would also mean you wouldn't have to use Mergegvcf since the workflow would be executing one haplotypecaller task creating one gvcf.

  • cmtcmt Seattle WAMember
    @bshifaw That makes sense!

    I will restructure how I'm running HaplotypeCaller and give it a shot without the additional scattering and without using MergeVCFs and see if that solves my problem.

    Not to get too ahead of myself, but I think this is the answer to my issues- thanks :)
  • AdelaideRAdelaideR Member admin

    Good luck! Feel free to let us know how it went, this is helpful information for other forum users.

  • cmtcmt Seattle WAMember
    Hi @AdelaideR and @bshifaw!

    I have a couple more questions about optimizing my HC run, thank you for all your help so far!

    I removed the scatter that had the HaplotypeCaller running a task for each of the lines of my interval list (8310) and instead have it running on a test of 6 intervals, using -L in the body of the HC command, but no additional scattering/parallelization. It has been running for more than 30 hours because it was preempted a couple of times and had to re-start after each from the beginning of the HC task. It looks like maybe it just restarted again. I'm thinking of running the HC command without the preemptible allowance, because I think I am just wasting time allowing the job to be preempted.

    But I also wonder if I am not running HC on a fast or powerful enough VM on Google. How do I know whether to increase the memory or the CPUs of the job thoughtfully- that is without asking for more than I need and wasting money? I have asked for 7GB of memory in the wdl, but no explicit number of CPUs. Maybe another clue that my wdl isn't optimized for this task is that the HaplotypeCaller.log ahs these lines: (?)

    20:44:44.919 INFO IntelPairHmm - Available threads: 2
    20:44:44.919 INFO IntelPairHmm - Requested threads: 4
    20:44:44.920 WARN IntelPairHmm - Using 2 available threads, but 4 were requested

    In addition to optimizing the VM I'm running HC on in terms of computing power and memory, I want to optimize the time it takes to run HC using some sort of parallelization because I think HC is stalling on the large scaffolds because they are so big. So the other question I had is if you think this new plan I have is a good idea: I would run HaplotypeCaller on the 24 largest scaffolds I have using the original scatter method (one HC task for each line in the interval file -24 scaffolds in the interval file) (I would probably run this preemptively to save money) and then merge the resulting 24 gvcf files to produce one gvcf. I would run the remaining 8286 very very small scaffolds using the non-scattered protocol (just a wdl of HC with -L for the 8286 scaffolds within the body of the command) to produce one gvcf file for all of those small scaffolds. Then I would MergeVCF the two gvcf files to get one gvcf that contained all 8310 scaffolds.

    What do you think?

    Thanks!!
Sign In or Register to comment.