Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Metric collection task stalls

I have a metric collection task that's part of a workflow that I've run on 372 samples. For 2 of those samples, this particular task seems to stall. The task is supposed to read in a BAM file and output a few metrics files and a PDF. From the stderr logs I can see that it successfully reads the entire BAM file but never writes any output files and becomes stuck. I've tried rerunning the workflow with call caching on but get the same result. I've also downloaded the input BAM file for one sample and tried running it locally with the same docker image and in finishes without any issues. I can't figure out what is going wrong or how else to debug this issue. Any insight you have would be very helpful. Thanks!

Best Answer

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    hey @jurhoades - more than happy to take a look! In case its specific workspace issue, can you share the workspace name?

    This Ticket has been deleted from Zendesk
  • jurhoadesjurhoades BostonMember

    Sure, it is blood-biopsy/MRD_duplex_bams. This is the original run of one of the samples and this is me trying to rerun it. The specific task that gets stuck is called CollectDuplexSeqMetrics. Thanks!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @jurhoades -

    I've looked into your workflow and I have a few thoughts/questions to help us start debugging:

    From the logs, under Call #2 for the CollectDuplexSeqMetrics I was able to see that this step was aborted because it was running for more than 6 days without completing. This step needs to run faster so depending on what CollectDuplexSeqMetrics is doing, you will need to determine how best to speed it up. Based on some reading, you might for example be able to speed it up by restricting calculation of metrics to a set of regions using the --intervals parameter. Here is the specific message I was referring to:
    "message": "Operation canceled at 2018-10-18T18:26:18-07:00 because it is older than 6 days"

    You mentioned that you tried to troubleshoot by downloading the "input BAM" and testing it locally and that it was successful. Would you be able to provide how long that test took on your local machine? Can you also clarify if the input BAM you referred to is the output from the previous step (FGBioGroupReadsByUmi) or if it is the original BAM tested just on the CollectDuplexSeqMetrics.

  • jurhoadesjurhoades BostonMember

    Thanks for looking into this! This task usually runs relatively quickly. On my local machine it ran in an hour. I'm thinking that job timed out because it was stuck and not because it was still doing work, although I can't say for sure. I went to the task that timed out and downloaded the input bam_file for that task. I also confirmed that it was the same file as the one output from the FGBioGroupReadsByUmi task.

  • jurhoadesjurhoades BostonMember

    @SChaluvadi
    I've restarted one of the failed samples here and it is currently running. I tried connecting to the VM but it appears I don't have the right permissions: ERROR: (gcloud.compute.ssh) Could not fetch resource: - Required 'compute.instances.get' permission for 'projects/blood-biopsy/zones/us-central1-b/instances/ggp-11699669849065487397'. It's hasn't gotten to the point where it stalls yet as of posting this. I'll keep checking to see if it eventually gets stuck again with increased memory.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @jurhoades - Looks like the workflow finished successfully! Can you confirm that everything looks good from your end?

  • jurhoadesjurhoades BostonMember

    It looks like it did! It must've been a memory issue, even though it didn't complain about running out of memory before. I'll adjust my workflow to give more memory to this task. Thanks for your help!

Sign In or Register to comment.