CNVDiscovery pipeline no ".done" file ?

hjzhouhjzhou Member
edited May 2018 in GenomeSTRiP

Hi staff,

In the CNVDiscovery pipeline working directory "cnv_stage2", seq_chr2 always restart all the jobs when I redo the pipeline, unlike other chromosomes. I checked the seq_chr2 folder, even though I could see all the P0xxx.genotypes.. files, there are no ".done" file at all. I could find the ".done" files in other chromosome folders. I wonder if there is bug peculiar to chromosome 2 in stage 2? Or ".done" files are all produced at once at the very end for each chromosome? (This seems unlikely, as the progress seems stuck, as the log shows that

INFO  13:02:52,137 QGraph - 8 Pend, 456 Run, 0 Fail, 0 Done

after 3 days running, but I could see all the 456 partitions' files in the seq_chr2 folder.)
I wonder if I could just use

touch .P0xxx.genotypes.vcf.gz.done

to mark these partitions as "done"?


  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Touching the files will make Queue keep going, but you aren't addressing the underlying problem.

    First, you should ask whether the 456 jobs really completed. Are they still running on your cluster?
    If they did complete, it would be good to try to verify if you can whether they exited normally (to the extent your cluster keeps history for completed jobs).
    If they completed, and if they exited normally, then the problem is that Queue didn't get notified (or didn't process the notification correctly). This is often a problem related to the drmaa API (which Queue uses to interact with the job scheduler). Queue uses the drmaa API to poll at regular intervals for the completion status of the jobs it has sent to the job scheduler.

  • hjzhouhjzhou Member

    Thank you very much bhandsaker. I checked the status of some completed jobs. The "exit_status" are all 0. Then it must be related to drmaa API.
    I have another question though. Is Genome STRiP designed to submit jobs in an "increasing-interval" manner? Take stage2, chr22 for example, I noticed that at the very beginning, it could submit 4 "partition" jobs in a minute. At the end it took 2 minutes to submit a single "partition" jobs. For stage2, chr2, since it's a big chromosome, it took 3 days to go from P0001 to P0456. It took 15 -20 minutes to submit a single partition job at the end. It's also true for SVPreprocessing and SVDiscovery I think. I wonder whether it's something related to the operating system. We use CentOS 6.9.

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    This increasing time behavior between dispatches doesn't sound normal. Pure speculation, but I wonder if there's some problem with the drmaa API such that (a) you are missing the completion events (or Queue is not able to match them up with the jobs) and (b) as the number of completions increase Queue is spending more time polling and looking for completed jobs.

    If you are capable, you could try instrumenting the Queue code (it is scala code in GATK) to debug this. We actually use a fork of the gatk Queue code with some bug fixes. Let me know if you want access.

Sign In or Register to comment.