Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

DepthOfCoverage in parallel mode does not actually run in parallel

danilovkiridanilovkiri Moscow, RussiaMember

Hi.

I've been using DepthOfCoverage tool for coverage estimation for human WGS data, which was aligned with BWA MEM, filtered using samtools and passed through MarkDuplicates. I tried to run DepthOfCoverage in parallel mode using -nt and --omitIntervalStatistics and in a single-threaded mode. All the data is stored on an SSD and being processed on a server with 12 actual cores. Surprisingly, the speed of data processing as reported by ProgressMeter is two times faster in a single-threaded mode (15 sec per 1 million sites vs 30 sec). I understand the limitations of I/O, but it is confusing when compared with some other GATK (non-Spark) tools which are actually able to process data in -nt or -nct mode with reading/writing.

Does this behaviour actually look like as it is supposed to? Any comment would be greatly appreciated.

Best Answer

  • danilovkiridanilovkiri Moscow, Russia
    Accepted Answer

    Hi @bhanuGandham

    I guess a figured this out. The speed of DepthOfCoverage analysis is severely and mostly limited by writing out the depth at each base argument, hence disabling --omitIntervalStatistics and enabling -nt basically do not improve the performance unless --omitDepthOutputAtEachBase is used. Disabling depth output at each base makes DepthOfCoverage run 30 times faster using 12 cores. So, if one does not need the file with depths at each base to be generated, it is useful to use --omitDepthOutputAtEachBase argument.

    However, I do need this file and it seems like my question is irrelevant given the I/O limitations. As for processing speed comparison, I implied GATK 3.8 HaplotypeCaller (though I use the newest now). I understand that these tools perform completely different procedures and the speed of HC mostly depends on the active region complexity in terms of deviance from the reference sequence. Thus, this comparison may be irrelevant.

    As I suppose the problem with DepthOfCoverage when using -nt and writing per-base depth is that the tool writes these depths in a consecutive order (which is perhaps not necessary) thus making queues of data pieces obtained from different threads waiting for output into a single file. Please, correct me if I'm wrong. In case I'm not it would be great not to encounter such behaviour in the future versions of this tool.

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @danilovkiri

    Which version of GATK are you using? Multi threading options have been deprecated in the latest versions. We currently only support GATK4 version and the latest is v4.1.1.0.

  • danilovkiridanilovkiri Moscow, RussiaMember

    @bhanuGandham thank you for your reply.

    Since DepthOfCoverage has not been implemented in GATK4 yet, I’m using GATK3.8, obviously.

    So there’s no support for the tools that have not been deprecated though are necessary and have no analogs in the newest GATK version. Am I right?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @danilovkiri

    My apologies. We should support it given that depthofcoverage hasn't been ported over in GATK4. Let me look into this further and I will get back to you shortly.

    PS: depthofcoverage will be added to GATK4 in the next couple of months

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @danilovkiri

    I have not personally compared the speed of DepthOfCoverage in GATK3.8 to other non sark tools. But I can ask around.
    Can you give more specifics about which non spark tools did you compare to and what are the processing times for those compared to DepthOfCoverage?
    Also can you post the exact command you used for DepthOfCoverage?

  • danilovkiridanilovkiri Moscow, RussiaMember
    Accepted Answer

    Hi @bhanuGandham

    I guess a figured this out. The speed of DepthOfCoverage analysis is severely and mostly limited by writing out the depth at each base argument, hence disabling --omitIntervalStatistics and enabling -nt basically do not improve the performance unless --omitDepthOutputAtEachBase is used. Disabling depth output at each base makes DepthOfCoverage run 30 times faster using 12 cores. So, if one does not need the file with depths at each base to be generated, it is useful to use --omitDepthOutputAtEachBase argument.

    However, I do need this file and it seems like my question is irrelevant given the I/O limitations. As for processing speed comparison, I implied GATK 3.8 HaplotypeCaller (though I use the newest now). I understand that these tools perform completely different procedures and the speed of HC mostly depends on the active region complexity in terms of deviance from the reference sequence. Thus, this comparison may be irrelevant.

    As I suppose the problem with DepthOfCoverage when using -nt and writing per-base depth is that the tool writes these depths in a consecutive order (which is perhaps not necessary) thus making queues of data pieces obtained from different threads waiting for output into a single file. Please, correct me if I'm wrong. In case I'm not it would be great not to encounter such behaviour in the future versions of this tool.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @danilovkiri

    That is correct and we are going to be making improvements to this process in the new DepthOfCoverage tool in GATK4.1, to be released shortly.

    Thank you for this detailed solution to your question. This will be helpful to others in the community.

Sign In or Register to comment.