Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

PathSeq running too slow

Gaurav1983Gaurav1983 Member
edited May 29 in Ask the GATK team

Hi,

I am trying to run PathSeq pipeline on single end RNAseq data (2.37GB unaligned BAM file). The pipeline is still running after 3 days.

Below is the snapshot of the log file of the step that is taking most time. Is there any option that I can use to speed up this process.

Thanks

Regards

Gaurav

command line parameters used:

gatk-4.1.2.0/gatk --java-options "-Xmx300000m" \
            PathSeqPipelineSpark H-CRC-07TT-APC_S71_L006_unaligned.bam \
            --output H-CRC-07TT-APC_S71_L006.pathseq.bam \
            --scores-output H-CRC-07TT-APC_S71_L006.pathseq.tsv \
            --filter-metrics H-CRC-07TT-APC_S71_L006.pathseq.filter_metrics \
            --score-metrics H-CRC-07TT-APC_S71_L006.pathseq.score_metrics \
            --kmer-file cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_host.bfi \
            --filter-bwa-image cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_host.fa.img \
            --microbe-bwa-image cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_microbe.fa.img \
            --microbe-fasta cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_microbe.fa \
            --taxonomy-file cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_taxonomy.db \
            --bam-partition-size 4000000 \
            --is-host-aligned false \
            --skip-quality-filters false \
            --min-clipped-read-length 60 \
            --filter-bwa-seed-length 19 \
            --host-min-identity 30 \
            --filter-duplicates true \
            --skip-pre-bwa-repartition false \
            --min-score-identity 0.9 \
            --identity-margin 0.02 \
            --divide-by-genome-length true \
-- \
        --spark-runner LOCAL --spark-master local[4]

Partial log file:

19/05/28 21:58:04 INFO BlockManagerInfo: Added rdd_65_23 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/28 21:58:23 INFO Executor: Finished task 24.0 in stage 38.0 (TID 9505). 2070 bytes result sent to driver
19/05/28 21:58:23 INFO TaskSetManager: Starting task 28.0 in stage 38.0 (TID 9509, localhost, executor driver, partition 28, ANY, 4995 bytes)
19/05/28 21:58:23 INFO TaskSetManager: Finished task 24.0 in stage 38.0 (TID 9505) in 11085359 ms on localhost (executor driver) (25/117)
19/05/28 21:58:23 INFO Executor: Running task 28.0 in stage 38.0 (TID 9509)
19/05/28 21:58:23 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/28 21:58:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:03:30 INFO MemoryStore: Block rdd_65_24 stored as bytes in memory (estimated size 4.2 MB, free 153.4 GB)
19/05/29 00:03:30 INFO BlockManagerInfo: Added rdd_65_24 in memory on 192.168.22.18:38230 (size: 4.2 MB, free: 155.9 GB)
19/05/29 00:03:41 INFO Executor: Finished task 25.0 in stage 38.0 (TID 9506). 2070 bytes result sent to driver
19/05/29 00:03:41 INFO TaskSetManager: Starting task 29.0 in stage 38.0 (TID 9510, localhost, executor driver, partition 29, ANY, 4995 bytes)
19/05/29 00:03:41 INFO TaskSetManager: Finished task 25.0 in stage 38.0 (TID 9506) in 9629953 ms on localhost (executor driver) (26/117)
19/05/29 00:03:41 INFO Executor: Running task 29.0 in stage 38.0 (TID 9510)
19/05/29 00:03:41 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/29 00:03:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:09:28 INFO MemoryStore: Block rdd_65_25 stored as bytes in memory (estimated size 4.0 MB, free 153.4 GB)
19/05/29 00:09:28 INFO BlockManagerInfo: Added rdd_65_25 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/29 00:09:40 INFO Executor: Finished task 26.0 in stage 38.0 (TID 9507). 2070 bytes result sent to driver
19/05/29 00:09:40 INFO TaskSetManager: Starting task 30.0 in stage 38.0 (TID 9511, localhost, executor driver, partition 30, ANY, 4995 bytes)
19/05/29 00:09:40 INFO Executor: Running task 30.0 in stage 38.0 (TID 9511)
19/05/29 00:09:40 INFO TaskSetManager: Finished task 26.0 in stage 38.0 (TID 9507) in 9169125 ms on localhost (executor driver) (27/117)
19/05/29 00:09:40 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/29 00:09:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:35:58 INFO MemoryStore: Block rdd_65_26 stored as bytes in memory (estimated size 4.0 MB, free 153.4 GB)
19/05/29 00:35:58 INFO BlockManagerInfo: Added rdd_65_26 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/29 00:36:09 INFO Executor: Finished task 27.0 in stage 38.0 (TID 9508). 2070 bytes result sent to driver
19/05/29 00:36:09 INFO TaskSetManager: Starting task 31.0 in stage 38.0 (TID 9512, localhost, executor driver, partition 31, ANY, 4995 bytes)
19/05/29 00:36:09 INFO Executor: Running task 31.0 in stage 38.0 (TID 9512)
19/05/29 00:36:09 INFO TaskSetManager: Finished task 27.0 in stage 38.0 (TID 9508) in 10080656 ms on localhost (executor driver) (28/117)
19/05/29 00:36:09 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/29 00:36:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:41:31 INFO MemoryStore: Block rdd_65_27 stored as bytes in memory (estimated size 4.0 MB, free 153.4 GB)
19/05/29 00:41:31 INFO BlockManagerInfo: Added rdd_65_27 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/29 00:41:50 INFO Executor: Finished task 28.0 in stage 38.0 (TID 9509). 2070 bytes result sent to driver
19/05/29 00:41:50 INFO TaskSetManager: Starting task 32.0 in stage 38.0 (TID 9513, localhost, executor driver, partition 32, ANY, 4995 bytes)
Post edited by bshifaw on
Tagged:

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @Gaurav1983 ,

    I'm asking one of the dev members about this question, I'll bet back to you soon

  • bshifawbshifaw Member, Broadie, Moderator admin

    Here is what the dev team advised.

    Usually this happens with microbe-rich data. One thing that can help is to not generate filtering metrics. Also, assuming their data is microbe-rich, they may try down-sampling the bam to ~10M reads before running pathseq as it will still provide a good estimate of the overall microbial composition for the most abundant organisms.

    Hope this helps!

  • Gaurav1983Gaurav1983 Member

    Hi @bshifaw

    Thanks for your reply. I reduced the number of reads and now it finishes quickly for single reads but for paired reads it still took two days to finish.

    Would it be advantageous to pre filter the host reads to speed up the process?

    Would increasing the value of bam-partition-size will speed up the process as I am running job on a fat node with 500GB of RAM?

    Regards
    Gaurav

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @Gaurav1983

    Here is a response from the dev team

    Pre-filtering host aligned reads by the BAM mapping flag would make PathSeq faster, though this can result in loss of some microbial reads. Increasing BAM partition size may be beneficial if the BAM is large and contains a high proportion of microbial reads. Is the paired-end BAM larger than the single-end BAM? Omitting --filter-metrics can result in significant performance improvements if input BAM is large.

Sign In or Register to comment.