Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

How to run PathSeqPipelineSpark on the local machine?

How to run PathSeqPipelineSpark on a "normal" machine (even just a laptop) with multiple CPU cores?

Answers

  • markwmarkw Cambridge, MAMember, Broadie, Dev admin
    edited January 2018

    Hello Yinga,

    Thanks for your interest in using PathSeq. PathSeqPipelineSpark (and in fact any GATK Spark tool) can be run on your local machine by omitting the Spark arguments. See first Usage example in the tool documentation here. If you want to specify how many CPU cores to use, you can specify it like this:

    ./gatk PathSeqPipelineSpark
      ...
      -- \
      --spark-runner LOCAL --spark-master local[4]
    

    would use 4 cores. For more information see this (note this doc cites the GATK4-beta --sparkMaster argument instead of --spark-master used in the new GATK4 release).

    Note you will need the necessary reference files that are built from the host and pathogen references. Pre-built references are available for download on the GATK Resource Bundle FTP server in /bundle/beta/PathSeq.

    Additionally, a WDL is now available in the master branch on github in /scripts/pathseq/WDL. There is a readme file that further describes how it works.

  • jorgezjorgez Member
    Hello Markw,

    I am trying to follow your guidelines to select local cores but I get the following error:

    A USER ERROR has occurred: spark-master is not a recognized option

    In my case I am not running PathSeqPipelineSpark but Mutect2 (GATK v4.0.10.0), like this:

    ```
    gatk Mutect2 \
    -R GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --tumor-sample HCC1143_tumor \
    --input hcc1143_N_subset50K.bam \
    --input hcc1143_T_subset50K.bam \
    --output mutect2.vcf \
    -- --spark-runner LOCAL --spark-master local[1]
    ```

    Any help to select local cores will be appreciated.

    Thanks
    Jorge
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @jorgez

    You see this error because mutect2 is not a Spark tool, so that is why you cannot use the Spark options.

  • jorgezjorgez Member
    Hi,

    Thanks so much for letting me know.

    Is there then a built in way to parallelise mutect2?

    Jorge
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jorgez

    No there is no single built in way to parallelize Mutect2. Parallelizing is only done for certain tools because for the others there are errors generated due to the way their algorithms are designed. Hence we are being very cautious and have separate spark tools for them.

  • jorgezjorgez Member

    Hi,

    I understand,

    Thanks
    Jorge

Sign In or Register to comment.