Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

A Gene Pattern Question

ayjoonayjoon Houston, TXMember
edited November 2016 in Ask the GATK team

Hi, I was referred here but not sure if this is the right place to ask. Any comments will be appreciated.
I am planning to run a Gene Pattern Module "CopyNumberInferencePipeline" on 240 tumor .CEL files, and roughly equivalent number of normal .CEL files. Since the module documentation specifically stated that the max number of CEL to be processed is 200, my strategy is to run the samples in batches. However, the noise reduction step relies on normal samples. For that reason, it will be the best to run all samples all together, which is beyond the max number specified. Any advice? If there is someone I can talk to regarding this, please point out.

Issue · Github
by Sheila

Issue Number
1500
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Comments

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @ayjoon,

    Please direct GenePattern questions to gp-help (at) broadinstitute (dot) org.

    And actually, the CopyNumberInferencePipeline module documentation states

    Due to memory constraints this pipeline is estimated to run with a maximum of about 200 CEL files. The estimated run time with 200 samples is about 40 hours.

    I would suggest two solutions. First, you can try your 240 files on the public server. Second, you can see if the tool is available on GenePattern @ Indiana University, which has more compute resources. Information on [email protected] is here and the actual site is here.

  • ayjoonayjoon Houston, TXMember

    Thank you for the response, shlee ! In fact, I was refereed here through [email protected] That section you quoted was exactly where my concern came from. I interpret it as NOT to exceed 200 CEL files on the public server. I can only use public server for the reason that one of the component of the pipeline is not available outside Broad server (permission issue) . Also, I went to the Indiana University first but for the same permission issue they lack part of the module and referred me to the Broad.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited December 2016

    @ayjoon --The key word is estimated. It's a bit funny that gp-help should refer you to the GATK forum. I would suggest you ask them whether there is a Docker version of GenePattern (with a binary version of the tool) that would then allow you to run your analyses in the cloud.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Oh, sorry for making you the ball in a support ping pong game, @ayjoon. I assume the Gene Pattern support team might have sent you to us because the module you want to use involves some GATK or GATK-based tools, however it's not one of our supported products. If there was a specific GATK tool in the pipeline that was throwing an error, we might be able help with that, but we can't help with the overall pipeline because we have no part in defining it. In any case it sounds like the problem you ran into is an infrastructure-related limitation, which is even more out is scope for us... So I'm afraid you've fallen in the gap between our teams' scopes. That being said, @shlee's suggestion of finding a way to run this on cloud makes sense. You could ask the FireCloud team if they can help you set this up.
  • bahillbahill Cambridge, MAMember, Broadie

    Hi all,

    Figured we could save @ayjoon some round trips by converging here.

    @Geraldine_VdAuwera, yes that's the crux. We don't have any nodes with enough memory to run this job, and I'm not sure if simply providing more memory would address the issue, or if there needs to be some optimization and/or threading as well. IE I do not know how this thing scales.

    We have not set this up in Docker (we're getting there), but regardless, this particular pipeline makes use of some human data which can't leave the Broad, so it's stuck here.

    Are there are any former CGA folks still kicking around over there (Gordon?) in FireCloud who could help with answering this? (They have shut down the forum that used to support these sorts of questions, and provided no alternative source for assistance). Basically we need to know the following:

    • if partitioning the data is a viable/valid option
    • if there is an alternative to the human data file being used for noise reduction in Tangent
    • if there is an alternative to this pipeline in FireCloud
    • if providing more memory is really going to help, or if there would need to be optimization as well.

    @ayjoon - thanks for bearing with us, and apologies for bouncing you between our help forums. We are actually only down the hall from each other, but work on separate teams, so this can happen.

    To that end - @Geraldine_VdAuwera , I'm happy to stop by and see if we can't sort this out together/find the folks who can.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @bahill, I would suggest you ask @esalinas on the FireCloud team.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yeah, Eddie is probably the one to ask about this. We should discuss over coffee some time, @bahill, but it's unlikely my team can do much to help with this stuff.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Let's start with Gordon as he provided some help with the construction of the GenePattern pipeline. I'll ask Gordon to respond on this forum.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    There is a copy number analysis workflow available on FireCloud. It is a component of our suite of somatic mutation calling workflows. Please see http://gatkforums.broadinstitute.org/firecloud/discussion/7512/broad-mutation-calling-best-practice-workflows#latest, and in particular the subsection on Broad Mutation Calling Copy Number Workflow.

  • ayjoonayjoon Houston, TXMember

    @birger Does that workflow cover SNP6 data, specifically, .CEL files (not WES)?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @ayjoon,

    For CNV calling on SNP6 data, try out GISTIC or ABSOLUTE.

Sign In or Register to comment.