Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Latest Release: 12/4/18
Release Notes can be found here.

Cross product between entities for Mutect2 normal-normal analysis

davidbendavidben BostonMember, Broadie, Dev ✭✭
edited March 7 in Ask the FireCloud Team

What I want to do
One of the standard evaluations of somatic variant callers is normal-normal calling, where you take a bunch of replicate normal (non-tumor) bams from the same individual eg NA12878 and run your caller over every pair of samples, assigning one sample as the "tumor" and one as the "normal."

If I were writing a wdl outside of Firecloud I could implement this with the cross product:

workflow NormalNormal {
  Array[File] bams

  scatter (pair in cross(bams, bams)) {
    call Mutect2 { input: tumor = pair.left, normal = pair.right }
  }
}

My understanding of why this is tricky
As I understand the data model, it's baked into Firecloud that you run a method over each sample, which means you don't know about the other samples. That is, you can't perform the cross because your input is a File egsample.bam, and not an Array[File] eg sample_set.bams.

Hacky solution 1
I suppose one could set up a bunch of pairs, basically by implementing the cross manually and then uploading the resulting data model, and run the analysis over a pair set. Besides being really ugly this is not very maintainable because the part of the workflow that forms all pairs out of the list of samples lives outside of Firecloud.

What I mean is that even though the natural data model of the problem is

sample bam
sample1 sample1.bam
sample2 sample2.bam
sample3 sample3.bam

I would run on pairs:

pair tumor_sample normal_sample
pair1 sample1 sample2
pair2 sample2 sample1
pair3 sample1 sample3
pair4 sample3 sample1
pair5 sample2 sample3
pair5 sample3 sample2

Hacky solution 2
I could also imagine the following hack. The data model would be a single, dummy, sample, with the attribute "bam_paths" which is a file with a path to a different bam on each line. Then the method would take in this FoFN, use read_tsv to get the Array[File] of bams, and then scatter over the cross.

Similarly, there's another analysis I want to do that scatters Mutect2 linearly (not sure what the word is but I mean trivially, without a cross or anything) over samples, but then does a cross over the output, the pairwise overlap between callsets FWIW. I'm also not sure how to do that.

Is there a good way to do these things?

Post edited by Geraldine_VdAuwera on
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @davidben, my instinct is that it should be possible to do through sample sets but I might be dreaming -- let me check with the FC team. If that's not possible we'll find you the least hacky workaround possible -- which might be your solution 2.

    Not sure I understand your second request, do you mean run a WDL across each sample individually then collect the output and take that into another WDL as an array?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ok, I got confirmation that this should work:

    Make a sample set that includes all the samples you want in your cross, set the root entity type for your method to sample set, then specify this.samples.bam to your Array input. This will give your WDL the list of bams from your samples.

    If I'm right about your second question, then you would do essentially the same thing; except you would first run your M2 method per sample (by setting its root entity to sample then running it on the sample set with expression this.samplesin the launch dialog), then you would run your cross analysis on the sample set in the same way as for your first problem.

    Does that make sense? Let me know if you run into any trouble. At some point we want to write some kind of cookbook of recipes for this sort of thing since it's really not intuitive to figure out.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    Thanks @Geraldine_VdAuwera! I will try that.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    @Geraldine_VdAuwera It's working!

    For posterity, let me summarize.

    Simple parallelization with sets
    If you have a method whose root entity is sample and you want to run it on many samples, you must

    1. Define a sample set in the data model containing all your samples.
    2. Launch the method on the sample set and use the launch expression this.samples. Note that the root entity of the method is still "sample" and not sample set.

    The wdl looks like

    workflow MyWorkflow {
        File bam
    }
    

    In the task configurationbam is set to this.bam, where this means a single sample.

    Running a method on an entire set
    If your method inherently requires a set as an input, that is, if you're not just running on each sample in the set independently, you must

    1. Define a sample set in the data model containing all your samples.
    2. Write a method whose root entity is sample set. The wdl for this method will take some Array[] as input as shown below.
    3. Launch the method on the sample set with no launch expression.

    The wdl for such a method looks like

    workflow MyWorkflow {
        Array[File] bams
    }
    

    In the task configurationbams is set to this.samples.bam, where this means the sample set.

Sign In or Register to comment.