Batch effects begone: Introducing the Functional Equivalence data processing pipeline spec

ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭
edited February 12 in Announcements

By Eric Banks, Director, Data Sciences Platform and original member of the GATK development team

Ever since the GATK started getting noticed by the research community (mainly as a result of our contribution to the 1000 Genomes Project), people have asked us to share the pipelines we use to process data for variant discovery. Historically we have shied away from providing our actual scripts, not because we didn't want to share, but because the scripts themselves were very specific to the infrastructure we were using at the Broad. Fortunately we've been able to move beyond that thanks to the development of WDL and Cromwell, which allow potentially limitless portability of our pipeline scripts.

But it was also because there is a fair amount of wiggle room in terms of how to implement a pipeline to achieve correct results, depending on whether you care more about speed, cost or other factors. So instead we formulated "Best Practices", which I'll talk more about in a minute, to provide a blueprint of what are the key steps in the pipeline.

Today though we're taking that idea a step further: in collaboration with several other major genomics institutions, we defined a "Functional Equivalence" specification that is intended to standardize pipeline implementations, with the ultimate goal of eliminating batch effects and thereby promoting data interoperability. That means if you use a pipeline that follows this specification, you can rest assured that you will be able to analyze your results against all compatible datasets, including huge resources like gnomAD and TOPMed.


The GATK Best Practices were always meant to be just a set of guidelines for processing sequencing data and performing variant discovery, enumerating the key steps that we found produced the best results. To turn that into an actual pipeline, you have to make certain implementation choices -- how to chain the steps together, set command-line parameters, even which tools to use since for some of the steps there are several valid options. This means there is room for different implementations depending on factors like whether you care more about cost or about runtime, for example. However, any difference between implementations has the potential to cause subtle batch effects in downstream analyses that compare datasets produced by those variant pipelines. This is not a purely theoretical concern -- we have seen such batch effects occur in real analyses, with subtle but important consequences for the scientific results. In fact, this has been so prevalent that the Broad Institute has historically reprocessed from scratch any data that it received from other genome centers in order to avoid batch effects.

Clearly, with the amount of data in today’s large sequencing projects, that strategy of reprocessing everything is no longer feasible. So for the past year, we worked closely with several of the other large genome sequencing and analysis centers (New York Genome Center, Washington University, University of Michigan, Baylor College of Medicine) to develop a standardized pipeline specification that would promote compatibility among our respective institutions' pipelines. And I'm proud to say we accomplished our goal! It took a lot of testing and evaluations, but the consortium was able to define very precisely what are the components of a pipeline implementation from unmapped reads to an analysis-ready CRAM file that will make it “functionally equivalent” to any other implementation that adheres to this standard specification. This means that any data produced through such functionally equivalent pipelines will be directly comparable without risk of batch effects.

The consortium has already published the specification itself in Github here, and we're currently preparing a manuscript detailing the methodology we used as well as the consequences for downstream analysis, which we will submit for peer review and journal publication. We'll post updates and link to the preprint and final article as they become available.

The genome centers involved in this effort have all committed to using this standard to process all of our genomes, and together we’ve already processed over 150,000 human whole genomes with it. Since collectively we account for a substantial proportion of all genomes that get sequenced in the world, that should already simplify the lives of the large number of researchers who get their data from any one of our centers. That being said, there are plenty of other genome sequencing facilities out in the world (yes, we're very aware --and happy-- that there is a wonderful world beyond North America), and there are also many researchers who do their own processing and analysis. We hope that these service providers and individual analysts will consider adopting the Functional Equivalence pipeline specification in their own work to further increase the number of genomic datasets that will be fully compatible in this way. Based on our experience working with datasets from different provenances, we expect this will have far-reaching positive consequences for the ability of biomedical scientists to cross-analyze datasets. As an example, major datasets like the next version of gnomAD (~70,000 genomes, plus countless exomes) and TOPMed (another ~70,000 genomes) are now being processed through pipelines that conform to this Functional Equivalence standard. These datasets constitute incredibly valuable resources for analyses that e.g. rely on known allele frequencies, yet we know that any cross-analysis you run against them is vulnerable to batch effects that skew results UNLESS your samples were also processed through a functionally equivalent pipeline.

That last point is really important, so let me give an example to illustrate it. Imagine you want to find a causal variant in a sample you really care about, so you run variant calling and then compare the resulting callset against gnomAD in order to find the population-based allele frequencies. And, behold, you find a SNP that’s not in gnomAD! This could be the rare variant that you’ve been searching for… or it can be an artifact that arose because you didn’t process your sample with a pipeline that’s functionally equivalent to the pipeline used to make gnomAD. In case you're curious, we've found that the most egregious batch effects arise from using different aligners (or even different versions of the same aligner), which is why the functional equivalence specification includes the requirement to run the exact same version of BWA. And we’ve found that even simple tasks like marking of duplicate reads can have some drastic effects on things like downstream calling of structural variation. More details to come in the paper!

In my next blog post (coming soooon) I will point you to the Broad’s actual implementation and describe how we’ve made it really cheap to run. Stay tuned!

Post edited by Geraldine_VdAuwera on

Comments

  • chapmanbchapmanb Boston, MAMember

    Eric;
    Thanks for all the work and the detailed standardization in the GitHub repo. I'm excited about this and am keen to bring the bcbio pipeline up to date with all of these so folks can choose to remain compatible with what you are doing on big projects.

    Are you able to share some of the validation sets you're using to make these decisions? I'd be especially interested in putting together a set of difficult regions that drive some of the choices like duplicate marking algorithms. Having these as a complement to the type of whole genome validations that Genome in a Bottle project would be a huge help to avoiding regressions and testing out new tools relative to the current command line specific best practices.

    Practically, I have a couple of questions about the decisions as we look to adjust our pipeline:

    • Do you have a sense of the downstream effects of dropping the -M flag for bwa-mem? I'm open to doing this but have been using -M for a while so wanted to see if you'd evaluated secondary/supplementary marking for structural variant calling with tools like manta, lumpy and delly.
    • Similarly, what value/effect does -Y have on downstream analyses? If you've looked at this already it would save a lot of work (and worry about changing from what we have).
    • Have you evaluated biobambam2's markduplicates and bamsormadup (https://github.com/gt1/biobambam2)? We use this over Picard for performance reasons and it's been functionally equivalent in our testing. It would be useful to know if you've found issues where Picard outperforms it and make decisions about changing or mitigating those effects.

    Thank you again for all this work.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Brad!

    It was the work of a whole bunch of people across many different genome centers. I take zero credit for it -- I just get the privilege of blogging about it.

    Although you ask great questions, I need to ask you to be patient and wait for the paper to come out. I solemnly swore to the consortium that I wouldn't go into the technical details here until the paper comes out (or is in preprint). Hopefully shouldn't be too far from away (and then we can dive right in!).

  • fiamhfiamh Glasgow, UKMember

    Awesome work, been waiting for this since it the initial discussions. I second Brad's comment about the availability of comparison tools. We'd love to get to a stage of functional equivalence where BAMs (or variant calls, down the road) are considered acceptable even if they may not have used the exact same methods to get to the processed data.

  • chapmanbchapmanb Boston, MAMember

    Eric -- thanks for this and for thanks to everyone who did the crazy amount of work that went into it. We'll look at benchmarking and validation what we can from our side based on the current recommendations and looking forward to discussing more details when the preprint is available. Thank you again.

Sign In or Register to comment.