This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Fake your data -- for Science!
Part 1 of a series on the theme of synthetic data
Don't get me wrong, I'm not suddenly advocating for fraudulent research. What I'm talking about is creating synthetic sequence data for testing pipelines, sharing tools and generally increasing the computational reproducibility of published studies, so that we can all more easily build on each other's work.
The majority of the effort around computational reproducibility has so far focused on better ways to share and run code, as far as I can tell. With great results -- it's been transformative to see the community adopt tooling like version control, containers and Jupyter notebooks. Yet you can give me all the containers and notebooks in the world; if I don't have appropriate data to run that code on, none of it helps me.
Most of the genomic data that gets generated for human biomedical research is subject to very strict access restrictions. These protections exist for good reason, but on the downside, they make it much harder to train researchers in key methodologies until after they have been granted access to specific datasets — if they can get access at all. There are certainly open datasets like 1000 Genomes and ENCODE that can be used beyond their original research purposes for some types of training and testing. However they don't cover the full range of what is needed in the field in terms of technical characteristics (eg exome vs WGS, amount of coverage, number of samples for scale testing etc); not by a long way.
That's where fake data comes in -- we can create synthetic datasets to use as proxies for the real data. This is not a new idea of course; people have been using synthetic data for some time, as in the ICGC-TCGA DREAM Mutation challenges, and there is already a rather impressive range of command-line software packages available for generating synthetic genomic data. It's even possible to introduce (or "spike in") variants into sequencing data, real or fake, on demand. So that's all pretty cool. But in practice these packages tend to be mostly used by savvy tool developers for small-scale testing and benchmarking purposes, and rarely (if ever? send me links!) by biomedical researchers for providing reproducible research supplements.
And frankly, it's no surprise. It's actually kinda hard.