Howto best build a new bioinformatic infrastructure
I'm new to the bioinformatics area, but I am a computer scientist and have a history within distributed computer environments.
I have been given the exiting task to build a new infrastructure for the following purposes...
New DNA sequencing technologies have revolutionized biological and medical research. Today, it is possible to sequence a complete human genome in less than one week and at low cost. DNA sequencing can also be used for gene expression analysis (RNA-seq), identification of mutations (SNPs), probing of binding sites for DNA and RNA binding proteins (ChIP-seq and CLIP-seq), sequencing of ancient DNA, studying the biodiversity of ecosystems (META genomics), and much more. Each such experiment typically generates at least 200 million short DNA sequences of 100 bases each (one lane of the Illumina HiSeq machine). Handling and analyzing these 20 billion base pairs at the moment requires a bioinformatics expert. We is currently using these technologies in medical applications such as disease classification and diagnosis, in studying the bacterial ecosystem of the human gut, and several others.
The selection of drug treatments based on the genetic make-up of a specific patient is the future of personalized medicine. In breast cancer research, inhibitors that block a specific step in homologous recombination (HR) are thought to eradicate tumor cells in some forms of breast cancer. To identify such drug candidates, we study the enzymatic steps of HR in live human cells using a PerkinElmer Opera high-throughput (100.000 images/day) confocal microscope available at the Center for Advanced BioImaging. Such a screen will examine 100-200 cells at three different drug concentrations for each molecule in a drug library of >10.000 small molecules. The subsequent analysis of 3-6 million cells at three imaging wavelength will require large computing capabilities for object detection, segmentation, geometric alignment and quantization, for which the access to bioinformatics and computational analysis is crucial. Several groups in the department use imaging technologies and there is a strong need for computational resources and – not the least – a professional storage solution.
Protein structure analysis:
Proteins are biological macromolecules that play a central role in biology, biotechnology and medicine, and atomic resolution structures of proteins can provide crucial insight in to the mechanisms by which they function. As such, the generation and analysis of protein structure forms the basis for a broad range of experimental studies in biochemistry, biophysics and molecular biology. Many protein structures have already been determined and are available in publicly accessible databases, but detailed and quantitative analyses require a computational approach. Further, it is now possible to model the structures of many proteins by exploiting structural information on other proteins with similar sequences (so-called homology modeling); again reliable modeling requires specific computational expertise. The department hosts several research groups that can (i) determine protein structures experimentally through nuclear magnetic resonance spectroscopy and X-ray crystallography, (ii) model or predict the structures and dynamical properties of proteins, and (iii) use computational methods to predict the effect of protein mutations on biophysical and biochemical properties. The substantial computational resources that will be available in the BIO-Computing core facility will be essential to unleash the full combined potential of these individual research activities, and to make these available to all research groups at the department.
Can Hadoop/HDFS + GATK (+ ?) be used to best solve above requirements ?
If yes: what components (HW + SW) would you prefer and how ?
If no: what other tools could you recommend be used to (better) accomplish this ?
Thanks in advance !