Howto best build a new bioinformatic infrastructure

maymannmaymann Copenhagen - DenmarkMember
edited November 2013 in Ask the GATK team

Hi community,

I'm new to the bioinformatics area, but I am a computer scientist and have a history within distributed computer environments.

I have been given the exiting task to build a new infrastructure for the following purposes...

DNA sequencing:
New DNA sequencing technologies have revolutionized biological and medical research. Today, it is possible to sequence a complete human genome in less than one week and at low cost. DNA sequencing can also be used for gene expression analysis (RNA-seq), identification of mutations (SNPs), probing of binding sites for DNA and RNA binding proteins (ChIP-seq and CLIP-seq), sequencing of ancient DNA, studying the biodiversity of ecosystems (META genomics), and much more. Each such experiment typically generates at least 200 million short DNA sequences of 100 bases each (one lane of the Illumina HiSeq machine). Handling and analyzing these 20 billion base pairs at the moment requires a bioinformatics expert. We is currently using these technologies in medical applications such as disease classification and diagnosis, in studying the bacterial ecosystem of the human gut, and several others.

BioImaging:
The selection of drug treatments based on the genetic make-up of a specific patient is the future of personalized medicine. In breast cancer research, inhibitors that block a specific step in homologous recombination (HR) are thought to eradicate tumor cells in some forms of breast cancer. To identify such drug candidates, we study the enzymatic steps of HR in live human cells using a PerkinElmer Opera high-throughput (100.000 images/day) confocal microscope available at the Center for Advanced BioImaging. Such a screen will examine 100-200 cells at three different drug concentrations for each molecule in a drug library of >10.000 small molecules. The subsequent analysis of 3-6 million cells at three imaging wavelength will require large computing capabilities for object detection, segmentation, geometric alignment and quantization, for which the access to bioinformatics and computational analysis is crucial. Several groups in the department use imaging technologies and there is a strong need for computational resources and – not the least – a professional storage solution.

Protein structure analysis:
Proteins are biological macromolecules that play a central role in biology, biotechnology and medicine, and atomic resolution structures of proteins can provide crucial insight in to the mechanisms by which they function. As such, the generation and analysis of protein structure forms the basis for a broad range of experimental studies in biochemistry, biophysics and molecular biology. Many protein structures have already been determined and are available in publicly accessible databases, but detailed and quantitative analyses require a computational approach. Further, it is now possible to model the structures of many proteins by exploiting structural information on other proteins with similar sequences (so-called homology modeling); again reliable modeling requires specific computational expertise. The department hosts several research groups that can (i) determine protein structures experimentally through nuclear magnetic resonance spectroscopy and X-ray crystallography, (ii) model or predict the structures and dynamical properties of proteins, and (iii) use computational methods to predict the effect of protein mutations on biophysical and biochemical properties. The substantial computational resources that will be available in the BIO-Computing core facility will be essential to unleash the full combined potential of these individual research activities, and to make these available to all research groups at the department.

Can Hadoop/HDFS + GATK (+ ?) be used to best solve above requirements ?

  • If yes: what components (HW + SW) would you prefer and how ?

  • If no: what other tools could you recommend be used to (better) accomplish this ?

Thanks in advance :) !

~maymann

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    Hi @maymann,

    Welcome to the field! GATK is appropriate for identifying mutations, specifically SNPs and Indels, in genome sequencing data. Have a look at the Best Practices documentation for more details.

    GATK does not currently run on Hadoop, however, although this is something we are starting to look into. For large-scale work, we recommend using multithreading (which can be achieved internally using -nt and -nct arguments) and scatter-gather parallelism using Queue on a computing cluster.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    GATK will solve some of your requirements, but certainly not all (that's a long list of very different things you posted...). I do not know of any software that will do all of the above. But GATK will cover important points of your DNA sequence analysis needs.

    Your setup will depend enormously on what infrastructure is available to you. We are not able to provide detailed recommendations on this point, sorry. Good luck!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    Hi @maymann,

    Welcome to the field! GATK is appropriate for identifying mutations, specifically SNPs and Indels, in genome sequencing data. Have a look at the Best Practices documentation for more details.

    GATK does not currently run on Hadoop, however, although this is something we are starting to look into. For large-scale work, we recommend using multithreading (which can be achieved internally using -nt and -nct arguments) and scatter-gather parallelism using Queue on a computing cluster.

  • maymannmaymann Copenhagen - DenmarkMember

    Hi Geraldine,

    thanks for you quick and kind reply :) !

    if using GATK - could above requirements be solved ? and if so, what would be the ideal large-scale setup for this kind of activities ?

    Thanks in advance :) !

    ~maymann

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    GATK will solve some of your requirements, but certainly not all (that's a long list of very different things you posted...). I do not know of any software that will do all of the above. But GATK will cover important points of your DNA sequence analysis needs.

    Your setup will depend enormously on what infrastructure is available to you. We are not able to provide detailed recommendations on this point, sorry. Good luck!

  • maymannmaymann Copenhagen - DenmarkMember

    Hi,

    Geraldine : thanks again for you quick and kind reply :) !

    All : I'm building a new infrastructure from scratch, so need to figure out what SW to base my solution on and from that find the OS and hardware requirements and then again estimate infrastructure (network, rackspace, power, cooling, etc.) from that.

    Any help/reference at this point would be much appreciated.

    Thanks in advance :) !

    ~maymann

Sign In or Register to comment.