Using ReducedReads as part of GATK pipeline

mpw6mpw6 Posts: 10Member
edited October 2012 in Ask the GATK team

We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?

Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • ebanksebanks Broad InstitutePosts: 686Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Dev, GP Member admin

    After all of the data processing steps, right before variant calling. Note that ReduceReads will want to work on single sample files.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Dev admin

    I would no longer do multi-sample realignment. I'd do per sample realignment with known sites, recalibrate, and then run ReduceReads per sample to make a reduced BAM. That's the recommended option today. And allows you to avoid the (ridiculously) expensive joint realignment step. Let us know your experiences, I'm very interested to hear how well this worked for you.

    --
    Mark A. DePristo, Ph.D.
    Co-Director, Medical and Population Genetics
    Broad Institute of MIT and Harvard

  • mpw6mpw6 Posts: 10Member

    So then if I need to combine samples, I should reduce the reads and then combine?

  • ebanksebanks Broad InstitutePosts: 686Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Dev, GP Member admin

    The GATK can combine BAMs on the fly, so you shouldn't need to physically combine them.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mpw6mpw6 Posts: 10Member

    Ok, that leads to a question I had earlier but never posted. It it really outside the scope of this thread, but since you mentioned GATK combining BAMs, I stumbled when I got to the VariantAnnotator since that didn't support multiple BAMs.

  • ebanksebanks Broad InstitutePosts: 686Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Dev, GP Member admin

    Variant Annotator absolutely does support multiple BAMs as input...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mpw6mpw6 Posts: 10Member

    Thank you. I'll try it again, and if I hit the same snag, I'll start another thread asking where I went wrong with VariantAnnotator.

  • mpw6mpw6 Posts: 10Member

    My mistake. It did accept multiple BAMs. It was just taking an enormous amount of time to process. I'll try reducing the reads.

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Dev admin

    Why are you running VariantAnnotator anyway? You can just tell UnifiedGenotyper to add all of the annotations you want while calling.

    --
    Mark A. DePristo, Ph.D.
    Co-Director, Medical and Population Genetics
    Broad Institute of MIT and Harvard

  • mpw6mpw6 Posts: 10Member

    I just asked that question of my collaborator. We're going to rework our protocol. Thank you for all the help. I'm off to find more info about parallelizing now.

  • mpw6mpw6 Posts: 10Member

    After some time working on analysis for our Nature paper, I have returned my focus to this issue and have results to discuss. Using the default arguments with ReduceReads and using the resulting BAM as input to UnifiedGenotyper, I get a result set that is smaller than I would get using non-reduced inputs. If I split the file by chromosome intervals, I get an even smaller VCF file. We are guessing here that this is due to some threshold values dropping reads that would otherwise be processed. I am still trying to determine if the smaller files are subsets of the larger ones, but I'm hoping that you might have advice regarding the arguments that would allow for equivalent outputs.

  • mpw6mpw6 Posts: 10Member

    I was using GenomeAnalysisTK-2.0-38-g45f7b0d. I hadn't seen that 2.1 was available. I will try that instead. Thank you.

Sign In or Register to comment.