Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Data Processing Pipeline [RETIRED]

delangeldelangel Posts: 71GSA Member mod
edited September 2013 in Methods and Workflows

Please note that the DataProcessingPipeline qscript is no longer available.

The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.

Data Processing Pipeline

The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.

Introduction

Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).

Requirements

This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.

Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.

Command-line arguments

Required Parameters

Argument (short-name) Argument (long-name) Description
-i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files.
-R <fasta> --reference <fasta> Reference fasta file.
-D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format).

Optional Parameters

Argument (short-name) Argument (long-name) Description
-indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment.
-bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
-outputDir <path> --output_directory <path> Output path for the processed BAM files.
-L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only
-intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals

Modes of Operation (also optional parameters)

Argument (short-name) Argument (long-name) Description
-p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
-knowns --knowns_only Perform cleaning on knowns only.
-sw --use_smith_waterman Perform cleaning using Smith Waterman
-bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
-bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads

The Pipeline

Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

the data processing pipeline

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:

BWA alignment

This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.

Sample Level Processing

This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.

Indel Realignment

This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.

Base Quality Score Recalibration

This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate

The Outputs

The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.

Processed Bam File

The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.

Validation Files

We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.

Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.

Base Quality Score Recalibration Analysis

PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.

Examples

  1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run

  2. Performing realignment and the full data processing pipeline in one pair-ended bam file

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run

DataProcessingPPL.jpeg
834 x 2309 - 211K
Post edited by Geraldine_VdAuwera on

Comments

  • ericminikelericminikel Posts: 24Member

    Nice -- so to actually download this DataProcessingPipeline.scala is this where I'll find the most recent version?

    https://github.com/broadgsa/gatk/tree/master/public/scala/qscript/org/broadinstitute/sting/queue/qscripts

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Yes, that's correct.

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member

    The input bam files used by the DataProcessingPipeline.scala script have to have read groups already, correct?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Yes, that's correct.

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member

    I am sure many of us are interested in adding further GATK analysis steps to this base script, like me.

    As a proof of principle that I have understood the way the scala file works, I tried to add a UnifiedGenotyper step at the end of the current DataProcessingPipeline that uses recalBam as an input.

    I started from the DataProcessingPipeline.scala file from the QueueLite-2.1-11-gfb37f33 resources. Without any modifications, this script worked for me with an input list of 2 bam files.

    I made the following modifications, but did not get it to work so far:

    1. I added the following line to the "put each sample through the pipeline" section, indicating the extension for the file that will be created as an output of the UnifiedGenotyper step (ugvcf):

           val ugvcf = swapExt(bam, ".bam", ".ug.vcf")
      
    2. I added an "add" step after the "add" step that calculates the covariates for the recalibrated bam file

           cov(recalBam, postRecalFile),
           ugvcf(recalBam, ugvcf)
      
    3. Taking the other GATK walker classes as an example, I added a new class in the section "Classes (GATK walkers)", modifying what seemed necessary:

          // Unified Genotyper:
      
            case class ugvcf (inBam: File, outVcf: File) extends UnifiedGenotyper with CommandLineGATKArgs {
            this.input_file :+= inBam
            this.baq = CalculationMode.CALCULATE_AS_NECESSARY
            this.out = outVcf
            if (!qscript.intervalString.isEmpty) this.intervalsString ++= Seq(qscript.intervalString)
            else if (qscript.intervals != null) this.intervals :+= qscript.intervals
            this.scatterCount = nContigs
            this.isIntermediate = false
            this.analysisName = queueLogDir + outVcf + ".ug"
            this.jobName = queueLogDir + outVcf + ".ug"
              }
      

    When running the following queue command:

              java -jar QueueLite.jar DataProcessingPipeline_UG.scala -p bamprocess_and_ug -i myDataSet.list -R hsapiens.fa -D dbsnp.vcf.gz -l ERROR
    

    I get the following error:

              ERROR 16:09:21,321 QScriptManager - DataProcessingPipeline_121016.scala:332: ugvcf of type java.io.File does not take parameters 
              ERROR 16:09:21,327 QScriptManager -           ugvcf(recalBam, ugvcf) 
              ERROR 16:09:21,329 QScriptManager -                ^ 
              ERROR 16:09:22,108 QScriptManager - one error found 
    

    It is not obvious to me what I have to do to make this class "functional" or to make it read the output from the previous step. Maybe you can enlighten me?

    Many thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Hi @Sophia,

    You're on the right track, your problem is just a programming error. I can't set a precedent for helping people debug their own code, but I'll give you a hint: you're giving the same name to two very different things...

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member
    edited October 2012

    Thanks a lot, @Geraldine_VdAuwera, I found it, I was a little bit blind, and thought there might be some major things wrong/missing.

    For anyone who is interested, the correct lines of code would be:

                 val ugVariants = swapExt(bam, ".bam", ".ug.vcf")
    

    [...]

                 ugvcf(recalBam, ugVariants)
    
    Post edited by Sophia on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    It's usually the small easy things that are hard to find, because you look for something complex... :)

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member

    Another question on the default DataProcessingPipeline:

    It always executes the RevertSam step in the beginning of the workflow, although I am not using any of the BWA options:

                 java -jar QueueLite.jar DataProcessingPipeline.scala -p firstproject -i myDataSet.list -R hsapiens.fa -D dbsnp.vcf.gz
    

    Since I am using quite plain unprocessed bam files as an input (with readgroups added), I do not expect any trouble from omitting this step and starting directly with the intervals for indel realignment step. What do you think about that?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    The RevertSam/BWA do-over step is something we find useful internally, but it is by no means essential. Feel free to skip it if you don't think it's necessary for your data set.

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member

    I have been trying to avoid the revert step to be executed (at least for executions without realignment), and am finding it somewhat tricky. I suspect it will be necessary to interfere at one or several of the following parts of the script:

    1. Here, when there is no BWA option specified, the bam files are put through the revert step which is what I want to stop. However, when eliminating the "else..." part, I lack one parameter (namely realignedBAMS) for the subsequent execution of createSampleFiles.

      val realignedBAMs = if (useBWApe || useBWAse  || useBWAsw) {performAlignment(bams)} else {revertBams(bams, false)}
      val sampleBAMFiles: Map[String, Seq[File]] = createSampleFiles(bams, realignedBAMs)
      
    2. At the beginning of the "put each sample through the pipeline" section, I have to use the sampleBAMFiles map:

       for ((sample, bamList) <- sampleBAMFiles)
      
    3. The createSampleFiles method asks for 2 lists of bam files, original and realigned ones. But what if I do not perform realignment?

    My aim is to generate the sampleBAMFiles map directly for my original input bam files (as specified with -i in the Queue command line) and use this as an initial input for the following steps (clean, dedup, cov etc.).

    What would be the most straightforward way to tackle this? I am a bit stuck here, and maybe I am thinking too complicated again. Would you have any hints for me?

    Many thanks in advance and cheers, Sophia

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Hi Sophia,

    You should be able to run this pipeline without reverting/realigning -- that's what it's set up to do by default. It does execute a step called revert(), but if you look at the code for that step, you'll see that as long as removeAlignmentInformation is false (which it should be by default unless you specify a realigner), all it does is rename the file.

    If you do want to get rid of that step entirely, you can bypass it very simply by loading your input files into a file list the old-fashioned way. See some of the example qscripts for inspiration -- the other pipeline scripts are unnecessarily complicated for what you want to do.

    Good luck!

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member

    Thanks Geraldine, I can see which lines of the script you are referring to. Still strange, because I have been running the script without the BWA realignment options all the while but my *reverted.bam.out files contain a full picard command line:

           net.sf.picard.sam.RevertSam INPUT=rg.bam OUTPUT=reverted.bam SORT_ORDER=coordinate        
           REMOVE_ALIGNMENT_INFORMATION=false VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true  
           RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ,  
           PG, MD, MQ] VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 
           CREATE_MD5_FILE=false
    

    ... and say they have run for like 20 min. Also the reverted.bam files are slightly bigger than the input.bam files. So it seems to be running the revert step after all, although REMOVE_ALIGNMENT_INFORMATION is indeed set to false.

    As for the list of input files, I might load them to a list easily, but then I would not be able to take advantage of the createSampleFiles method to retrieve the readgroups and sample names e.g. for multi-bam samples...

    I will keep playing around with this then...

    cheers!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Ah, fair points all around. Let me clarify, because I was thinking of realignment (from scratch) only when talking about reverting, and mischaracterized what RevertSam does (my comment about it "just renaming the file" was incorrect). Realignment itself is not necessary, and you can happily skip that -- which is what RevertSam does when removeAlignmentInformation is false. But admittedly it does revert the rest of the information like base qualities that might have been already recalibrated. This is recommended if your BAMs have been messed around with previously, to start from a clean slate, which will lead to better results. However if your BAMs are fresh off the aligner, none of that should be necessary, and you are right to think that it would be more efficient to skip RevertSam completely.

    I agree the createSampleFiles method is more powerful and useful, so if you feel comfortable experimenting with getting it to work, by all means, have at it :)

    Geraldine Van der Auwera, PhD

  • mikedmiked Posts: 2Member
    edited November 2012

    Hello,

    I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

    • MarkDuplicates
    • Count Covariates
    • Table Recalibration
    • Realigner Target Creator
    • Indel Realigner

    What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples.

    Any help on this would be appreciated.

    Mike

    Post edited by Geraldine_VdAuwera on
  • jjmmiijjmmii Posts: 3Member
    edited November 2012

    Hi,

    I've been getting an error when following the second example script.

    http://pastebin.com/Mv5qBNZy

    I ran with -l DEBUG. At first the debugger shows that I am missing R packages, and I installed them. Now it is showing the following output: "Error in order(allJobs$analysisName, allJobs$startTime, decreasing = T) : argument 1 is not a vector:"

    Furthermore, the first ERROR line in every attempt is

    http://pastebin.com/pJFbcYd9

    I also noticed that the command doesn't get past the realignment stage, because when I deliberately type the wrong path for bwa, the errors are still the same.

    Please give me some advice on what I can do to elucidate what's wrong.

    Apologies for not knowing how to embed code or output. Any links to teach me that would be great too.

    Jamie

    Post edited by jjmmii on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin
    edited January 2013

    Hi Jamie,

    I'm sorry but at this time we don't provide support for using the pipeline scripts. I suggest you check carefully what are the required inputs for this script and whether your command line fulfills this.

    To embed multi-line code, just paste it in with a 4-space indent like this:

    ##This is code
    <the syntax is shown as pasted, all markup is displayed>
    

    For a single line:

    Use back-ticks

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • TiphaineTiphaine Posts: 53Member

    In your Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file) here, you put MarkDuplicate after Realignment but in Best Practices, you put the step of MarkDuplicate before Realignment. What is the good order for you and why?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    These two steps can be switched around without noticeable impact. This document is an older one; in our later docs we've grouped non-GATK tools at the start of the pipeline mainly for practical reasons.

    Geraldine Van der Auwera, PhD

  • sboylesboyle Posts: 19Member

    Geraldine, would you please provide access to your newer data processing pipeline docs? For example the one you mention in your pervious post where you have regrouped tool use. Thank you in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    That simply refers fro example to the Best Practices document: http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

    Phase I of that document describes the data processing part.

    Geraldine Van der Auwera, PhD

  • weijiazhangweijiazhang Posts: 1

    I was trying to download DataProcessingPipeline.scala from the site listed above and found that the script was removed. Where can I get the script? Thx

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Sorry, this script is no longer publicly available. Please read the message at the top of this page for an explanation of why we removed the script and how to emulate its functionality.

    Geraldine Van der Auwera, PhD

  • AviAvi Posts: 4Member

    Hi Jamie, Did you ever figure out why you were getting this error - "Error in order( .......... argument 1 is not a vector" ? I'm getting the same error and just thought I'd ask before making a separate post ! Thanks, Avi

    @jjmmii said: Hi,

    I've been getting an error when following the second example script.

    http://pastebin.com/Mv5qBNZy

    I ran with -l DEBUG. At first the debugger shows that I am missing R packages, and I installed them. Now it is showing the following output: "Error in order(allJobs$analysisName, allJobs$startTime, decreasing = T) : argument 1 is not a vector:"

    Furthermore, the first ERROR line in every attempt is

    http://pastebin.com/pJFbcYd9

    I also noticed that the command doesn't get past the realignment stage, because when I deliberately type the wrong path for bwa, the errors are still the same.

    Please give me some advice on what I can do to elucidate what's wrong.

    Apologies for not knowing how to embed code or output. Any links to teach me that would be great too.

    Jamie

  • AviAvi Posts: 4Member

    Fixed it. In case anyone comes across the same error try updating your ggplot2 package, I think ggplot has some significant differences between some of the recent versions ! After updating, I was able to produce all the run statistic plots correctly. Avi

Sign In or Register to comment.