We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Data pre-processing for variant discovery

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin


The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.


Reference Implementations

Pipeline Summary Notes Github Terra
Prod* germline short variant per-sample calling uBAM to GVCF optimized for GCP yes hg38 & b37
$5 Genome Analysis Pipeline uBAM to GVCF or cohort VCF optimized for GCP (see blog) yes hg38
Generic data pre-processing uBAM to analysis-ready BAM universal yes hg38 & b37

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on individual samples, for which the data is initially organized in distinct subsets called read groups. These correspond to the intersection of libraries (the DNA product extracted from biological samples and prepared for sequencing, which includes fragmenting and tagging with identifying barcodes) and lanes (units of physical separation on the DNA sequencing chips) generated through multiplexing (the process of mixing multiple libraries and sequencing them on multiple lanes, for risk and artifact mitigation purposes).

Our reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM.

Main steps

We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.

Map to Reference

Tools involved: BWA, MergeBamAlignments

This first processing step is performed per-read group and consists of mapping each individual read pair to the reference genome, which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. Because the mapping algorithm processes each read pair in isolation, this can be massively parallelized to increase throughput as desired.

Mark Duplicates

Tools involved: MarkDuplicatesSpark / MarkDuplicates + SortSam


This second processing step is performed per-sample and consists of identifying read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes. These are considered to be non-independent observations, so the program tags all but a single read pair within each set of duplicates, causing the marked pairs to be ignored by default during the variant discovery process. At this stage the reads also need to be sorted into coordinate-order for the next step of the pre-processing. MarkDuplicatesSpark performs both the duplicate marking step and the sort step for this stage of pre-processing. This phase of the pipeline has historically been a performance bottleneck due to the large number of comparisons made between read pairs in a sample so MarkDuplicatesSpark utilizes Apache Spark in order to parallelize the process to better take advantage all available resources. This tool can be run locally even without access to a dedicated Spark cluster.

MarkDuplicates and SortSam:

As an alternative to MarkDuplicatesSpark this step can be performed by using the Picard implementation of MarkDuplicates to perform the duplicate marking followed by SortSam to sort the reads. Both of these tools are currently implemented as single threaded tools and consequently cannot take advantage of core parallelism. It is advised to run MarkDuplicatesSpark over MarkDuplicates followed by SortSam on machines with a high number of cores or on machines with access to fast disk drives. MarkDuplicatesSpark produces bit-wise equivalent output to this approach if run following the best practices so either set of tools is valid.

Base (Quality Score) Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate measurements from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

Additional Information

Post edited by Geraldine_VdAuwera on


  • foxDie00foxDie00 Member

    Is it ok to sort and index the raw aligned BAM that is later merged with the uBAM, or should I align the raw BAM only, merge with uBAM, then only sort & index the final merged BAM?

  • foxDie00foxDie00 Member

    Also, when will tutorials for each of these steps be updated for GATK4?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I think this article will help.

    Most of the theory behind the tools are the same in GATK4, but I see what you are saying about commands needing to be updated. I will make a note and see if we can update them asap.


    Issue · Github
    by Sheila

    Issue Number
    Last Updated
  • @Sheila said:

    I think this article will help.

    Most of the theory behind the tools are the same in GATK4, but I see what you are saying about commands needing to be updated. I will make a note and see if we can update them asap.


    Hi dear Sheila, I am a little confused, so did the json file and wdl file in this github page already been upgraded https://github.com/gatk-workflows/gatk4-data-processing for hg38 or not?
    I was using docker and local computer to run best practice for data-preprocessing according with the two files I mentioned above. I deleted the runtime part. I always got error report:
    [2018-08-14 10:28:58,02] [error] WorkflowManagerActor Workflow c83e2d4f-03e8-48a7-9571-629c1b15651a failed (during ExecutingWorkflowState): Job PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem:16:1 exited with return code 127 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /Users/shuanglu/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/c83e2d4f-03e8-48a7-9571-629c1b15651a/call-SamToFastqAndBwaMem/shard-16/execution/stderr.
    /Users/shuanglu/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/c83e2d4f-03e8-48a7-9571-629c1b15651a/call-SamToFastqAndBwaMem/shard-16/execution/script: line 32: samtools: command not found
    /Users/shuanglu/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/c83e2d4f-03e8-48a7-9571-629c1b15651a/call-SamToFastqAndBwaMem/shard-16/execution/script: line 30: /usr/gitc/bwa: No such file or directory
    Error: Unable to access jarfile /usr/gitc/picard.jar
    I used docker run -i -t broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 and ls , found there is a picard file in this path. I feel lost.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Someone will help you here.


  • Angry_PandaAngry_Panda Member
    edited September 2018

    Dear GATK team,
    I have a question about the sample data: NA12878_24RG_small.

    What is the Small unaligned bam list of NA12878 (NA12878_24RG_small.txt)?
    In these best practice workflows, GATK team usually offer NA12878_24RG_small or NA12878 sample data to test. I cannot find out what is the NA12878_24RG_small data, what is the difference between it and normal NA12878? What is the word ´24RG´ implied here?
    Thanks for your reading and looking forward to your answer.

  • Looking at the workflow on github it looks like one input.json file is required for each sample that I want to preprocess.
    So should I create one json file for each of them and launch each analysis separately? I know that I can write a new wdl workflow that imports the PreProcessingForVariantDiscovery worflow and scatter among many samples but I was wondering if there's another way.

  • Hi I found a contradiction between the git hub wdl script and the statement here about Mark Duplicates
    Tools involved: MarkDuplicates, SortSam.

    In github page offered wdl script: https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl
    # Aggregate aligned+merged flowcell BAM files and mark duplicates
    # We take advantage of the tool's ability to take multiple BAM inputs and write out a single output
    # to avoid having to spend time just merging BAM files.
    call MarkDuplicates {...}
    # Sort aggregated+deduped BAM file and fix tags
    call SortAndFixTags {...}

    Generally, in this wdl script, you call markduplicate then sort. but in this paper, sort step was supposed to do firstly.
    I got confused.

    Also, I ran this github page offered processing-for-variant-discovery-gatk4.wdl script with my own NA12878 ubam file list (4 ubam file generate from FastqToSam) and reference data offered by hg38 json files. (https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.wdl + https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json)
    I found a little bit funny result: my input 4 ubams, each of them sized 33G, I checked cromwell-execution folder, my aligned.duplicates-marked.sorted.bam sized around 120G. After, BQSR, the final output bam file sized around 70 G. then I run the haplotypecaller wdl script, offered by github page as well, got a really tiny size g.vcf file, 2.7G NA12878.hg38.g.vcf.gz.

    I used exactly the same wdl file I offered above, in json file, just changed input as my own 4 bam files passing as a ubam list, and keep using the ref and known sites resource file offered in above json file. And certainly, I adjust all files path to its absolute path in my virtual machine.

    Do you think this is normal or I did something wrong?

  • picasa1983picasa1983 FranceMember


    I am trying to run the pipeline but I got an issue:

    [2019-10-12 18:27:15,07] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
    [2019-10-12 18:27:15,22] [info] MaterializeWorkflowDescriptorActor [de8c2b0b]: Parsing workflow as WDL draft-2
    [2019-10-12 18:27:16,72] [info] WorkflowManagerActor Workflow de8c2b0b-ab99-42a4-837b-1a84aa1f8425 failed (during MaterializingWorkflowDescriptorState
    ): cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
    Required workflow input 'PreProcessingForVariantDiscovery_GATK4.known_indels_sites_VCFs' not specified
    Required workflow input 'PreProcessingForVariantDiscovery_GATK4.dbSNP_vcf' not specified
    Required workflow input 'PreProcessingForVariantDiscovery_GATK4.known_indels_sites_indices' not specified
    Required workflow input 'PreProcessingForVariantDiscovery_GATK4.dbSNP_vcf_index' not specified

    I am working with a non model organism and i don't have any known snp or indel. Is it possible to skip those parameters ?

Sign In or Register to comment.