Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Best Practice Variant Detection with the GATK v4, for release 2.0 [RETIRED]

Mark_DePristoMark_DePristo Posts: 153Administrator, GSA Member admin
edited February 4 in Methods and Workflows

Introduction

1. The basic workflow

Our current best practice for making SNP and indel calls is divided into four sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes.

image

Example commands for each tool are available on the individual tool's wiki entry. There is also a list of which resource files to use with which tool.

Note that due to the specific attributes of a project the specific values used in each of the commands may need to be selected/modified by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits the data and project design.

2. Lane, Library, Sample, Cohort

There are four major organizational units for next-generation DNA sequencing processes that used throughout this documentation:

  • Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.
  • Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.
  • Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Here we treat samples as independent individuals whose genome sequence we are attempting to determine. From this perspective, tumor / normal samples are different despite coming from the same individual.
  • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends intimately on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many samples (e.g., ESP with 800 EOMI samples) deeply sequenced we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.

Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.

3. Testing data: 64x HiSeq on chr20 for NA12878

In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data from the GATK resource bundle. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.

4. Where can I find out more about the new GATK 2.0 tools you are talking about?

In our GATK 2.0 slide archive.


Phase I: Raw data processing

1. Raw FASTQs to raw reads via mapping

The GATK data processing pipeline assumes that one of the many NGS read aligners (see this review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits SAM files natively (which are then easy to convert to BAM).

2. Raw reads to analysis-ready reads

The three key processes used here are:

  • Local realignment around indels: Reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs. We look for the most consistent placement of the reads with respect to the indel in order to clean up these artifacts.
  • MarkDuplicates: Duplicately sequenced molecules shouldn't be counted as additional evidence for or against a putative variant. By marking these reads as duplicates the algorithms in the GATK know to ignore them.
  • Base quality score recalibration: The per-base estimate of error known as the base quality score is the foundation upon which all statistically calling algorithms are based. We've found that the estimates provided by the sequencing machines are often inaccurate, and worse, biased. Through recalibration an empirically accurate error model is assigned to the bases to create an analysis-ready bam file. Note: if you have old data that has been recalibrated with an old version of BQSR, you need to rerun your data with the new version so insertion and deletion qualities can be added to your recalibrated BAM file.

There are several options here from the easy and fast basic protocol to the more comprehensive but computationally expensive pipeline. For example, there are two types of realignment which constitute a vastly different amount of processing power required:

  • Realignment only at known sites, which is very efficient, can operate with little coverage (1x per lane genome wide) but can only realign reads at known indels.
  • Fully local realignment uses mismatching bases to determine if a site should be realigned, and relies on sufficient coverage to discover the correct indel allele in the reads for alignment. It is much slower (involves SW step) but can discover new indel sites in the reads. If you have a database of known indels (for human, this database is extensive) then at this stage you would also include these indels during realignment, which vastly improves sensitivity, specificity, and speed.

Fast: lane-level realignment (at known sites only) and lane-level recalibration

This protocol uses lane-level local realignment around known indels (very fast, as there's no sample level processing) to clean up lane-level alignments. This results in better quality scores, as they are less biased for indel alignment artefacts.

for each lane.bam
    dedup.bam <- MarkDuplicate(lane.bam)
    realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
    recal.bam <- recal(realigned.bam)

Fast + per-sample processing

Here we are essentially just merging the recalibrated lane.bams for a sample, dedupping the reads, and calling it quite. It doesn't perform indel realignment across lanes, so it leaves in some indels artifacts. For humans, which now have an extensive list of indels (get them from the GATK bundle!) the lane-level realignment around known indels is going to make up for the lack of cross-lane realignment. This protocol is appropriate if you are going to use callers like the HaplotypeCaller, UnifiedGenotyper with BAQ, or samtools with BAQ that are less sensitive to the initial alignment of reads, or if your project has limited coverage per sample (< 8x) where per-sample indel realignment isn't more empowered than per-lane realignment. For other situations or for organisms with limited database of segregating indels, it's better to use the advanced protocol if you have deep enough data per sample.

for each sample
    recals.bam <- merged lane-level recal.bams for sample
    dedup.bam <- MarkDuplicates(recals.bam)
    sample.bam <- dedup.bam

Better: recalibration per lane then per-sample realignment with known indels

As with the basic protocol, this protocol assumes the per-lane processing has been already completed. This protocol is essentially the basic protocol but with per-sample indel realignment.

for each sample
    recals.bam <- merged lane-level recal.bams for sample
    dedup.bam <- MarkDuplicates(recals.bam)
    realigned.bam <- realign(dedup.bam) [with known sites included if available]
    sample.bam <- realigned.bam

This is the protocol we use at the Broad in our fully automated pipeline because it gives an optimal balance of performance, accuracy and convenience.

Best: per-sample realignment with known indels then recalibration

Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.

for each sample
    lanes.bam <- merged lane.bams for sample
    dedup.bam <- MarkDuplicates(lanes.bam)
    realigned.bam <- realign(dedup.bam) [with known sites included if available]
    recal.bam <- recal(realigned.bam)
    sample.bam <- recal.bam

This protocol can be hard to implement in practice unless you can afford to wait until all of the data is available to do data processing for your samples.

Misc. notes on the process

  • MarkDuplicates needs only be run at the library level. So the sample-level dedupping isn't necessary if you only ever have a library on a single lane. If you run the sample library on many lanes (as can be necessary for whole exome, for example), you should dedup at the library level.
  • The base quality score recalibrator is read group aware, so running it on a merged BAM files containing multiple read groups is the same as running it on each bam file individually. There's some memory cost (so it's best not to recalibrate many read groups simultaneously) but for reasonable projects this is fine.
  • Local realignment preserves read meta-data, so you can realign and then recalibrate just fine.
  • Multi-sample realignment with known sites and recalibration isn't really recommended any longer. It's extremely computational expensive and isn't necessary for advanced callers with advanced filters like the Unified Genotyper / HaplotypeCaller and VQSR. It's better to use one of the protocols above and then an advanced caller that is robust to indel artifacts.
  • However, note that for contrastive calling projects -- such as cancer tumor/normals -- we recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

3. Reducing BAMs to minimize file sizes and improve calling performance

ReduceReads is a novel (perhaps even breakthrough?) GATK 2.0 data compression algorithm. The purpose of ReducedReads is to take a BAM file with NGS data and reduce it down to just the information necessary to make accurate SNP and indel calls, as well as genotype reference sites (hard to achieve) using GATK tools like UnifiedGenotyper or HaplotypeCaller. ReduceReads accepts as an input a BAM file and produces a valid BAM file (it works in IGV!) but with a few extra tags that the GATK can use to make accurate calls.

You can find more information about reduced reads in some of our presentations in the archive.

ReduceReads works well for exomes or high-coverage (at least 20x average coverage) whole genome BAM files. In this case we highly recommend using ReduceReads to minimize the file sizes. Note that ReduceReads performs a lossy compression of the sequencing data that works well with the downstream GATK tools, but may not be supported by external tools. Also, we recommend that you archive your original BAM file, or at least a copy of your original FASTQs, as ReduceReads is highly lossy and doesn't quality as an archive data compression format.

Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive IO and processing burdens. Even for single samples ReduceReads cuts the memory requirements, IO burden, and CPU costs of downstream tools significantly (10x or more) and so we recommend you preprocess analysis-ready BAM files with ReducedReads.

for each sample
    sample.reduced.bam <- ReduceReads(sample.bam)

Phase II: Initial variant discovery and genotyping

1. Input BAMs for variant discovery and genotyping

After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass.

For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.

2. Multi-sample SNP and indel calling

The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Haplotype Caller or Unified Genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.

Selecting an appropriate quality score threshold

A common question is the confidence score threshold to use for variant detection. We recommend:

  • Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.
  • Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Standard protocol: HaplotypeCaller

raw.vcf <- HaplotypeCaller(sample1.bam, sample2.bam, ..., sampleN.bam)

Standard protocol: UnifiedGenotyper

raw.vcf <- UnifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)

Choosing HaplotypeCaller or UnifiedGenotyper

  • We believe the best possible caller in the GATK is the HaplotypeCaller, which combines a local de novo assembler with a more advanced HMM likelihood function than the UnifiedGenotyper. It should produce excellent SNP, MNP, indel, and short SV calls. It should be the go-to calling algorithm for most projects. It is, for example, how we make our Phase II call set for 1000 Genomes.
  • Currently the HaplotypeCaller only supports diploid calling. If you want to call non-diploid samples you'll need to use the UnifiedGenotyper.
  • At the moment the HaplotypeCaller does not support multithreading. For now you should indeed stick with the UG if you wish to use the -nt option. However you can use Queue to parallelize execution of HaplotypeCaller.
  • If for some reason you cannot use the HaplotyperCaller do fall back to the UnifiedGenotyper protocol below. Otherwise try out the HaplotypeCaller and let us know about your experiences here on the forum!

Phase III: Integrating analyses: getting the best call set possible

This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the false positive machine artifacts from the true positive genetic variants.

1. Statistical filtering of the raw calls

The process used here is the Variant quality score recalibrator which builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant in the callset is a true genetic variant or a machine/alignment artifact. All filtering criteria are learned from the data itself.

2. Analysis ready VCF protocol

Take a look at our FAQ page for specific recommendations on which training sets and command-line arguments to use with various project designs. It is expected that analysis will be required by the user when determining the best way to use this tool for different projects. These recommendations are only meant as jumping off points!

We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:

    snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
    indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
    recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
    analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)

3. Notes about small whole exome projects or small target experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples. Also, for experiments that employ targeted resequencing of a small region (for example, a few hundred genes), VQSR may not be empowered regardless of the number of samples in the experiment. For users with experiments containing fewer exome samples or with a small target region there are several options to explore (listed in priority order of what we think will give the best results):

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline).
  • Use the VQSR with the smaller SNP callset but experiment with the argument settings. For example, try adding --maxGaussians 4 --percentBad 0.05 to your command line. Note that this is very dependent on your dataset, and you may need to try some very different settings. It may even not work at all. Unfortunately we cannot give you any specific advice, so please do not post questions on the forum asking for help finding the right parameters.
  • Use hard filters (detailed below).

Recommendations for very small data sets (in terms of both number of samples or size of targeted regions)

These recommended arguments for VariantFiltration are only to used when ALL other options are not available:

You will need to compose filter expressions (see here, here and here for details) to filter on the following annotations and values:

For SNPs:

  • QD < 2.0
  • MQ < 40.0
  • FS > 60.0
  • HaplotypeScore > 13.0
  • MQRankSum < -12.5
  • ReadPosRankSum < -8.0

For indels:

  • QD < 2.0
  • ReadPosRankSum < -20.0
  • InbreedingCoeff < -0.8
  • FS > 200.0

Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

For shallow-coverage (<10x): you cannot use filtering to reliably separate true positives from false positives. You must use the protocol involving variant quality score recalibration.

The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by variant quality score recalibration.

BPP-wRR-white_small.png
660 x 406 - 49K
Post edited by Geraldine_VdAuwera on

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

«1

Comments

  • thondeboerthondeboer Posts: 15Member

    Is there a link to Version 3 somewhere? I need to reference this since that is what we currently use

  • TristanTristan La Jolla, CAPosts: 11Member

    Is the ApplyRecalibration the final step then? I've only run it the once but so far I seem to have the same number of lines though the scores and number of lines with "PASS" have all changed. I'm assuming we should take it from there?

    Love the new site, looks great =)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Tristan,

    ApplyRecalibration is the final step of recalibration. What you're seeing is normal -- it is a very simple tool; given a minimum VQSLOD (say, 99.5%) specified by the user, ApplyRecalibration goes through the whole file, marking variants as 'pass' or 'fail' based on whether their VQSLODs are above or below that threshold. No lines are added or taken out of your VCF.

    There are additional steps that can be done with GATK tools to examine the quality of your callset, before moving on to the actual analyses you want to perform on it. See for example the documentation of the VariantEval tool. Based on the results, you may want to go back to the recalibration step and apply a different cutoff to your callset (i.e. repeat ApplyRecalibration with a different VQSLOD value). This is not included in the Best Practices because it depends more on your experiment, dataset and results, whereas the Best Practices constitute a set of core principles that should apply to all experiments.

    Glad you like the new site!

    Geraldine Van der Auwera, PhD

  • ericminikelericminikel Posts: 24Member

    The Queue DataProcessingPipeline seems like a great way to accomplish Phase I pretty much out of the box. Is there anything comparable for Phase III?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin
    edited September 2012

    Not as such but we encourage you to look at the example scripts on the repository and use them to develop your own.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • JayceJayce Posts: 7Member

    hi,

    Can i know if we still need to filter the StrandBias (SB) > >= -1.0 for indel? thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Jayce,

    All the key recommendations are in this Best Practices document; if it's not in here, it's not essential. Which doesn't mean it can't be used -- that's between you and your dataset...

    Good luck!

    Geraldine Van der Auwera, PhD

  • sboylesboyle Posts: 21Member

    Hello,

    I am preparing to learn to use GATK for the first time. I am new to this type of analysis and am interested in running through some exercises prior to jumping into my own data. In this article you mention applying this approach to the NA12878 data. You mention using raw and realigned, recalibrated. I have found NA12878 data under bundle/1.5/b37, but I'm not sure whether this is the correct location? There are also many different files, but all of them are called cleaned and recalled. Also, some of them are labled as HG19 even though they are in the b37 folder. I ideally would like to reference to the HG19 build.

    Which files (and their locations on the server please) would you consider the raw, realigned, and recalibrated data?

    Also, is there a section of the website where you work through a specific variant calling example and provide output files that we can compare our results to? It would be extremely helpful for feeling that we are on the correct track. I could have missed it if there is already something like this.

    Thank you very much for your help! ~Sean

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Sean,

    Unfortunately we don't really have any start-to-finish tutorials yet; it's near the top of my to-do list, but I can't promise you an ETA.

    If you want to practice the workflow, I would recommend starting with NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam, which is chromosome #20 from NA12878. You can use Picard's RevertSam tool to revert the file to raw reads, then work through BWA alignment (to the reference of your choosing) and all subsequent steps as indicated in this article. Then you can use the callsets we provide in the bundle to evaluate whether things worked properly.

    Good luck!

    Geraldine Van der Auwera, PhD

  • sboylesboyle Posts: 21Member

    Thanks for the suggestion! I also came across this website while searching the web. It seems to be a good resource for exome analysis using GATK that includes command line examples. I image you've come across it, but in case you haven't... http://seqanswers.com/wiki/How-to/exome_analysis

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Thanks for the link, that's an interesting walkthrough. Just keep in mind that command lines may change, as we occasionally update argument names and things like that, so be sure to check that everything is current before running.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
    edited November 2012

    Hi:

    I saw a few places including the URL referred by sboyle as above suggesting to use picard FixMateInformation.jar utility for paired end data, is it really necessary? I have run through GATK without that for such data, it seems fine (no error and no complains from GATK). Just want to ask your opinion from GATK group point of view.

    Also for the best practice V4, for the step: 2. Raw reads to analysis-ready reads at phase I, I only have lane-level sample data (each sample in one lane, no sample in multi-lanes), so I only did the Fast: lane-level realignment (at known sites only) and lane-level recalibration, which is

    for each lane.bam
        dedup.bam <- MarkDuplicate(lane.bam)
        realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
        recal.bam <- recal(realigned.bam)
    

    since I do not have multi-lane samples, I do not have to run Fast + per-sample processing, Better: per-sample realignment with known indels and recalibration, or Best: multi-sample realignment with known sites and recalibration, right? It seems that all of these schemes, Better, Best etc, are only for situation when some samples run at multiple lanes, is my understanding correct?

    at the paragraph for Fast + per-sample processing, it mentioned: cross-lane realignment, not sure what it means exactly. if every sample of mine are in a single lane, it does not need, right?

    Thanks

    Mike

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin
  • cardilloxcardillox Posts: 6Member

    Hello, thank you for this guide.

    I have some doubts about the phase I, I cannot understand how the realignment and the recalibration should be performed. Should I use a alignment software such as BWA?

    Thank you very much

    cardillox

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    To clarify: the initial mapping step (as shown here) is done by an aligner such as BWA. Realignment and recalibration are done by the GATK tools (RealignerTargetCreator+IndelRealigner then BaseRecalibrator+PrintReads) as explained in this and other documents.

    Geraldine Van der Auwera, PhD

  • cardilloxcardillox Posts: 6Member
    edited November 2012

    Dear Geraldine, Thank you for your answer. I did not read all the documents yet, but in this document i cannot find any reference to RealignerTargetCreator and other tools that you mentioned.

    I apologize for the inconvenience

    Have a nice day

    Post edited by cardillox on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    That's okay, we'll try to clarify this documentation in the near future.

    Geraldine Van der Auwera, PhD

  • julianjulian Posts: 4Member

    Hi,

    what does the term 'multisample' in the section 'Best: multi-sample realignment with known sites and recalibration' refer to? Looking at the workflow code and the text, it seems to do all the processing for each sample individually.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Julian, that was actually an inappropriate title (leftover from a previous version), I've rewritten it to make more sense. Thanks for pointing this out!

    Geraldine Van der Auwera, PhD

  • FarahFarah Posts: 8Member

    Hi, I've downloaded and installed the GATK bundle, run the example and got all the output as required. I now want to try this on my real data. I have 6 trios, each composed of two affected kids and one normal parent. I want to filter out homozygous or compound heterozygous mutations as my final output. I read that GATK works better if I analyze the cohort together, should I go this route, given my complex sample? Also I was reading about Queue, should I install that too before I start? Finally is it just a matter of following the same steps as in the example (with the caveat that I have to do this for multiple samples, which I am not sure of how to?) on the real data?

    Sorry about all the questions, being new to bioinformatics this is a little overwhelming

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Farah,

    You can analyze all the samples in your cohort together, yes. I would recommend that you get comfortable with running commands individually before you try to use Queue. Since you have a reasonably small amount of samples you don't really need Queue at this point anyway. I also recommend you use the figure in this article as a roadmap. For each step, look up the tool, understand what is its purpose, and run the tool on your samples using the basic parameters that are detailed in the documentation. You may also find it useful to look at the presentations from our recent user education workshop before you start, particularly the videos (which just came online today, whoo).

    See http://www.broadinstitute.org/gatk/guide/events and http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk

    If you run into problems, check out the documentation and be sure to search the forum for answers. If you still can't find a solution, post a question and we'll try to help you out.

    Good luck!

    Geraldine Van der Auwera, PhD

  • FarahFarah Posts: 8Member

    whoo hoo, you have no idea how happy I am clicking on the links you sent! Thanks so much! I'll do this step-wise as you suggest and hopefully won't bug you too much along the way. I just learned that our institute (which uses it's own cancer specific pipeline, while my samples are intellectual disablity so I think GATK will be more suitable) has it's own formats for the .bam files generated by BWA, but it's not GATK compatible. So I'm going to start from the fastq files. I had to regenerate the fastq files from the raw bam files though as the institute doesn't store the original fastq. I did this using picard, I hope it will still be compatible.

    Thanks again for the help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    I'm glad you're finding the workshop materials helpful :)

    Sounds like you're on the right track with your files. Picard tools are great, we use them for a lot of things too. They are developed by another group here at Broad. All standard formats (bam, vcf etc) are compatible between their tools and ours.

    Geraldine Van der Auwera, PhD

  • FarahFarah Posts: 8Member

    Hi Geraldine, I've followed the complete workshop and it was great. Very clear, informative and I feel quite ready to start now. The one thing I still need answered is a very basic question; how do I run the walkers on multiple files at the same time? Since it is better to run the analyses as a cohort rather than as individual files. The workshop gave command line for single file analysis only. So for example if I need to run a walker on the cohort, do I then simply call more input files than one? for example - java -jar GenomeAnalysisTK.jar - T WalkerName \ - R reference.fasta -I Trio1child1.bam -I Trio1child2.bam -I Trio1mother.bam -o output1 -o output2 -o output3

    and so on. I have 18 samples, i.e, 6 trios where a trio is a composed of 2 affected kids and the normal parent (often mother).

    Or am I completely on the wrong track here?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    You've got it right! Although you can also pass the filenames in a single text file, one filename (with full path) per line, like so: -I my_bamfiles.list

    Geraldine Van der Auwera, PhD

  • FarahFarah Posts: 8Member

    Great, thanks very much Geraldine. Just got to see this on getting back from another job. I'll set it up and start right away. Excited!

  • LaviniaLavinia Posts: 37Member
    edited February 2013

    Hi, I have just started working with GATK and was pleased to discover this page. Can I just confirm in the above recommendations for very small data sets, a QD < 2?, surely this should be >? The example for SelectVariants is a QD > 10. (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html).

    Update - I think I see my error, the above is for flagging, so > 2 is flagged, as opposed to > 10 for removal, thanks.

    Post edited by Lavinia on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi there,

    It sounds like you are getting confused between VariantFiltration, where you say what you want to remove "QD < 2", and SelectVariants, where you say what you want to keep "QD > 2". I'll clarify the docs on this point.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi,

    I saw on this page, the basic workflow diagram showed that at phase 2, when call variants, GATK can create SNP, indels and structure variations (SVs). which one can be used to get structure variations? Is it Unified genotyper or HyploTypeCaller or some other new modules in the GATK V2.0?

    Also for indels derived from unified genotyper, any size limit for the indels derived (max 50bp, 100bp or no limit)?

    Thanks

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    UnifiedGenotyper and HaplotypeCaller both call SNPs and Indels. For SVs, you'll need to us a different tool: GenomeStrip is a program based on the GATK (but not part of GATK) that allows identification of SVs. See http://www.broadinstitute.org/gatk/guide/topic?name=third-party-tools

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thanks for the info, Geraldine! However, I found this link below from forum: http://gatkforums.broadinstitute.org/discussion/1541/ask-your-questions-about-genomestrip-here#latest

    It seems we can tun GenomeSTRiP from GATK? I did not see any jar file for this tool or we have to download from their site, then why you say it based on GATK?

    Thanks again

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Mike, GenomeSTRiP uses the same underlying engine as the GATK, and it is made by our colleagues at the Broad Institute, so we are hosting their forum and documentation. However it is built and distributed as a separate program, so you need to download it from their site.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thanks a lot for the info, Geraldine! Mike

  • mikemike Posts: 103Member

    Hi Geraldine:

    Just realized that you did not address one of my questions mentioned above: Also for indels derived from unified genotyper (or hyplotypecaller), any size limit for the indels derived (max 50bp, 100bp or no limit)?

    Thanks in advance!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    The tools themselves do not set a hard limit; however in practice there are some limits to what they are able to detect. This is largely dependent on the length of reads in your dataset, since for deletions for example, if you do not have full overlap, you cannot with certitude distinguish deletion from lack of coverage.

    Geraldine Van der Auwera, PhD

  • LaviniaLavinia Posts: 37Member

    Hi, minor point shouldn't the statement above regarding BWA: "emits BAM files natively" be adjusted as (AFAIK) BWA will generate a SAM file but other tools are need to convert a SAM file into a BAM file, might be helpful for other newbies like myself, cheers.

  • jpofmarsjpofmars Posts: 3Member

    Hi,

    I want to filter snp with using hard filters method. So I would like to know if the length of reads is important for the definition of parameters (in particular with the ReadPosRankSum Test). I have reads of 75 bp I'm wondering if the given parameters in the "Best Practice Variant Detection" are adapted to my dataset.

    Thanks in advance for your answer.

    JP

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Not really -- maybe if you had extremely short or extremely long reads, but we do test a variety of read lengths, and for 75bp you should be fine with our best practices recommendations.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    @Lavinia, that was a fair point -- I edited the doc accordingly. Thanks for pointing it out.

    Geraldine Van der Auwera, PhD

  • JiahaoJiahao Posts: 2Member

    Hi,

    I found after I did multi-samples (multiple treated versus one untreated) realignment to detect the induced mutations, I lost many SNPs in the untreated sample compared to the results from single-sample realignment. I am wondering will the multi-sample realignment will remove the rare mutations in only a few samples?

    Thanks a lot in advance for your answers.

    Jiahao

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Jiahao,

    When you do multi-sample calling, the genotyper uses the cohort information to evaluate confidence in calls that do not have strong support in individual samples. In general this has a positive effect of "rescuing" variants that occur in several samples even though they are not well supported in the individual samples. However this can occasionally lower the confidence in rare calls that are not present in other samples. This should only affect marginal calls, but if you are looking for very rare variants then perhaps it would be better to call samples individually.

    Geraldine Van der Auwera, PhD

  • luca_beltrameluca_beltrame Posts: 10Member

    A somewhat naive question: I'm thinking about using VQSR on my small targeted region sample set, and I got a number of bams from the 1000 genomes to empower VQSR appropriately. If I understood correctly, I have to make one single multi-sample VCF file including both my samples and the additional bams, then run VQSR. The resulting VCF will however include also calls from the "extra" samples, which I am not interested in. How can I effectively "get rid" of the extra samples post VQSR ?

    Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Luca,

    You can use SelectVariants to extract the variants from only the samples you're interested in, as in this example from the tech doc:

    Select two samples out of a VCF with many samples:
    
     java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T SelectVariants \
       --variant input.vcf \
       -o output.vcf \
       -sn SAMPLE_A_PARC \
       -sn SAMPLE_B_ACTG
    

    Geraldine Van der Auwera, PhD

  • luca_beltrameluca_beltrame Posts: 10Member

    @Geraldine_VdAuwera said:

    You can use SelectVariants to extract the variants from only the samples you're interested in, as in this example from the tech doc:

    Thanks a lot, noted and added to my processing pipeline.

  • MattBMattB Posts: 19Member

    Can you please give me some advise as to what I would do re phase 1 if I have multiple samples (3 to 4) multiplexed per lane? You discuss the situation of having a sample split across multiple lanes but not multiple samples per lane. Would I recall each lane as a whole entity (I'm assuming this is better as you can take advantage of a whole lanes worth of data for the recalibration) then split up these lanes to the sample level (via read groups, set differently for each sample in the lane) for subsequent de duplication and realignment? Each lane was also run from a single pooled library.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi @MattB,

    If your samples are duly tagged with read group information, the GATK engine will correctly identify and process them at sample level where appropriate. You can just pass a single BAM file containing all your data, if that's what you have.

    Geraldine Van der Auwera, PhD

  • MattBMattB Posts: 19Member

    Thanks, so with respect to the different phase 1 methods, fast, better or best, what order would I perform these steps? Would the following work:

    0) Merge all of my .bam files from BWA in to oneBig.bam (with ID and SM set appropriately in the @RG line, these mirror each other as there is only one alignment per sample)

    1) MarkDuplicates on the whole study (despite multiple libraries, will the LB: bit in the @RG line be needed to differentiate?)

    2) Realign everything

    3) Recal everything

    I guess I can then split these by read group before variant detection if I wanted?

  • MattBMattB Posts: 19Member

    Just to come back to this. So after re reviewing the info you have here http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files my question is as regards base quality score recalibration, how do I get GATK to take into account an error model for the whole lane rather than each sample in each lane. Since from the looks of things it will only function per read group and I will have 3 to 4 of these per lane? Thus recalibration will happen in 3 or 4 separate chunks for a lane rather than the whole thing?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Well, the thing to keep in mind is that if you merge all your BAMs together into a big one, the processing of that big BAM is going to be very computationally demanding. Also, note that the recalibrator will process read groups individually, so you will not get the "whole lane data" advantage that you think from recalibrating multiple samples together.

    Actually, let me take a step back and give you a quick run-down of what we do in-house.

    Our setup is a bit complex because we have samples spread over multiple lanes, with multiple samples per lane. So when we get the FastQs, we separate out the read data by read group into individual files, so that after alignment we have one bam file per read group. We run dedup-realign-recal on each bam file, then merge the bams of read groups that belong to the same sample to produce one bam file per sample. Then we do another round of realign-recal on the sample bams as a form of cross-lane cleanup.

    But if you don't have samples spread across different lanes, you don't need to do all of this. The simplest is probably to separate out the samples using their read group tags into individual bam files and process them separately through dedup-realign-recal. I realize this contradicts what I said earlier; technically that (earlier suggestion to just process all together) is still an option (with some advantages, e.g. all samples aligned the same way), but if you plan to split up your samples before variant calling anyway, you might as well split everything earlier on and save yourself the compute resources of processing everything together.

    Does this make sense? We're planning to rewrite and simplify the best practices recommendations (because right now we realize they're pretty confusing), so feel free to give feedback on what is or isn't clear.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Oh, just saw your new post (took me a while to put a response together, sorry -- like I said we're rethinking our recommendations so it needed a bit of internal discussion).

    For the BQSR, the unit of aggregated information is the read group. This is convenient in many aspects but the downside is that you can't get the recalibrator to recalibrate over the entire lane if it is composed of different read groups. As you say, it will process the lane info in chunks according to read group. You could of course cheat and change the read groups, but I wouldn't recommend it. If you only have 3 or 4 different read groups per lane, that should still be enough data for the recalibrator to do its work properly.

    Geraldine Van der Auwera, PhD

  • MattBMattB Posts: 19Member

    Hey thanks very much for all your comments this clears up many thing I've been wondering for quite some time. I should now be able to get done what I want to do. I agree the intro to phase 1 is really quit confusing, so any rewrite would be useful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Glad to help, Matt. We'll try to clarify that intro to phase 1 in the near future.

    Geraldine Van der Auwera, PhD

  • SHWSHW Posts: 2Member

    Hi, I have one sample spread over multiple lanes (aliquots from the same library). I try to follow "Best: per-sample realignment with known indels then recalibration" and want to make sure my understanding is correct in designating read groups : We should set the same "read group" in all lane.bams, is that correct?

    Besides, in the in-house procedure what would be the difference between (dedup-realign-recal) and (only dedup) for each lane.bam? Since eventually in the cross-lane cleanup stage a (realign-recal) would be applied on the lanes.bam.

    Thank you very much for your help :)

    Han

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Han,

    You should assign as separate read group to each of your lane.bams.

    In our in-house procedure, the benefit of doing a first pass of realignment and recalibration is to have the process done at the smallest coherent unit of aggregate information, where it is most needed and pertinent. Then, doing an additional cross-lane cleanup helps make sure that there is homogeneity among the different parts of the dataset. But it's at the lane level that the cleanup really matters most. And just trying to do all the cleanup on all the data lumped together does not give as good results.

    Geraldine Van der Auwera, PhD

  • SHWSHW Posts: 2Member

    Hi, Thank you for your immediate response. In that sense it seems quite consistent (conceptually) between the in-house procedure and the best practice. Now I understand how to get it done properly. Thank you very much for your help :)

    Han

  • praveenrajspraveenrajs Posts: 5Member

    Hi, I've a few questions related to variant calling with GATK.

    1) As we know that, a non-diploid variant calling algorithm provides great sensitivity to make calls in tumor samples as tumor samples are usually aneuploid. Does GATK assume the sample is diploid? How sensitive is GATK for calling variants in tumor samples - or as compared to other calling algorithms SNVmix or CaVEMan?

    2) Hope you've come across the criteria 'The number of reads with unique start sites' while making variant calls. Setting a value (2 or 4) improve the call confidence. Does this feature (directly or indirectly) available in UnifiedGenotyper or HC?

    3) Is running Genome STRiP on exome data a good practice? If no, what is your suggestion for exome SV discovery?

    Thanks, Raj

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Raj,

    1) The GATK currently includes two variant calling tools, the UnifiedGenotyper (UG) and HaplotypeCaller (HC). The UG assumes diploidy by default but it is possible to set your own desired ploidy. The HC currently does not have that capability. Other tools are largely ploidy-agnostic. Please note in any case that the GATK tools are not specifically designed for calling somatic mutations. For those purposes please see the software developed by the Broad's cancer group, such as MuTect and SomaticIndelDetector.

    2) I'm not sure exactly what you mean, but I can tell you that the UG applies a number of quality control read filters by default to only consider informative reads and thus improve call confidence.

    3) We don't do SV discovery, but you can discuss this with the authors of GenomeStrip in their own forum, which we host here: http://gatkforums.broadinstitute.org/categories/genomestrip

    Geraldine Van der Auwera, PhD

  • flyingflyersflyingflyers Posts: 13Member

    Following this guidance of GATK, after aligment with BWA, I MarkDuplicates using picard. Just mark it, didn't physically remove the duplicates. Then I run gatk (RealignerTargetCreator, IndelRealigner, BaseRecalibrator, PrintReads, UnifiedGenotyper, VariantRecalibrator, ApplyRecalibration). Then I realize the coverage are set below 250, whereas the average coverage of my sample is 300.

    1. What is the cutoff of "good reads" defined by UnifiedGenotyper by default, like for the depth(5~250?), Base Quality(>=30?), Mapping Quality(>=17?).

    2. If I want to remove or modify the above restrictions, for downsampling depth, base quality and mapping quality, which step and how should I adjust them?

    3. In picard, I didn't remove the duplicated reads. Do you recommend to remove the duplicated reads in picard step or do it in ReduceRead step in GATK. Which field in BAM file does GATK recognize as duplicates if I only MarkDuplicates using picard? What the difference will supposed to be if I physically remove the duplicates and only MarkDuplicates, especially when some sample the high coverage is achieved due to high PCR duplicate rate?

    Thank you so much in advance for clarifying the above points.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi @flyingflyers,

    There is not really a hard cutoff for base and mapping qualities; rather each base's contribution to the genotype calculation will be weighed by the qualities. The program uses this to calculate a confidence score (which means, how confident the program is that the site is variant or not) and you can set thresholds based on that score to determine how confident the calls should be. See the UG documentation for details on the -stand_emit_conf and -stand_call_conf arguments.

    You can adjust the downsampling level using the -dcov argument; see the CommandLineGATK (engine) documentation for details on that.

    Finally, it doesn't matter if you physically remove duplicates or not as long as you mark them. The GATK will simply ignore any reads marked as duplicates.

    I hope this helps clarify things for you.

    Geraldine Van der Auwera, PhD

  • rob123kingrob123king Posts: 8Member

    New to NGS. Working on SNP detection between two illumina 100bp x50 tomato whole genome resquencing data sets. There is a reference available and annotation. Using this information can I follow the protocols and not need anything extra, such as a file for known indels? I was going to use BWA-MEM and picard tools to convert from sam to bam then mark duplicates using picard tools, then follow the guide from here http://seqanswers.com/wiki/How-to/exome_analysis whilst referring to your guide to check using correct parameters as the example is human and exome. Does this sound logical? Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    The general workflow looks sound, just make sure you do read up on our best practices and check arguments, since they do evolve over time. I'd also recommend you have a look at the videos/presentations from our last user workshop, which folks who are new to the field seem to find quite useful: http://www.broadinstitute.org/gatk/guide/events?id=2038

    One issue you're going to run into is the need for known snps (indels are not so important) for base recalibration, which is often an issue for non-human organisms (but maybe for tomato it is available, I'm not sure). And you may not have enough variants to run variant recalibration, in which case you'll have to use hard-filtering.

    We've just finished drafting some fairly comprehensive tutorials to the best practices tools, which I'll try to post in the next few days, so keep an eye out for that.

    Good luck!

    Geraldine Van der Auwera, PhD

  • rob123kingrob123king Posts: 8Member

    Thanks gives me a bit of confidence moving forward with it. There are some snp files here http://www-plb.ucdavis.edu/labs/maloof/tomatosnp/downloads.html What needs to be in the files, coordinates? thanks again

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    A vcf file with sites is needed -- see the FAQ on input files, I think it's explained there.

    Geraldine Van der Auwera, PhD

  • rob123kingrob123king Posts: 8Member

    I'm currently proceeding but using samtools at the moment but would like to compare with GATK. I've found known SNPs on the ncbi to download in the form of fasta or bin file. Do you know of a tool that could convert it to vcf or to bed? The vcf tools just seem to analyse/confirm a vcf file rather than convert.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    We have a tool called VariantsToVCF that can convert certain formats to VCF format. Please see the tool documentation for more details.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 183Member

    Why isn't there the step to clean up the raw reads at the beginning?

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    You mean indel realignment? It is there.

  • blueskypyblueskypy Posts: 183Member

    By "clean up the raw reads", I mean, for example, removing the low quality or short reads. Thanks!

  • blueskypyblueskypy Posts: 183Member

    for example, many GATK functions apply a number of read filters automatically. But why does not this Best Practice workflow do such filtering at the very beginning so that the following steps does not have to deal with those bad reads any more?

    As I understand, those read filters in GATK just mark the bad reads, but do not remove them. Does it mean the next step has to process those marks again?

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    throwing away data is a controversial decision.

    The GATK smartly teases out the reads that don't pass read filters and the walker never sees them. So marking instead of removing is a safe option. If you're keen on removing data from your datasets, you're free to do so. But we are not recommending this in the best practices pipeline.

  • blueskypyblueskypy Posts: 183Member
  • blueskypyblueskypy Posts: 183Member
    edited May 2013

    Multi-sample realignment with known sites and recalibration isn't really recommended any longer. It's extremely computational expensive and isn't necessary for advanced callers with advanced filters like the Unified Genotyper / HaplotypeCaller and VQSR. It's better to use one of the protocols above and then an advanced caller that is robust to indel artifacts.

    Does it mean that the "Best: per-sample realignment with known indels then recalibration" isn't recommended?

    Also this is a bit in disagreement with Geraldine's comment:

    Our setup is a bit complex because we have samples spread over multiple lanes, with multiple samples per lane. So when we get the FastQs, we separate out the read data by read group into individual files, so that after alignment we have one bam file per read group. We run dedup-realign-recal on each bam file, then merge the bams of read groups that belong to the same sample to produce one bam file per sample. Then we do another round of realign-recal on the sample bams as a form of cross-lane cleanup.

    Should it be "Then we do another round of dedup-realign on the sample bams", i.e., the 'Better' protocol? If Geraldine was referring to the 'Best' protocol, should it be "Then we do another round of dedup-realign-recal on the sample bams"?

    Post edited by blueskypy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Ah, this is all a bit messy I'm afraid. We're in the process of rewriting this document more clearly. In brief, the comment of mine which you quote is accurate; that's what we do here at the Broad. It doesn't correspond directly to either of the "Best" or "Better" protocols as described here. The closest is the "Better" one. You are correct that the "Better" one is no longer recommended -- not because it is bad, but because it is unnecessary. I hope this clears things up a little -- my apologies for the confusion.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 30Member

    Hello,there. I now has 10 whole genome sequencing samples (with an coverage of 9X for each sample ) from 10 chicken breeds (one sample for each breed), should I conduct the VQSR ? I noticed two sentence in this article "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples" and "For shallow-coverage (<10x): you cannot use filtering to reliably separate true positives from false positives. You must use the protocol involving variant quality score recalibration." Does these mean I should use hard filters rather than running VQSR? The second question in VQSR, the input file has to be the vcf produced by multiple-sample variant calling, right? Many thanks!

  • JackJack Posts: 30Member

    Was my question too silly....? I was just not sure about the VQSR step after reading the Best Pratice....

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    No question is too silly :) But questions asked on the weekend often have to wait for Monday, and today was a busy one.

    So, to answer your question... You can always try VQSR; if your dataset is too small the program will give you a warning. But note that you'll need to have truth/training resources available. If your dataset is too small or you don't have the necessary resources, then you switch to hard-filtering. Since your coverage is approx. 9x, it's borderline for filtering -- not ideally suitable, but you may not have the choice. In that case, start with the recommended parameters, but be sure to experiment tweaking them to try and figure out what works best for your data.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 30Member

    Thanks for your helpful advice, Geraldine!

  • JeremyLeipzigJeremyLeipzig Posts: 1Member

    Are there some typical benchmarks available? I heard GATK2 is now spending more time in "phase 2" than "phase 1".

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi @JeremyLeipzig,

    Do you mean runtime/speed performance benchmarks, or accuracy? And do you mean GATK2 takes more time to run phase 2 vs 1, or are you referring to development effort?

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 183Member
    edited June 2013

    hi, I have one sample sequenced in 4 lanes and my pipeline flow is dedup-realign-recal for each lane bam file and then dedup-realign on the merged bam file. I have two questions:

    1. is the reason not merging lane bam files at the very beginning due to recal step? Any other reasons?

    2. Can I do realign-recal for each lane bam file using options like "-I lane1.bam -I lane4.bam" and "-o lane1.recal.bam -o lane4.recal.bam"? If it's true, I guess the input and output orders have to be identical, right?

    Thanks a lot!

    Post edited by blueskypy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin
    1. If you have a lot of coverage, realignment works better if you first realign per lane. If you realign directly on the merged file, that multiplies the depth, and in complex regions with deep coverage the realigner will tend to quit and skip to the next region. This is on purpose, to avoid having the realigner spend a gazillion hours pedaling through popcorn. So if you first realign by lane, you're less likely to end up with un-realigned regions.

    2. Not exactly. If you pass in all the files in one command, the data will be processed together (not per-lane). But there is a way to get the IndelRealigner to output separate bams corresponding to the separate inputs. See the nWayOut argument of the realigner: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner.html#--nWayOut

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 183Member

    Thank you soooo much Geraldine! Your reply is very educational!

    So in order to process each lane bam file separately, I have to do something like the following?

    java -T RealignerTargetCreator -I lane1.sorted.dupMarked.bam -o lane1.bad.intervals ...
    java -T RealignerTargetCreator -I lane2.sorted.dupMarked.bam -o lane2.bad.intervals ...
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Yep, that's right. Folks who have a lot of files to process typically set up a shell script to cycle through them without having to manually write and enter each command line.

    Sorry if you got a "Geraldine disagreed" notification, by the way. My mouse slipped on a banana peel and clicked the wrong button. And if you didn't receive any notification, just ignore this little postscript...

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 183Member

    @Geraldine_VdAuwera said: Ah, this is all a bit messy I'm afraid. We're in the process of rewriting this document more clearly. In brief, the comment of mine which you quote is accurate; that's what we do here at the Broad. It doesn't correspond directly to either of the "Best" or "Better" protocols as described here. The closest is the "Better" one. You are correct that the "Better" one is no longer recommended -- not because it is bad, but because it is unnecessary. I hope this clears things up a little -- my apologies for the confusion.

    I still don't fully understand. Did you mean that, for samples sequenced in multiple lanes, Broad does the following? dedup, realign, recal on individual lane files, then dedup, realign, recal on merged file

    what's the point of doing recal on the merged file since recal process different read group separately anyway?

    thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Oh darn, that was my mistake. I wrote "realign-recal" instead of "dedup-realign". Your comment on May 31 was correct, I read it wrong. Well spotted, my apologies for the confusion. Give me one more week and I will have a shiny new Best Practices document ready that will iron out all of these issues.

    Geraldine Van der Auwera, PhD

  • ecyehecyeh Posts: 8Member

    Is realignment before HaplotypeCaller still necessary or still recommended? I have found info here and there but an explicit answer will be more helpful. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Yes, realignment is still recommended before calling with HC.

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 55Member

    Hi It seems like a stupid question after using gatk for this long. I was not sure if only uniquely mapping reads should be used or not because gatk best practices document does not talk about uniquely mapping reads but I know folks who ask to use only uniquely mapping reads for snp/genotype calling.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    No worries, there are no stupid questions :)

    There is no absolute requirement for uniquely mapping reads, but the GATK callers use only primary alignments and discards any secondary alignments.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    To clarify my previous statement: although there is no direct requirement of unique mapping, the GATK callers (and most non-QC tools) do require a non-zero mapping quality score, and we expect that aligners only assign MAPQ>0 for uniquely mapping reads.

    Geraldine Van der Auwera, PhD

  • BiocybermanBiocyberman Posts: 3Member

    Hello, I am trying to apply GATK best practices for my colorspace data. These best practices look good in general, but I am still quite close to clueless. I did watch the workshop videos. My problem is the Scala snipets in this document looks simple enough, but they are out of (code) context. A start to end Scala file with comments will be much more helpful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin
    edited August 2013

    Greetings, shiny metal person.

    There are two points here :

    1) Understanding what processing /analysis steps need to be applied to the data

    2) Figuring out how to chain the steps into a pipeline

    Number 1 is the most important. Once you get that, you can already work through the entire Best Practices workflow step by step manually. To complement the current BP document (which is admittedly a little confusing) and videos you can have a look at the individual tutorials for each step (which live in the Tutorials section) as well as the corresponding method articles and technical docs. I know this sounds like a lot of disparate resources, but our cybermats are hard at work consolidating them into a more coherent whole. Stay tuned, it's only a matter of days now.

    Number 2 depends on how you plan to pipeline your analyses. Are you going o use Queue, or something else? If Queue, you're in luck, because the cybermats are also working on developing better documentation on how to do that. But that will take a couple more weeks before they are ready for public consumption.

    It sounds like you are more so worried about #2, correct? We do have some (very imperfect) docs about that in the Developer Zone section of the guide, that may help tide you over until we get the new and improved versions up.

    But basically there is a trade off between the amount of time you can wait for great docs vs. the amount of effort you're willing to put in working with the "meh" docs...

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • sannesanne NorwayPosts: 9Member

    Hi! Thank for your excellent website, super helpful for those like me who are new to whole genome analyses. I’m looking for some advice regarding realignment and recalibration. I have low-coverage (average 6X) resequencing data of 40 individuals (from 8 populations). I have a good reference genome, but no list of known indels. Samples were not sequenced over multiple lanes so I have all data nicely together already. My question is, is realignment of indels that are noted in the original BAM file possible with such low coverage? And if not, would it be valuable (and possible) to do it per population rather than per individual? What about base recalibration (in case I do realignment), or should I forget about all this and just use robust SNP callers and apply hard-filtering of SNPs? (note I don’t have known SNPs either)

    Many thanks for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi @sanne, glad you find the site helpful.

    Low coverage in this case does indeed present a good opportunity to try realigning per population. Typically we don't recommend this because it is computationally intensive, but with such low coverage it shouldn't be a problem. It will give you more consistent alignments, which is always nice. Known indels are not required.

    Base Recalibration does require a set of high confidence SNPs; if you don't have any available for your organism, you can bootstrap a set. This is described elsewhere on the site (try the BQSR method article). Because you have such low coverage, I would strongly recommend you try to do this, as it is likely the improve the quality of your results quite a bit.

    Good luck!

    Geraldine Van der Auwera, PhD

  • sannesanne NorwayPosts: 9Member

    Thanks for the super quick reply Geraldine, very helpful. I have a follow-up question (of course!). I assume that in order to re-align per population, I have to run the RealignTargetCreator and IndelRealigner on a merged bam file containing the 5 individuals of that population. From the Read Groups I can then later split into the separate individuals again if needed. However, I’m still struggling getting my head around these read groups. Am I correct thinking that RGID should refer to sample names/numbers and RGSM should refer to the population? In that way each individual has a unique ID but they are united into populations by RGSM. Many thanks again!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    @sanne, sorry for the less-quick response this time, we were on holiday.

    To clarify one thing first: to run any GATK tool on multiple samples, you don't actually need to merge the bam files; just pass the bam files in together in your command line, either with -I sample1 -I sample2 etc or as a list of samples like -I samples.list. By default the output will be merged, but for the realignment step you can keep the outputs separate by using the -nWayOut argument.

    Regarding read groups: RGID is the actual read group identifier (which refers to the sequencing unit, and should be indicated in your sequencing metadata), while RGSM is the individual/sample identifier. It's important to understand the distinction and how they correspond to the biological material because sometimes several samples have the same read group (if they are pooled) and sometimes a sample is spread across several read groups. Or it's both, which is even more confusing, I know. There is no identifier for the population as such; whatever you feed in together will be considered to constitute a population.

    Does this make sense?

    Geraldine Van der Auwera, PhD

  • sannesanne NorwayPosts: 9Member

    Hi Geraldine, hope you had a nice holiday! Thanks for the clarification on the merging, that sounds straightforward and I should be able to go on from here. As for the RG, I have to admit they remain confusing! I can understand that when for example individual samples are sequenced over multiple lanes/runs then RGID is informative, but in my case I have a single library for each sample, and 10 samples/libraries were combined for sequencing on a single lane. The libraries were indexed so they were not sequenced as a pooled dataset, and I now have bam files for each individual sample. So unless RGID should in my case refer to the lane (and RGSM to the individual samples sequenced on that lane), I think for my dataset RGID and RGSM are equivalent and can have the same name/number... would that be that correct?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    @sanne, I had a great time, thanks! No internet (no forum!) for 4 days... lovely :)

    Actually you do want the RGID to correspond to the lane of data, mainly because when you do BaseRecalibration, the tool will operate separately on each read group, and since the lane is sort of the basic unit of sequencing (any systematic machine errors will be common per lane) you want recalibration to be done per lane. So in your case since all the samples were in one lane, you would give the same RGID to all samples (but separate RGSMs of course). Sorry if that wasn't clear in my earlier comment, that's what I meant to say.

    Geraldine Van der Auwera, PhD

  • sannesanne NorwayPosts: 9Member

    Brilliant, the RG is totally clear now! And also great to learn that BaseRecalibration should be done per lane, I hadn't quite realised that yet. I hope you don't mind that I'm going to ask yet another follow-up question... This time about the boot strapping for the base recalibration which seems the best way forward for my dataset. What I'm wondering is which samples to use to get a high confidence SNP set which can be used for the base calibration. Can I use the same set for multiple lanes, or should I create a new one each time? And if the latter, should I use all samples per lane to create the set? The composition of the lanes varies (i.e. samples from different populations were spread over multiple lanes to reduce lane effect per population). I apologise for all the questions, but you've been truly helpful and I've learned so much thanks to your website and this forum! :)

Sign In or Register to comment.