It looks like you're new here. If you want to get involved, click one of these buttons!
Mark_DePristo
Posts: 132Administrator, GSA Official Member admin
Our current best practice for making SNP and indel calls is divided into four sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes.

Example commands for each tool are available on the individual tool's wiki entry. There is also a list of which resource files to use with which tool.
Note that due to the specific attributes of a project the specific values used in each of the commands may need to be selected/modified by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits the data and project design.
There are four major organizational units for next-generation DNA sequencing processes that used throughout this documentation:
This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.
In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data from the GATK resource bundle. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.
In our GATK 2.0 slide archive.
The GATK data processing pipeline assumes that one of the many NGS read aligners (see this review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits SAM files natively (which are then easy to convert to BAM).
The three key processes used here are:
There are several options here from the easy and fast basic protocol to the more comprehensive but computationally expensive pipeline. For example, there are two types of realignment which constitute a vastly different amount of processing power required:
This protocol uses lane-level local realignment around known indels (very fast, as there's no sample level processing) to clean up lane-level alignments. This results in better quality scores, as they are less biased for indel alignment artefacts.
for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)
Here we are essentially just merging the recalibrated lane.bams for a sample, dedupping the reads, and calling it quite. It doesn't perform indel realignment across lanes, so it leaves in some indels artifacts. For humans, which now have an extensive list of indels (get them from the GATK bundle!) the lane-level realignment around known indels is going to make up for the lack of cross-lane realignment. This protocol is appropriate if you are going to use callers like the HaplotypeCaller, UnifiedGenotyper with BAQ, or samtools with BAQ that are less sensitive to the initial alignment of reads, or if your project has limited coverage per sample (< 8x) where per-sample indel realignment isn't more empowered than per-lane realignment. For other situations or for organisms with limited database of segregating indels, it's better to use the advanced protocol if you have deep enough data per sample.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
sample.bam <- dedup.bam
As with the basic protocol, this protocol assumes the per-lane processing has been already completed. This protocol is essentially the basic protocol but with per-sample indel realignment.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
sample.bam <- realigned.bam
This is the protocol we use at the Broad in our fully automated pipeline because it gives an optimal balance of performance, accuracy and convenience.
Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.
for each sample
lanes.bam <- merged lane.bams for sample
dedup.bam <- MarkDuplicates(lanes.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
recal.bam <- recal(realigned.bam)
sample.bam <- recal.bam
This protocol can be hard to implement in practice unless you can afford to wait until all of the data is available to do data processing for your samples.
ReduceReads is a novel (perhaps even breakthrough?) GATK 2.0 data compression algorithm. The purpose of ReducedReads is to take a BAM file with NGS data and reduce it down to just the information necessary to make accurate SNP and indel calls, as well as genotype reference sites (hard to achieve) using GATK tools like UnifiedGenotyper or HaplotypeCaller. ReduceReads accepts as an input a BAM file and produces a valid BAM file (it works in IGV!) but with a few extra tags that the GATK can use to make accurate calls.
You can find more information about reduced reads in some of our presentations in the archive.
ReduceReads works well for exomes or high-coverage (at least 20x average coverage) whole genome BAM files. In this case we highly recommend using ReduceReads to minimize the file sizes. Note that ReduceReads performs a lossy compression of the sequencing data that works well with the downstream GATK tools, but may not be supported by external tools. Also, we recommend that you archive your original BAM file, or at least a copy of your original FASTQs, as ReduceReads is highly lossy and doesn't quality as an archive data compression format.
Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive IO and processing burdens. Even for single samples ReduceReads cuts the memory requirements, IO burden, and CPU costs of downstream tools significantly (10x or more) and so we recommend you preprocess analysis-ready BAM files with ReducedReads.
for each sample
sample.reduced.bam <- ReduceReads(sample.bam)
After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass.
For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.
The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Haplotype Caller or Unified Genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.
A common question is the confidence score threshold to use for variant detection. We recommend:
raw.vcf <- HaplotypeCaller(sample1.bam, sample2.bam, ..., sampleN.bam)
raw.vcf <- UnifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)
-nt option. However you can use Queue to parallelize execution of HaplotypeCaller.This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the false positive machine artifacts from the true positive genetic variants.
The process used here is the Variant quality score recalibrator which builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant in the callset is a true genetic variant or a machine/alignment artifact. All filtering criteria are learned from the data itself.
Take a look at our FAQ page for specific recommendations on which training sets and command-line arguments to use with various project designs. It is expected that analysis will be required by the user when determining the best way to use this tool for different projects. These recommendations are only meant as jumping off points!
We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples. Also, for experiments that employ targeted resequencing of a small region (for example, a few hundred genes), VQSR may not be empowered regardless of the number of samples in the experiment. For users with experiments containing fewer exome samples or with a small target region there are several options to explore (listed in priority order of what we think will give the best results):
--maxGaussians 4 --percentBad 0.05 to your command line. Note that this is very dependent on your dataset, and you may need to try some very different settings. It may even not work at all. Unfortunately we cannot give you any specific advice, so please do not post questions on the forum asking for help finding the right parameters. These recommended arguments for VariantFiltration are only to used when ALL other options are not available:
You will need to compose filter expressions (see here, here and here for details) to filter on the following annotations and values:
For SNPs:
QD < 2.0MQ < 40.0FS > 60.0HaplotypeScore > 13.0 MQRankSum < -12.5ReadPosRankSum < -8.0For indels:
QD < 2.0ReadPosRankSum < -20.0InbreedingCoeff < -0.8FS > 200.0Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x): you cannot use filtering to reliably separate true positives from false positives. You must use the protocol involving variant quality score recalibration.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by variant quality score recalibration.
-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard
Comments
Is there a link to Version 3 somewhere? I need to reference this since that is what we currently use
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Never mind...I found it: http://gatkforums.broadinstitute.org/discussion/15/retired-best-practice-variant-detection-with-the-gatk-v3
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Is the ApplyRecalibration the final step then? I've only run it the once but so far I seem to have the same number of lines though the scores and number of lines with "PASS" have all changed. I'm assuming we should take it from there?
Love the new site, looks great =)
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Tristan,
ApplyRecalibration is the final step of recalibration. What you're seeing is normal -- it is a very simple tool; given a minimum VQSLOD (say, 99.5%) specified by the user, ApplyRecalibration goes through the whole file, marking variants as 'pass' or 'fail' based on whether their VQSLODs are above or below that threshold. No lines are added or taken out of your VCF.
There are additional steps that can be done with GATK tools to examine the quality of your callset, before moving on to the actual analyses you want to perform on it. See for example the documentation of the VariantEval tool. Based on the results, you may want to go back to the recalibration step and apply a different cutoff to your callset (i.e. repeat ApplyRecalibration with a different VQSLOD value). This is not included in the Best Practices because it depends more on your experiment, dataset and results, whereas the Best Practices constitute a set of core principles that should apply to all experiments.
Glad you like the new site!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •The Queue DataProcessingPipeline seems like a great way to accomplish Phase I pretty much out of the box. Is there anything comparable for Phase III?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Not as such but we encourage you to look at the example scripts on the repository and use them to develop your own.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •hi,
Can i know if we still need to filter the StrandBias (SB) > >= -1.0 for indel? thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Jayce,
All the key recommendations are in this Best Practices document; if it's not in here, it's not essential. Which doesn't mean it can't be used -- that's between you and your dataset...
Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello,
I am preparing to learn to use GATK for the first time. I am new to this type of analysis and am interested in running through some exercises prior to jumping into my own data. In this article you mention applying this approach to the NA12878 data. You mention using raw and realigned, recalibrated. I have found NA12878 data under bundle/1.5/b37, but I'm not sure whether this is the correct location? There are also many different files, but all of them are called cleaned and recalled. Also, some of them are labled as HG19 even though they are in the b37 folder. I ideally would like to reference to the HG19 build.
Which files (and their locations on the server please) would you consider the raw, realigned, and recalibrated data?
Also, is there a section of the website where you work through a specific variant calling example and provide output files that we can compare our results to? It would be extremely helpful for feeling that we are on the correct track. I could have missed it if there is already something like this.
Thank you very much for your help! ~Sean
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Sean,
Unfortunately we don't really have any start-to-finish tutorials yet; it's near the top of my to-do list, but I can't promise you an ETA.
If you want to practice the workflow, I would recommend starting with NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam, which is chromosome #20 from NA12878. You can use Picard's RevertSam tool to revert the file to raw reads, then work through BWA alignment (to the reference of your choosing) and all subsequent steps as indicated in this article. Then you can use the callsets we provide in the bundle to evaluate whether things worked properly.
Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for the suggestion! I also came across this website while searching the web. It seems to be a good resource for exome analysis using GATK that includes command line examples. I image you've come across it, but in case you haven't... http://seqanswers.com/wiki/How-to/exome_analysis
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •Thanks for the link, that's an interesting walkthrough. Just keep in mind that command lines may change, as we occasionally update argument names and things like that, so be sure to check that everything is current before running.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi:
I saw a few places including the URL referred by sboyle as above suggesting to use picard FixMateInformation.jar utility for paired end data, is it really necessary? I have run through GATK without that for such data, it seems fine (no error and no complains from GATK). Just want to ask your opinion from GATK group point of view.
Also for the best practice V4, for the step: 2. Raw reads to analysis-ready reads at phase I, I only have lane-level sample data (each sample in one lane, no sample in multi-lanes), so I only did the Fast: lane-level realignment (at known sites only) and lane-level recalibration, which is
since I do not have multi-lane samples, I do not have to run Fast + per-sample processing, Better: per-sample realignment with known indels and recalibration, or Best: multi-sample realignment with known sites and recalibration, right? It seems that all of these schemes, Better, Best etc, are only for situation when some samples run at multiple lanes, is my understanding correct?
at the paragraph for Fast + per-sample processing, it mentioned: cross-lane realignment, not sure what it means exactly. if every sample of mine are in a single lane, it does not need, right?
Thanks
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Answered in this thread:
http://gatkforums.broadinstitute.org/discussion/1836/best-practice-v4#latest
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello, thank you for this guide.
I have some doubts about the phase I, I cannot understand how the realignment and the recalibration should be performed. Should I use a alignment software such as BWA?
Thank you very much
cardillox
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •To clarify: the initial mapping step (as shown here) is done by an aligner such as BWA. Realignment and recalibration are done by the GATK tools (RealignerTargetCreator+IndelRealigner then BaseRecalibrator+PrintReads) as explained in this and other documents.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Dear Geraldine, Thank you for your answer. I did not read all the documents yet, but in this document i cannot find any reference to RealignerTargetCreator and other tools that you mentioned.
I apologize for the inconvenience
Have a nice day
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •That's okay, we'll try to clarify this documentation in the near future.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •Hi,
what does the term 'multisample' in the section 'Best: multi-sample realignment with known sites and recalibration' refer to? Looking at the workflow code and the text, it seems to do all the processing for each sample individually.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Julian, that was actually an inappropriate title (leftover from a previous version), I've rewritten it to make more sense. Thanks for pointing this out!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I've downloaded and installed the GATK bundle, run the example and got all the output as required. I now want to try this on my real data. I have 6 trios, each composed of two affected kids and one normal parent. I want to filter out homozygous or compound heterozygous mutations as my final output. I read that GATK works better if I analyze the cohort together, should I go this route, given my complex sample? Also I was reading about Queue, should I install that too before I start? Finally is it just a matter of following the same steps as in the example (with the caveat that I have to do this for multiple samples, which I am not sure of how to?) on the real data?
Sorry about all the questions, being new to bioinformatics this is a little overwhelming
Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Farah,
You can analyze all the samples in your cohort together, yes. I would recommend that you get comfortable with running commands individually before you try to use Queue. Since you have a reasonably small amount of samples you don't really need Queue at this point anyway. I also recommend you use the figure in this article as a roadmap. For each step, look up the tool, understand what is its purpose, and run the tool on your samples using the basic parameters that are detailed in the documentation. You may also find it useful to look at the presentations from our recent user education workshop before you start, particularly the videos (which just came online today, whoo).
See http://www.broadinstitute.org/gatk/guide/events and http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk
If you run into problems, check out the documentation and be sure to search the forum for answers. If you still can't find a solution, post a question and we'll try to help you out.
Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •whoo hoo, you have no idea how happy I am clicking on the links you sent! Thanks so much! I'll do this step-wise as you suggest and hopefully won't bug you too much along the way. I just learned that our institute (which uses it's own cancer specific pipeline, while my samples are intellectual disablity so I think GATK will be more suitable) has it's own formats for the .bam files generated by BWA, but it's not GATK compatible. So I'm going to start from the fastq files. I had to regenerate the fastq files from the raw bam files though as the institute doesn't store the original fastq. I did this using picard, I hope it will still be compatible.
Thanks again for the help!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I'm glad you're finding the workshop materials helpful :)
Sounds like you're on the right track with your files. Picard tools are great, we use them for a lot of things too. They are developed by another group here at Broad. All standard formats (bam, vcf etc) are compatible between their tools and ours.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine, I've followed the complete workshop and it was great. Very clear, informative and I feel quite ready to start now. The one thing I still need answered is a very basic question; how do I run the walkers on multiple files at the same time? Since it is better to run the analyses as a cohort rather than as individual files. The workshop gave command line for single file analysis only. So for example if I need to run a walker on the cohort, do I then simply call more input files than one? for example - java -jar GenomeAnalysisTK.jar - T WalkerName \ - R reference.fasta -I Trio1child1.bam -I Trio1child2.bam -I Trio1mother.bam -o output1 -o output2 -o output3
and so on. I have 18 samples, i.e, 6 trios where a trio is a composed of 2 affected kids and the normal parent (often mother).
Or am I completely on the wrong track here?
Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •You've got it right! Although you can also pass the filenames in a single text file, one filename (with full path) per line, like so: -I my_bamfiles.list
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Great, thanks very much Geraldine. Just got to see this on getting back from another job. I'll set it up and start right away. Excited!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I have just started working with GATK and was pleased to discover this page. Can I just confirm in the above recommendations for very small data sets, a QD < 2?, surely this should be >? The example for SelectVariants is a QD > 10. (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html).
Update - I think I see my error, the above is for flagging, so > 2 is flagged, as opposed to > 10 for removal, thanks.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
It sounds like you are getting confused between VariantFiltration, where you say what you want to remove "QD < 2", and SelectVariants, where you say what you want to keep "QD > 2". I'll clarify the docs on this point.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi,
I saw on this page, the basic workflow diagram showed that at phase 2, when call variants, GATK can create SNP, indels and structure variations (SVs). which one can be used to get structure variations? Is it Unified genotyper or HyploTypeCaller or some other new modules in the GATK V2.0?
Also for indels derived from unified genotyper, any size limit for the indels derived (max 50bp, 100bp or no limit)?
Thanks
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •UnifiedGenotyper and HaplotypeCaller both call SNPs and Indels. For SVs, you'll need to us a different tool: GenomeStrip is a program based on the GATK (but not part of GATK) that allows identification of SVs. See http://www.broadinstitute.org/gatk/guide/topic?name=third-party-tools
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for the info, Geraldine! However, I found this link below from forum: http://gatkforums.broadinstitute.org/discussion/1541/ask-your-questions-about-genomestrip-here#latest
It seems we can tun GenomeSTRiP from GATK? I did not see any jar file for this tool or we have to download from their site, then why you say it based on GATK?
Thanks again
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Mike, GenomeSTRiP uses the same underlying engine as the GATK, and it is made by our colleagues at the Broad Institute, so we are hosting their forum and documentation. However it is built and distributed as a separate program, so you need to download it from their site.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks a lot for the info, Geraldine! Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine:
Just realized that you did not address one of my questions mentioned above: Also for indels derived from unified genotyper (or hyplotypecaller), any size limit for the indels derived (max 50bp, 100bp or no limit)?
Thanks in advance!
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The tools themselves do not set a hard limit; however in practice there are some limits to what they are able to detect. This is largely dependent on the length of reads in your dataset, since for deletions for example, if you do not have full overlap, you cannot with certitude distinguish deletion from lack of coverage.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, minor point shouldn't the statement above regarding BWA: "emits BAM files natively" be adjusted as (AFAIK) BWA will generate a SAM file but other tools are need to convert a SAM file into a BAM file, might be helpful for other newbies like myself, cheers.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi,
I want to filter snp with using hard filters method. So I would like to know if the length of reads is important for the definition of parameters (in particular with the ReadPosRankSum Test). I have reads of 75 bp I'm wondering if the given parameters in the "Best Practice Variant Detection" are adapted to my dataset.
Thanks in advance for your answer.
JP
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Not really -- maybe if you had extremely short or extremely long reads, but we do test a variety of read lengths, and for 75bp you should be fine with our best practices recommendations.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •@Lavinia, that was a fair point -- I edited the doc accordingly. Thanks for pointing it out.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi,
I found after I did multi-samples (multiple treated versus one untreated) realignment to detect the induced mutations, I lost many SNPs in the untreated sample compared to the results from single-sample realignment. I am wondering will the multi-sample realignment will remove the rare mutations in only a few samples?
Thanks a lot in advance for your answers.
Jiahao
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Jiahao,
When you do multi-sample calling, the genotyper uses the cohort information to evaluate confidence in calls that do not have strong support in individual samples. In general this has a positive effect of "rescuing" variants that occur in several samples even though they are not well supported in the individual samples. However this can occasionally lower the confidence in rare calls that are not present in other samples. This should only affect marginal calls, but if you are looking for very rare variants then perhaps it would be better to call samples individually.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •A somewhat naive question: I'm thinking about using VQSR on my small targeted region sample set, and I got a number of bams from the 1000 genomes to empower VQSR appropriately. If I understood correctly, I have to make one single multi-sample VCF file including both my samples and the additional bams, then run VQSR. The resulting VCF will however include also calls from the "extra" samples, which I am not interested in. How can I effectively "get rid" of the extra samples post VQSR ?
Thanks in advance.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Luca,
You can use SelectVariants to extract the variants from only the samples you're interested in, as in this example from the tech doc:
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks a lot, noted and added to my processing pipeline.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Can you please give me some advise as to what I would do re phase 1 if I have multiple samples (3 to 4) multiplexed per lane? You discuss the situation of having a sample split across multiple lanes but not multiple samples per lane. Would I recall each lane as a whole entity (I'm assuming this is better as you can take advantage of a whole lanes worth of data for the recalibration) then split up these lanes to the sample level (via read groups, set differently for each sample in the lane) for subsequent de duplication and realignment? Each lane was also run from a single pooled library.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @MattB,
If your samples are duly tagged with read group information, the GATK engine will correctly identify and process them at sample level where appropriate. You can just pass a single BAM file containing all your data, if that's what you have.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks, so with respect to the different phase 1 methods, fast, better or best, what order would I perform these steps? Would the following work:
0) Merge all of my .bam files from BWA in to oneBig.bam (with ID and SM set appropriately in the @RG line, these mirror each other as there is only one alignment per sample)
1) MarkDuplicates on the whole study (despite multiple libraries, will the LB: bit in the @RG line be needed to differentiate?)
2) Realign everything
3) Recal everything
I guess I can then split these by read group before variant detection if I wanted?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Just to come back to this. So after re reviewing the info you have here http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files my question is as regards base quality score recalibration, how do I get GATK to take into account an error model for the whole lane rather than each sample in each lane. Since from the looks of things it will only function per read group and I will have 3 to 4 of these per lane? Thus recalibration will happen in 3 or 4 separate chunks for a lane rather than the whole thing?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Well, the thing to keep in mind is that if you merge all your BAMs together into a big one, the processing of that big BAM is going to be very computationally demanding. Also, note that the recalibrator will process read groups individually, so you will not get the "whole lane data" advantage that you think from recalibrating multiple samples together.
Actually, let me take a step back and give you a quick run-down of what we do in-house.
Our setup is a bit complex because we have samples spread over multiple lanes, with multiple samples per lane. So when we get the FastQs, we separate out the read data by read group into individual files, so that after alignment we have one bam file per read group. We run dedup-realign-recal on each bam file, then merge the bams of read groups that belong to the same sample to produce one bam file per sample. Then we do another round of realign-recal on the sample bams as a form of cross-lane cleanup.
But if you don't have samples spread across different lanes, you don't need to do all of this. The simplest is probably to separate out the samples using their read group tags into individual bam files and process them separately through dedup-realign-recal. I realize this contradicts what I said earlier; technically that (earlier suggestion to just process all together) is still an option (with some advantages, e.g. all samples aligned the same way), but if you plan to split up your samples before variant calling anyway, you might as well split everything earlier on and save yourself the compute resources of processing everything together.
Does this make sense? We're planning to rewrite and simplify the best practices recommendations (because right now we realize they're pretty confusing), so feel free to give feedback on what is or isn't clear.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Oh, just saw your new post (took me a while to put a response together, sorry -- like I said we're rethinking our recommendations so it needed a bit of internal discussion).
For the BQSR, the unit of aggregated information is the read group. This is convenient in many aspects but the downside is that you can't get the recalibrator to recalibrate over the entire lane if it is composed of different read groups. As you say, it will process the lane info in chunks according to read group. You could of course cheat and change the read groups, but I wouldn't recommend it. If you only have 3 or 4 different read groups per lane, that should still be enough data for the recalibrator to do its work properly.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hey thanks very much for all your comments this clears up many thing I've been wondering for quite some time. I should now be able to get done what I want to do. I agree the intro to phase 1 is really quit confusing, so any rewrite would be useful.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Glad to help, Matt. We'll try to clarify that intro to phase 1 in the near future.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I have one sample spread over multiple lanes (aliquots from the same library). I try to follow "Best: per-sample realignment with known indels then recalibration" and want to make sure my understanding is correct in designating read groups : We should set the same "read group" in all lane.bams, is that correct?
Besides, in the in-house procedure what would be the difference between (dedup-realign-recal) and (only dedup) for each lane.bam? Since eventually in the cross-lane cleanup stage a (realign-recal) would be applied on the lanes.bam.
Thank you very much for your help :)
Han
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Han,
You should assign as separate read group to each of your lane.bams.
In our in-house procedure, the benefit of doing a first pass of realignment and recalibration is to have the process done at the smallest coherent unit of aggregate information, where it is most needed and pertinent. Then, doing an additional cross-lane cleanup helps make sure that there is homogeneity among the different parts of the dataset. But it's at the lane level that the cleanup really matters most. And just trying to do all the cleanup on all the data lumped together does not give as good results.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •Hi, Thank you for your immediate response. In that sense it seems quite consistent (conceptually) between the in-house procedure and the best practice. Now I understand how to get it done properly. Thank you very much for your help :)
Han
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I've a few questions related to variant calling with GATK.
1) As we know that, a non-diploid variant calling algorithm provides great sensitivity to make calls in tumor samples as tumor samples are usually aneuploid. Does GATK assume the sample is diploid? How sensitive is GATK for calling variants in tumor samples - or as compared to other calling algorithms SNVmix or CaVEMan?
2) Hope you've come across the criteria 'The number of reads with unique start sites' while making variant calls. Setting a value (2 or 4) improve the call confidence. Does this feature (directly or indirectly) available in UnifiedGenotyper or HC?
3) Is running Genome STRiP on exome data a good practice? If no, what is your suggestion for exome SV discovery?
Thanks, Raj
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Raj,
1) The GATK currently includes two variant calling tools, the UnifiedGenotyper (UG) and HaplotypeCaller (HC). The UG assumes diploidy by default but it is possible to set your own desired ploidy. The HC currently does not have that capability. Other tools are largely ploidy-agnostic. Please note in any case that the GATK tools are not specifically designed for calling somatic mutations. For those purposes please see the software developed by the Broad's cancer group, such as MuTect and SomaticIndelDetector.
2) I'm not sure exactly what you mean, but I can tell you that the UG applies a number of quality control read filters by default to only consider informative reads and thus improve call confidence.
3) We don't do SV discovery, but you can discuss this with the authors of GenomeStrip in their own forum, which we host here: http://gatkforums.broadinstitute.org/categories/genomestrip
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Following this guidance of GATK, after aligment with BWA, I MarkDuplicates using picard. Just mark it, didn't physically remove the duplicates. Then I run gatk (RealignerTargetCreator, IndelRealigner, BaseRecalibrator, PrintReads, UnifiedGenotyper, VariantRecalibrator, ApplyRecalibration). Then I realize the coverage are set below 250, whereas the average coverage of my sample is 300.
What is the cutoff of "good reads" defined by UnifiedGenotyper by default, like for the depth(5~250?), Base Quality(>=30?), Mapping Quality(>=17?).
If I want to remove or modify the above restrictions, for downsampling depth, base quality and mapping quality, which step and how should I adjust them?
In picard, I didn't remove the duplicated reads. Do you recommend to remove the duplicated reads in picard step or do it in ReduceRead step in GATK. Which field in BAM file does GATK recognize as duplicates if I only MarkDuplicates using picard? What the difference will supposed to be if I physically remove the duplicates and only MarkDuplicates, especially when some sample the high coverage is achieved due to high PCR duplicate rate?
Thank you so much in advance for clarifying the above points.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @flyingflyers,
There is not really a hard cutoff for base and mapping qualities; rather each base's contribution to the genotype calculation will be weighed by the qualities. The program uses this to calculate a confidence score (which means, how confident the program is that the site is variant or not) and you can set thresholds based on that score to determine how confident the calls should be. See the UG documentation for details on the
-stand_emit_confand-stand_call_confarguments.You can adjust the downsampling level using the
-dcovargument; see the CommandLineGATK (engine) documentation for details on that.Finally, it doesn't matter if you physically remove duplicates or not as long as you mark them. The GATK will simply ignore any reads marked as duplicates.
I hope this helps clarify things for you.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •New to NGS. Working on SNP detection between two illumina 100bp x50 tomato whole genome resquencing data sets. There is a reference available and annotation. Using this information can I follow the protocols and not need anything extra, such as a file for known indels? I was going to use BWA-MEM and picard tools to convert from sam to bam then mark duplicates using picard tools, then follow the guide from here http://seqanswers.com/wiki/How-to/exome_analysis whilst referring to your guide to check using correct parameters as the example is human and exome. Does this sound logical? Thanks in advance.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The general workflow looks sound, just make sure you do read up on our best practices and check arguments, since they do evolve over time. I'd also recommend you have a look at the videos/presentations from our last user workshop, which folks who are new to the field seem to find quite useful: http://www.broadinstitute.org/gatk/guide/events?id=2038
One issue you're going to run into is the need for known snps (indels are not so important) for base recalibration, which is often an issue for non-human organisms (but maybe for tomato it is available, I'm not sure). And you may not have enough variants to run variant recalibration, in which case you'll have to use hard-filtering.
We've just finished drafting some fairly comprehensive tutorials to the best practices tools, which I'll try to post in the next few days, so keep an eye out for that.
Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks gives me a bit of confidence moving forward with it. There are some snp files here http://www-plb.ucdavis.edu/labs/maloof/tomatosnp/downloads.html What needs to be in the files, coordinates? thanks again
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •A vcf file with sites is needed -- see the FAQ on input files, I think it's explained there.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I'm currently proceeding but using samtools at the moment but would like to compare with GATK. I've found known SNPs on the ncbi to download in the form of fasta or bin file. Do you know of a tool that could convert it to vcf or to bed? The vcf tools just seem to analyse/confirm a vcf file rather than convert.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •We have a tool called VariantsToVCF that can convert certain formats to VCF format. Please see the tool documentation for more details.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Why isn't there the step to clean up the raw reads at the beginning?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •You mean indel realignment? It is there.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •By "clean up the raw reads", I mean, for example, removing the low quality or short reads. Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •for example, many GATK functions apply a number of read filters automatically. But why does not this Best Practice workflow do such filtering at the very beginning so that the following steps does not have to deal with those bad reads any more?
As I understand, those read filters in GATK just mark the bad reads, but do not remove them. Does it mean the next step has to process those marks again?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •throwing away data is a controversial decision.
The GATK smartly teases out the reads that don't pass read filters and the walker never sees them. So marking instead of removing is a safe option. If you're keen on removing data from your datasets, you're free to do so. But we are not recommending this in the best practices pipeline.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •thanks for the explanation!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •