The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# (howto) Call variants with HaplotypeCaller

edited February 2016

#### Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

#### Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Genotyping mode (--genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

• Emission confidence threshold (-stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (-stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms "called" and "filtered" are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-L 20 \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf


Note that -L specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes, as documented here. For example, when running on exome data, we use -L to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.

#### Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• earthMember Posts: 29

FYI, for users like myself who are devotees of Murphy and his laws, the first mention of genotyping_mode in this tutorial has a single dash and beware, it is not a real ascii dash. If you unwittingly copy and paste that into your command it won't work.

First you'll discover you need a second dash. But even after you add that you might be misled to think that DISCOVERY isn't a valid setting, as suggested by these error messages:

##### ERROR Invalid argument value 'DISCOVERY' at position 7.

What really confused me was if it didn't like genotyping_mode why was it complaining about DISCOVERY; and if it did like genotyping_mode, well, I couldn't see what was wrong with DISCOVERY. And why was it telling me it thought '–-genotyping_mode' was an invalid argument value? And what's the difference between an ERROR MESSAGE and a message telling me there's an ERROR?

These tutorials are very helpful, by the way. In spite of the pit- and prat-falls.

@BobHarris
Hi,

These tutorials are not meant to be copied and pasted! We want to give you an idea of the arguments you might want to use, but we do expect you to read the tool documentation and understand all of the possible arguments so you can decide which are most appropriate for your analysis. Of course, we understand users might want to copy and paste then blame us for any improper outputs, so we purposely put in hard to copy arguments.

As for the error messages, you are right they are slightly confusing, but in this case, the invalid argument value comes from the program thinking –-genotyping_mode and DISCOVERY are values for the previous argument. The ERROR MESSAGE in the beginning is stating there is indeed an error message and the second ERROR points out what the possible other errors are because there is more than one error here.

-Sheila

For the record, I think @Sheila is joking -- we certainly do not make it difficult to copy/paste commands on purpose. We've had a few errors creep in because the text above was in part copy-pasted from a paper manuscript (the 2014 Current Protocols paper). I've fixed the errors you pointed out. If you see any others let us know and we'll fix those too.

Regarding the error messages there is some misunderstanding due to the formatting. There is only one error message (prefaced by ERROR MESSAGE) but it's reporting several problems, so there's enough text to lead to multiple lines, which are all prefaced by the word ERROR by convention.

As to the nature of the errors, what you're seeing here is a domino effect, where one error in an argument name, value or formatting can lead to further errors in subsequent arguments as the parser fails to correctly recognize which is which. The fact that the parser complains that –-genotyping_mode was an invalid argument value suggests that something in front of it is wrong. Fun fact: in your post (and possibly your command?) the first dash is one of those fake non-ASCII dashes... If it helps, we run into that problem fairly frequently too and it's annoying as hell. On the bright side, once you've run into this problem you're a lot more likely to recognize it immediately (and not spend a lot of time tearing your hair out) the next time it happens.

Geraldine Van der Auwera, PhD

@BobHarris
Hi,

Yes, to clarify I was indeed joking about putting in hard to copy arguments to throw off users. We would never do anything like that, at least not on purpose!

-Sheila

• earthMember Posts: 29

Right, I did know Sheila was joking.

I did in fact think that "Invalid argument value '–-genotyping_mode' at position 6" meant that the parser thought the quoted text was to be a value for some preceding argument. But my command up to that point was "java -jar GenomeAnalysisTK.jar --analysis_type HaplotypeCaller --reference_sequence apple.fa --input_file orange.bam --genotyping_mode DISCOVERY". (I've corrected the hyphen in that). I wasn't able to identify what argument it might have thought "--genotyping_mode" was a value for, since the previous argument, "--input_file", had been supplied a value, "orange.bam". That error message could be improved if it also told me what argument the parser thinks this allegedly bad value belongs to.

The expectation that users won't cut and paste code or keywords posted in a tutorial is ill-considered. Being myself a lousy typist, I'm (usually) much less likely to screw up an unfamiliar command line if I copy and paste from an example, editing to change the parts specific to me. I did in fact read all of the tutorial text before running my command, and did look through the output of HaplotypeCaller --help to get some brief understanding of each argument before I actually tried running the command.

If, as Geraldine writes, non-ascii dashes are a fairly frequent problem, it would seem useful to fix the problem in the parser. It would save headaches for users and support staff alike.

OK, just wanted to make sure because humor can be easily lost in translation, and taken at face value that statement could be a bit upsetting

I agree with you that tutorial commands should be cut-and-pastable, @BobHarris. The modus operandi of these tutorials is that we tell you to run a specific command line to achieve a specific purpose. Obviously we're hoping people read the explanation of what the key arguments do, but we don't expect anyone to customize the command while in the context of the tutorial. The warning against straight-up copy-pasting is meant for other types of documents (such as the tool docs) where the commands given are usually generic examples that we'd rather people not take for granted as applicable to their use case.

The dashes problem is not frequent enough to make me willing to have the parser start accepting other types of characters -- I see that as a slippery slope to more problems. And to some extent it's a fairly innocuous and easy to resolve problem (we recognize it very easily, at least) that introduces people to the idea that characters are not all created equal and that word processors are evil. I know that view might be a bit controversial but it's how I feel when I have my educator hat on. Now, ask me again when I'm actually trying to run commands for my own purposes. Then there's swearing and shaking of fists

Geraldine Van der Auwera, PhD

• Member Posts: 82

Hello,

I have a query regarding generating a single sample VCF file from BAM file for exome sequencing. Is it possible to exclude some intronic regions while generating a VCF because those intronic regions we do not want in our analysis.

Suppose if it want keep variants present only CDS +12/-12 positions. Which command should I use at this step or hardfiltering step?

Thanks

@aneek
Hi,

-Sheila

• Member Posts: 82
edited April 2016

@ Sheila
Thank you for quick response. I have understood that I have to use an interval padding option with -L option to restrict my exome analysis up to CDS +12/-12 positions. However to do that should I have to use
"-L chr:1-12" option to add padding 12 bp around all exons? Please correct me if I am wrong.

No, that would mean you want to run on bases 1 to 12 of a contig called chr. The interval padding option is -ip.

Geraldine Van der Auwera, PhD

• Member Posts: 82

@Geraldine_VdAuwera

Thank you for the correction. So now if I use the following command would serve my purpose of to restrict my exome analysis up to CDS +12/-12 positions.

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-L file_interval.bed \
-ip 12
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf

Please correct me again if I am wrong.

Yes that's correct.

Geraldine Van der Auwera, PhD

• Member Posts: 82

@Geraldine_VdAuwera

Thank you very much for the guidance.

After getting the VCF file, if I want to filter out known variants having greater than 0.01 MAF in 1000genome, ExAC and EVS, from the VCF file itself, before going to the annotation step, is there any tool in GATK to do that?

In other words if there is any tools or option in GATK which can filter the VCF file by removing known variants having greater than 0.01 MAF in 1000genome, ExAC and EVS?

If not is there any other way to do that?

Thank you.

@aneek
Hi,

You will need to use SelectVariants to select the sites in the VCFs that have greater than 0.01 MAF. Then, use SelectVariants again to remove the sites from your own VCF.

-Sheila

• Member Posts: 82

@Sheila

Is their any specific example in the tutorial to do that. It is not much clear to me about what are the options should I use in the command line. I want to keep the novel variants along with the known variants having MAF < 0.01 in population in the output VCF. MAF is in terms of POPULATION metric and not the Allele FRACTION here.

Do I need to feed reference data from 1000 genome, EVS and ExAC in the commandline based on which SelectVariants will perform the task. Because what I guess that the program will need some reference with which it will compare MAF of each variant in the input VCF and will generate an output VCF having only those variants which have MAF less than equals to 0.01 along with the other variants which are novel (not reported in any of these databases).

Any help would be highly appreciated. Thank you.

• Member Posts: 82

Hi, any help or suggestions regarding my queries..

@aneek
Hi,

-Sheila

• Member Posts: 82
edited April 2016

@Sheila

Thank you very much for the information. Another thing is how can I restrict the information provided in the INFO column.

Suppose I want to generate a VCF having only "AC=4;AF=1.00;AN=4;DB;DP=76;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.21;SF=0,1;SOR=0.693" in the INFO column excluding the excess information like "BaseQRankSum=1.973;ClippingRankSum=0.932;MQRankSum=0.282;QD=28.65;ReadPosRankSum=-0.065 etc. which it provides for some variants while using "HaplotypeCaller".

@aneek
Hi,

You can use -sites_only.

-Sheila

• Member Posts: 82

@Sheila

Thank you very much. It is written in the link that using this argument will generate a VCF file with first 8 columns including the INFO column (excluding the genotypes). How will it restrict the information as I have mentioned in my earlier query in the INFO column. Shall I have to put any value for this argument.

Suppose if I want to keep only "AC=4;AF=1.00;AN=4" information in the INFO column what value should I use for -sites_only argument?

In other words is there any control in GATK to restrict the amount of information in the INFO column of VCF.

Thanks..

@aneek
Hi,

You can use -XA to exclude any annotations you don't want.

-Sheila

• PolandMember Posts: 15
edited June 2016

HI, Is there any way to make it run faster as I am running it "GATK 3.5" on cluster "using one node 28 core 100gb of rams " and it takes 7 days and it is not even finished "my input two .bam samples each is 109 GB"

I think the solution may be to use the Queue system http://gatkforums.broadinstitute.org/gatk/discussion/1306/overview-of-queue but it is not clear how it will control rams or how it will speed it up

@medhat
Hi,

You can try using WDL which is replacing Queue and is easier to use than Queue.

-Sheila

• PolandMember Posts: 15

Thanks a lot for the answer, but after reading the documentation there is no clear way how to use it with slurm cluster and to use multi node ?

Apologies for the delay in the response, I've been clarifying a few things to hopefully give you a response to satisfy all your questions.

At this time, WDL does not support a Slurm backend, so I would actually recommend Queue. We do have some presentations about getting started with Queue, which you can find here.

There is no explicit ram control in Queue itself. When writing the scala script for your Queue pipeline, you can specify the usual JVM memory parameters within the GATK command. At the Slurm level, you can use jobNative arguments (listed in the Advanced Queue presentation linked above) to specify memory requirements in the Queue command.

In addition to that, you can also speed up the runtime using multithreading, which is done with the GATK argument -nt or -nct. You can read more about how that works here.

• USAMember Posts: 78

The default values for following parameters in HaplotypeCaller and GenotypeGVCFs are :

-stand_emit_conf 30
-stand_call_conf 30

But, for Single Sample DNA-Seq analysis, it seems to me the suggestions are :

-stand_emit_conf 10
-stand_call_conf 30

Please let me know if my understanding is correct regarding the recommended value of 10 for stand_emit_conf parameter especially in single sample DNA-Seq exome analysis.

Thanks,
mglclinical

• USAMember Posts: 78

I am wondering if anyone has any opinion on my question ?

Thanks,
mglclinical

Hmm, aside from the crickets... frankly it depends more on the level of sensitivity you want to achieve then whether you're working with a single sample or with a cohort. -stand_emit_conf 10 will allow you to consider sites where the evidence is very unconvincing, and that would most likely be filtered out in the next step anyway. Keep in mind that those sites will have a status of filtered, so if you want them to be included in subsequent analysis you need to use the appropriate arguments for each tool. If you don't want to have to deal with that, drop stand_call_conf too.

Note also that we're probably going to get rid of the stand_emit_conf argument in the next version because we don't find it useful any more and it causes a lot of confusion. I think it will be easier to have just one dial to turn to set the sensitivity floor.

Geraldine Van der Auwera, PhD

• USAMember Posts: 78

@Geraldine_VdAuwera , thank you for your opinion

• guangzhou, ChinaMember Posts: 1

Hi @Geraldine_VdAuwera ,does the arguement "stand_emit_conf" have been deleted in the up-to-date nightly version？ I meet some error while call variant with HaplotypeCaller: "ERROR MESSAGE: Arguement with name 'stand_emit_conf' isn't defined."

@LeanStream
Hi,

Yes, that argument has been deprecated. It was supposed to be a convenience function to do a sort of "pre-filtering", but mostly it ended up being confusing and pointless. You can still use stand_call_conf.

-Sheila

• SeattleMember Posts: 2
edited January 18

I got this error when I applied the example command.

##### ERROR MESSAGE: Invalid command line: The parameter standard_min_confidence_threshold_for_emitting is deprecated.  This argument is no longer used in GATK versions 3.7 and newer. Please see the online documentation for the latest usage recommendations.


Then I found this post. So, should I just remove the -stand_emit_conf 10?

Also, which value should be a good starting point for This also comes with a lower default value for the -stand_call_conf threshold

Thanks,
Roden

#### Issue · Github January 18 by Geraldine_VdAuwera

Issue Number
1643
State
open
Last Updated
Yes that's correct. We'll update the tutorial accordingly. Thanks for reporting this!

Geraldine Van der Auwera, PhD

• karolinska InstitutetMember Posts: 4

Hello, Could you please comment which will be a good starting value for -stand_call_conf when performing HaplotypeCaller in RNA-seq data. Thanks

The RNAseq doc specifies a call threshold of 20.

Geraldine Van der Auwera, PhD

• USAMember Posts: 7

I have 19 samples (WGS). As recommended by best practices, I ran HaplotypeCaller on individual samples to generate the corresponding gVCF file. The problem is, it takes a huge amount of time (~99 hours). Is there a way to shorten this time amount.
Also I was curious, if I called HaplotypeCaller with multiple BAM files (different samples) directly to generate raw variants, will it give me incorrect results?
I mean for exome data, I would definitely first generate gVCFs; but for WGS data, given that it takes so much time, can we directly call HaplotypeCaller for vcf files (in the interest of time saving).

Thanks
Abhishek

@abhishek_maj08 Calling HC on multiple samples takes a lot of time as well -- the computationl resource expenditure increases dramatically. So we parallelize HaplotypeCaller over intervals to accelerate processing. We have a intervals that are appropriate for WGS, if you need it.

Geraldine Van der Auwera, PhD

• USAMember Posts: 7

@Geraldine_VdAuwera Yes please. It will be very helpful if you provide the intervals. Also just to clarify, when you are talking about parallelizing HC, are you talking about generating GVCF or the joint analysis?

We parallelize both over the same intervals. Our workflow for the first part of the pipeline (up to GVCF) is available here if you want to check it out, and we have some additional workflows coming out soon for the rest.

The intervals lists are in our resource bundle -- see the Downloads page for details of how to access it.

Geraldine Van der Auwera, PhD

• USAMember Posts: 7

@Geraldine_VdAuwera
Thank you for the workflow. I was looking at the wgs_calling_regions.hg38.interval_list and I observed that there are chunks in each chromosome that have been left out. Could you please tell me how these intervals were selected/chosen? I mean why a segment of a chromosome is selected or left out?