(howto) Call variants with HaplotypeCaller

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,980Administrator, Dev admin
edited February 11 in Tutorials

Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

Prerequisites

  • TBD

Steps

  1. Determine the basic parameters of the analysis
  2. Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

  • Genotyping mode (--genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

  • Emission confidence threshold (-stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

  • Calling confidence threshold (-stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms "called" and "filtered" are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.


2. Call variants in your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T HaplotypeCaller \ 
    -R reference.fa \ 
    -I preprocessed_reads.bam \  
    -L 20 \ 
    --genotyping_mode DISCOVERY \ 
    -stand_emit_conf 10 \ 
    -stand_call_conf 30 \ 
    -o raw_variants.vcf 

Note that -L specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes, as documented here. For example, when running on exome data, we use -L to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.

Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • BobHarrisBobHarris earthPosts: 29Member

    FYI, for users like myself who are devotees of Murphy and his laws, the first mention of genotyping_mode in this tutorial has a single dash and beware, it is not a real ascii dash. If you unwittingly copy and paste that into your command it won't work.

    First you'll discover you need a second dash. But even after you add that you might be misled to think that DISCOVERY isn't a valid setting, as suggested by these error messages:

    ERROR MESSAGE: Invalid argument value '–-genotyping_mode' at position 6.
    ERROR Invalid argument value 'DISCOVERY' at position 7.

    What really confused me was if it didn't like genotyping_mode why was it complaining about DISCOVERY; and if it did like genotyping_mode, well, I couldn't see what was wrong with DISCOVERY. And why was it telling me it thought '–-genotyping_mode' was an invalid argument value? And what's the difference between an ERROR MESSAGE and a message telling me there's an ERROR?

    These tutorials are very helpful, by the way. In spite of the pit- and prat-falls.

  • SheilaSheila Broad InstitutePosts: 3,241Member, Broadie, Moderator, Dev admin

    @BobHarris
    Hi,

    These tutorials are not meant to be copied and pasted! We want to give you an idea of the arguments you might want to use, but we do expect you to read the tool documentation and understand all of the possible arguments so you can decide which are most appropriate for your analysis. Of course, we understand users might want to copy and paste then blame us for any improper outputs, so we purposely put in hard to copy arguments.

    As for the error messages, you are right they are slightly confusing, but in this case, the invalid argument value comes from the program thinking –-genotyping_mode and DISCOVERY are values for the previous argument. The ERROR MESSAGE in the beginning is stating there is indeed an error message and the second ERROR points out what the possible other errors are because there is more than one error here.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,980Administrator, Dev admin

    For the record, I think @Sheila is joking -- we certainly do not make it difficult to copy/paste commands on purpose. We've had a few errors creep in because the text above was in part copy-pasted from a paper manuscript (the 2014 Current Protocols paper). I've fixed the errors you pointed out. If you see any others let us know and we'll fix those too.

    Regarding the error messages there is some misunderstanding due to the formatting. There is only one error message (prefaced by ERROR MESSAGE) but it's reporting several problems, so there's enough text to lead to multiple lines, which are all prefaced by the word ERROR by convention.

    As to the nature of the errors, what you're seeing here is a domino effect, where one error in an argument name, value or formatting can lead to further errors in subsequent arguments as the parser fails to correctly recognize which is which. The fact that the parser complains that –-genotyping_mode was an invalid argument value suggests that something in front of it is wrong. Fun fact: in your post (and possibly your command?) the first dash is one of those fake non-ASCII dashes... If it helps, we run into that problem fairly frequently too and it's annoying as hell. On the bright side, once you've run into this problem you're a lot more likely to recognize it immediately (and not spend a lot of time tearing your hair out) the next time it happens.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstitutePosts: 3,241Member, Broadie, Moderator, Dev admin

    @BobHarris
    Hi,

    Yes, to clarify I was indeed joking about putting in hard to copy arguments to throw off users. We would never do anything like that, at least not on purpose! :smile:

    -Sheila

  • BobHarrisBobHarris earthPosts: 29Member

    Right, I did know Sheila was joking.

    I did in fact think that "Invalid argument value '–-genotyping_mode' at position 6" meant that the parser thought the quoted text was to be a value for some preceding argument. But my command up to that point was "java -jar GenomeAnalysisTK.jar --analysis_type HaplotypeCaller --reference_sequence apple.fa --input_file orange.bam --genotyping_mode DISCOVERY". (I've corrected the hyphen in that). I wasn't able to identify what argument it might have thought "--genotyping_mode" was a value for, since the previous argument, "--input_file", had been supplied a value, "orange.bam". That error message could be improved if it also told me what argument the parser thinks this allegedly bad value belongs to.

    The expectation that users won't cut and paste code or keywords posted in a tutorial is ill-considered. Being myself a lousy typist, I'm (usually) much less likely to screw up an unfamiliar command line if I copy and paste from an example, editing to change the parts specific to me. I did in fact read all of the tutorial text before running my command, and did look through the output of HaplotypeCaller --help to get some brief understanding of each argument before I actually tried running the command.

    If, as Geraldine writes, non-ascii dashes are a fairly frequent problem, it would seem useful to fix the problem in the parser. It would save headaches for users and support staff alike.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,980Administrator, Dev admin

    OK, just wanted to make sure because humor can be easily lost in translation, and taken at face value that statement could be a bit upsetting :)

    I agree with you that tutorial commands should be cut-and-pastable, @BobHarris. The modus operandi of these tutorials is that we tell you to run a specific command line to achieve a specific purpose. Obviously we're hoping people read the explanation of what the key arguments do, but we don't expect anyone to customize the command while in the context of the tutorial. The warning against straight-up copy-pasting is meant for other types of documents (such as the tool docs) where the commands given are usually generic examples that we'd rather people not take for granted as applicable to their use case.

    The dashes problem is not frequent enough to make me willing to have the parser start accepting other types of characters -- I see that as a slippery slope to more problems. And to some extent it's a fairly innocuous and easy to resolve problem (we recognize it very easily, at least) that introduces people to the idea that characters are not all created equal and that word processors are evil. I know that view might be a bit controversial but it's how I feel when I have my educator hat on. Now, ask me again when I'm actually trying to run commands for my own purposes. Then there's swearing and shaking of fists :)

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 65Member

    Hello,

    I have a query regarding generating a single sample VCF file from BAM file for exome sequencing. Is it possible to exclude some intronic regions while generating a VCF because those intronic regions we do not want in our analysis.

    Suppose if it want keep variants present only CDS +12/-12 positions. Which command should I use at this step or hardfiltering step?

    Thanks

  • SheilaSheila Broad InstitutePosts: 3,241Member, Broadie, Moderator, Dev admin

    @aneek
    Hi,

    I think this article should help.

    -Sheila

  • aneekaneek Posts: 65Member
    edited April 5

    @ Sheila
    Thank you for quick response. I have understood that I have to use an interval padding option with -L option to restrict my exome analysis up to CDS +12/-12 positions. However to do that should I have to use
    "-L chr:1-12" option to add padding 12 bp around all exons? Please correct me if I am wrong.

    Post edited by aneek on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,980Administrator, Dev admin

    No, that would mean you want to run on bases 1 to 12 of a contig called chr. The interval padding option is -ip.

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 65Member

    @Geraldine_VdAuwera

    Thank you for the correction. So now if I use the following command would serve my purpose of to restrict my exome analysis up to CDS +12/-12 positions.

    java -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R reference.fa \
    -I preprocessed_reads.bam \
    -L file_interval.bed \
    -ip 12
    --genotyping_mode DISCOVERY \
    -stand_emit_conf 10 \
    -stand_call_conf 30 \
    -o raw_variants.vcf

    Please correct me again if I am wrong.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,980Administrator, Dev admin

    Yes that's correct.

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 65Member

    @Geraldine_VdAuwera

    Thank you very much for the guidance.

    After getting the VCF file, if I want to filter out known variants having greater than 0.01 MAF in 1000genome, ExAC and EVS, from the VCF file itself, before going to the annotation step, is there any tool in GATK to do that?

    In other words if there is any tools or option in GATK which can filter the VCF file by removing known variants having greater than 0.01 MAF in 1000genome, ExAC and EVS?

    If not is there any other way to do that?

    Thank you.

  • SheilaSheila Broad InstitutePosts: 3,241Member, Broadie, Moderator, Dev admin

    @aneek
    Hi,

    You will need to use SelectVariants to select the sites in the VCFs that have greater than 0.01 MAF. Then, use SelectVariants again to remove the sites from your own VCF.

    -Sheila

  • aneekaneek Posts: 65Member

    @Sheila

    Is their any specific example in the tutorial to do that. It is not much clear to me about what are the options should I use in the command line. I want to keep the novel variants along with the known variants having MAF < 0.01 in population in the output VCF. MAF is in terms of POPULATION metric and not the Allele FRACTION here.

    Do I need to feed reference data from 1000 genome, EVS and ExAC in the commandline based on which SelectVariants will perform the task. Because what I guess that the program will need some reference with which it will compare MAF of each variant in the input VCF and will generate an output VCF having only those variants which have MAF less than equals to 0.01 along with the other variants which are novel (not reported in any of these databases).

    Any help would be highly appreciated. Thank you.

  • aneekaneek Posts: 65Member

    Hi, any help or suggestions regarding my queries..

  • SheilaSheila Broad InstitutePosts: 3,241Member, Broadie, Moderator, Dev admin

    @aneek
    Hi,

    I think this thread and this thread may help.

    -Sheila

  • aneekaneek Posts: 65Member
    edited April 13

    @Sheila

    Thank you very much for the information. Another thing is how can I restrict the information provided in the INFO column.

    Suppose I want to generate a VCF having only "AC=4;AF=1.00;AN=4;DB;DP=76;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.21;SF=0,1;SOR=0.693" in the INFO column excluding the excess information like "BaseQRankSum=1.973;ClippingRankSum=0.932;MQRankSum=0.282;QD=28.65;ReadPosRankSum=-0.065 etc. which it provides for some variants while using "HaplotypeCaller".

    Post edited by aneek on
  • SheilaSheila Broad InstitutePosts: 3,241Member, Broadie, Moderator, Dev admin

    @aneek
    Hi,

    You can use -sites_only.

    -Sheila

  • aneekaneek Posts: 65Member

    @Sheila

    Thank you very much. It is written in the link that using this argument will generate a VCF file with first 8 columns including the INFO column (excluding the genotypes). How will it restrict the information as I have mentioned in my earlier query in the INFO column. Shall I have to put any value for this argument.

    Suppose if I want to keep only "AC=4;AF=1.00;AN=4" information in the INFO column what value should I use for -sites_only argument?

    In other words is there any control in GATK to restrict the amount of information in the INFO column of VCF.

    Thanks..

  • SheilaSheila Broad InstitutePosts: 3,241Member, Broadie, Moderator, Dev admin

    @aneek
    Hi,

    You can use -XA to exclude any annotations you don't want.

    -Sheila

Sign In or Register to comment.