(howto) Call variants with HaplotypeCaller

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin
edited February 11 in Tutorials

Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

Prerequisites

  • TBD

Steps

  1. Determine the basic parameters of the analysis
  2. Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

  • Genotyping mode (--genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

  • Emission confidence threshold (-stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

  • Calling confidence threshold (-stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms "called" and "filtered" are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.


2. Call variants in your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T HaplotypeCaller \ 
    -R reference.fa \ 
    -I preprocessed_reads.bam \  
    -L 20 \ 
    --genotyping_mode DISCOVERY \ 
    -stand_emit_conf 10 \ 
    -stand_call_conf 30 \ 
    -o raw_variants.vcf 

Note that -L specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes, as documented here. For example, when running on exome data, we use -L to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.

Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • BobHarrisBobHarris earthPosts: 11Member

    FYI, for users like myself who are devotees of Murphy and his laws, the first mention of genotyping_mode in this tutorial has a single dash and beware, it is not a real ascii dash. If you unwittingly copy and paste that into your command it won't work.

    First you'll discover you need a second dash. But even after you add that you might be misled to think that DISCOVERY isn't a valid setting, as suggested by these error messages:

    ERROR MESSAGE: Invalid argument value '–-genotyping_mode' at position 6.
    ERROR Invalid argument value 'DISCOVERY' at position 7.

    What really confused me was if it didn't like genotyping_mode why was it complaining about DISCOVERY; and if it did like genotyping_mode, well, I couldn't see what was wrong with DISCOVERY. And why was it telling me it thought '–-genotyping_mode' was an invalid argument value? And what's the difference between an ERROR MESSAGE and a message telling me there's an ERROR?

    These tutorials are very helpful, by the way. In spite of the pit- and prat-falls.

  • SheilaSheila Broad InstitutePosts: 2,678Member, Broadie, Moderator, Dev admin

    @BobHarris
    Hi,

    These tutorials are not meant to be copied and pasted! We want to give you an idea of the arguments you might want to use, but we do expect you to read the tool documentation and understand all of the possible arguments so you can decide which are most appropriate for your analysis. Of course, we understand users might want to copy and paste then blame us for any improper outputs, so we purposely put in hard to copy arguments.

    As for the error messages, you are right they are slightly confusing, but in this case, the invalid argument value comes from the program thinking –-genotyping_mode and DISCOVERY are values for the previous argument. The ERROR MESSAGE in the beginning is stating there is indeed an error message and the second ERROR points out what the possible other errors are because there is more than one error here.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin

    For the record, I think @Sheila is joking -- we certainly do not make it difficult to copy/paste commands on purpose. We've had a few errors creep in because the text above was in part copy-pasted from a paper manuscript (the 2014 Current Protocols paper). I've fixed the errors you pointed out. If you see any others let us know and we'll fix those too.

    Regarding the error messages there is some misunderstanding due to the formatting. There is only one error message (prefaced by ERROR MESSAGE) but it's reporting several problems, so there's enough text to lead to multiple lines, which are all prefaced by the word ERROR by convention.

    As to the nature of the errors, what you're seeing here is a domino effect, where one error in an argument name, value or formatting can lead to further errors in subsequent arguments as the parser fails to correctly recognize which is which. The fact that the parser complains that –-genotyping_mode was an invalid argument value suggests that something in front of it is wrong. Fun fact: in your post (and possibly your command?) the first dash is one of those fake non-ASCII dashes... If it helps, we run into that problem fairly frequently too and it's annoying as hell. On the bright side, once you've run into this problem you're a lot more likely to recognize it immediately (and not spend a lot of time tearing your hair out) the next time it happens.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstitutePosts: 2,678Member, Broadie, Moderator, Dev admin

    @BobHarris
    Hi,

    Yes, to clarify I was indeed joking about putting in hard to copy arguments to throw off users. We would never do anything like that, at least not on purpose! :smile:

    -Sheila

  • BobHarrisBobHarris earthPosts: 11Member

    Right, I did know Sheila was joking.

    I did in fact think that "Invalid argument value '–-genotyping_mode' at position 6" meant that the parser thought the quoted text was to be a value for some preceding argument. But my command up to that point was "java -jar GenomeAnalysisTK.jar --analysis_type HaplotypeCaller --reference_sequence apple.fa --input_file orange.bam --genotyping_mode DISCOVERY". (I've corrected the hyphen in that). I wasn't able to identify what argument it might have thought "--genotyping_mode" was a value for, since the previous argument, "--input_file", had been supplied a value, "orange.bam". That error message could be improved if it also told me what argument the parser thinks this allegedly bad value belongs to.

    The expectation that users won't cut and paste code or keywords posted in a tutorial is ill-considered. Being myself a lousy typist, I'm (usually) much less likely to screw up an unfamiliar command line if I copy and paste from an example, editing to change the parts specific to me. I did in fact read all of the tutorial text before running my command, and did look through the output of HaplotypeCaller --help to get some brief understanding of each argument before I actually tried running the command.

    If, as Geraldine writes, non-ascii dashes are a fairly frequent problem, it would seem useful to fix the problem in the parser. It would save headaches for users and support staff alike.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin

    OK, just wanted to make sure because humor can be easily lost in translation, and taken at face value that statement could be a bit upsetting :)

    I agree with you that tutorial commands should be cut-and-pastable, @BobHarris. The modus operandi of these tutorials is that we tell you to run a specific command line to achieve a specific purpose. Obviously we're hoping people read the explanation of what the key arguments do, but we don't expect anyone to customize the command while in the context of the tutorial. The warning against straight-up copy-pasting is meant for other types of documents (such as the tool docs) where the commands given are usually generic examples that we'd rather people not take for granted as applicable to their use case.

    The dashes problem is not frequent enough to make me willing to have the parser start accepting other types of characters -- I see that as a slippery slope to more problems. And to some extent it's a fairly innocuous and easy to resolve problem (we recognize it very easily, at least) that introduces people to the idea that characters are not all created equal and that word processors are evil. I know that view might be a bit controversial but it's how I feel when I have my educator hat on. Now, ask me again when I'm actually trying to run commands for my own purposes. Then there's swearing and shaking of fists :)

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.