The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# HaplotypeCaller - single block subsitution or multiple SNVs?

Member Posts: 9
edited October 2012

Hello GATK team,

I notice that HaplotypeCaller will sometimes call a single block substitution variant where UnifiedGenotyper would call two separate SNVs, e.g.:

## HC

chr12 93147901 . CGTGCCA TGTGCCT 6490.01 . AC=4;AF=0.667;AN=6;ActiveRegionSize=260;ClippingRankSum=0.815;DP=153;EVENTLENGTH=0;FS=4.419;MLEAC=4;MLEAF=0.667;MQ=70.00;MQRankSum=1.712;NVH=2;NumHapAssembly=12;NumHapEval=7;QD=57.95;QDE=28.98;ReadPosRankSum=-1.668;TYPE=SNP;extType=MNP GT:GQ:PL 1/1:99:3489,175,0 0/0:99:0,123,2700 1/1:86:3001,86,0

## UG

chr12 93147901 . C T 3801.01 . AC=4;AF=0.667;AN=6;BaseQRankSum=-7.623;DP=164;Dels=0.00;FS=2.258;HaplotypeScore=0.8824;MLEAC=4;MLEAF=0.667;MQ=70.00;MQ0=0;MQRankSum=0.370;QD=32.77;ReadPosRankSum=-0.525;SB=-1.332e+03 GT:AD:DP:GQ:PL 1/1:0,63:63:99:2089,172,0 0/0:48,0:48:99:0,120,1581 1/1:1,52:53:99:1712,106,0

chr12 93147907 . A T 3678.01 . AC=4;AF=0.667;AN=6;BaseQRankSum=-2.338;DP=162;Dels=0.00;FS=0.000;HaplotypeScore=1.2101;MLEAC=4;MLEAF=0.667;MQ=70.00;MQ0=0;MQRankSum=-0.710;QD=31.98;ReadPosRankSum=-1.209;SB=-1.345e+03 GT:AD:DP:GQ:PL 1/1:0,61:61:99:1932,156,0 0/0:47,0:47:99:0,120,1525 1/1:1,53:54:99:1746,110,0

I can see other cases where two SNVs the same distance apart (6 bp) are output as two separate variants by HC, e.g..

## HC

chr1 13110 . G A 62.24 . AC=1;AF=0.167;AN=6;BaseQRankSum=-2.631;DP=77;Dels=0.00;FS=18.534;HaplotypeScore=3.7523;MLEAC=1;MLEAF=0.167;MQ=33.69;MQ0=0;MQRankSum=-2.277;QD=3.28;ReadPosRankSum=-0.718;SB=-5.869e-03 GT:AD:DP:GQ:PL 0/0:20,0:20:51:0,51,582 0/1:13,6:19:97:97,0,337 0/0:37,1:38:81:0,81,1122

chr1 13116 . T G 344.26 . AC=3;AF=0.500;AN=6;BaseQRankSum=4.784;DP=59;Dels=0.00;FS=9.743;HaplotypeScore=2.8686;MLEAC=3;MLEAF=0.500;MQ=33.73;MQ0=0;MQRankSum=-5.601;QD=5.83;ReadPosRankSum=-0.393;SB=-4.601e+01 GT:AD:DP:GQ:PL 0/1:11,5:16:71:71,0,272 0/1:5,5:11:99:104,0,130 0/1:20,12:32:99:208,0,537

I have two questions:

1) What determines whether two SNVs lying close together are output as a single block substitution or as two separate variants? Does it make a difference that all three samples in this file are called as homozygous for the chr12 variant? I've examined the chr1 variant in IGV and there are reads that overlap both variants for the sample with heterozygous genotypes for both.

2) Is there any option to have HC produce multiple SNV variant lines rather than a block substitution in cases like the chr12 example? (I can't see one, but thought I'd ask just in case). I ask because some downstream programs struggle with block substitutions, and also because I'd like to know how many variants are called by HC compared to UG. I would consider the chr12 example two separate variants as the two SNVs affect different codons, although I appreciate that opinions may differ as to what exactly constitutes a distinct variant in some cases.

Katherine

Post edited by Geraldine_VdAuwera on
Tagged:

• Dev Posts: 122 ✭✭✭

Hi Katherine,

One of the advantages of the HaplotypeCaller over the UnifiedGenotyper is that it considers regions as haplotypes instead of treating each locus independently. In your first example the HC has determined that the two variants segregate together in your samples (using a likelihood-based R^2 calculation) and so it creates a single variant to explain the data. In the second example the HC thinks the variants don't appear to be on the same haplotypes and so they are output as separate variants. The LD calculation gets more powerful with more samples. Short of calling your three samples individually (so that there is no combining of component variants) there is currently no way to modify this behavior. We'll think about adding something like this in the future.

Cheers,

• Dev Posts: 122 ✭✭✭

Hi Katherine,

One of the advantages of the HaplotypeCaller over the UnifiedGenotyper is that it considers regions as haplotypes instead of treating each locus independently. In your first example the HC has determined that the two variants segregate together in your samples (using a likelihood-based R^2 calculation) and so it creates a single variant to explain the data. In the second example the HC thinks the variants don't appear to be on the same haplotypes and so they are output as separate variants. The LD calculation gets more powerful with more samples. Short of calling your three samples individually (so that there is no combining of component variants) there is currently no way to modify this behavior. We'll think about adding something like this in the future.

Cheers,

• Member Posts: 9

Hi GATK team,

I would like to check that my understanding of the current behaviour (v2.5) is correct.

1. The default behaviour is now to print each individual variant separately. I.e. for biallelic variants, either the reference or the alternate allele will be of length 1
2. --mergeVariantsViaLD restores the behaviour described above, where individual variants deemed to segregate together are merged into a singleblock substitution variant
3. The VariantsToAllelicPrimitives walker can be used to partially convert the output of (2) to (1). This will only work for biallelic block substitutions where all individual variants are SNVs. It won't work for block variants containing one or more indel variants.

Is this correct?

Thanks very much,

Katherine

• Member Posts: 9

Thanks for the reply, Geraldine, and thanks to the rest of the team too for implementing these options. Having variants reported individually means that standard annotation programs can be used for HaplotypeCaller output, which is great news. Of course, it's nice to have the option of reporting block substitutions, too.

Katherine