Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

HaplotypeCaller - single block subsitution or multiple SNVs?

KatherineSKatherineS Posts: 9Member
edited October 2012 in Ask the GATK team

Hello GATK team,

I notice that HaplotypeCaller will sometimes call a single block substitution variant where UnifiedGenotyper would call two separate SNVs, e.g.:

HC

chr12 93147901 . CGTGCCA TGTGCCT 6490.01 . AC=4;AF=0.667;AN=6;ActiveRegionSize=260;ClippingRankSum=0.815;DP=153;EVENTLENGTH=0;FS=4.419;MLEAC=4;MLEAF=0.667;MQ=70.00;MQRankSum=1.712;NVH=2;NumHapAssembly=12;NumHapEval=7;QD=57.95;QDE=28.98;ReadPosRankSum=-1.668;TYPE=SNP;extType=MNP GT:GQ:PL 1/1:99:3489,175,0 0/0:99:0,123,2700 1/1:86:3001,86,0

UG

chr12 93147901 . C T 3801.01 . AC=4;AF=0.667;AN=6;BaseQRankSum=-7.623;DP=164;Dels=0.00;FS=2.258;HaplotypeScore=0.8824;MLEAC=4;MLEAF=0.667;MQ=70.00;MQ0=0;MQRankSum=0.370;QD=32.77;ReadPosRankSum=-0.525;SB=-1.332e+03 GT:AD:DP:GQ:PL 1/1:0,63:63:99:2089,172,0 0/0:48,0:48:99:0,120,1581 1/1:1,52:53:99:1712,106,0
chr12 93147907 . A T 3678.01 . AC=4;AF=0.667;AN=6;BaseQRankSum=-2.338;DP=162;Dels=0.00;FS=0.000;HaplotypeScore=1.2101;MLEAC=4;MLEAF=0.667;MQ=70.00;MQ0=0;MQRankSum=-0.710;QD=31.98;ReadPosRankSum=-1.209;SB=-1.345e+03 GT:AD:DP:GQ:PL 1/1:0,61:61:99:1932,156,0 0/0:47,0:47:99:0,120,1525 1/1:1,53:54:99:1746,110,0

I can see other cases where two SNVs the same distance apart (6 bp) are output as two separate variants by HC, e.g..

HC

chr1 13110 . G A 62.24 . AC=1;AF=0.167;AN=6;BaseQRankSum=-2.631;DP=77;Dels=0.00;FS=18.534;HaplotypeScore=3.7523;MLEAC=1;MLEAF=0.167;MQ=33.69;MQ0=0;MQRankSum=-2.277;QD=3.28;ReadPosRankSum=-0.718;SB=-5.869e-03 GT:AD:DP:GQ:PL 0/0:20,0:20:51:0,51,582 0/1:13,6:19:97:97,0,337 0/0:37,1:38:81:0,81,1122
chr1 13116 . T G 344.26 . AC=3;AF=0.500;AN=6;BaseQRankSum=4.784;DP=59;Dels=0.00;FS=9.743;HaplotypeScore=2.8686;MLEAC=3;MLEAF=0.500;MQ=33.73;MQ0=0;MQRankSum=-5.601;QD=5.83;ReadPosRankSum=-0.393;SB=-4.601e+01 GT:AD:DP:GQ:PL 0/1:11,5:16:71:71,0,272 0/1:5,5:11:99:104,0,130 0/1:20,12:32:99:208,0,537

I have two questions:

1) What determines whether two SNVs lying close together are output as a single block substitution or as two separate variants? Does it make a difference that all three samples in this file are called as homozygous for the chr12 variant? I've examined the chr1 variant in IGV and there are reads that overlap both variants for the sample with heterozygous genotypes for both.

2) Is there any option to have HC produce multiple SNV variant lines rather than a block substitution in cases like the chr12 example? (I can't see one, but thought I'd ask just in case). I ask because some downstream programs struggle with block substitutions, and also because I'd like to know how many variants are called by HC compared to UG. I would consider the chr12 example two separate variants as the two SNVs affect different codons, although I appreciate that opinions may differ as to what exactly constitutes a distinct variant in some cases.

Thanks very much in advance.

Katherine

Post edited by Geraldine_VdAuwera on

Best Answers

  • rpoplinrpoplin Posts: 122GATK Developer mod
    Answer ✓

    Hi Katherine,

    One of the advantages of the HaplotypeCaller over the UnifiedGenotyper is that it considers regions as haplotypes instead of treating each locus independently. In your first example the HC has determined that the two variants segregate together in your samples (using a likelihood-based R^2 calculation) and so it creates a single variant to explain the data. In the second example the HC thinks the variants don't appear to be on the same haplotypes and so they are output as separate variants. The LD calculation gets more powerful with more samples. Short of calling your three samples individually (so that there is no combining of component variants) there is currently no way to modify this behavior. We'll think about adding something like this in the future.

    Cheers,

Answers

  • KatherineSKatherineS Posts: 9Member

    Hi GATK team,

    I would like to check that my understanding of the current behaviour (v2.5) is correct.

    1. The default behaviour is now to print each individual variant separately. I.e. for biallelic variants, either the reference or the alternate allele will be of length 1
    2. --mergeVariantsViaLD restores the behaviour described above, where individual variants deemed to segregate together are merged into a singleblock substitution variant
    3. The VariantsToAllelicPrimitives walker can be used to partially convert the output of (2) to (1). This will only work for biallelic block substitutions where all individual variants are SNVs. It won't work for block variants containing one or more indel variants.

    Is this correct?

    Thanks very much,

    Katherine

  • KatherineSKatherineS Posts: 9Member

    Thanks for the reply, Geraldine, and thanks to the rest of the team too for implementing these options. Having variants reported individually means that standard annotation programs can be used for HaplotypeCaller output, which is great news. Of course, it's nice to have the option of reporting block substitutions, too.

    Katherine

Sign In or Register to comment.