Variant calling inconsistencies using GVCF

sgrzegorsgrzegor Ann ArborMember

Before starting, I'm a grad student without formal training in bioinformatics so I've mainly learned from online forums. That being said, if this question is stupid or I'm missing critical information, don't hold back.

I have an issue similar to this thread http://gatkforums.broadinstitute.org/gatk/discussion/5638/incorrect-missing-genotypes-using-haplotypecaller but it doesn't appear to be a data quality issue as was seen there.

To summarize:
I followed the best practices workflow using GATK 3.7, although I omitted base quality score recalibration. I have a a combination of WGS and WES and 21 samples for which I generated gvcf files and then genotyped concurrently. I noticed that occasionally single samples would be called ./. in regions where every other sample is 1/1. I focused on one of these oddities and went to the single sample that was causing the problem.
image
Sample 1270 is the original bam that is called improperly while 1271 is called properly. Sample 1270 has 40 alternate reads and 0 reference reads.

I ran the following:

$ GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R GRCz10.fa \ 
-I 1270.mark.bam \ 
-o 1270origregion.g.vcf \
-L 16:17352455-17392053 \
-bamout 1270origregion.bam \
-ERC GVCF 

The region (-L) is the surrounding gene from the region list I used on the whole genome data although this sample specifically is from an exome capture.

In the resulting GVCF I see:

16  17361797    .   C   <NON_REF>   .   .   END=17361797    GT:DP:GQ:MIN_DP:PL  0/0:40:0:40:0,0,0

In the bamout generated I see:
image

This looks odd as the "ArtificialHaploytypes" (the top 4 reads) generated do not contain the variant. The remaining 24 reads are from the sample. It seems that the caller is pulling the DP of 40 from the original file but for some reason calling it reference.

Out of curiosity I bootstrapped this with Haplotype Caller not in GCVF mode:

$ GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R GRCz10.fa \ 
-I 1270.origregion.bam \ 
-o 1270bootstrap.bam \
-L 16:17352455-17392053 \
-bamout 1270bootstrap.bam

And now the vcf includes the variant

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1270    HC
16  17361797    .   C   T   780.17  .   AC=2;AF=0.500;AN=4;BaseQRankSum=-0.659;ClippingRankSum=0.000;DP=29;ExcessHet=0.7918;FS=10.017;MLEAC=2;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=31.21;ReadPosRankSum=-0.913;SOR=0.472 GT:AD:DP:GQ:PL  1/1:0,25:25:75:814,75,00/0:4,0:4:12:0,12,151

and the bamout looks to include the variant in the artificial haplotypes

image

If I do a traditional multisample haplotype caller using the 1270 and 1271 shown above, it will call both properly. However, if I call g.vcf and the genotype them it misses the call in 1270. If I call 1270 in the standard mode it also misses the call. The base qualities and mapping qualities in the region seem to be fine, with the caveat that I didn't recalibrate scores.

The last thing I notice is that the active region is truncated for sample 1270. But if I call 1270 with 1271 the active region extends 100 bp upstream:
image

If anyone has any experience with this, tips for troubleshooting or needs more information, please let me know.

Thanks
Steve

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, that is odd, and interesting. Can you just check if you get the same results with the latest nightly build?

  • sgrzegorsgrzegor Ann ArborMember

    HC from the nightly build (GenomeAnalysisTK-nightly-2017-04-06-g34bd8a3) gave the same results in both default and GVCF modes.

    However, I tried both the nightly and the stable 3.7 with the region (-L) set to the 200 bp immediately surrounding the missing snp. This seemed to select a different active region that didn't truncate the ends and correctly called the snp. However at a 250bp region and a large 100kbp region, the initial problems recur.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sgrzegor
    Hi Steve,

    Can you submit a bug report so I can take a look locally? If so instructions are here.

    Thanks,
    Sheila

  • sgrzegorsgrzegor Ann ArborMember

    Hi Sheila,
    I submitted the bug report today, the file is called sgrzegor.tar.gz and should contain everything needed to replicate the error.

    Thank you,
    Steve

    Issue · Github
    by Sheila

    Issue Number
    1980
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sgrzegor
    Hi Steve,

    Thanks. I will take a look soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sgrzegor
    Hi,

    Sorry for the delay. I tried with the latest version of GATK, and I did not see the site present in the output from HaplotypeCaller in normal mode or in GVCF mode?

    -Sheila

  • sgrzegorsgrzegor Ann ArborMember

    @Sheila
    No worries about the delay.
    Do you mean you don't see the variant present at 17361797 in the HC output? That's the problem, in the raw reads it seems there is a SNP with roughly 40x coverage of the alternate allele, but haplotype caller seems to be saying its reference at that location. My apologies if I'm misinterpreting your statement.

    Thanks,
    Steve

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sgrzegor
    Hi Steve,

    Ah, I see. I did think the issue was the contradiction between the GVCF and VCF. However, on second look, it looks like an issue with the assembly. If you add either --forceActive or --allowNonUniqueKmersInRef to the command, the site is called 1/1.

    -Sheila

  • sgrzegorsgrzegor Ann ArborMember

    @Sheila
    Ahh okay. That makes sense, so i imagine this is due to the high duplication rate in our models genome? For processing large groups would either of these commands be more reliable in terms of accuracy and/or runtime?
    Thank you,
    Steve

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sgrzegor
    Hi Steve,

    I know there was some chatter about making --allowNonUniqueKmersInRef the default in highly repetitive regions, but there has not been a definitive decision made on that yet. The major issue is that it could introduce false positives and take up a lot of compute.

    I think if you know the regions where the genome is repetitive, it is worth adding the flag.

    -Sheila

Sign In or Register to comment.