Why is HaplotypeCaller not calling this Sanger sequencing confirmed variant?

cgilliescgillies Ann ArborMember

Hi,

I am using HaplotypeCaller in GVF mode and GenotypeGVCFs to call some high-depth targeted sequencing data. I have some variants that have been confirmed using Sanger sequencing and I am using these variants as a gold standard to evaluate the pipeline. HaplotypeCaller is missing some of these variants and I would like to understand why.

Here is a variant in the g.vcf file that I thought should of been called:

15 75644465 . C . . END=75644465 GT:DP:GQ:MIN_DP:PL 0/0:546:0:546:0,0,1860

The allele depth is 335 C / 266 T. It looks like the variant should of been called but it is not present in the vcf file. I have attached a pileup at that site.

Here are the commands I was using with GATK 3.4-0. I have followed the best practices pipeline with the exception of duplicate removal because of the way the platform I am using works.

export REGION=15:75643465-75645465
$JAVA -Xmx2048m -jar $GATK -T HaplotypeCaller -R $REF -I $BAM --dbsnp $DBSNP --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -L $REGION -o $GVCF

$JAVA -Xmx2048m -jar $GATK -T GenotypeGVCFs -R $REF -L $REGION -o $OUT/sample.vcf --variant $GVCF --dbsnp $DBSNP

Do you know why this variant was not called heterozygous?

Issue · Github
by Sheila

Issue Number
1112
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
chandrans

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cgillies
    Hi,

    Can you post a bamout screenshot of the site? I am assuming the post is of the original bam file.

    Also, can you check the base qualities and mapping qualities at the site to make sure they are good? The PLs suggest there was evidence for both hom ref and het, but the default is to be called hom ref.

    -Sheila

  • cgilliescgillies Ann ArborMember
    edited July 2015

    @Sheila
    Thank you for your response. Yes, the image I posted was the original bam. I generated a bamout with the following command and posted it:

    $JAVA -Xmx2048m -jar $GATK -T HaplotypeCaller -R $REF -I $BAM --dbsnp $DBSNP --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -L $REGION -o $GVCF --dontUseSoftClippedBases -bamout $OUT/sample.bamout.bam --bamWriterType ALL_POSSIBLE_HAPLOTYPES

    The allele balance is now 332 C / 210 T. It still seems like it should of been called. The mapping quality for the reads in this region is 60. The base quality for the Ts is 25 and the base quality of the Cs is (25-30).

  • cgilliescgillies Ann ArborMember
    edited July 2015

    @Sheila
    Do you have any other suggestions for fixing this? I have found six examples across my cohort of 473 patients where we have a variant confirmed by Sanger sequencing, but HaplotypeCaller did not call the variant at the site. They all appear to be similar to the example I have posted about above.

    Thank you so much for your help.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cgillies
    Hi,

    Can you tell me more about what kind of data you are working with? It look like all the reads start at the exact same position. Are they all duplicate reads? How did you pre-process the data? Did you follow our Best Practices?

    Thanks,
    Sheila

  • cgilliescgillies Ann ArborMember
    edited July 2015

    @Sheila
    Thanks for your help Sheila!

    Can you tell me more about what kind of data you are working with? It look like all the reads start at the exact same position. Are they all duplicate reads?

    Yes, you are correct those reads are duplicates. The data comes from a two-stage amplification. The first amplification happens in the Fluidigm system and the second amplification is on the standard PCR machine. This two step process allows us to efficiently sequence ~ 20 genes with high depth (400x-800x mean depth) for our studies. Please see PMID: 23188109 for more details if you are interested. But the bottom line is the reads that we are producing are going to be duplicates so we cannot perform the remove duplicates step.

    How did you pre-process the data? Did you follow our Best Practices?

    Yes, I followed the best practices document, step by step. I did not remove duplicates because of our platform. The workflow is as follows:

    (1) Trim adapters using cutadapt
    (2) Perform alignment using bwa-mem
    (3) Sort and index bam with picard tools
    (4) Use RealignerTargetCreator with IndelRealigner for indel realignment
    (5) Use BaseRecalibrator and then PrintReads to recalibrate bases
    (6) Use HaplotypeCaller --emitRefConfidence GVCF for each sample's BAM
    (7) Use CombineGVCFs to merge batches of ~50 samples
    (8) Use GenotypeGVCFs on the combined GVCFs.

    Would you like me to post the BAM subsetted over the problematic region for you to look at?

    Thanks again

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cgillies
    Hi,

    Yes, I am not sure what is going on here. Can you submit a bug report? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • DaliaDalia Member

    Hi, we also found that HaplotypeCaller was not working correctly for PCR amplicon sequencing. It seems that due to PCR errors, it sometimes generates very large numbers of haplotypes, and then some obvious good quality variants get lost. We stick to using UnifiedGenotyper for PCR amplicon data. Also, there would be little advantage in using HaplotypeCaller anyway, because we look at separate amplicons, so assembly of reads into haplotypes is mostly irrelevant.

  • cgilliescgillies Ann ArborMember

    @Sheila
    I have put an archive named cgillies_HaplotypeCallerNotCallingSangerConfirmedVariant.zip with a snippet of problematic bam file. Please see commands.txt in the archive. Thank you for your help!

    @Dalia
    Thanks for posting this. Do you have any other recommended settings when using the UnifiedGenotyper for PCR amplicon data? Could you post the command you are using? Do you recommend calling SNPs and Indels separately with the UnifiedGenotyper?

  • DaliaDalia Member

    @cgillies
    We use default settings for UnifiedGenotyper, calling SNVs and indels together. The only difference is that we disable downsampling for amplicon sequencing (increase -dcov value so that no downsampling is performed), because downsampling was causing problems for PCR amplicon calling. You do get a lot of false positive calls, so you need to do hard filtering afterwards. We found ABHet metric the most useful for discrimination of true and false positives, when we plot ABHet distribution, we can clearly see a mixture of two distributions which allows us to decide on filtering threshold.

  • cgilliescgillies Ann ArborMember

    @Dalia

    Thanks for the information. I used the -dt NONE parameter to turn off downsampling for the UnifiedGenotyper. Yes, I agree that the allele balance heterozygous metric is very informative.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cgillies
    Hi again,

    Can you tell me if any of the samples in your cohort have the variant called?

    Thanks,
    Sheila

  • cgilliescgillies Ann ArborMember

    @Sheila
    Hi,

    Yes, one sample has the variant. It looks like changing the kmerSize parameter seems to fix the problem. Will this affect the sensitivity of calling other variants?

    Thanks,

    Chris

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cgillies
    Hi Chris,

    I am happy the kmer size helps. The change will affect sensitivity. However, it seems like for amplicon sequencing, the smaller kmer size helps. I am putting in a ticket to have the difference looked into. Until then, can you tell me if the smaller kmer size gives more variants in the VCF than with a larger kmer size? Do most of the missed variants with the larger kmer size appear at the beginning of reads?

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cgillies
    Hi again Chris,

    It seems like the reads in the region could be mapped to other regions on chromosome 15, and some of those regions are very close to the region of interest. I suspect those reads get mapped differently when using different kmer sizes. Unfortunately, at the Broad, we do not do much testing with RADSeq data, so we cannot comment on how accurate the results will be. You can see from other users, they have found alternate methods.

    -Sheila

  • cgilliescgillies Ann ArborMember

    @Sheila
    Thanks for your help. I decided to use the UnifiedGenotyper for my use case. There are many false positives, but I have found that training an SVM to predict positive and negative sites removes the majority of the false sites. So I think my results are pretty accurate now in terms of Sanger confirmed sites. Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cgillies
    Hi Chris,

    What annotations are you using in your SVM? We are curious what could be better than VQSR! :smirk:

    -Sheila

  • daniela.roblesdaniela.robles Cambridge, UKMember

    Hi all,

    I hope you're well. I'm happy I found this post as it is very relevant to my work!

    In the project I am currently working on, we also have data generated from amplicon sequencing and thus we cannot perform "normal" duplicate marking or filtering steps like end-distance bias or strand-bias on called variants, because these sites are generally covered by reads in only one direction (These data were generated by the Fluidigm unidirectional sequencing protocol).

    Before launching variant calling on the 9000 samples we have sequenced, I wanted to make sure I understand the called variants. I was having the same problem cgillies had, in which one clear SNP (annotated in exac, dbSNP, etc) was not being called by GATK HC even though I could clearly see it in the bamout. Then I added -kmerSize 15 to the command as suggested by Sheila and indeed, the variant is now called.

    21 43863521 . A G 6902.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=0.778;ClippingRankSum=-0.352;DP=237;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=70.00;MQRankSum=0.440;QD=29.13;ReadPosRankSum=1.145;SOR=9.541 GT:AD:DP:GQ:PL 1/1:1,235:236:99:6931,678,0

    I wanted to ask your advice on adding this -kmerSize 15 option to the command - I don't think we can use UnifiedGenotyper for this project because it's a case/control analysis and thus we need gVCFs, something I don't think UG can generate (If I'm wrong please let me know). We are interested in capturing all possible variants, and are willing to Sanger-validate them afterwards so a few false positives shouldn't matter too much, we're more interested in sensitivity. Do you have any advice for this?

    Please let me know if you want me to do some tests with the data, or if you require more information.

    Thanks very much!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @daniela.robles
    Hi,

    I did see this post, so I deleted your other post. We prefer users to not post several new threads if there is already a relevant thread posted. I try to check each and every thread for any new updates :smile:

    We are very interested in improving Haplotype Caller for amplicon sequencing. Right now, the GATK tools are tested on Illumina 30X data. If you can send me some test data with variants you know are true but are not called with Haplotype Caller, I can try to play around with some parameters and talk with the team about what to do. Unfortunately, after talking to Chris @cgillies it turns out he got the best results using Unified Genotyper. Unified Genotyper does not output GVCFs (and we do not recommend using it anymore!). However, if you can upload a "bug report", I will see what I can do. http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • daniela.roblesdaniela.robles Cambridge, UKMember

    @Sheila

    Ah, thanks so much Sheila! I wasn't sure what to do as it was my first post here.

    Yes - we will compile a list of these variants (we're intersecting GATK HC with GATK UG and Samtools mpileup) and will confirm them by some other method. Then I can send you data and a list of the variants that validated. At the moment I am assuming that variants annotated in ExAc are real, although we just found a case that even though it has a high annotated AF might be a systematic calling error. So we will investigate this further and then update :smile:

    Thanks very much for your answer. I will submit a bug report and let you know what we find.

  • pachewychomppachewychomp OregonMember

    I'm very curious whether there are any updates on the amplicon variant calling front? I have honed my settings in pretty well, for the most part, but there are still some edge cases I would love to take care of!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @pachewychomp
    Hi,

    There has not been much effort on this front. Did you mark duplicates? If you have some test cases to share, we would love to have a look.

    -Sheila

  • pachewychomppachewychomp OregonMember

    Hi @Sheila,

    No, I do not mark duplicates. I have messed around with marking duplicates on this data type, but it leads to a call set that is incorrect, and vastly underestimates read depth, as you might expect.

    The main key to success has been to take minPruning down to 1. This makes sense considering the heavily duplicated regions we are dealing with.

    I do not use either BQSR or IndelRealigner. I did notice with other variant callers that IndelRealigner seemed helpful, but I'm following the suggestions that IndelRealigner is not useful for HaploytpeCaller/MuTect2 any longer. I'm doing further testing to determine how BQSR will effect this workflow.

    Also, kmerSize did change my results, strangely. Certain variants were being called, but at very low allele fractions. Looking at bamout files it was clear that local realignment was incorrectly placing fragments and thus skewing the frequency of the call. I gave kmerSize the value 13 and this cleared up, giving me the correct call with correct allele fraction that I was looking for. I'm doing final testing to ensure I have not caused other variant calls to now be miscalled.

    The other thing I've had to do is break the triallelic_site calls out of the VCF and reprocess just those sites a second time. The tool I wrote simply find the coordinates for all triallelic_site calls then makes an interval_list file out of them, and then sends that in as a -L option to a second round of variant calling. This has worked quite well, and has resolved the issue where an alternate allele at very low VAF is being placed in the callset as opposed to the call that seems to be more obvious, and is of higher VAF. I think this could be a big problem for people that are performing very high depth amplicon sequencing.

    If you can tell me which files I can provide that would help you look at some of these things I will try to get those together soon. I will need to make extra sure I am not divulging any PHI, so it will take some time.

    Cheers!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @pachewychomp
    Hi,

    Thanks for the information and sharing! Yes, we would love to improve the variant calling on amplicon data, so if you can share test cases with us, that will be very helpful. Instructions for upload are [here](https://software.broadinstitute.org/gatk/documentation/article?id=1894.

    -Sheila

    P.S. This article under "7. Try fiddling with graph arguments (ADVANCED)" explains the extra argument -allowNonUniqueKmersInRef which you may try out as well.

Sign In or Register to comment.