You expect to see a variant at a specific site, but it's not getting called

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin
edited April 26 in Common Problems

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

image

2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example,
in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • nikmalnikmal Posts: 23Member

    Thank you for this useful information!

    I have a question however: if I view a BAM file with recalibrated base qualities in IGV - does IGV then report the recalibrated score or the score before recalibration? That is, the "Base phred quality" score that shows in the info box that pops up when you hover over a locus/read.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    IGV will report the recalibrated qualities, because they are written in place of the originals. It is possible to still see the original qualities in the read information; if you set the appropriate flag to keep the original quals when you did the second step of recalibration, they will be written to the OQ tag.

    Geraldine Van der Auwera, PhD

  • alirezakjalirezakj Posts: 61Member

    Hi Geraldine,

    If someone is doing a mutation rate analysis, say in yeast (like exposing the yeast to radiations) to measure the mutation rate. Then different mutations in different locations in each yeast might happen. Let's say, I want to call all of those variants even if they are seen in only one read (but with good mapping quality and good phred score that one can see in IGV by mousing over).

    As far as I know, UG and HC are very sensitive to not miss anything and that they don't just look at the base Q score, MPQ and DP to decide how true a variant is, but they also consider things like HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum etc.

    I know that we shouldn't trust everything we see in IGV, however my question is, is it possible that UG miss calling some of those mutation because they might be seen in only one read but because the coverage for those regions are very high, UG might ignore calling some of them because they are only present in one read? (or some other scenario like that)

    Would you please clarify more on the sensitivity of UG, because no matter how many times you tell people don't trust everything you see in IGV they again tend to say "I can see it right there"!

    This is a question for a lot of people and me!

    Thanks,

  • alirezakjalirezakj Posts: 61Member

    Sorry, in my post "HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum" are annotations and aren't necessarily used for variant calling algorithm.

  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    Hi @alirezakj -

    I'm not sure that "sensitivity" is the right word for what you're looking for. I suppose it is in the strictest sense, but in practical terms I think it's something else. You don't ever mention what kind of read depths you're using, but with any significant depth the probability of seeing an "alternate allele" in one read at any given locus approaches 1. The Illumina sequencers have been getting better and better, but they've still got a pretty significant error rate.

    UG and HC (and most other variant callers) help to mitigate that error rate by using the expected ploidy of the library in question. So if we're using human data (which is what the GATK was initially developed for), we would expect to see the reads supporting putative variations to be near 50 or 100% of the reads at that site. If the read support varies significantly from those levels, it's less likely to be a real variation. It sounds like in your experiment, you have a whole bunch of yeast cells and you're looking for evidence of variation in any of them - this is a completely different analysis.

    If you haven't already, you may want to look at the ploidy parameter of UG. It won't resolve your genotypes to the cell level, but I would think you could, e.g., try to find variants with 5% penetrance by setting the ploidy to 20.

    Caveat emptor: I've done exactly one UG run with a non-default ploidy - I used 8. It's entirely possible that 20 is way too high and that you'll be swallowed by our dying Sun before the analysis finishes. Or maybe it will finish in two hours. I have no idea. But I think it's your best hope for doing this type of analysis in GATK

  • alirezakjalirezakj Posts: 61Member
    edited February 2014

    Thanks @pdexheimer,

    Let me ask it this way, l have some mitochondrial seq with 10,000 x coverage/10,000 mean depth at any given position and of course the -ploidy is 1.

    In my analysis any high quality variant should be called even if it is seen in one read among thousands but with high MPQ and BQS! Is it possible that UG would miss calling it because it is only in one read? Dose UG ignore it, because all the other reads disagree with that one read even though everything in that read is of high quality?

    From what you said, I understand that UG would call such variants only if I increase the ploidy number. Right?

    Post edited by alirezakj on
  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    One out of ten thousand? I certainly hope so, no matter the quality.

    As is always the case in bioinformatics, there's uncertainty here. Mitochondria are subject to some degree of heteroplasmy, for instance. But the error rate of the sequencer itself is much, much higher than one out of ten thousand - to say nothing of the errors in library prep (PCR polymerases aren't perfect) and alignment. To me, there's absolutely no way you should take a single read out of many (think dozens, not thousands) as significant evidence of variation

  • alirezakjalirezakj Posts: 61Member
    edited February 2014

    Thanks @pdexheimer,

    So the answer is that UG would not call it do to the pile of evidence (thousands of reads)! Even though that one read is of vary high quality, right? Due to seq machines error rate the algorithm would not trust that variant, right?

    Post edited by alirezakj on
  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    Right. One high-quality read is not enough to outweigh the evidence of many mixed-quality reads. The definition of "many" is left as an exercise for the reader, but it ultimately depends on ploidy and error rates

  • alirezakjalirezakj Posts: 61Member
    edited February 2014

    In fact, I would never trust such a variant even if it is of high quality. But I just want to show this post to the people who ask it gain and again saying "I can see it right there" :) thanks a lot, I think you closed this thread quit well.

    Post edited by alirezakj on
  • wchenwchen Posts: 37Member

    We are comparing variants called using GATK HaplotypeCaller and MiSeq reporter. It's amplicon-based Illumina Truseq targeted gene panel. Strangely, all variants's PL is 0 (vcf from one sample, no VQSR done) for those matched MiSeq variants:

    CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT
    chr13,28624294,rs1933437,G,A,1040.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=0.887;ClippingRankSum=-1.822;DB;DP=81;FS=34.541;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.681;QD=12.85;ReadPosRankSum=7.066,GT:AD:DP:GQ:PL,0/1:25,52:77:99:1069,0,387
    chr17,7579472,rs1042522,G,C,1719.77,.,AC=2;AF=1.00;AN=2;DB;DP=91;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=18.90,GT:AD:DP:GQ:PL,1/1:0,76:76:99:1748,221,0
    chr20,31022959,rs6058694,T,C,1861.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=-4.081;ClippingRankSum=-0.704;DB;DP=134;FS=26.555;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.615;QD=13.89;ReadPosRankSum=8.110,GT:AD:DP:GQ:PL,0/1:28,105:133:99:1890,0,316
    chr4,106196951,rs2454206,A,G,431.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=-2.851;ClippingRankSum=-0.036;DB;DP=84;FS=2.199;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.459;QD=5.14;ReadPosRankSum=-0.109,GT:AD:DP:GQ:PL,0/1:35,26:61:99:460,0,715
    chr17,7579472,rs1042522,G,C,852.77,.,AC=2;AF=1.00;AN=2;DB;DP=45;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.76;MQ0=0;QD=18.95,GT:AD:DP:GQ:PL,1/1:0,35:35:99:881,104,0

    Also the total reads should be over 300k, but the DP here is so small. I did downsampling to 3000, marked the duplicates but didn't remove them. Any clue why all these happened? How to calculate the variant frequency here? Thanks!

  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    @wchen‌ -

    PL is by definition relative to the called genotype, so will always be 0 for the called genotype.

    What does the log output of HaplotypeCaller say at the end, in the read summary? I suspect you'll see an absurdly high number of reads discarded because they're duplicates. Marked reads are skipped by HC. Regarding your depth, the duplicate filters play some role in that, but it seems like HaplotypeCaller 3.2 down sampled a bit too aggressively. I'm not sure what version you used, or if that's still the case in 3.3.

    I'm not sure what to tell you about variant frequency, because it depends on which definition you're using. The vcf already contains an allele frequency for each variant, but you only have 1 sample so that's not real interesting. As I understand, you only ran HC over the variants reported by another tool - if you're just looking for how many of them were actually variant here you probably want to use a tool like VariantEval (or maybe GenotypeConcordance). Of course, if you just ran HaplotypeCaller in the default mode, it only output calls at variant sites. VariantEval will also help you calculate things like variants per kilobase, if that's what you're after

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    It is true that HaplotypeCaller does some fairly aggressive downsampling, and unfortunately it seems we have a bug that occurs when you try to override HaplotypeCaller's downsampling defaults. It's a hard one to fix because the downsampling system is rather complex, so in the meantime our recommendation is to not touch the downsampling settings for HaplotypeCaller. Rest assured that the tool downsamples data to a level that gives it all the data it needs to make an informed call.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 51Member

    UG missed calling a SNP for us on chr17 in a set of 78 WES samples. (It should appear in two of those)

    Using the same data, but trimmed down to just chr17, UG made the call. Is there a explainable reason for this?

    Coverage, Base Qual, and Mapping Qual are all good. There are no additional alternate alleles.

    Full disclosure, I am still using 3.1-1

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    @bwubb Is it a borderline call? Any excess coverage? This could be due to downsampling differences triggered by the different interval lists (which are used to generate the random seed for downsampling).

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 51Member

    @Geraldine_VdAuwera‌ How might I tell if it is borderline? It looks pretty ok to me.

    17 7577121 . G A 1569.73 . ABHet=0.360;ABHom=0.998;AC=2;AF=0.013;AN=156;BaseCounts=74,3,6646,2;BaseQRankSum=-11.619;DP=6729;Dels=0.00;FS=11.934;GC=53.37;HRun=0;HW=0.0;HaplotypeScore=2.5790;InbreedingCoeff=-0.0133;LowMQ=0.0003,0.0009,6729;MLEAC=2;MLEAF=0.013;MQ=59.92;MQ0=2;MQRankSum=-0.894;OND=3.897e-03;PercentNBaseSolid=0.0000;QD=13.42;ReadPosRankSum=-2.719;VariantType=SNP

    The samples: 0/1:0.470:47,53:101:99:0:1298,0,1394 and 0/1:0.250:4,12:16:99:0:329,0,99

    So the second one has pretty low coverage.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    Hmm, given the coverage for those two samples, there's no downsampling going on, so that doesn't explain it. Aside from a filesystem hiccup / memory issue that caused a skipped position, I can't think of why this would happen. And it's hard to test without running the whole thing again. If you can run again on the full cohort, I would suggest trying in allSites mode, and see if something gets output at that position. Or maybe first try on just those two samples and see what happens.

    Unfortunately there's not much else I can do aside from encouraging you to upgrade and use HaplotypeCaller instead of UG.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 51Member

    Thank you, I will try on the full cohort again.

    We are fairly married to UG. Tumor samples have never worked well with the HC every time we try it out.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    Oh, forgot you were dealing with tumor samples, hmm. Have you ever tried MuTect?

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 37Member
    edited November 2014

    @pdexheimer

    sorry to get back late. Thank you for the explanation of PL.

    Yes you are right - HC3.2 agressively deduped the reads, please see the log file:

    INFO 04:28:20,187 MicroScheduler - 15936835 reads were filtered out during the traversal out of approximately 16074010 total reads (99.15%)
    **********INFO 04:28:20,187 MicroScheduler - -> 15771195 reads (98.12% of total) failing DuplicateReadFilter**********
    INFO 04:28:20,187 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 04:28:20,188 MicroScheduler - -> 147175 reads (0.92% of total) failing HCMappingQualityFilter
    INFO 04:28:20,188 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 04:28:20,188 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 04:28:20,188 MicroScheduler - -> 18465 reads (0.11% of total) failing NotPrimaryAlignmentFilter
    INFO 04:28:20,189 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    How to suppress removing duplicates in GATK? It's tumor sample without normal control, so we are interested in the variant frequency related to tumor percentage.

    I actually compared all GATK called variants to those called by MiSeq.

    Post edited by wchen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    @wchen, what you're showing is not downsampling, it's read filtering, which is a separate process. Your problem is that almost all your read are getting filtered out because they are marked as duplicates, as shown in this line:

    5771195 reads (98.12% of total) failing DuplicateReadFilter
    

    What sequencing design/platform was used to generate these data?

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 37Member
    edited November 2014

    @Geraldine_VdAuwera
    Illumina TruSeq checmistry with PE. It's amplicon-based. How can I run GATK without marking duplicates?
    By the way, what does downsampling do exactly? I set it to dcov 3000 since we have very high coverage for tumor sample without normal match. Thanks!

    Post edited by wchen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    Ah, that explains it -- for amplicon sequencing you should skip the MarkDuplicates step.

    Downsampling to coverage reduces the number of reads that the program will use for an analysis, to the amount that is sufficient for the analysis to be empowered, without being slowed down by excess/redundant data. This is done because very high coverage (above several hundred x) is typically unnecessary and counterproductive. The read depth is reduced by randomly throwing out reads until the desired quantity is achieved, so the remaining set is representative of the overall data. Each tool's default downsampling settings are tuned for its particular requirements, so you generally don't need to modify them unless you are trying to do something specific.

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 37Member
    edited November 2014

    @Geraldine_VdAuwera
    I though GATK wouldn't run without marking duplicates by picard, am I right? If not, how to make GATK not dedup? Thanks!

    Post edited by wchen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    Nope, GATK does not require duplicates to be marked at the programmatic level. We just recommend doing it per Best Practices because for most experimental designs it's the right thing to do, but amplicon targets is one notable exception.

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 37Member

    I tried not marking the duplicates, just added the read group, then ran the indel realigner in GATK3.3, but got this error?
    Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 51.0

    java -jar -Xmx4g .../GenomeAnalysisTK-3.3.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -I addrg.bam -known 1000G_phase1.indels.hg19.vcf -o realigner_3.3.intervals

    What caused this error? Was it related to skipping MarkDup? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    @wchen, This error occurs when you run with the wrong Java version. GATK 3.3 uses Java 1.7. Perhaps you are running on a different machine that is not set up the same way, or you just upgraded your GATK version?

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 37Member

    @‌

    You are right. The wrong java version was called on a different machine, also the GATK 3.3 was newly installed. Thanks!

  • raphael123raphael123 MontrealPosts: 7Member

    Hi Geraldine ! I am using GenomeAnalysisTK-3.2-2 on 4 different samples :

    $gatk -T UnifiedGenotyper -R $hg19 -I recal.bam --dbsnp $dbsnp -o raw.vcf

    I took one case where 2 of the four samples are not calling at all a SNP that is obvious:

    It s a real problem.

    I add a question : how to call ALL the variants ? I would preffer to do the filter myself using my assumptions.

    Thanks a lot for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    Try using our Best Practices calling workflow which involves the HaplotypeCaller instead of UnifiedGenotyper. If that doesn't work, then we can try to help you. But we can't help you use an older, inferior tool.

    Geraldine Van der Auwera, PhD

  • raphael123raphael123 MontrealPosts: 7Member

    Sorry I impement both but I have dislexia, I fell stupid and you save me !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,187Administrator, GATK Dev admin

    @raphael123‌ Don't worry about it, but remember we can only help you if we have all the information we need. It is better to be overly descriptive than not enough. In a case like this, you can tell us everything you tried, step by step, and show the results (posting the vcf record is usually very helpful).

    Geraldine Van der Auwera, PhD

  • mb47mb47 Posts: 2Member

    Hi. I have just tried the HaplotypeCaller on a new dataset, and it is not calling any variants in a mapping that should have plenty of them (and visual inspection seems to support that). Base qualities at the SNP locations all seem healthy. The main suspect seems to be MAPQ. The mapping has been produced with BWA, and for some reason it has decided to give ALL my reads MAPQ scores of 6 or less. I am mapping barley exome reads to the barley genome, which is a) huge, b) fragmented and c) very incomplete, so that might be causing BWA to freak out. I have set the -mmq parameter to 0 for testing, and the HaplotypeCaller log output tells me that it is no longer filtering out any reads now. Could you please clarify for me your point 3 above -- does the local reassembly of reads still take into account the reads' MAPQ scores even if -mmq is set to zero?

  • mb47mb47 Posts: 2Member

    P.S. I forgot to mention that this is a single sample which is almost completely homozygous but has plenty of differences with the reference sample, i.e. the SNPs I can see visually are all homozygous for the alternate allele. Could this be part of the problem? Does the HaplotypeCaller not call SNPs like these by default? I am not using any optional command line parameters apart from -mmq.

Sign In or Register to comment.