Why didn't the UG or HC call my SNP? I can see it right there in IGV!

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin
edited June 24 in FAQs

Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.

To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.

Here is a checklist of questions you should ask yourself:

  • How many overlapping deletions are there at the position?

The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.

  • What do the base qualities look like for the non-reference bases?

Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.

  • What do the mapping qualities look like for the reads with the non-reference bases?

A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

  • Are there a lot of alternate alleles?

By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.

  • Are you working with SOLiD data?

SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.

  • Specifically for Haplotype Caller

In addition to the reasons above, Haplotype Caller has another reason why some variants do not get called when it seems obvious in the original bam file.

Haplotype Caller performs a local reassembly of the reads in the active region. Please refer here for more details: http://www.broadinstitute.org/gatk/guide/article?id=4148

This reassembly is important, because when mapping reads to the whole genome, it is easy to make an error. When reassembling reads in a much smaller region, such as the active region, the alignment is more likely to be accurate.

The reads you see in the alignment of the original bam file may suggest a variant should be called. However, due to the realignment, the reads may no longer support the variant. In order to see the new alignment of reads, you can use -bamout argument. You can then compare the aligned reads from the original bam file to the newly aligned reads in the -bamout file.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

image

IGV View of HC Assembly Difference.png
1313 x 782 - 167K
Post edited by Sheila on

Geraldine Van der Auwera, PhD

Comments

  • nikmalnikmal Posts: 23Member

    Thank you for this useful information!

    I have a question however: if I view a BAM file with recalibrated base qualities in IGV - does IGV then report the recalibrated score or the score before recalibration? That is, the "Base phred quality" score that shows in the info box that pops up when you hover over a locus/read.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    IGV will report the recalibrated qualities, because they are written in place of the originals. It is possible to still see the original qualities in the read information; if you set the appropriate flag to keep the original quals when you did the second step of recalibration, they will be written to the OQ tag.

    Geraldine Van der Auwera, PhD

  • alirezakjalirezakj Posts: 51Member

    Hi Geraldine,

    If someone is doing a mutation rate analysis, say in yeast (like exposing the yeast to radiations) to measure the mutation rate. Then different mutations in different locations in each yeast might happen. Let's say, I want to call all of those variants even if they are seen in only one read (but with good mapping quality and good phred score that one can see in IGV by mousing over).

    As far as I know, UG and HC are very sensitive to not miss anything and that they don't just look at the base Q score, MPQ and DP to decide how true a variant is, but they also consider things like HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum etc.

    I know that we shouldn't trust everything we see in IGV, however my question is, is it possible that UG miss calling some of those mutation because they might be seen in only one read but because the coverage for those regions are very high, UG might ignore calling some of them because they are only present in one read? (or some other scenario like that)

    Would you please clarify more on the sensitivity of UG, because no matter how many times you tell people don't trust everything you see in IGV they again tend to say "I can see it right there"!

    This is a question for a lot of people and me!

    Thanks,

  • alirezakjalirezakj Posts: 51Member

    Sorry, in my post "HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum" are annotations and aren't necessarily used for variant calling algorithm.

  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    Hi @alirezakj -

    I'm not sure that "sensitivity" is the right word for what you're looking for. I suppose it is in the strictest sense, but in practical terms I think it's something else. You don't ever mention what kind of read depths you're using, but with any significant depth the probability of seeing an "alternate allele" in one read at any given locus approaches 1. The Illumina sequencers have been getting better and better, but they've still got a pretty significant error rate.

    UG and HC (and most other variant callers) help to mitigate that error rate by using the expected ploidy of the library in question. So if we're using human data (which is what the GATK was initially developed for), we would expect to see the reads supporting putative variations to be near 50 or 100% of the reads at that site. If the read support varies significantly from those levels, it's less likely to be a real variation. It sounds like in your experiment, you have a whole bunch of yeast cells and you're looking for evidence of variation in any of them - this is a completely different analysis.

    If you haven't already, you may want to look at the ploidy parameter of UG. It won't resolve your genotypes to the cell level, but I would think you could, e.g., try to find variants with 5% penetrance by setting the ploidy to 20.

    Caveat emptor: I've done exactly one UG run with a non-default ploidy - I used 8. It's entirely possible that 20 is way too high and that you'll be swallowed by our dying Sun before the analysis finishes. Or maybe it will finish in two hours. I have no idea. But I think it's your best hope for doing this type of analysis in GATK

  • alirezakjalirezakj Posts: 51Member
    edited February 17

    Thanks @pdexheimer,

    Let me ask it this way, l have some mitochondrial seq with 10,000 x coverage/10,000 mean depth at any given position and of course the -ploidy is 1.

    In my analysis any high quality variant should be called even if it is seen in one read among thousands but with high MPQ and BQS! Is it possible that UG would miss calling it because it is only in one read? Dose UG ignore it, because all the other reads disagree with that one read even though everything in that read is of high quality?

    From what you said, I understand that UG would call such variants only if I increase the ploidy number. Right?

    Post edited by alirezakj on
  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    One out of ten thousand? I certainly hope so, no matter the quality.

    As is always the case in bioinformatics, there's uncertainty here. Mitochondria are subject to some degree of heteroplasmy, for instance. But the error rate of the sequencer itself is much, much higher than one out of ten thousand - to say nothing of the errors in library prep (PCR polymerases aren't perfect) and alignment. To me, there's absolutely no way you should take a single read out of many (think dozens, not thousands) as significant evidence of variation

  • alirezakjalirezakj Posts: 51Member
    edited February 17

    Thanks @pdexheimer,

    So the answer is that UG would not call it do to the pile of evidence (thousands of reads)! Even though that one read is of vary high quality, right? Due to seq machines error rate the algorithm would not trust that variant, right?

    Post edited by alirezakj on
  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    Right. One high-quality read is not enough to outweigh the evidence of many mixed-quality reads. The definition of "many" is left as an exercise for the reader, but it ultimately depends on ploidy and error rates

  • alirezakjalirezakj Posts: 51Member
    edited February 17

    In fact, I would never trust such a variant even if it is of high quality. But I just want to show this post to the people who ask it gain and again saying "I can see it right there" :) thanks a lot, I think you closed this thread quit well.

    Post edited by alirezakj on
  • wchenwchen Posts: 13Member

    We are comparing variants called using GATK HaplotypeCaller and MiSeq reporter. It's amplicon-based Illumina Truseq targeted gene panel. Strangely, all variants's PL is 0 (vcf from one sample, no VQSR done) for those matched MiSeq variants:

    CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT chr13,28624294,rs1933437,G,A,1040.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=0.887;ClippingRankSum=-1.822;DB;DP=81;FS=34.541;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.681;QD=12.85;ReadPosRankSum=7.066,GT:AD:DP:GQ:PL,0/1:25,52:77:99:1069,0,387 chr17,7579472,rs1042522,G,C,1719.77,.,AC=2;AF=1.00;AN=2;DB;DP=91;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=18.90,GT:AD:DP:GQ:PL,1/1:0,76:76:99:1748,221,0 chr20,31022959,rs6058694,T,C,1861.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=-4.081;ClippingRankSum=-0.704;DB;DP=134;FS=26.555;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.615;QD=13.89;ReadPosRankSum=8.110,GT:AD:DP:GQ:PL,0/1:28,105:133:99:1890,0,316 chr4,106196951,rs2454206,A,G,431.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=-2.851;ClippingRankSum=-0.036;DB;DP=84;FS=2.199;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.459;QD=5.14;ReadPosRankSum=-0.109,GT:AD:DP:GQ:PL,0/1:35,26:61:99:460,0,715 chr17,7579472,rs1042522,G,C,852.77,.,AC=2;AF=1.00;AN=2;DB;DP=45;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.76;MQ0=0;QD=18.95,GT:AD:DP:GQ:PL,1/1:0,35:35:99:881,104,0

    Also the total reads should be over 300k, but the DP here is so small. I did downsampling to 3000, marked the duplicates but didn't remove them. Any clue why all these happened? How to calculate the variant frequency here? Thanks!

  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    @wchen‌ -

    PL is by definition relative to the called genotype, so will always be 0 for the called genotype.

    What does the log output of HaplotypeCaller say at the end, in the read summary? I suspect you'll see an absurdly high number of reads discarded because they're duplicates. Marked reads are skipped by HC. Regarding your depth, the duplicate filters play some role in that, but it seems like HaplotypeCaller 3.2 down sampled a bit too aggressively. I'm not sure what version you used, or if that's still the case in 3.3.

    I'm not sure what to tell you about variant frequency, because it depends on which definition you're using. The vcf already contains an allele frequency for each variant, but you only have 1 sample so that's not real interesting. As I understand, you only ran HC over the variants reported by another tool - if you're just looking for how many of them were actually variant here you probably want to use a tool like VariantEval (or maybe GenotypeConcordance). Of course, if you just ran HaplotypeCaller in the default mode, it only output calls at variant sites. VariantEval will also help you calculate things like variants per kilobase, if that's what you're after

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    It is true that HaplotypeCaller does some fairly aggressive downsampling, and unfortunately it seems we have a bug that occurs when you try to override HaplotypeCaller's downsampling defaults. It's a hard one to fix because the downsampling system is rather complex, so in the meantime our recommendation is to not touch the downsampling settings for HaplotypeCaller. Rest assured that the tool downsamples data to a level that gives it all the data it needs to make an informed call.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 50Member

    UG missed calling a SNP for us on chr17 in a set of 78 WES samples. (It should appear in two of those)

    Using the same data, but trimmed down to just chr17, UG made the call. Is there a explainable reason for this?

    Coverage, Base Qual, and Mapping Qual are all good. There are no additional alternate alleles.

    Full disclosure, I am still using 3.1-1

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    @bwubb Is it a borderline call? Any excess coverage? This could be due to downsampling differences triggered by the different interval lists (which are used to generate the random seed for downsampling).

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 50Member

    @Geraldine_VdAuwera‌ How might I tell if it is borderline? It looks pretty ok to me.

    17 7577121 . G A 1569.73 . ABHet=0.360;ABHom=0.998;AC=2;AF=0.013;AN=156;BaseCounts=74,3,6646,2;BaseQRankSum=-11.619;DP=6729;Dels=0.00;FS=11.934;GC=53.37;HRun=0;HW=0.0;HaplotypeScore=2.5790;InbreedingCoeff=-0.0133;LowMQ=0.0003,0.0009,6729;MLEAC=2;MLEAF=0.013;MQ=59.92;MQ0=2;MQRankSum=-0.894;OND=3.897e-03;PercentNBaseSolid=0.0000;QD=13.42;ReadPosRankSum=-2.719;VariantType=SNP

    The samples: 0/1:0.470:47,53:101:99:0:1298,0,1394 and 0/1:0.250:4,12:16:99:0:329,0,99

    So the second one has pretty low coverage.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hmm, given the coverage for those two samples, there's no downsampling going on, so that doesn't explain it. Aside from a filesystem hiccup / memory issue that caused a skipped position, I can't think of why this would happen. And it's hard to test without running the whole thing again. If you can run again on the full cohort, I would suggest trying in allSites mode, and see if something gets output at that position. Or maybe first try on just those two samples and see what happens.

    Unfortunately there's not much else I can do aside from encouraging you to upgrade and use HaplotypeCaller instead of UG.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 50Member

    Thank you, I will try on the full cohort again.

    We are fairly married to UG. Tumor samples have never worked well with the HC every time we try it out.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Oh, forgot you were dealing with tumor samples, hmm. Have you ever tried MuTect?

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.