False positives in variant calls?

neurobryneurobry Jena, GermanyPosts: 5Member

Hello all,

We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.

We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:

scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.

In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.

A reverse example:

scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.

Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.

I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.

false_hetero.png
936 x 220 - 6K
false_homo.png
1083 x 220 - 4K

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    Hi there,

    Can you please tell me what version of GATK you are using and post your command line?

    Geraldine Van der Auwera, PhD

  • neurobryneurobry Jena, GermanyPosts: 5Member

    Sorry for the delay - I was off on vacation.

    In response to your question:

    v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02

    Calling: -T UnifiedGenotyper -R genome.th66masked.fa (-I different mapping.bam files) -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 500 -nt 12

    Filtering: -T VariantFiltration -R genome.th66masked.fa -V GRZ_FLI_snp.raw.vcf -o GRZ_FLI_snp.vcf --clusterWindowSize 10 --filterExpression DP<10 || DP>200 --filterName DP --filterExpression QUAL<20 --filterName QUAL

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    And now I was away and so am responding late -- but we'll get there.

    So, your command lines look fine. The filtering shouldn't be at issue here; it does look like it's the genotype assignment that's not making sense. We had some cases like this before 2.5 but I'm a little surprised this would happen with 2.5. You can always try again with the latest version to make sure it's not something we've fixed already; and consider trying making the calls with HaplotypeCaller instead of UG.

    Geraldine Van der Auwera, PhD

  • neurobryneurobry Jena, GermanyPosts: 5Member

    Hi Geraldine,

    So we've used 2.6.5 with both HC and UG and we see the same behavior. What is strange is that this behavior actually seems to be an artifact from combining analyses. For example in the line:

    scaffold00061 678225 . G T 97.88 PASS AC=2;AF=0.333;AN=6;BaseQRankSum=-2.308;DP=67;Dels=0.00;FS=5.269;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=23.96;MQ0=0;MQRankSum=1.241;QD=3.26;ReadPosRankSum=0.211 GT:AD:DP:GQ:PL 0/1:6,8:14:76:116,0,76 0/1:15,1:16:13:13,0,141 0/0:37,0:37:42:0,42,542 ./.

    This file was produced after combining 4 different analyses (corresponding to the genotypes for each of the 4 samples). Yet in these original vcf files, only the first sample (0/1:6,8) has an entry. So this error seems to be introduced during the combining stage. FYI, our command to combine is essentially the same as for the calling:

    -T UnifiedGenotyper -R genome.th66masked.fa (-I different mapping.bam files) -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 500

    Do you have any suggestions for a workaround?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    Just to be sure I understand the problem correctly, you're saying that you're combining four VCF files, each containing calls for a different sample, and when you combine them you suddenly get genotypes for sites that aren't called in some samples, is that right? So for the line you posted, you would expect 0/1:6,8:14:76:116,0,76 ./. ./. ./.? If so, can you please post your command line and a few sites flanking that position in all four VCFs?

    Geraldine Van der Auwera, PhD

  • neurobryneurobry Jena, GermanyPosts: 5Member

    Let me see if I can explain it a bit better. We have 4 individuals that we have sequenced DNA for (fishes). We've used GATK to call variants in those 4 fishes, which give us fish1.pass.vcf, fish2.pass.vcf, etc. We call this "individual variant calling".

    So for individual variant calling, our Unified Genotyper command line was:

    -T UnifiedGenotyper -R genome.th66masked.fa -I fish1.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 500 -nt 12 
    -T UnifiedGenotyper -R genome.th66masked.fa -I fish2.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 500 -nt 12 
    -T UnifiedGenotyper -R genome.th66masked.fa -I fish3.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 500 -nt 12
    -T UnifiedGenotyper -R genome.th66masked.fa -I fish4.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 500 -nt 12

    Additionally, we performed a combined analysis using the command line: -T UnifiedGenotyper -R genome.th66masked.fa -I fish1.bam -I fish2.bam -I fish3.bam -I fish4.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 500 -nt 12

    We did NOT use a vcf merge script along the lines of vcf-merge. This is important because we're also interested in sites that are heterozygous in one individual, but homozygous reference in another (to find the sex-determining region in this fish).

    I hope that clarifies the problem that we're having.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    Oh, got it -- that's completely different from what I understood from your original post.

    I think what's happening here is that you're seeing the effect of single- vs. multisample calling (which is a feature, not a bug):

    When you call the samples individually, anything that looks too iffy is going to be dismissed by the caller as an artifact. However, when you call several samples together, the caller will take into account what it sees in other samples. So if you have a strong call for a variant in one sample, the caller may also accept making the same call in another sample where it would normally have been too marginal on its own. Basically if I see an ALT allele in Fish #1, I will admit the possibility that it is also present in Fish #2, whereas on the evidence of Fish #2 alone I would have dismissed it.

    In your data that's what leads to Fish #2 getting that het call with AD of 15,1. The genotype is debatable (I would want to look at the evidence for the site) but knowing that Fish #1 is clearly het at this site, with that same ALT allele, makes me think that maybe it's worth looking at as opposed to throwing it out entirely. Then you have Fish #3 which is clearly hom-ref; that call seems indisputably good, and it only didn't come out in the individual analysis because you didn't specify that you wanted to see ref calls in the output (which is quite sensible, no problem there). Finally, Fish #4 doesn't have a genotype, probably because there was no useable coverage at that site in that sample.

    Does that make sense?

    Geraldine Van der Auwera, PhD

  • neurobryneurobry Jena, GermanyPosts: 5Member

    It does make sense, and it's something that we will need to figure out the best way to deal with. Part of the problem is that because this is a fish, we have an ancestral gene duplication event which we need to keep in mind. That means that we have a lot of paralogous sequence, and so we really have to worry a lot about false positives. Do you have any suggestions for the best way to filter out false positives due to paralogues? When looking at IGV, we've noticed some heterozygous variant calls are due to reads which contain clusters - in other words, one read will have a variation from the reference in multiple sites, and a different read will have the exact same variation at multiple sites.

    I'm considering just discarding variations which occur within 100bp of each other (this is a highly inbred species, so we expect true variation to occur only in the sex-determining region in males), but since we don't have a lot of experience in variation calling, any tips from your side would be greatly appreciated!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    Hmm, I see. We don't really have any specific recommendations for dealing with that, as in our view it's more a mapping problem (ensuring the reads are mapped to the right region) than a calling problem. One thing you can do, if these paralogous regions are known and are not of interest to you, is to exclude them from your analysis using the -XL argument. If you don't already have a list of intervals that contain the paralogous regions, it may be worthwhile to invest some effort in putting together such a list.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.