Haplotypecaller: SNPs with three genotypes have higher missing rates.

Dear GATK team,

I called SNPs out of 150 samples of WGS data on a non-model species (coral). The reference is a draft genome of 500 MB, each sample has roughly 15 M paired end reads. Species is diploid.

I have ran the GATK pipeline following the best practices and called genotypes using the haplotypecaller. Since I needed to speed up calculations and be as conservative as possible, I set the min-pruning flag =10. Next, I only kept bi-allelic SNPs.

Now, I noticed that the missing rates per SNP are generally higher for SNPs with 3 genotypes (the most frequent case), compared to those with 2 genotypes.

In other words, if I filter SNPs for missing-rate I end up with a genotype matrix in which most of the SNPS have only two genotypes (1 homozygote + the heterozygote).

Any idea on what could be the case? Is it possible that a min-pruning flag of 10 can systematically create more missing calls on SNPs with three genotypes (particularly on homozygotes) ?

thank you in advance




  • oselmoselm Member

    I think I found what the problem is, but not sure how to correct it.

    I ran haplotypecaller using the following command.

    java -jar GenomeAnalysisTK.jar -nct 18 -T HaplotypeCaller -R reference.fasta -I bam/batch1SAMPLE1.bam -o vcf/batch1SAMPLE1.g.vcf --minPruning 10 -ERC GVCF

    And subsequently I genotyped all the samples from this batch using:

    java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta --variant vcf/batch1SAMPLE1.g.vcf --variant vcf/batch1SAMPLE2.g.vcf --variant vcf/batch1SAMPLE3.g.vcf --variant vcf/batch1SAMPLE4.g.vcf --variant vcf/batch1SAMPLE5.g.vcf --variant vcf/batch1SAMPLE6.g.vcf -o batch1.vcf

    Now, if I have a look at the jointed vcf (batch1) I see this variant associated to SAMPLE1 (note: SAMPLE1 is the first genotype):

    CONTIG1 287 . A AT 196.88 . AC=2;AF=0.200;AN=10;DP=23;ExcessHet=0.2482;FS=0.000;MLEAC=2;MLEAF=0.200;MQ=31.87;QD=32.81;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL ./.:7,0:7:.:.:.:0,0,0 0/0:4,0:4:12:.:.:0,12,124 0/0:1,0:1:3:.:.:0,3,18 1/1:0,6:6:18:1|1:287_A_AT:243,18,0 0/0:2,0:2:6:.:.:0,6,81 ./.:0,0:0:.:.:.:0,0,0

    In the g.vcf corresponding to sample 1, we find these calls on the region:

    CONTIG1 273 . C <NON_REF> . . END=278 GT:DP:GQ:MIN_DP:PL 0/0:7:6:7:0,6,90 CONTIG1 279 . G <NON_REF> . . END=280 GT:DP:GQ:MIN_DP:PL 0/0:7:3:7:0,3,45 CONTIG1 281 . A <NON_REF> . . END=288 GT:DP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,0 CONTIG1 289 . T <NON_REF> . . END=290 GT:DP:GQ:MIN_DP:PL 0/0:7:15:7:0,15,225 CONTIG1 291 . A <NON_REF> . . END=294 GT:DP:GQ:MIN_DP:PL 0/0:6:0:6:0,0,0 CONTIG1 295 . A <NON_REF> . . END=300 GT:DP:GQ:MIN_DP:PL 0/0:5:15:5:0,15,186

    If I get this correctly, GenotypeGVCFs assigned the genotype (position 287) based on the observation on position 281 which has low confidence. I don't understand why it did not use those around (for instance on position 289, which is actually the closer one).

    Because of this, this SNP is considered low quality and filtered out. This same situation happens with many 0/0 genotypes which are then filtered when applying missingness filters. Eventually, this leads to have very few 3-genotypes SNPs.

    My question: any idea on what could be the solution?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Here is a quote from @Geraldine_VdAuwera earlier thread on a similar issue.

    "Your main problem here is that you're hitting the low-coverage limitations of the algorithms. These tools are designed for high-throughput sequencing data, so they expect to see many short(ish) reads rather than a couple of long ones. The probability models are tuned accordingly, so when they see just one or two observations, they tend to consider that inconclusive (hence the many no-calls) unless the observation quality is particularly good.

    In HC there is the additional constraint at the reassembly step of how many reads must support a variant allele for it to be retained as a node in the graph. That is why you are seeing the most improvements with the min pruning and dangling branch tweaks in HC, as those lower the bar for the amount of data that needs to be seen per node in the reassembly graph."

    So, is it possible that one of your genotypes has low coverage that is biasing the variant calls? In other words, does the low coverage in one sample reduce the estimation of a true SNP based on the lower confidence in the overall data set?

    Just a thought. We are about to go on break until January 3, but please feel free to respond.

  • oselmoselm Member

    @AdelaideR said:

    So, is it possible that one of your genotypes has low coverage that is biasing the variant calls? In other words, does the low coverage in one sample reduce the estimation of a true SNP based on the lower confidence in the overall data set?

    Would one "low coverage" samples bias SNP calling only for some type of variants? I am asking because the true '0/0' genotypes seem more frequently called as './.', any idea why? Is it correct to say that likelihood calculation differ between '0/0's and '0/1's or '1/1's? If that's the case, what could drive the fact that '0/0's are more likely to be not called?

    thank you

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin


    "." is a placeholder for "empty" and this happens when coverage is too low in many cases. So, based on your description of the frequency of "./." that may indicate that the coverage is too low for the amount of information provided. This could be true for a non-model species with an incomplete or fragmented reference.

    Have you taken a look at the bam files using the IGV tool in these regions to visualize what might be leading to the low calls? How good is the reference in these regions?

Sign In or Register to comment.