MQRankSum and ReadPosRankSum for SNPs in a haploid organism?

Apologies if this has been addressed previously. I'm working with genomic resequencing data for a haploid organism, and I have created a VCF file using GenotypeGVCFs from 33 gVCFs created using HaplotypeCaller (using best practices). I did not set the ploidy option when using GenotypeGVCFs as directed. My final aim is to filter for a subset of high-quality SNPs for a downstream analysis.

As I understand it the parameters MQRankSum and ReadPosRankSum can only be calculated if there is an individual with a heterozygous genotype (ref and alt alleles) at that position. Around 15% of my SNPs have been scored for these parameters, can anyone explain what this means for a haploid? Are these sites good candidates to filter outright?

An example SNP is below:

chromosome_1 3316 . G A 492.42 . AC=3;AF=0.136;AN=22;BaseQRankSum=0.731;ClippingRankSum=1.70;DP=1488;FS=0.000;MLEAC=3;MLEAF=0.136;MQ=31.59;MQRankSum=-5.660e-01;QD=16.98;ReadPosRankSum=0.731;SOR=1.308 GT:AD:DP:GQ:PL 0:86,0:86:99:0,1800 0:89,0:89:99:0,1800 0:147,0:147:99:0,1800 0:51,5:56:99:0,1800 1:1,4:5:80:80,0 0:271,0:271:99:0,1800 1:0,8:8:99:247,0 0:21,1:22:99:0,814 .:0,0 1:5,11:16:99:211,0 0:140,72:212:99:0,1800 0:242,13:255:99:0,1800 0:252,0:252:99:0,1800 .:0,0 .:0,0 .:0,0 0:1,0:1:44:0,44 .:0,0 .:0,0 0:3,0:3:99:0,112 0:1,0:1:39:0,39 .:0,0 .:0,0 .:0,0 .:0,0 0:17,0:17:99:0,360 0:1,0:1:37:0,37 0:4,0:4:99:0,135 0:10,0:10:99:0,270 0:3,0:3:99:0,119 0:3,0:3:99:0,111 0:3,0:3:99:0,119 .:0,0



    Hi Rory,

    Can you confirm that the RankSum annotations do not appear in the final VCF if you do set ploidy in GenotypeGVCFs? I don't think you should use the RankSum annotations for haploid samples, as the annotation is meant for diploid samples.


    Hi Sheila, sorry for the slow reply. I can confirm that these annotations do still appear if ploidy is set to 1 in the GenotypeGVCFs command. Do you have any insight on whether it's best to ignore these annotations, or actively filter them? Thanks!

    We don't have any recommendations for using or not using rank sum annotations in haploid (or non-diploid) samples. I think the best thing to do is try both ways (filtering with and without the rank sum annotations) and see which works best for your dataset.


    I had a question about ReadPosRankSum in haploid. I went through the definition given at GATK doc page ReadPosRankSumTest, also the explanation of the statistics behind the score at Rank Sum Test.

    However, I could not find any information related to the ploidy of a sample that makes this score invalid for haploid. I really would like to have some clarification on why this score is only valid for diploid ?

    Many thx.


    Hi @qiangfu

    ReadPosRankSumTest invalid for haploid because it compares distributions of relative positions of alt reads and ref reads, and we can't have both on a haploid sample
    there is this note in documentation:


    • The read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

    Thanks for the clarification.

    Indeed for haploid species, it should have only ref or alt theoretically. I got confused as for virus or even for bacteria, there is sometimes more then one population in a sample, resulting ref/alt to appear at the same position... But that is not from the same origin (at least two different populations), then the assumption for ReadPos bias does not applicable anymore.

    Hi @qiangfu Feel free to post a follow up question.

