Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MQRankSum and ReadPosRankSum for SNPs in a haploid organism?

rorycraigrorycraig EdinburghMember

Hi,

Apologies if this has been addressed previously. I'm working with genomic resequencing data for a haploid organism, and I have created a VCF file using GenotypeGVCFs from 33 gVCFs created using HaplotypeCaller (using best practices). I did not set the ploidy option when using GenotypeGVCFs as directed. My final aim is to filter for a subset of high-quality SNPs for a downstream analysis.

As I understand it the parameters MQRankSum and ReadPosRankSum can only be calculated if there is an individual with a heterozygous genotype (ref and alt alleles) at that position. Around 15% of my SNPs have been scored for these parameters, can anyone explain what this means for a haploid? Are these sites good candidates to filter outright?

An example SNP is below:

chromosome_1 3316 . G A 492.42 . AC=3;AF=0.136;AN=22;BaseQRankSum=0.731;ClippingRankSum=1.70;DP=1488;FS=0.000;MLEAC=3;MLEAF=0.136;MQ=31.59;MQRankSum=-5.660e-01;QD=16.98;ReadPosRankSum=0.731;SOR=1.308 GT:AD:DP:GQ:PL 0:86,0:86:99:0,1800 0:89,0:89:99:0,1800 0:147,0:147:99:0,1800 0:51,5:56:99:0,1800 1:1,4:5:80:80,0 0:271,0:271:99:0,1800 1:0,8:8:99:247,0 0:21,1:22:99:0,814 .:0,0 1:5,11:16:99:211,0 0:140,72:212:99:0,1800 0:242,13:255:99:0,1800 0:252,0:252:99:0,1800 .:0,0 .:0,0 .:0,0 0:1,0:1:44:0,44 .:0,0 .:0,0 0:3,0:3:99:0,112 0:1,0:1:39:0,39 .:0,0 .:0,0 .:0,0 .:0,0 0:17,0:17:99:0,360 0:1,0:1:37:0,37 0:4,0:4:99:0,135 0:10,0:10:99:0,270 0:3,0:3:99:0,119 0:3,0:3:99:0,111 0:3,0:3:99:0,119 .:0,0

Cheers,
Rory

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @rorycraig
    Hi Rory,

    Can you confirm that the RankSum annotations do not appear in the final VCF if you do set ploidy in GenotypeGVCFs? I don't think you should use the RankSum annotations for haploid samples, as the annotation is meant for diploid samples.

    -Sheila

  • rorycraigrorycraig EdinburghMember

    Hi Sheila, sorry for the slow reply. I can confirm that these annotations do still appear if ploidy is set to 1 in the GenotypeGVCFs command. Do you have any insight on whether it's best to ignore these annotations, or actively filter them? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @rorycraig
    Hi,

    We don't have any recommendations for using or not using rank sum annotations in haploid (or non-diploid) samples. I think the best thing to do is try both ways (filtering with and without the rank sum annotations) and see which works best for your dataset.

    -Sheila

  • qiangfuqiangfu BelgiumMember

    Hi,

    I had a question about ReadPosRankSum in haploid. I went through the definition given at GATK doc page ReadPosRankSumTest, also the explanation of the statistics behind the score at Rank Sum Test.

    However, I could not find any information related to the ploidy of a sample that makes this score invalid for haploid. I really would like to have some clarification on why this score is only valid for diploid ?

    Many thx.

    -Qiang

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @qiangfu

    ReadPosRankSumTest invalid for haploid because it compares distributions of relative positions of alt reads and ref reads, and we can't have both on a haploid sample
    there is this note in documentation:

    Caveat

    • The read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

  • qiangfuqiangfu BelgiumMember

    Thanks for the clarification.

    Indeed for haploid species, it should have only ref or alt theoretically. I got confused as for virus or even for bacteria, there is sometimes more then one population in a sample, resulting ref/alt to appear at the same position... But that is not from the same origin (at least two different populations), then the assumption for ReadPos bias does not applicable anymore.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @qiangfu Feel free to post a follow up question.

Sign In or Register to comment.