Stats in PhaseByTransmission Progress Log and output files differ- how to interpret this?

Hi, I have a question regarding PhaseByTransmission output files and the numbers as they appear in the log.

I have a trio for which i did the following:

  • called with HC3.5 in gVCF mode
  • joint genotyped the three samples

For the trio vcf, i did the following: ran PBT (in the meantime updated to GATK 3.6) with various priors and mendelian violations file output, but let's take for example 1.5e-8. For this run, all went smoothly and I saw the following log:

INFO 15:10:38,157 PhaseByTransmission - Number of complete trio-genotypes: 6229265
INFO 15:10:38,157 PhaseByTransmission - Number of trio-genotypes containing no call(s): 66600
INFO 15:10:38,158 PhaseByTransmission - Number of trio-genotypes phased: 4815257
INFO 15:10:38,158 PhaseByTransmission - Number of resulting Het/Het/Het trios: 1402162
INFO 15:10:38,158 PhaseByTransmission - Number of remaining single mendelian violations in trios: 1718
INFO 15:10:38,159 PhaseByTransmission - Number of remaining double mendelian violations in trios: 0
INFO 15:10:38,159 PhaseByTransmission - Number of complete pair-genotypes: 0
INFO 15:10:38,159 PhaseByTransmission - Number of pair-genotypes containing no call(s): 0
INFO 15:10:38,159 PhaseByTransmission - Number of pair-genotypes phased: 0
INFO 15:10:38,160 PhaseByTransmission - Number of resulting Het/Het pairs: 0
INFO 15:10:38,160 PhaseByTransmission - Number of remaining mendelian violations in pairs: 0
INFO 15:10:38,160 PhaseByTransmission - Number of genotypes updated: 132226

I have as mentioned also generated a mendelian violations file. My confusion started when i had a look at the mendelian violations file which has 37296 variants as mendelian violations, whereas PBT progress meter/log only shows 1718?

My question is: how are the mendelian violations that are reported determined, and is there a way to retrieve the exact ones PBT calls single mendelian violations? I am afraid this is not clear to me from the documentation...

PS: I suspected this may be due to filtering, so I have tried to filter on TP and DP ( TP>40 & DP >10), but i still have almost double than the log. So maybe the exact filtering that is used for the report can be shared?

Thanks!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi there, sorry for the late response. PBT was contributed by an external developer and we don't have any internal expertise on its inner workings. As with many tools we have lots of tests covering the results it returns to ensure accuracy of the output, but none on the contents of the log, so I wouldn't be surprised if the log was somehow incomplete. Unfortunately it's not something we can devote effort to investigating at this time. You're welcome to look at the code in github of course.
Sign In or Register to comment.