Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Stats in PhaseByTransmission Progress Log and output files differ- how to interpret this?
Hi, I have a question regarding PhaseByTransmission output files and the numbers as they appear in the log.
I have a trio for which i did the following:
- called with HC3.5 in gVCF mode
- joint genotyped the three samples
For the trio vcf, i did the following: ran PBT (in the meantime updated to GATK 3.6) with various priors and mendelian violations file output, but let's take for example 1.5e-8. For this run, all went smoothly and I saw the following log:
INFO 15:10:38,157 PhaseByTransmission - Number of complete trio-genotypes: 6229265
INFO 15:10:38,157 PhaseByTransmission - Number of trio-genotypes containing no call(s): 66600
INFO 15:10:38,158 PhaseByTransmission - Number of trio-genotypes phased: 4815257
INFO 15:10:38,158 PhaseByTransmission - Number of resulting Het/Het/Het trios: 1402162
INFO 15:10:38,158 PhaseByTransmission - Number of remaining single mendelian violations in trios: 1718
INFO 15:10:38,159 PhaseByTransmission - Number of remaining double mendelian violations in trios: 0
INFO 15:10:38,159 PhaseByTransmission - Number of complete pair-genotypes: 0
INFO 15:10:38,159 PhaseByTransmission - Number of pair-genotypes containing no call(s): 0
INFO 15:10:38,159 PhaseByTransmission - Number of pair-genotypes phased: 0
INFO 15:10:38,160 PhaseByTransmission - Number of resulting Het/Het pairs: 0
INFO 15:10:38,160 PhaseByTransmission - Number of remaining mendelian violations in pairs: 0
INFO 15:10:38,160 PhaseByTransmission - Number of genotypes updated: 132226
I have as mentioned also generated a mendelian violations file. My confusion started when i had a look at the mendelian violations file which has 37296 variants as mendelian violations, whereas PBT progress meter/log only shows 1718?
My question is: how are the mendelian violations that are reported determined, and is there a way to retrieve the exact ones PBT calls single mendelian violations? I am afraid this is not clear to me from the documentation...
PS: I suspected this may be due to filtering, so I have tried to filter on TP and DP ( TP>40 & DP >10), but i still have almost double than the log. So maybe the exact filtering that is used for the report can be shared?