Feedback on approach to create a custom truth set for VQSR

sp580 Germany


I would like to ask you for feedback on my approach to construct a truth set, since there is no such resource for my species.

What I am doing is to:
1/ call variants with GATK best practices by joint calling with GenotypeGVCFs
2/ call variants with another caller (samtools mpileup-> bcftools call)
3/ Filter each set by retaining sites in which all samples have a depth of at least 10 (DP>=10) and a genotype quality of 30 (GQ>=30) in the FORMAT.
4/ Use retained sites common between both callers as truth set for VQSR

My reasoning was that sites called by two different algorithms having a GQ>=30 and DP>=10 in all samples of the cohort are very likely to be truth, and their annotations can be used to learn the rules of what a good variant looks like.

I would like to know if my reasoning makes sense to you and if so, what would you suggest me to change/add/remove (for example, I am not completely convinced about retaining sites if all samples have the min GQ and DP, what about if only one sample passes the condition?).

I greately appreciate your feedback and thanks in advance!


