Contamination estimation while following the Best Practices workflow
I've started following the Best Practices guide to process >80 whole-genome samples (reference-confidence model workflow). During variant calling via HaplotypeCaller, a few of the gVCFs blew up in size (> 1TB, while the rest of gVCFs were around 75 GB). I suspect that these samples are contaminated. I'd like to:
1) Confirm that the samples that produced large gVCFs are contaminated
2) Check that the rest of the samples are unaffected (higher priority)
I've started looking at GATK-friendly ConTest . It seems that it requires genotyped VCF files as input. Is it appropriate to call GenotypeGVCFs individually for each gVCF and use the produced VCF as input for ConTest? Or can I use "raw" gVCFs?
I've also considered using cleanCall , but sounds like I'd need to repeat variant calling using samtools (took ages to produce gVCFs, so trying avoid this) or somehow do the gVCF -> VCF -> PED conversion.