Too Many Missing 0 Calls in GVCF

I have 5555 patient samples. I have a VCF file for each patient with 1,2 calls, and a GVCF file for each patient with just the 0 calls. I used CombineVariants to merge just the VCF files. Now, I wrote a python script to loop through the GVCF files, and collect the quality 0 calls. Surprisingly, although I have up to 400 million 0 genotype calls in some of these GVCF files, the final 0,1,2 matrix I created looks surprisingly sparsely populated of 0 calls.

For instance, on ONE sample that I examined 3.7 million SNPs of interest it has a combined major allele frequency of 27% across all of these SNPs. (that means I looked at 3.7 million SNPs in one patient, and the major allele is only represented in a 27% frequency). In short there are surprisingly few 0 calls in my GVCF that match up with the 1,2 SNP calls (filtered PASS) in my VCF files over the 5555 patients.

Is there a way to infer the missing calls from the GVCF files? Or would it make a difference if I merged by GVCF files first?


Best Answer


Sign In or Register to comment.