This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
format fields and sample entries in VCF files
I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column.
For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF:
1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 ..............
The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable.
It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields.