format fields and sample entries in VCF files

eflynn90eflynn90 Washington DCMember

I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column.

For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF:
1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 ..............

The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable.

It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields.

Best Answer

Answers

  • eflynn90eflynn90 Washington DCMember

    Thanks! Please let me know if it's added as an option.

Sign In or Register to comment.