The genotypes in combined VCF generated by combineVariants are different from original VCFs
Hi GATK team,
I was working on generating a combined VCF using 150+ VCFs (building the sort of cohort). The purpose of it is to calculate variants cohort frequency. But I found the genotype is messed up in the combined VCF. Here is my cmd line:
java -jar /GATK/GenomeAnalysisTK-2.7-4/GenomeAnalysisTK.jar -R refernce.fasta \
-T CombineVariants \
--variant sample1.vcf \
--variant sample2.vcf \
Here is one example of one variant/ position in the combined VCF. The record is very long in the combined VCF, I just grabbed the related columns here.
1 22082967 rs35545280 CAAA CAA,C,CA,CAAAA 76.73 PASS AC=109,6,104,6;AF=0.368,0.020,0.351,0.020;AN=296;DB;DP=9869;GC=48.13;MQ0=0;PercentNBaseSolid=0.0000;RU=A;STR;set=filterInvariant-filterInvariant2-filterInvariant3… GT:DP:GQ
In this record, sample 1 has this variant and it shows as "0/3:46:99",
but in the sample1.vcf, it is listed as
1 22082967 rs35545280 CA C 88.73 Low_Confidence AC=2;AF=1.00;AN=2;BaseCounts=0,76,0,0;BaseQRankSum=1.905;DB;DP=76;FS=0.000;GC=48.13;HaplotypeScore=195.0572;IndelType=DEL.NOVEL_2.;LowMQ=0.0000,0.0000,76;MLEAC=1;MLEAF=0.500;MQ=68.40;MQ0=0;MQRankSum=0.423;PercentNBaseSolid=0.0000;QD=1.17;RPA=20,19;RU=A;ReadPosRankSum=-0.741;STR;set=FilteredInAll GT:AD:DP:GQ:PL 1/1:0,17:76:19:587,19,0
And you can see that the genotype in combined VCF for sample 1 is 0/3, but in its original its is 1/1 which is homozygous. So when I calculate the cohort frequency, I'm confused on matching genotype of this variant for sample 1.
To give you more idea, I listed another sample of same variant in combined.VCF and its record in sample2.VCF.
In the combined.VCF, sample 2 shows as "0/3:91:99".
In the sample2.VCF, the record is:
1 22082967 rs35545280 CAA C,CA 549.19 PASS AC=1,1;AF=0.500,0.500;AN=2;BaseCounts=0,105,0,0;BaseQRankSum=-1.897;DB;DP=105;FS=0.000;GC=48.13;HaplotypeScore=238.8742;IndelType=MULTIALLELIC_INDEL;LowMQ=0.0000,0.0000,105;MLEAC=1,1;MLEAF=0.500,0.500;MQ=68.78;MQ0=0;MQRankSum=-0.992;PercentNBaseSolid=0.0000;QD=5.23;RPA=20,18,19;RU=A;ReadPosRankSum=-1.344;STR;set=variant2 GT:AD:DP:GQ:PL 1/2:0,11,25:105:99:1363,307,775,501,0,585
Where you can see the genotype is 1/2, but in the combined VCF, it shows as "0/3".
Please advise me if I should use any parameter in the cmd line to solve this problem.