This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
The genotypes in combined VCF generated by combineVariants are different from original VCFs
Hi GATK team,
I was working on generating a combined VCF using 150+ VCFs (building the sort of cohort). The purpose of it is to calculate variants cohort frequency. But I found the genotype is messed up in the combined VCF. Here is my cmd line:
java -jar /GATK/GenomeAnalysisTK-2.7-4/GenomeAnalysisTK.jar -R refernce.fasta \
-T CombineVariants \
--variant sample1.vcf \
--variant sample2.vcf \
Here is one example of one variant/ position in the combined VCF. The record is very long in the combined VCF, I just grabbed the related columns here.
1 22082967 rs35545280 CAAA CAA,C,CA,CAAAA 76.73 PASS AC=109,6,104,6;AF=0.368,0.020,0.351,0.020;AN=296;DB;DP=9869;GC=48.13;MQ0=0;PercentNBaseSolid=0.0000;RU=A;STR;set=filterInvariant-filterInvariant2-filterInvariant3… GT:DP:GQ
In this record, sample 1 has this variant and it shows as "0/3:46:99",
but in the sample1.vcf, it is listed as
1 22082967 rs35545280 CA C 88.73 Low_Confidence AC=2;AF=1.00;AN=2;BaseCounts=0,76,0,0;BaseQRankSum=1.905;DB;DP=76;FS=0.000;GC=48.13;HaplotypeScore=195.0572;IndelType=DEL.NOVEL_2.;LowMQ=0.0000,0.0000,76;MLEAC=1;MLEAF=0.500;MQ=68.40;MQ0=0;MQRankSum=0.423;PercentNBaseSolid=0.0000;QD=1.17;RPA=20,19;RU=A;ReadPosRankSum=-0.741;STR;set=FilteredInAll GT:AD:DP:GQ:PL 1/1:0,17:76:19:587,19,0
And you can see that the genotype in combined VCF for sample 1 is 0/3, but in its original its is 1/1 which is homozygous. So when I calculate the cohort frequency, I'm confused on matching genotype of this variant for sample 1.
To give you more idea, I listed another sample of same variant in combined.VCF and its record in sample2.VCF.
In the combined.VCF, sample 2 shows as "0/3:91:99".
In the sample2.VCF, the record is:
1 22082967 rs35545280 CAA C,CA 549.19 PASS AC=1,1;AF=0.500,0.500;AN=2;BaseCounts=0,105,0,0;BaseQRankSum=-1.897;DB;DP=105;FS=0.000;GC=48.13;HaplotypeScore=238.8742;IndelType=MULTIALLELIC_INDEL;LowMQ=0.0000,0.0000,105;MLEAC=1,1;MLEAF=0.500,0.500;MQ=68.78;MQ0=0;MQRankSum=-0.992;PercentNBaseSolid=0.0000;QD=5.23;RPA=20,18,19;RU=A;ReadPosRankSum=-1.344;STR;set=variant2 GT:AD:DP:GQ:PL 1/2:0,11,25:105:99:1363,307,775,501,0,585
Where you can see the genotype is 1/2, but in the combined VCF, it shows as "0/3".
Please advise me if I should use any parameter in the cmd line to solve this problem.