Different reference allele in two files processed with same pipeline and smae reference files
Dear GATK team,
I have a set of samples for which I have WES and WGS data. I have run the same pipeline (mostly following GATK's Best Proctices) on both datasets with the same reference sets and target sites down to VQSR. So now I have to VCFs ready to analyze. My idea is to merge these datasets and I believe my situation parallels to the third case described in this document gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one.
Therefore, I used -T CombineVariants to merge these two files into one. Up until here everything runs ok and I obtain a working final VCF. However when I look deeper into it, there are some variants that have been kept duplicated but present different reference allele. When I look back at the separate VCF files, these differences are already there. My question is, how two sets of files that have been processed exactly in the same way, can present different reference sites? I paste couple of examples for clarification:
chr10 100167436 . T C 177.06 PASS AC=1;AF=0.011;AN=94; set=WGS
chr10 100167436 . TGTCACCAGGGGTCACCAGGGATGAGGACC CGTCACCAGGGGTCACCAGGGATGAGGACC,T 56690.81 PASS AC=50,2;AF=0.024,9.533e-04;AN=2098; set=WES
chr10 101462504 . CT CTT,C 1539.20 PASS AC=34,48;AF=0.014,0.020;AN=2348; set=WES
chr10 101462504 . C T,CT . PASS AC=0,0;AF=0.00,0.00;AN=114; set=WGS
chr10 102295645 . G GT 627.02 PASS AC=7;AF=0.061;AN=114; set=WGS
chr10 102295645 . GT GTT,G,TT 63780.89 PASS AC=203,403,17;AF=0.088,0.174,7.328e-03;AN=2320; set=WES
Any thoughts on this behaviour? and possible solutions?
Thanks in advance