This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
AD in VCF doesn't match BAM
I'm using GATK to call variants in my RNA-Seq data. I'm noticing something strange, perhaps someone can help? For a number of sites the VCF is reporting things I cannot replicate from BAMs. How can I recover the reads that contribute to a variant call? Here is an example for 1 site in 1 sample, but I've observed this at many sites/samples:
$ grep 235068463 file.vcf chr1 235068463 . T C 1795.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.530;ClippingRankSum=-0.535;DP=60;FS=7.844;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.401;QD=29.93;ReadPosRankSum=3.557 GT:AD:DP:GQ:PL 0/1:5,55:60:44:1824,0,44
60 reads, 5 T, 55 C.
But loading the bam in IGV, I do not see any T reads. Similarly:
$ samtools view -uh file.md.realn.bam chr1:235068463-235068463 |samtools mpileup - |grep 235068463 [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 chr1 235068463 N 60 cCCccccCCCcccccCcccccccccCCCccCCCCCcCcccccCCCcCcCCccCCCCccCC >[email protected]@>A>[email protected]:@@[email protected]?>A?CBBAAAABA
There are just 60 C's at that location. How do I decide what the genotype here is? C/C or C/T ?
For methodology I'm using gatk/3.2.0. I tried using HC from gatk/3.3.1 and got the same result. The bam and vcf files come from the final two lines:
-2 pass STAR