This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Inconsistent Variant Detection in RNA-seq dataset using GATK
Hello, I've been following the best practices for calling variants in RNA-seq data, and I've come across an inconsistency when actually calling variants in my dataset that I can't figure out. Basically, I have a bam file that I've aligned using STAR, and removed duplicates, re-aligned, etc. following the above mentioned pipeline. I then called my variants using the following command:
java -Xms16g -Xmx25g -jar ~/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample_A.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o GATK_sample_a.vcf
I'm only interested in non-synonymous SNPs, so after filtering I end up with 4,675 SNPs in my sample (sample A). Now what I need to do is find SNPs in sample A that are not present in sample B. So I added the -ERC BP_RESOLUTION option to emit variants and non variants to the command above. I also added the -L option to limit my search to sites where I detected SNPs in sample A. Everything else in the command is unchanged:
java -Xms16g -Xmx25g -jar ~/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample_b.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -ERC BP_RESOLUTION -L sites.intervals -o GATK_sample_b.vcf
Here is where the problem is. As a test, I ran the above command on sample A thinking that GATK would identify all the sites as SNPs again. However, there are 7 sites that are no longer called variants.
Here is the original output for the 7 sites in question:
1 62683882 . G A 466.77 . AC=2;AF=1.00;AN=2;DP=11;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=24.45;SOR=2.494 GT:AD:DP:GQ:PL 1/1:0,11:11:33:495,33,0
14 98670790 . G T 6163.77 . AC=2;AF=1.00;AN=2;DP=140;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=62.87;QD=34.91;SOR=1.211 GT:AD:DP:GQ:PL 1/1:0,140:140:99:6192,425,0
14 120959107 . C A 61714.77 . AC=2;AF=1.00;AN=2;DP=1389;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.31;SOR=1.424 GT:AD:DP:GQ:PL 1/1:0,1389:1389:99:61743,4170,0
2 81810156 . C T 2505.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.249;ClippingRankSum=0.195;DP=323;FS=10.007;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.552;QD=7.76;ReadPosRankSum=3.185;SOR=1.326 GT:AD:DP:GQ:PL 0/1:147,88:235:99:2534,0,4547
7 24603765 . A G 988.77 . AC=2;AF=1.00;AN=2;DP=24;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.70;SOR=6.439 GT:AD:DP:GQ:PL 1/1:0,24:24:72:1017,72,0
GL895866.1 23532 . G C 2347.77 . AC=2;AF=1.00;AN=2;DP=54;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.35;SOR=2.876 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2376,169,0
GL893678.1 37589 . T G 823.77 . AC=2;AF=1.00;AN=2;DP=22;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=70.00;QD=33.49;SOR=1.931 GT:AD:DP:GQ:PL 1/1:0,22:22:66:852,66,0
And here is the output for the same 7 sites (using the same bam file) with the -ERC BP_RESOLUTION and -L option:
1 62683882 . G . . . GT:AD:DP:GQ:PL 0/0:0,12:12:0:0,0,0
14 98670790 . G . . . GT:AD:DP:GQ:PL 0/0:0,140:140:0:0,0,0
14 120959107 . C . . . GT:AD:DP:GQ:PL 0/0:1,1386:1387:0:0,0,0
2 81810156 . C . . . GT:AD:DP:GQ:PL 0/0:150,135:285:0:0,0,210
7 24603765 . A . . . GT:AD:DP:GQ:PL 0/0:0,24:24:0:0,0,0
GL895866.1 23532 . G . . . GT:AD:DP:GQ:PL 0/0:0,55:55:0:0,0,0
GL893678.1 37589 . T . . . GT:AD:DP:GQ:PL 0/0:0,25:25:0:0,0,0
The other 4,768 SNPs are recognized as SNPs in both outputs. Does anyone know what could be causing this inconsistency? For my project I need to be able to confidently call a SNP in 1 sample and confidently call the lack of a SNP at the same site in another sample. Please let me know if there's any additional information you need from me. Thank you very much!