If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Inconsistent Variant Detection in RNA-seq dataset using GATK
Hello, I've been following the best practices for calling variants in RNA-seq data, and I've come across an inconsistency when actually calling variants in my dataset that I can't figure out. Basically, I have a bam file that I've aligned using STAR, and removed duplicates, re-aligned, etc. following the above mentioned pipeline. I then called my variants using the following command:
java -Xms16g -Xmx25g -jar ~/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample_A.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o GATK_sample_a.vcf
I'm only interested in non-synonymous SNPs, so after filtering I end up with 4,675 SNPs in my sample (sample A). Now what I need to do is find SNPs in sample A that are not present in sample B. So I added the -ERC BP_RESOLUTION option to emit variants and non variants to the command above. I also added the -L option to limit my search to sites where I detected SNPs in sample A. Everything else in the command is unchanged:
java -Xms16g -Xmx25g -jar ~/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample_b.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -ERC BP_RESOLUTION -L sites.intervals -o GATK_sample_b.vcf
Here is where the problem is. As a test, I ran the above command on sample A thinking that GATK would identify all the sites as SNPs again. However, there are 7 sites that are no longer called variants.
Here is the original output for the 7 sites in question:
1 62683882 . G A 466.77 . AC=2;AF=1.00;AN=2;DP=11;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=24.45;SOR=2.494 GT:AD:DP:GQ:PL 1/1:0,11:11:33:495,33,0
14 98670790 . G T 6163.77 . AC=2;AF=1.00;AN=2;DP=140;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=62.87;QD=34.91;SOR=1.211 GT:AD:DP:GQ:PL 1/1:0,140:140:99:6192,425,0
14 120959107 . C A 61714.77 . AC=2;AF=1.00;AN=2;DP=1389;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.31;SOR=1.424 GT:AD:DP:GQ:PL 1/1:0,1389:1389:99:61743,4170,0
2 81810156 . C T 2505.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.249;ClippingRankSum=0.195;DP=323;FS=10.007;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.552;QD=7.76;ReadPosRankSum=3.185;SOR=1.326 GT:AD:DP:GQ:PL 0/1:147,88:235:99:2534,0,4547
7 24603765 . A G 988.77 . AC=2;AF=1.00;AN=2;DP=24;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.70;SOR=6.439 GT:AD:DP:GQ:PL 1/1:0,24:24:72:1017,72,0
GL895866.1 23532 . G C 2347.77 . AC=2;AF=1.00;AN=2;DP=54;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.35;SOR=2.876 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2376,169,0
GL893678.1 37589 . T G 823.77 . AC=2;AF=1.00;AN=2;DP=22;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=70.00;QD=33.49;SOR=1.931 GT:AD:DP:GQ:PL 1/1:0,22:22:66:852,66,0
And here is the output for the same 7 sites (using the same bam file) with the -ERC BP_RESOLUTION and -L option:
1 62683882 . G . . . GT:AD:DP:GQ:PL 0/0:0,12:12:0:0,0,0
14 98670790 . G . . . GT:AD:DP:GQ:PL 0/0:0,140:140:0:0,0,0
14 120959107 . C . . . GT:AD:DP:GQ:PL 0/0:1,1386:1387:0:0,0,0
2 81810156 . C . . . GT:AD:DP:GQ:PL 0/0:150,135:285:0:0,0,210
7 24603765 . A . . . GT:AD:DP:GQ:PL 0/0:0,24:24:0:0,0,0
GL895866.1 23532 . G . . . GT:AD:DP:GQ:PL 0/0:0,55:55:0:0,0,0
GL893678.1 37589 . T . . . GT:AD:DP:GQ:PL 0/0:0,25:25:0:0,0,0
The other 4,768 SNPs are recognized as SNPs in both outputs. Does anyone know what could be causing this inconsistency? For my project I need to be able to confidently call a SNP in 1 sample and confidently call the lack of a SNP at the same site in another sample. Please let me know if there's any additional information you need from me. Thank you very much!