Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Inconsistent Variant Detection in RNA-seq dataset using GATK

shocker8786shocker8786 United StatesMember

Hello, I've been following the best practices for calling variants in RNA-seq data, and I've come across an inconsistency when actually calling variants in my dataset that I can't figure out. Basically, I have a bam file that I've aligned using STAR, and removed duplicates, re-aligned, etc. following the above mentioned pipeline. I then called my variants using the following command:

java -Xms16g -Xmx25g -jar ~/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample_A.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o GATK_sample_a.vcf

I'm only interested in non-synonymous SNPs, so after filtering I end up with 4,675 SNPs in my sample (sample A). Now what I need to do is find SNPs in sample A that are not present in sample B. So I added the -ERC BP_RESOLUTION option to emit variants and non variants to the command above. I also added the -L option to limit my search to sites where I detected SNPs in sample A. Everything else in the command is unchanged:

java -Xms16g -Xmx25g -jar ~/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample_b.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -ERC BP_RESOLUTION -L sites.intervals -o GATK_sample_b.vcf

Here is where the problem is. As a test, I ran the above command on sample A thinking that GATK would identify all the sites as SNPs again. However, there are 7 sites that are no longer called variants.

Here is the original output for the 7 sites in question:

1 62683882 . G A 466.77 . AC=2;AF=1.00;AN=2;DP=11;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=24.45;SOR=2.494 GT:AD:DP:GQ:PL 1/1:0,11:11:33:495,33,0

14 98670790 . G T 6163.77 . AC=2;AF=1.00;AN=2;DP=140;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=62.87;QD=34.91;SOR=1.211 GT:AD:DP:GQ:PL 1/1:0,140:140:99:6192,425,0

14 120959107 . C A 61714.77 . AC=2;AF=1.00;AN=2;DP=1389;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.31;SOR=1.424 GT:AD:DP:GQ:PL 1/1:0,1389:1389:99:61743,4170,0

2 81810156 . C T 2505.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.249;ClippingRankSum=0.195;DP=323;FS=10.007;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.552;QD=7.76;ReadPosRankSum=3.185;SOR=1.326 GT:AD:DP:GQ:PL 0/1:147,88:235:99:2534,0,4547

7 24603765 . A G 988.77 . AC=2;AF=1.00;AN=2;DP=24;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.70;SOR=6.439 GT:AD:DP:GQ:PL 1/1:0,24:24:72:1017,72,0

GL895866.1 23532 . G C 2347.77 . AC=2;AF=1.00;AN=2;DP=54;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.35;SOR=2.876 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2376,169,0

GL893678.1 37589 . T G 823.77 . AC=2;AF=1.00;AN=2;DP=22;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=70.00;QD=33.49;SOR=1.931 GT:AD:DP:GQ:PL 1/1:0,22:22:66:852,66,0

And here is the output for the same 7 sites (using the same bam file) with the -ERC BP_RESOLUTION and -L option:

1 62683882 . G . . . GT:AD:DP:GQ:PL 0/0:0,12:12:0:0,0,0

14 98670790 . G . . . GT:AD:DP:GQ:PL 0/0:0,140:140:0:0,0,0

14 120959107 . C . . . GT:AD:DP:GQ:PL 0/0:1,1386:1387:0:0,0,0

2 81810156 . C . . . GT:AD:DP:GQ:PL 0/0:150,135:285:0:0,0,210

7 24603765 . A . . . GT:AD:DP:GQ:PL 0/0:0,24:24:0:0,0,0

GL895866.1 23532 . G . . . GT:AD:DP:GQ:PL 0/0:0,55:55:0:0,0,0

GL893678.1 37589 . T . . . GT:AD:DP:GQ:PL 0/0:0,25:25:0:0,0,0

The other 4,768 SNPs are recognized as SNPs in both outputs. Does anyone know what could be causing this inconsistency? For my project I need to be able to confidently call a SNP in 1 sample and confidently call the lack of a SNP at the same site in another sample. Please let me know if there's any additional information you need from me. Thank you very much!

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @shocker8786
    Hi,

    Can you post the original bam file, the bamout file from Haplotype Caller in regular mode, and the bamout file from Haplotype Caller in BP_RESOLUTION mode? Please post the region around 14:98670790. http://gatkforums.broadinstitute.org/discussion/5484/howto-generate-a-bamout-file-showing-how-haplotypecaller-has-remapped-sequence-reads#latest
    I would like to see what the reassembled bam looks like for Haplotype Caller in normal mode versus Haplotype Caller in BP_RESOLUTION mode.

    Also, can you please run Genotype GVCFs on the GVCF, and let me know if anything changes? I don't think anything will, but I just want to make sure.

    Thanks,
    Sheila

  • shocker8786shocker8786 United StatesMember

    Sheila,

    Thanks for your quick reply. Here is the region from the original bam file (the red variant is the site in question):

    The normal haplotypecaller output:

    And haplotypecaller output using BP_RESOLUTION mode:

    Please let me know if these figures are not what you were looking for.

    I also ran the GVCF file from haplotypecaller using BP_RESOLUTION mode through genotypeGVCFs. In addition to the 7 above mentioned sites, another 13 were excluded from the genotypeGVCFs output. Below are the additional 13 sites in question from the GVCF file used as input (from haplotypecaller using BP_RESOLUTION):

    10 38189019 . G A, 20 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00 GT:AD:DP:GQ:PL:SB 1/1:0,1,0:1:3:45,3,0,45,3,45:0,0,0,1

    10 70074419 . A G, 25.78 . BaseQRankSum=1.296;ClippingRankSum=1.061;DP=11;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=0.118;ReadPosRankSum=0.118 GT:AD:DP:GQ:PL:SB 0/1:9,2,0:11:54:54,0,212,81,218,298:5,4,1,1

    2 149239838 . A C, 23.79 . BaseQRankSum=0.480;ClippingRankSum=0.898;DP=10;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-0.480;ReadPosRankSum=0.480 GT:AD:DP:GQ:PL:SB 0/1:8,2,0:10:52:52,0,283,77,289,366:7,1,2,0

    4 98693752 . G C, 26.78 . BaseQRankSum=-0.530;ClippingRankSum=-0.177;DP=18;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-0.265;ReadPosRankSum=-0.883 GT:AD:DP:GQ:PL:SB 0/1:9,9,0:18:55:55,0,117,79,140,220:2,7,0,9

    6 89963436 . C T, 27.77 . BaseQRankSum=0.898;ClippingRankSum=0.898;DP=10;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=0.480;ReadPosRankSum=-0.480 GT:AD:DP:GQ:PL:SB 0/1:8,2,0:10:56:56,0,283,81,289,369:2,6,0,2

    6 145367351 . G A, 26.78 . BaseQRankSum=0.096;ClippingRankSum=1.722;DP=23;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-0.096;ReadPosRankSum=-2.200 GT:AD:DP:GQ:PL:SB 0/1:19,3,0:22:55:55,0,691,112,700,812:7,12,0,3

    7 24644738 . C G, 13.95 . BaseQRankSum=2.145;ClippingRankSum=-0.837;DP=53;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.20;MQRankSum=1.172;ReadPosRankSum=-1.202 GT:AD:DP:GQ:PL:SB 0/1:48,5,0:53:42:42,0,1300,186,1315,1501:22,26,3,2

    8 139331335 . G A, 8.44 . BaseQRankSum=-0.698;ClippingRankSum=0.175;DP=24;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=0.087;ReadPosRankSum=-0.436 GT:AD:DP:GQ:PL:SB 0/1:21,3,0:24:36:36,0,716,99,725,824:13,8,2,1

    9 79467548 . A T, 29.77 . BaseQRankSum=-1.377;ClippingRankSum=0.715;DP=11;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=0.000;ReadPosRankSum=0.715 GT:AD:DP:GQ:PL:SB 0/1:8,3,0:11:58:58,0,279,82,288,370:3,5,2,1

    9 128954225 . C T, 26.78 . BaseQRankSum=1.067;ClippingRankSum=0.572;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=1.067;ReadPosRankSum=-0.572 GT:AD:DP:GQ:PL:SB 0/1:6,2,0:8:55:55,0,214,74,220,293:4,2,2,0

    GL894321.1 38682 . G A, 21.80 . BaseQRankSum=0.752;ClippingRankSum=-0.752;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=1.611;ReadPosRankSum=-0.537 GT:AD:DP:GQ:PL:SB 0/1:10,2,0:12:50:50,0,356,81,362,442:4,6,0,2

    GL892407.1 140890 . A G, 24.78 . BaseQRankSum=0.480;ClippingRankSum=0.480;DP=10;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-0.480;ReadPosRankSum=-0.480 GT:AD:DP:GQ:PL:SB 0/1:8,2,0:10:53:53,0,299,78,305,383:8,0,2,0

    JH118861.1 198717 . T C, 20.80 . BaseQRankSum=0.589;ClippingRankSum=-0.825;DP=11;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-0.589;ReadPosRankSum=-0.354 GT:AD:DP:GQ:PL:SB 0/1:9,2,0:11:49:49,0,281,77,287,363:5,4,1,1

    Please let me know if there is anything else you need. Thanks again!

    Issue · Github
    by Sheila

    Issue Number
    185
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • shocker8786shocker8786 United StatesMember

    I re-ran haplotypecaller using -ERC GVCF instead of BP_RESOLUTION, but the result was the same. However, adding the -ip 100 to the command line while using BP_RESOLUTION did restore the variant calls, so it seems I should have no problem using that command going forward. Thank you very much for you help!

  • shocker8786shocker8786 United StatesMember

    Sorry I thought this issue was resolved, but I'm now coming across more inconsistencies. Adding the -ip 100 option did restore the variant calls at the 7 sites in question. However, it also resulted in 3 additional sites that were called SNPs in the original output, also called SNPs using -ERC BP_RESOLUTION on the intervals without the -ip 100, but are no longer recognized as SNPs with the -ip 100 option.

    Here is the original output for these 3 sites:

    1 18702685 . G A 17491.77 . AC=2;AF=1.00;AN=2;DP=454;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.17;SOR=0.948 GT:AD:DP:GQ:PL 1/1:0,454:454:99:17520,1364,0

    5 84082784 . T C 34646.77 . AC=2;AF=1.00;AN=2;DP=532;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.73;SOR=0.850 GT:AD:DP:GQ:PL 1/1:0,518:518:99:34675,2452,0

    5 84082808 . T C 38218.77 . AC=2;AF=1.00;AN=2;DP=888;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=26.66;SOR=1.297 GT:AD:DP:GQ:PL 1/1:0,887:887:99:38247,2726,0

    The output using -ERC BP_RESOLUTION on the intervals:

    1 18702685 . G A, 17503.77 . DP=452;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00 GT:AD:DP:GQ:PL:SB 1/1:0,452,0:452:99:17532,1358,0,17532,1358,17532:0,0,250,202

    5 84082784 . T C, 20254.77 . BaseQRankSum=1.725;ClippingRankSum=1.597;DP=526;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQRankSum=1.387;ReadPosRankSum=1.725 GT:AD:DP:GQ:PL:SB 1/1:1,511,0:512:99:20283,1529,0,20286,1536,20293:1,0,239,272

    5 84082808 . T C, 36005.77 . DP=905;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00 GT:AD:DP:GQ:PL:SB 1/1:0,905,0:905:99:36034,2721,0,36034,2721,36034:0,0,554,351

    And the output using -ERC BP_RESOLUTION on the intervals with the -ip 100 option:

    1 18702685 . G . . . GT:AD:DP:GQ:PL 0/0:0,451:451:0:0,0,0

    5 84082784 . T . . . GT:AD:DP:GQ:PL 0/0:0,517:517:0:0,0,0

    5 84082808 . T . . . GT:AD:DP:GQ:PL 0/0:0,901:901:0:0,0,0

    After discovering this, I decided to run haplotypecaller on the 4,675 original SNP positions (-L option) without the -ERC option. This resulting in 4,665 SNP sites (10 SNPs no longer recognized). I than ran haplotypecaller using the -L option with -ip 100 (again no ERC), and only 4,671 are returned (4 missing). So it seems that although the -ip 100 option restores some SNPs, using the -L option is ultimately causing SNPs to no longer be recognized (even if I up it to -ip 200). Do you have any further suggestions to try and resolve this issue?

    I could run haplotypecaller to call SNPs across the genome in my control and treatment samples to find SNPs in both samples. But is it possible for me to call reference bases in RNAseq data using GATK, given that the -ERC mode is not supported? Maybe I need to come up with another way to verify that a SNP is absent in my control sample, and that the reason it is not called is not due to low coverage/quality at that site.

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The most likely reason this is happening is that when you provide more sequence context around a site by adding padding, it gives HaplotypeCaller the opportunity to do a more comprehensive reassembly of the region. This is a good thing -- it compensates for the flaws of the original mapper. In some cases this will produce new candidate variants, while in others it will take some away -- which in most cases means the original call was probably an alignment artifact.

  • shocker8786shocker8786 United StatesMember

    Thanks for the explanation, it's starting to make more sense now! I decided that after my initial SNP discovery, I would run GATK haplotypecaller again on my SNP intervals using -ip 100 to give me my final SNP list (4 less SNPs). If I then re-run GATK haplotypecaller using the final SNP intervals with -ip 100 and -ERC BP_RESOLUTION, I receive the same SNP calls. So I think if I use this protocol going forward, I should no longer have any issues. Thanks again for all your help!

Sign In or Register to comment.