Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Uncorrect Strand Bias due to Downsampling (HaplotypeCaller)

Hi GATK team,
Again thanks a lot for the wonderful tools you're offering to the community.

I have recently switched from UnifiedGenotyper to Haplotype Caller (1 sample at a time, DNASeq). I was planning to use the same hard filtering procedure that I was using previously, including the filter of the variants with FS > 60.
However I am facing an issue probably due to the downsampling done by HC.

I should have 5000 reads, but DP is around 500/600 which I understood is due to downsampling (even with -dt NONE). I did understand that it does not impact in the calling itself. However it is annoying me for 2 reasons
1) Calculating frequency of the variant using the AD field is not correct (not based on all reads)
2) I get variants with FS >60 whereas when you look at the entire set of reads, there is absolutely no strand bias.

Example with this variant
chr17 41245466 rs1799949 G A 7441.77 STRAND_BIAS; AC=1;AF=0.500;AN=2;BaseQRankSum=7.576;DB;DP=1042;FS=63.090;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.666;QD=7.14;ReadPosRankSum=-11.896;SOR=5.810 GT:AD:GQ:PL:SB 0/1:575,258:99:7470,0,21182:424,151,254,4

When I observe all reads I have the following counts, well shared on the + and - strands
Allele G : 1389 (874+, 515-)
Allele A : 1445 (886+, 559-)

Could you please tell me how to avoid such an issue ? (By the way, this variant is a true one and should not be filtered out).

Thanks a lot.

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Hi,

    Are you looking at the original bam file? Can you please post both the original bam file and bamout file? The bamout file contains the reads after Haplotype Caller's reassembly. In your example, the Strand Bias does have the correct read counts (they add up to the AD values). And, it does look like there is strand bias, because the alt allele is seen mostly on the forward strand but not on the reverse strand. However, I understand your frustration because the counts seem proportional in all the original reads. I think the bamout may help us understand what is going on here.

    -Sheila

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi,
    I am afraid that when I re-ran it to generate the bamout file, the strand bias was absent. Which is an even bigger problem.

    I do hypothethise that the downsampling, by chance may or may not choose reads "correctly" which may lead to a strand bias.
    I am trying to run it many times see if I can get it again somehow.

    This is really a problem for us because we are doing diagnosis so we cannot miss any variant (false positives are ok). Unless I do not filter on Strand Bias, this seems to create an issue. And I am a bit concerned that the QD filtering, by beeing based on the downsampled depth, may cause the same problem as well.

    Any idea/suggestion about that ?

    Thanks
    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Sorry, I am a bit stubborn and wanted to figure out why the FS varies.

    It seems that FS is high when I put -A StrandBiasBySample in the command line (to get the annotation in black in my first post). When I remove it, it seems alright (FS around 2, nothing to worry about).

    I made it run 2 times with the -A paramater --> led to high FS value
    I made it run 2 times without that paramater --> was ok.
    How is this possible ??

    Seems like a bug don't you think ? (Easily resolved from the user part though..).

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Hi Manon,

    I do not see a difference in FS when I add -A StrandBiasBySample. Please do note, Fisher Strand (FS) and StrandBiasBySample (SB) are different annotations.

    Can you post some VCF records in which the FS is different when using -A and not using -A?

    Thanks,
    Sheila

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey
    It's actually without both annotations = -A StrandBiasBySample and -A FisherStrand that I obtain the following vcf line (to be compared to the one in my original post)

    chr17 41245466 rs1799949 G A 17608.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.166;ClippingRankSum=0.064;DB;DP=1059;FS=1.265;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.361;QD=16.63;ReadPosRankSum=3.070;SOR=0.556 GT:AD:DP:GQ:PL 0/1:511,534:1045:99:17637,0,16633

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Hi Manon,

    Did you use the same intervals for both results? I am a little confused why the first result gives a much lower AD for the ALT allele than the second result. Can you post the exact commands you ran for each result? Also, are you using the latest version of GATK?

    Thanks,
    Sheila

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi !
    Exactly the same intervals, yes. And yes very last version of GATK.

    But here we go, I think I did not identify the correct culprit ! Look at the last command with nct 4. I guess I should have listened to the recommendations for not using it... Any idea when the parallelization will be possible ?

    Here are the commands and their corresponding outputs

    1) Without -A parameter
    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I ../5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 -bamout 5938_HC_bamout.bam -o 5938_HC_out.vcf

    Output =
    chr17 41245466 rs1799949 G A 17539.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.885;ClippingRankSum=-1.049;DB;DP=1056;FS=2.010;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-1.449;QD=16.61;ReadPosRankSum=-0.356;SOR=0.539 GT:AD:DP:GQ:PL 0/1:518,528:1046:99:17568,0,16685

    2) With -A parameter
    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I ../5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 -A StrandBiasBySample -A FisherStrand -bamout 5938_HC_bamout.bam -o 5938_HC_out.vcf

    Output =
    chr17 41245466 rs1799949 G A 17203.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.165;DB;DP=1058;FS=1.972;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.048;QD=16.26;ReadPosRankSum=-0.587;SOR=0.555 GT:AD:GQ:PL:SB 0/1:512,528:99:17232,0,16472:378,134,374,154

    2) With -A parameter and nct 4
    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I ../5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 -A StrandBiasBySample -A FisherStrand -nct 4 -o 5938_HC_out.vcf

    Output =
    chr17 41245466 rs1799949 G A 8068.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=8.147;DB;DP=1043;FS=51.647;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-1.291;QD=7.74;ReadPosRankSum=-10.722;SOR=4.298 GT:AD:GQ:PL:SB 0/1:576,303:99:8097,0,21822:409,167,290,13

    Thanks a lot for your help though !!

    Issue · Github
    by Sheila

    Issue Number
    1089
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • manon_sourdeixmanon_sourdeix FranceMember

    Ok great.
    Maybe at least not letting the possibility to use nt/nct as for many other GATK tools ?

    Manon

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Hi Manon,

    Can you do us a favor and run Haplotype Caller on that region with nct 1 and nct 4 three times each? Then, post the results from all 6 runs. We would like to see if there is an issue with downsampling and if what happened before was just by chance.

    Thanks,
    Sheila

  • manon_sourdeixmanon_sourdeix FranceMember

    Sure !
    Although I have something running right now on the computer. I will do that as soon as this is finished and come back to you.

    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey
    Sorry for the delay in my tests. Here they are. It seems that the problem comes up when -nct and -A parameters are used together.

    1) With both -A parameter and nct 4
    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-A StrandBiasBySample -A FisherStrand -nct 4** -o 5938_HC_out_nct4_1.vcf

    Output=
    chr17 41245466 rs1799949 G A 8068.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=8.147;DB;DP=1043;FS=51.647;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-1.291;QD=7.74;ReadPosRankSum=-10.722;SOR=4.298 GT:AD:GQ:PL:SB 0/1:576,303:99:8097,0,21822:409,167,290,13

    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-A StrandBiasBySample -A FisherStrand -nct 4** -o 5938_HC_out_nct4_2.vcf

    Output=
    chr17 41245466 rs1799949 G A 7879.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=5.604;DB;DP=1042;FS=53.796;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.109;QD=7.56;ReadPosRankSum=-9.917;SOR=4.233 GT:AD:GQ:PL:SB 0/1:581,297:99:7908,0,22048:414,167,284,13

    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-A StrandBiasBySample -A FisherStrand -nct 4** -o 5938_HC_out_nct4_3.vcf

    Output=
    chr17 41245466 rs1799949 G A 8210.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=8.589;DB;DP=1043;FS=54.591;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.279;QD=7.87;ReadPosRankSum=-11.658;SOR=4.328 GT:AD:GQ:PL:SB 0/1:574,306:99:8239,0,21720:407,167,293,13

    2) With -A parameter only
    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-A StrandBiasBySample -A FisherStrand** -o 5938_HC_out_Aonly_1.vcf

    Output=
    chr17 41245466 rs1799949 G A 17203.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.165;DB;DP=1058;FS=1.972;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.048;QD=16.26;ReadPosRankSum=-0.587;SOR=0.555 GT:AD:GQ:PL:SB 0/1:512,528:99:17232,0,16472:378,134,374,154

    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-A StrandBiasBySample -A FisherStrand** -o 5938_HC_out_Aonly_2.vcf

    Output=
    chr17 41245466 rs1799949 G A 17203.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.165;DB;DP=1058;FS=1.972;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.048;QD=16.26;ReadPosRankSum=-0.587;SOR=0.555 GT:AD:GQ:PL:SB 0/1:512,528:99:17232,0,16472:378,134,374,154

    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-A StrandBiasBySample -A FisherStrand** -o 5938_HC_out_Aonly_3.vcf

    Output=
    chr17 41245466 rs1799949 G A 17203.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.165;DB;DP=1058;FS=1.972;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.048;QD=16.26;ReadPosRankSum=-0.587;SOR=0.555 GT:AD:GQ:PL:SB 0/1:512,528:99:17232,0,16472:378,134,374,154

    3) With nct 4 only
    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-nct 4** -o 5938_HC_out_nctOnly_1.vcf

    Output=
    chr17 41245466 rs1799949 G A 18268.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.469;ClippingRankSum=0.414;DB;DP=1059;FS=0.601;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.184;QD=17.25;ReadPosRankSum=0.865;SOR=0.618 GT:AD:DP:GQ:PL 0/1:496,550:1046:99:18297,0,16009

    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-nct 4** -o 5938_HC_out_nctOnly_2.vcf

    Output=
    chr17 41245466 rs1799949 G A 17397.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.153;ClippingRankSum=0.235;DB;DP=1059;FS=2.034;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.180;QD=16.43;ReadPosRankSum=2.266;SOR=0.530 GT:AD:DP:GQ:PL 0/1:515,530:1045:99:17426,0,16856

    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 5938_HC_originalBam.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 **-nct 4** -o 5938_HC_out_nctOnly_3.vcf

    Output=
    chr17 41245466 rs1799949 G A 17397.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.153;ClippingRankSum=0.235;DB;DP=1059;FS=2.034;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.180;QD=16.43;ReadPosRankSum=2.266;SOR=0.530 GT:AD:DP:GQ:PL 0/1:515,530:1045:99:17426,0,16856

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Thanks. This is odd. I will get back to you once I hear from the team what might be happening.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Hi Manon,

    Looks like it is time to submit a bug report! This is certainly very odd behavior. Can you share you files with us, so we can debug locally? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    -Sheila

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi. I just came back from a long holiday. I will have a look at that sometime soon and submit my files.
    Sorry for the delay.

Sign In or Register to comment.