Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Incoherences between Haplotype Caller runs using multiple threads

Hi,

I have been testing my calling pipeline running it several times using the same samples, and I have found inconsistencies/different results when running it using multithreading (-nct 4).

My command is:

java -Xmx20G -Djava.io.tmpdir=${TMPDIR} -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R ${ref} \
-I ${input} \
-L ${reg} \
-ERC GVCF \
-nct ${nct} \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o ${name}.raw_variants.annotated.g.vcf \
-A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage -A InbreedingCoeff

When running it using -nct 1, all runs output vcf (.g.vcf in this case) files are completely identical (except for one header line that contains the date and command).

But, when running it with -nct 4, some variants present different values in some descriptors. Even more, different runs using -nct 4, show different values among them:

$ diff nct1.g.vcf vs nct4.g.vcf :
338c338

< 1 10409 . ACCCTAACCCTAACCCTAACCCTAACCCTAAC A, 57.73 . BaseQRankSum=-0.442;DP=87;MLEAC=1,0;MLEAF=0.500,0.00;MQ=36.19;MQRankSum=2.033;ReadPosRankSum=1.536 GT:AD:GQ:PL:SB 0/1:27,4,0:95:95,0,690,172,706,878:5,22,2,2

1 10409 . ACCCTAACCCTAACCCTAACCCTAACCCTAAC A, 57.73 . BaseQRankSum=-0.442;DP=87;MLEAC=1,0;MLEAF=0.500,0.00;MQ=36.19;MQRankSum=2.092;ReadPosRankSum=1.536 GT:AD:GQ:PL:SB 0/1:27,4,0:95:95,0,690,172,706,878:5,22,2,2

378c378

< 1 10492 . C T, 0.04 . BaseQRankSum=-1.364;DP=46;MLEAC=1,0;MLEAF=0.500,0.00;MQ=50.17;MQRankSum=1.124;ReadPosRankSum=-0.502 GT:AD:GQ:PL:SB 0/1:23,3,0:8:8,0,715,77,724,801:3,20,2,1

1 10492 . C T, 0.04 . BaseQRankSum=-1.605;DP=46;MLEAC=1,0;MLEAF=0.500,0.00;MQ=50.17;MQRankSum=1.364;ReadPosRankSum=-0.502 GT:AD:GQ:PL:SB 0/1:23,3,0:8:8,0,715,77,724,801:3,20,2,1

494c494

< 1 13110 . G A, 287.77 . BaseQRankSum=0.572;DP=37;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.89;MQRankSum=-0.696;ReadPosRankSum=0.418 GT:AD:GQ:PL:SB 0/1:22,15,0:99:316,0,501,381,546,927:0,22,0,15

1 13110 . G A, 287.77 . BaseQRankSum=0.541;DP=37;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.89;MQRankSum=0.789;ReadPosRankSum=0.418 GT:AD:GQ:PL:SB 0/1:22,15,0:99:316,0,501,381,546,927:0,22,0,15

582c582

< 1 13649 . G C, 32.77 . BaseQRankSum=-1.517;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.52;MQRankSum=0.322;ReadPosRankSum=0.322 GT:AD:GQ:PL:SB 0/1:5,3,0:61:61,0,117,76,126,202:4,1,1,2

1 13649 . G C, 32.77 . BaseQRankSum=-1.517;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.52;MQRankSum=0.684;ReadPosRankSum=0.322 GT:AD:GQ:PL:SB 0/1:5,3,0:61:61,0,117,76,126,202:4,1,1,2

Or when comparing two runs of -nct 4

diff nct4-1.g.vcf nct4-2.g.vcf

338c338

< 1 10409 . ACCCTAACCCTAACCCTAACCCTAACCCTAAC A, 57.73 . BaseQRankSum=-0.442;DP=87;MLEAC=1,0;MLEAF=0.500,0.00;MQ=36.19;MQRankSum=2.033;ReadPosRankSum=1.536 GT:AD:GQ:PL:SB 0/1:27,4,0:95:95,0,690,172,706,878:5,22,2,2

1 10409 . ACCCTAACCCTAACCCTAACCCTAACCCTAAC A, 57.73 . BaseQRankSum=-0.442;DP=87;MLEAC=1,0;MLEAF=0.500,0.00;MQ=36.19;MQRankSum=2.092;ReadPosRankSum=1.536 GT:AD:GQ:PL:SB 0/1:27,4,0:95:95,0,690,172,706,878:5,22,2,2

378c378

< 1 10492 . C T, 0.04 . BaseQRankSum=-1.525;DP=46;MLEAC=1,0;MLEAF=0.500,0.00;MQ=50.17;MQRankSum=0.482;ReadPosRankSum=-0.502 GT:AD:GQ:PL:SB 0/1:23,3,0:8:8,0,715,77,724,801:3,20,2,1

1 10492 . C T, 0.04 . BaseQRankSum=-1.605;DP=46;MLEAC=1,0;MLEAF=0.500,0.00;MQ=50.17;MQRankSum=1.364;ReadPosRankSum=-0.502 GT:AD:GQ:PL:SB 0/1:23,3,0:8:8,0,715,77,724,801:3,20,2,1

494c494

< 1 13110 . G A, 287.77 . BaseQRankSum=0.510;DP=37;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.89;MQRankSum=0.139;ReadPosRankSum=0.727 GT:AD:GQ:PL:SB 0/1:22,15,0:99:316,0,501,381,546,927:0,22,0,15

1 13110 . G A, 287.77 . BaseQRankSum=0.541;DP=37;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.89;MQRankSum=0.789;ReadPosRankSum=0.418 GT:AD:GQ:PL:SB 0/1:22,15,0:99:316,0,501,381,546,927:0,22,0,15

582c582

< 1 13649 . G C, 32.77 . BaseQRankSum=-1.221;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.52;MQRankSum=0.322;ReadPosRankSum=0.684 GT:AD:GQ:PL:SB 0/1:5,3,0:61:61,0,117,76,126,202:4,1,1,2

1 13649 . G C, 32.77 . BaseQRankSum=-1.517;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=23.52;MQRankSum=0.684;ReadPosRankSum=0.322 GT:AD:GQ:PL:SB 0/1:5,3,0:61:61,0,117,76,126,202:4,1,1,2

I have tested this also in 3.1-1 and the problem is already there.

Calling the chromosome 1 of a single 40x individual takes 9~10 hours with -nct 1 and 3.5 hours with -nct 4.
But I am not sure if this lack of coherence is worth the speed-up

Txema

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @txemaheredia
    Hi Txema,

    There will be difference between runs when using nct or nt, but the minor differences are nothing to worry about. Have a look at this thread as well: http://gatkforums.broadinstitute.org/discussion/4422/can-nct-nt-option-in-unifiedgenotyper-make-different-vcfs

    -Sheila

  • txemaherediatxemaheredia Member

    Hi Sheila,

    in that message you mention that differences should be in low quality calls. If we define "low quality" as GQ < 30, we can find this:

    Number of variants showing differences between -nct 1 and -nct 4 executions:
    259043

    -nct1 variants with QC < 30
    $ grep "^<" diff-nct1-nct4.txt | awk '{print $11}' | awk -F: -v x=30 '$3 < x' | wc -l
    19589

    -nct4 variants with QC < 30
    $ grep "^>" diff-nct1-nct4.txt | awk '{print $11}' | awk -F: -v x=30 '$3 < x' | wc -l
    19593

    -nct1 variants with QC >= 30
    $ grep "^<" diff-nct1-nct4.txt | awk '{print $11}' | awk -F: -v x=30 '$3 >= x' | wc -l
    239726

    -nct4 variants with QC >= 30
    $ grep "^>" diff-nct1-nct4.txt | awk '{print $11}' | awk -F: -v x=30 '$3 >= x' | wc -l
    239736

    -nct1 variants with QC >= 99
    $ grep "^<" diff-nct1-nct4.txt | awk '{print $11}' | awk -F: -v x=99 '$3 >= x' | wc -l
    213348
    -nct4 variants with QC >= 99
    $ grep "^>" diff-nct1-nct4.txt | awk '{print $11}' | awk -F: -v x=99 '$3 >= x' | wc -l
    213349

    I could even find some variants where some downsampling is applied and they became snps instead of homozygous reference in the g.vcf:

    $ grep 142538605 diff-nct1-nct4.txt
    < 1 142538605 . C G,<NON_REF> 1930.77 . BaseQRankSum=4.269;DP=896;MLEAC=1,0;MLEAF=0.500,0.00;MQ=57.48;MQRankSum=-13.463;ReadPosRankSum=-0.191 GT:AD:GQ:PL:SB 0/1:750,98,0:99:1959,0,32203,4283,32510,36793:358,392,50,48
    > 1 142538605 . C <NON_REF> . . END=142538605 GT:DP:GQ:MIN_DP:PL 0/0:859:0:859:0,0,24816

    $ grep 142538618 diff-nct1-nct4.txt
    < 1 142538618 . T A,<NON_REF> 1798.77 . BaseQRankSum=1.542;DP=1007;MLEAC=1,0;MLEAF=0.500,0.00;MQ=56.78;MQRankSum=-12.580;ReadPosRankSum=-0.845 GT:AD:GQ:PGT:PID:PL:SB 0/1:816,108,0:99:0|1:142538584_T_G:1827,0,33554,4309,33879,38189:391,425,47,61
    > 1 142538618 . T <NON_REF> . . END=142538618 GT:DP:GQ:MIN_DP:PL 0/0:945:0:945:0,0,28931

    $ grep 142538644 diff-nct1-nct4.txt
    < 1 142538644 . T <NON_REF> . . END=142538644 GT:DP:GQ:MIN_DP:PL 0/0:1081:0:1081:0,0,32867
    > 1 142538644 . T C,G,<NON_REF> 0 . BaseQRankSum=-0.730;DP=1047;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=56.35;MQRankSum=-1.140;ReadPosRankSum=0.485 GT:AD:GQ:PL:SB 0/0:795,10,52,0:99:0,1955,36152,107,33792,34113,2459,36143,34282,36633:380,415,31,31

    As well as some variants where the -nct 1 version show a lower Depth of Coverage than the -nct 4 verison (whish supposedly is the one downsampling):

    $ grep 142537799 diff-nct1-nct4.txt
    < 1 142537799 . C G,<NON_REF> 0 . BaseQRankSum=0.483;DP=850;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.62;MQRankSum=-5.791;ReadPosRankSum=-0.468 GT:AD:GQ:PL:SB 0/0:760,55,0:68:0,68,32399,2359,32566,34857:328,432,27,28
    > 1 142537799 . C <NON_REF> . . END=142537799 GT:DP:GQ:MIN_DP:PL 0/0:954:0:954:0,0,23379

    Should I still proceed to run HC with -nct 4 or, as it takes only 4h difference to run the whole chr1, I am better running it with -nct 1?

    Issue · Github
    by Sheila

    Issue Number
    1090
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @txemaheredia
    Hi Txema,

    The difference in low quality calls and downsampled calls is expected. However, I will put in a ticket to examine the higher quality call differences. For the most consistent results, it is best to use -nct 1. I am going to put in a ticket to have this examined. Another user has reported inconsistent results when using -nct 4 as well. http://gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller#latest

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @txemaheredia
    Hi Txema,

    It turns out there is really nothing to worry about here. The differences are minor and well within expectations for downsampling-related changes. However, I may have mislead you about the downsampling that occurs. There is downsampling with nct and without nct, but the seed is different, so different subsets of reads are used. By "the seed" I mean a random number that is generated to determine how downsampling will occur.

    It is important to note that neither set of results is objectively better than the other, so if speed matters to you, it is better to not give up the parallelism. However, if it matters that the runs have absolute internal consistency, you should get rid of parallelism.

    Please also note: There is no indication those 10 calls aren't made at all without nct, just that they're not made with a GQ between 30 and 99. The GQ may just have dropped in one of the runs.

    I hope this helps.

    -Sheila

Sign In or Register to comment.