Why BaseRecalibrator and PrintReads is not replicable

Hi GATK team,
I found that BaseRecalibrator and PrintReads can not be replicable.
for example, run BaseRecalibrator multiple times on the same input BAM, the output BQSR table sometimes have a little different, Here is the output of 'diff' command of two repeat BQSR table :

1726c1726
< XXXXXXXXXXXXXXXXXXXXX            15  1               Cycle          M                   19.0000         86407     1167.26
---
> XXXXXXXXXXXXXXXXXXXXX            15  1               Cycle          M                   18.0000         86407     1167.26

PrintReads also output different results when I ran twice with the same BQSR table and BAM as input. It generates different base quality:

YYYYYYYYYYYYYYYYYYYYY     69      chrM    1       0       *       =       1       0       GAGGCCTTCTTGTTTCCTGACAGTTCCACATACTGTGCTCCGGCTCCAGC      @[email protected]&D)(&&&CDDBB#2F9E<[email protected]<D?8?7      MC:Z:11S39M     BD:Z:LLOOSRPMMMLLLLBKMLONNKOOKKMMLLOLONPLNQPMPNNQROOOQQ PG:Z:MarkDuplicates     RG:Z:RGID1 BI:Z:NNQPPKLMNPNOONENNPPQPNQQNOOQPORPORRQQTTORRRSSNOQRR AS:i:0  XS:i:0
---
YYYYYYYYYYYYYYYYYYYYY     69      chrM    1       0       *       =       1       0       GAGGCCTTCTTGTTTCCTGACAGTTCCACATACTGTGCTCCGGCTCCAGC      @[email protected]&D)(&&&CDDBB#2F9E<[email protected]<D?8?7      MC:Z:11S39M     BD:Z:LLOOSRPMMMLLLLBKMLONNKOOKKMMLLOLONPLNQPMPNNQROOOQQ PG:Z:MarkDuplicates     RG:Z:RGID1 BI:Z:NNQPPKLMNPNOONENNPPQPNQQNOOQPORPORRQQTTORRRSSNOQRR AS:i:0  XS:i:0

What I want to known is:
1. Why this happened?
2. If there are some randomness in the algorithm, is there any option to make result replicable?

Commands:
java -Xmx4G -Djava.io.tmpdir=java_tmp -jar GenomeAnalysisTK.jar -nct 5 -T BaseRecalibrator -R hg19.fasta -I test.realign.bam -knownSites dbsnp_138.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -o test.realign.recal.table

java -Xmx4G -Djava.io.tmpdir=java_tmp -jar GenomeAnalysisTK.jar -nct 5 -T PrintReads -R hg19.fasta -I test.realign.bam -BQSR test.realign.recal.table -o test.realign.recal.bam

Comments

  • valentinvalentin Cambridge, MAMember, Dev

    Multithread tool runs may result in slightly different results as all the threads share the same random number generator and may end up processing the input in a different order. Although I cannot say from the top of my head what part of BaseRecalibrator or PrintReads would be affected by this, I bet that this is the reason.

    The solution is then to renounce to parallelism. However, if there are already differences in the recalibration table between BaseRecalibrator runs, you might be able to keep PrintReads parallel. If not, then you could keep BaseRecalibrator in parallel. Nevertheless BR is the most likely offender.

  • aaronicoaaronico ChinaMember

    @valentin, thanks for your reply

    follow your suggestion, I tested 4 times with '-nt 1 -nct 1', both BaseRecalibrator and PrintReads give me identical results.

    But I still can not make an conclusion that if it is the 'Multithread' thing makes the results different, because last time I try 4 repeat with multithread I got only 1 different from the three others. I can not tell what would happen in the next time with single threads.

    So I still expect a certain answer, and how to avoid it.

    @valentin said:
    Multithread tool runs may result in slightly different results as all the threads share the same random number generator and may end up processing the input in a different order. Although I cannot say from the top of my head what part of BaseRecalibrator or PrintReads would be affected by this, I bet that this is the reason.

    The solution is then to renounce to parallelism. However, if there are already differences in the recalibration table between BaseRecalibrator runs, you might be able to keep PrintReads parallel. If not, then you could keep BaseRecalibrator in parallel. Nevertheless BR is the most likely offender.

  • valentinvalentin Cambridge, MAMember, Dev

    @aaronico

    But I still can not make an conclusion that if it is the 'Multithread' thing makes the results different, because last time I try 4 repeat with multithread I got only 1 different from the three others. I can not tell what would happen in the next time with single threads.

    That is actually proof that is due to multithreading... it does not need to different every time. Multithreading does not guarantee that every run is going to be different but rather you cannot make the assumption that it will be the same every time.

Sign In or Register to comment.