Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

cjm2200cjm2200 nycMember
edited January 17 in Ask the GATK team

Hi!
I'd like to perform short germline variant calling on human DNA-seq samples (separate analysis of WES cohort and PCR-free WGS cohort, both paired end). The plan is to follow GATK best practices of short variant discovery with joint genotyping, starting with single-sample gVCF creation through HaplotypeCaller.

I have analyzed some samples with HaplotypeCaller 4.beta.5, and was unsure whether there had been any fixes between 4.beta.5 and 4.0.0.0 that would necessitate re-running the samples.

To check, I ran the 4.beta.5 and 4.0.0.0 HaplotypeCaller on chr21 of 1000Genomes sample NA11992 link.

I ran the same command, on the same machine, in different conda environments:

4.beta.5

gatk4                     4.0b5                    py27_0    bioconda
picard                    2.16.0                   py27_0    bioconda
setuptools                38.2.4                   py27_0    conda-forge
wheel                     0.30.0                     py_1    conda-forge

4.0.0.0

gatk4                     4.0.0.0                  py27_0    bioconda
picard                    2.17.2                   py27_0    bioconda
setuptools                38.4.0                   py27_0    conda-forge
wheel                     0.30.0                   py27_2    conda-forge

Java

$ java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15)
OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)

GATK command

gatk-launch HaplotypeCaller -R $refs/GRCh37.71.nochr.fa -I $data/NA11992.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam -O ${version}.NA11992.21.default.vcf.gz -ERC GVCF -L 21

Results showed 9 differences using diff (10 including the the different ##GATKCommandLine in the gVCF header). Sometimes PL or SB fields change, sometimes non-variant blocks are subdivided differently, and sometimes indel calls change. Three differences are below:

$diff 4.0.0.0.NA11992.21.default.vcf 4.beta.5.NA11992.21.default.vcf
...
36482c36480
< 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:2,3,0:5:66:0|1:19701769_AT_A:71,0,66,77,75,152:0,2,1,2
---
> 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PL:SB       0/1:2,3,0:5:66:71,0,66,77,76,152:0,2,1,2

36485,36486c36483,36485
< 21    19701776        .       T       <NON_REF>       .       .       END=19701778    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
< 21    19701779        .       TG      T,<NON_REF>     19.78   .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:4,2,0:6:57:1|0:19701769_AT_A:57,0,146,69,152,221:1,3,0,2
---
> 21    19701776        .       T       <NON_REF>       .       .       END=19701777    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
> 21    19701778        .       TTG     T,<NON_REF>     0       .       DP=6;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=19369.00 GT:AD:DP:GQ:PL:SB       0/0:6,0,0:6:18:0,18,203,18,203,203:1,5,0,0
> 21    19701779        .       TG      T,<NON_REF>     3.96    .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PL:SB       0/1:4,2,0:6:39:39,0,146,51,152,204:1,3,0,2

54434c54433,54435
< 21    26959938        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:8:0,24,260
---
> 21    26959938        .       A       <NON_REF>       .       .       END=26959943    GT:DP:GQ:MIN_DP:PL      0/0:8:24:8:0,24,260
> 21    26959944        .       C       <NON_REF>       .       .       END=26959944    GT:DP:GQ:MIN_DP:PL      0/0:9:27:9:0,27,275
> 21    26959945        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:9:0,24,360
...

Overall, can I use 4.beta.5 HaplotypeCaller results in downstream analysis, or should results be re-analyzed with 4.0.0.0 HaplotypeCaller? This analysis showed relatively few differences, but I'm still unsure about 4.beta.5 HaplotypeCaller.

Best Answer

Answers

Sign In or Register to comment.