To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

cjm2200cjm2200 nycMember
edited January 17 in Ask the GATK team

Hi!
I'd like to perform short germline variant calling on human DNA-seq samples (separate analysis of WES cohort and PCR-free WGS cohort, both paired end). The plan is to follow GATK best practices of short variant discovery with joint genotyping, starting with single-sample gVCF creation through HaplotypeCaller.

I have analyzed some samples with HaplotypeCaller 4.beta.5, and was unsure whether there had been any fixes between 4.beta.5 and 4.0.0.0 that would necessitate re-running the samples.

To check, I ran the 4.beta.5 and 4.0.0.0 HaplotypeCaller on chr21 of 1000Genomes sample NA11992 link.

I ran the same command, on the same machine, in different conda environments:

4.beta.5

gatk4                     4.0b5                    py27_0    bioconda
picard                    2.16.0                   py27_0    bioconda
setuptools                38.2.4                   py27_0    conda-forge
wheel                     0.30.0                     py_1    conda-forge

4.0.0.0

gatk4                     4.0.0.0                  py27_0    bioconda
picard                    2.17.2                   py27_0    bioconda
setuptools                38.4.0                   py27_0    conda-forge
wheel                     0.30.0                   py27_2    conda-forge

Java

$ java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15)
OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)

GATK command

gatk-launch HaplotypeCaller -R $refs/GRCh37.71.nochr.fa -I $data/NA11992.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam -O ${version}.NA11992.21.default.vcf.gz -ERC GVCF -L 21

Results showed 9 differences using diff (10 including the the different ##GATKCommandLine in the gVCF header). Sometimes PL or SB fields change, sometimes non-variant blocks are subdivided differently, and sometimes indel calls change. Three differences are below:

$diff 4.0.0.0.NA11992.21.default.vcf 4.beta.5.NA11992.21.default.vcf
...
36482c36480
< 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:2,3,0:5:66:0|1:19701769_AT_A:71,0,66,77,75,152:0,2,1,2
---
> 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PL:SB       0/1:2,3,0:5:66:71,0,66,77,76,152:0,2,1,2

36485,36486c36483,36485
< 21    19701776        .       T       <NON_REF>       .       .       END=19701778    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
< 21    19701779        .       TG      T,<NON_REF>     19.78   .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:4,2,0:6:57:1|0:19701769_AT_A:57,0,146,69,152,221:1,3,0,2
---
> 21    19701776        .       T       <NON_REF>       .       .       END=19701777    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
> 21    19701778        .       TTG     T,<NON_REF>     0       .       DP=6;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=19369.00 GT:AD:DP:GQ:PL:SB       0/0:6,0,0:6:18:0,18,203,18,203,203:1,5,0,0
> 21    19701779        .       TG      T,<NON_REF>     3.96    .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PL:SB       0/1:4,2,0:6:39:39,0,146,51,152,204:1,3,0,2

54434c54433,54435
< 21    26959938        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:8:0,24,260
---
> 21    26959938        .       A       <NON_REF>       .       .       END=26959943    GT:DP:GQ:MIN_DP:PL      0/0:8:24:8:0,24,260
> 21    26959944        .       C       <NON_REF>       .       .       END=26959944    GT:DP:GQ:MIN_DP:PL      0/0:9:27:9:0,27,275
> 21    26959945        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:9:0,24,360
...

Overall, can I use 4.beta.5 HaplotypeCaller results in downstream analysis, or should results be re-analyzed with 4.0.0.0 HaplotypeCaller? This analysis showed relatively few differences, but I'm still unsure about 4.beta.5 HaplotypeCaller.

Best Answer

Answers

Sign In or Register to comment.