To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Multithreaded HaplotypeCaller gives different GVCF files to single-core run

Hi,

I've been recreating a bioinformatics pipeline for paired WGS data which largely follows the GATK best practices, but I'm having issues with HaplotypeCaller to create a GVCF file for one individual. I'm using GATK 3.4 with java 1.8.0_141. The BAM file was assembled with bwa, and I've marked duplicates, locally realigned around indels and recalibrated the base quality scores.

I understand that it's not recommended to use multiple CPU threads with HC, but if there is no crash the results should be the same. I ran HC as follows:

gatk34 -T HaplotypeCaller -R /home/shared/reference/1000Genome/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -I K1561-4464.raw-reads.header.sorted.nodup.rg.realign.bsqr.bam -o K1561-4464.raw-reads.g.vcf -ERC GVCF --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation GCContent --annotation HaplotypeScore --annotation HomopolymerRun --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation DepthPerAlleleBySample --annotation Coverage --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation StrandBiasBySample --dbsnp /home/shared/reference/1000Genome/GRCh38/dbsnp_146.hg38.vcf.gz --excludeAnnotation ChromosomeCounts --excludeAnnotation FisherStrand --excludeAnnotation StrandOddsRatio --excludeAnnotation QualByDepth --GVCFGQBands 10 --GVCFGQBands 20 --GVCFGQBands 30 --GVCFGQBands 40 --GVCFGQBands 60 --GVCFGQBands 80 --standard_min_confidence_threshold_for_calling 0 --interval_set_rule INTERSECTION --read_filter BadCigar --read_filter NotPrimaryAlignment --unsafe LENIENT_VCF_PROCESSING --variant_index_parameter 128000 --variant_index_type LINEAR

I also ran this with -nct 4 and with -nct 16 to generate GVCF1, GVCF2 and GVCF3 respectively. The multi-threaded runs ran without and errors reported. However, the three GVCF files are not all the same. I ran GenotypeConcordance on all pairs of the GVCF files - I can't seem to attach the output files here, but here are the Genotype Concordance Counts for GVCF1 vs GVCF2

#:GATKTable:38:2:%s:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:;
#:GATKTable:GenotypeConcordance_Counts:Per-sample concordance tables: comparison counts
Sample      NO_CALL_NO_CALL  NO_CALL_HOM_REF  NO_CALL_HET  NO_CALL_HOM_VAR  NO_CALL_UNAVAILABLE  NO_CALL_MIXED  HOM_REF_NO_CALL  HOM_REF_HOM_REF  HOM_REF_HET  HOM_REF_HOM_VAR  HOM_REF_UNAVAILABLE  HOM_REF_MIXED  HET_NO_CALL  HET_HOM_REF  HET_HET  HET_HOM_VAR  HET_UNAVAILABLE  HET_MIXED  HOM_VAR_NO_CALL  HOM_VAR_HOM_REF  HOM_VAR_HET  HOM_VAR_HOM_VAR  HOM_VAR_UNAVAILABLE  HOM_VAR_MIXED  UNAVAILABLE_NO_CALL  UNAVAILABLE_HOM_REF  UNAVAILABLE_HET  UNAVAILABLE_HOM_VAR  UNAVAILABLE_UNAVAILABLE  UNAVAILABLE_MIXED  MIXED_NO_CALL  MIXED_HOM_REF  MIXED_HET  MIXED_HOM_VAR  MIXED_UNAVAILABLE  MIXED_MIXED  Mismatching_Alleles
ALL                       0                0            0                0                    0              0                0        244795804          575               25                 2908              0            0          265  3465716           46              325          0                0                3           54          1777987                   10              0                    0                 2705              302                   21                        0                  0              0              0          0              0                  0            0                  697
K1561-4464                0                0            0                0                    0              0                0        244795804          575               25                 2908              0            0          265  3465716           46              325          0                0                3           54          1777987                   10              0                    0                 2705              302                   21                        0                  0              0              0          0              0                  0            0                  697

The other output files were quite similar, with some variants being called differently between the files. I've run ValiateVariants on the three GVCF files, with no failures reported. I'm getting a PairInfoMap error with ValidateSamFile, not sure if this is related.

Is there a reason why the output GVCF files are different? I'm unsure which file to use, or if it will make a difference downstream.

Best Answer

Answers

Sign In or Register to comment.