We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

HaplotypeCaller is calling thousands of variants that don't seem to be there

I'm baffled as to what might be wrong here. I use several different callers and downstream I try to locate each called variant in an mpileup. GATK produces thousands of variants for which my code doesn't find it in the pileup. I took a look at a few of those SNVs in a genome browser, and there are no SNVs present. I turned off all read filtering in the browser, and still, the area where the SNV is called is pretty clean, no reads at all at the called position that differ from the reference. Although suspiciously, 2 bp downstream of the called SNV (which had a depth of 2), there IS an SNV of depth 2, but the Ref and Alt are completely different than the called SNV.

Here is the GATK .vcf entry for that variant:

X 69625646 . A G 40.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.981;ClippingRankSum=0.000;DP=7;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=5.82;ReadPosRankSum=-1.981;SOR=1.802 GT:AD:DP:GQ:PL 0/1:5,2:7:69:69,0,1786

The other four callers are not having this problem, and my program locates almost all of their called variants in the mpileup. This is really strange, I can't think of anything else to look at. Ideas?

I'm pasting the start of the HaplotypeCaller output below, if that is of any help.

Calling SNP/INDEL Germline Variants with GATK Haplotype Caller v3.6 in Single Sample Mode
mkdir -p GRM_VARS
/share/carvajal-archive/PACKAGES/src/java/jdk1.8.0_72/bin/java -Xmx4g -Djava.io.tmpdir=/share/carvajal-archive/tmp -jar /share/carvajal-archive/PACKAGES/src/GATK/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar --analysis_type HaplotypeCaller --reference_sequence /share/carvajal-archive/REFERENCE_DATA/GATK_Bundle/b37/human_g1k_v37_decoy.fasta --intervals TargetRgns_PaddedMerged.bed --input_file MAP/3-CG-103T_recal.bam --genotyping_mode DISCOVERY --standard_min_confidence_threshold_for_emitting 30 --standard_min_confidence_threshold_for_calling 30 --dbsnp /share/carvajal-archive/REFERENCE_DATA/GATK_Bundle/b37/dbsnp_138.b37.vcf --out GRM_VARS/3-CG-103T.gatk.vcf 2>&1 | tee GRM_VARS/3-CG-103T.gatk.vcf.htc_single.log
INFO 15:58:58,694 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:58:58,697 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO 15:58:58,698 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 15:58:58,698 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 15:58:58,699 HelpFormatter - [Mon Nov 21 15:58:58 PST 2016] Executing on Linux 4.4.0-45-generic amd64
INFO 15:58:58,699 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_72-b15 JdkDeflater
INFO 15:58:58,703 HelpFormatter - Program Args: --analysis_type HaplotypeCaller --reference_sequence /share/carvajal-archive/REFERENCE_DATA/GATK_Bundle/b37/human_g1k_v37_decoy.fasta --intervals TargetRgns_PaddedMerged.bed --input_file MAP/3-CG-103T_recal.bam --genotyping_mode DISCOVERY --standard_min_confidence_threshold_for_emitting 30 --standard_min_confidence_threshold_for_calling 30 --dbsnp /share/carvajal-archive/REFERENCE_DATA/GATK_Bundle/b37/dbsnp_138.b37.vcf --out GRM_VARS/3-CG-103T.gatk.vcf
INFO 15:58:58,710 HelpFormatter - Executing as [email protected] on Linux 4.4.0-45-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_72-b15.
INFO 15:58:58,711 HelpFormatter - Date/Time: 2016/11/21 15:58:58
INFO 15:58:58,711 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:58:58,712 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:58:58,797 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:58:58,993 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO 15:58:59,001 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 15:58:59,040 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04
INFO 15:58:59,061 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 15:59:05,474 IntervalUtils - Processing 100696230 bp from intervals
INFO 15:59:05,825 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 15:59:11,793 GenomeAnalysisEngine - Done preparing for traversal
INFO 15:59:11,795 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 15:59:11,796 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
INFO 15:59:11,798 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output
INFO 15:59:11,853 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
WARN 15:59:11,853 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
INFO 15:59:11,854 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
INFO 15:59:11,923 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
Using AVX accelerated implementation of PairHMM
INFO 15:59:18,442 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 15:59:18,443 VectorLoglessPairHMM - Using vectorized implementation of PairHMM
WARN 15:59:18,685 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper
INFO 15:59:41,802 ProgressMeter - 1:11562056 4.69755576E8 30.0 s 0.0 s 0.8% 65.8 m 65.3 m

Best Answers


Sign In or Register to comment.