Inconsistent results with HaplotypeCaller on haploid organism
Hello GATK team,
I would appreciate some help in understanding how GATK works in GVCF mode on my data.
Here is my data example I'm usign GATK v3.8:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 328-16 983-16 NC_018661.1 859953 . C T > 31035.30 SnpCluster AC=15;AF=0.536;AN=28;BaseQRankSum=-1.134e+00;ClippingRankSum=0.00;DP=7961;FS=3.292;MLEAC=15;MLEAF=0.536;MQ=56.52;MQRankSum=0.00;QD=7.76;ReadPosRankSum=-5.480e-01;SOR=0.486 GT:AD:DP:GQ:PL 0:354,0:354:99:0,106 1:157,157:314:99:361,0
First thing weird is that the variant is in heterocygosis with the highest GQ (99) when we are analyzing an haploid sample, this is different from the explanation given in this post
Second issue appears when we observe this position using IGV in our aligned reads using bwa mem(bam format). Here we see that both samples seem to have this site as AD 50%, but HaplotypeCaller calls it totally different.
This are the parameters we use for one sample:
java -Djava.io.tmpdir=/processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/TMP/ -Xmx10g -jar /opt/g atk/gatk-3.8.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECO LI07_SS_S/REFERENCES/GCF_000299475.1_ASM29947v1_genomic_NoPlasmid.fna -I /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM0 62_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/Alignment/BAM/328-16/328-16.woduplicates.bam -o /processing_Data/bioinformatics/services_and_colaboratio ns/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/variants/328-16.g.vcf -stand_call_conf 30 --em itRefConfidence GVCF -ploidy 1 -S LENIENT -log /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/A NALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/snp_indels.vcf-HaplotypeCaller.log
How is this even possible? (I have infinite-checked that bam files are the same used in IGV and passed to GATK, you never know...)
Could be the effect referred in this thread be somehow affecting the variant calling? Should we use BP_Resolution? Which is the main difference between GVCF and BP_RESOLUTION mode?
Our first idea is select by AD our GVCFs using JEXL expressions but as GVCF has reference blocks with no AD the command fails:
ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null
I could filter them manually before GenotypeGVCFs but, it is a good practice? As I read in this thread this is not recommended, obviously because we override GATK model which takes a lot more of variables into account...
Any ideas? We are kind of struggling, maybe is something trivial but we can't see it, any help will be much appreciate.
Thanks very much in advance,