Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

Inconsistent results with HaplotypeCaller on haploid organism

Hello GATK team,

I would appreciate some help in understanding how GATK works in GVCF mode on my data.
Here is my data example I'm usign GATK v3.8:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 328-16 983-16 NC_018661.1 859953 . C T > 31035.30 SnpCluster AC=15;AF=0.536;AN=28;BaseQRankSum=-1.134e+00;ClippingRankSum=0.00;DP=7961;FS=3.292;MLEAC=15;MLEAF=0.536;MQ=56.52;MQRankSum=0.00;QD=7.76;ReadPosRankSum=-5.480e-01;SOR=0.486 GT:AD:DP:GQ:PL 0:354,0:354:99:0,106 1:157,157:314:99:361,0

  • First thing weird is that the variant is in heterocygosis with the highest GQ (99) when we are analyzing an haploid sample, this is different from the explanation given in this post

  • Second issue appears when we observe this position using IGV in our aligned reads using bwa mem(bam format). Here we see that both samples seem to have this site as AD 50%, but HaplotypeCaller calls it totally different.

This are the parameters we use for one sample:
java -Djava.io.tmpdir=/processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/TMP/ -Xmx10g -jar /opt/g atk/gatk-3.8.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECO LI07_SS_S/REFERENCES/GCF_000299475.1_ASM29947v1_genomic_NoPlasmid.fna -I /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM0 62_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/Alignment/BAM/328-16/328-16.woduplicates.bam -o /processing_Data/bioinformatics/services_and_colaboratio ns/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/variants/328-16.g.vcf -stand_call_conf 30 --em itRefConfidence GVCF -ploidy 1 -S LENIENT -log /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/A NALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/snp_indels.vcf-HaplotypeCaller.log

How is this even possible? (I have infinite-checked that bam files are the same used in IGV and passed to GATK, you never know...)

Could be the effect referred in this thread be somehow affecting the variant calling? Should we use BP_Resolution? Which is the main difference between GVCF and BP_RESOLUTION mode?

Our first idea is select by AD our GVCFs using JEXL expressions but as GVCF has reference blocks with no AD the command fails:

ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null

I could filter them manually before GenotypeGVCFs but, it is a good practice? As I read in this thread this is not recommended, obviously because we override GATK model which takes a lot more of variables into account...
Any ideas? We are kind of struggling, maybe is something trivial but we can't see it, any help will be much appreciate.

Thanks very much in advance,
Best Regards
Sara

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sara_mfdez
    Hi Sara,

    Unfortunately, that is another "quirk" of the haploid model. I suspect even though the ADs are the same, the variant allele reads have higher base qualities, which causes the model to output a higher likelihood of the site being variant.

    You can read more about this here.

    -Sheila

  • Thank you very much for the response. I hope this could be resolved at some point.

    For now I will try to make an AD filter for keep only sites with 0.9 of variant fraction. How can I do this, I try to filter them in the gvcf using JEXL expressions but I get this error due to REF blocks:

    ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null

    Is there any way to make the filter in all the samples in the vcf using JEXL? Or how can a do this using GATK commands? A little help would be appreciated.

    I see in the post you link me that this user is doing something like this for filtering:
    GT:AD:DP:GQ:PL:FT
    1:49,46:95:99:555,0:G_readfrac0.9

    Maybe it can't be done and I have to parse the vcf manually?

    Thanks in advance
    Regards
    Sara

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited March 22

    Hi @sara_mfdez,

    Sheila is away at a workshop so let me see if I can help.

    For one, please be aware of the way AD is calculated by reading this article. Perhaps a better annotation to filter on is DP or even QD (qual by depth). Please see this article for expression examples to use with VariantFiltration or SelectVariants. Here's an example command:

    java -jar $GATK -T VariantFiltration -R ref/human_g1k_b37_20.fasta -V sandbox/platinum_NA12878_SNP.vcf.gz \
        --filterExpression "QD < 3.4 || FS > 81.0 || MQ < 10.0 || MQ > 100.0 || MQRankSum < -14.5 || ReadPosRankSum < -9.2 || ReadPosRankSum > 8.1 || DP_Orig > 4000 || SOR > 5.7 || QUAL < 10 || QUAL > 160000" \
        --filterName "shlee_filter" \
        -o sandbox/NA12878_SNPs_chr20_hardfilter22.vcf.gz
    

    P.S. You can also specify the filter expression multiply, as shown in this thread.

Sign In or Register to comment.