An Indel was called by HaplotypeCaller, passed SNPFilter, but the AD for alt allele is low

sgoyalsgoyal San CarlosMember

Using openjdk-8-jre-headless and GATK version 3.7-0-gcfedb67
I ran bam file output through HaplotypeCaller and VariantFiltration in GATK version 3.7.

Commands:

java -jar GenomeAnalysisTK_3.7.jar -T HaplotypeCaller -I sample.bam -R hg37.fasta --genotyping_mode DISCOVERY -stand_call_conf 30 -L targets.interval_list -nct 8 -o sample.vcf.gz

java -jar GenomeAnalysisTK_3.7.jar -T VariantFiltration -V sample.vcf.gz -R hg37.fasta --filterName SnpFilter --filterExpression QD<2.0||FS>60.0||MQ<40.0||MQRankSum<-12.5||ReadPosRankSum<-8.0 --filterName DPLow --filterExpression DP<30 --filterName GQLow --filterExpression GQ!=-1&&GQ<30 -o sample.vcf.gz

There were two INDELS of interest in the vcf output.

INDEL 1
6   137143793   .   C   CCGCGTG 723.73  PASS    AC=1;ACov=365.79;AF=0.500;AN=2;BaseQRankSum=-1.732;ClippingRankSum=0.000;DP=257;ExcessHet=3.0103;FS=16.829;GQ=99;Group=group1;MLEAC=1;MLEAF=0.500;MQ=70.00;MQRankSum=0.000;QD=3.11;ReadPosRankSum=-1.522;SOR=2.200;TCov=55235   GT:AD:DP:GQ:PL  0/1:231,2:233:99:761,0,8880
INDEL 2
6   137143796   .   C   CGGGGGGGGGG 491.73  IndelFilter AC=1;ACov=365.79;AF=0.500;AN=2;BaseQRankSum=-8.303;ClippingRankSum=0.000;DP=302;ExcessHet=3.0103;FS=294.284;GQ=99;Group=group1;MLEAC=1;MLEAF=0.500;MQ=70.00;MQRankSum=0.000;QD=1.78;ReadPosRankSum=-9.180;SOR=7.657;TCov=55235  GT:AD:DP:GQ:PL  0/1:232,45:277:99:529,0,9061

Indel 2 failed the indel filter and indel 1 passed the filter.

My question is why was indel 1 called by HaplotypeCaller and have high enough quality values to pass the VariantFiltration. The indel does not look real because there is an allele depth of 2 for the indel and 231 for the ref allele. Is there a reason why something like this would be called.

Thank you in advance for the help.

Answers

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Hi @sgoyal ,

    Here is a bit of background on the tool from the dev team.

    Genotyping is performed based on the phred-scaled genotype likelihoods (PLs), which in the first indel clearly do support a het call (the second number is 0, other two are very high). The AD values include reads which may have been filtered due to very low base quality, and so not included in the genotype likelihood calculation.

    It's hard to judge what's is going on based on the info above. Would you mind checking the base qualities at the site of interest? If the qualities of the reads are all good (>10), and you can share the sample bam, we can have a look to hunt down exactly what is going on.

Sign In or Register to comment.