Haplotype caller filter settings - Samtools called variant is filtered out with HaplotypeCaller

kjprkjpr LondonMember

Hi I am new to using HC. I am just using the HapMap sample NA12878 to validate my pipeline. I ran the following command on a BAM file that I generated. I noticed that certain SNPs were filtered out most of which I can understand the rationale.

java -Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/home/tx/tmpYdAKAC -jar /usr/local/share/gatk/3.4-46-gbc02625/GenomeAnalysisTK.jar \ -R /usr/local/share/genomes/Hsapiens/GRCh37/seq/GRCh37.fa \ -I /home/INDELanalysis/validationrun/NA12878-sort-11_0_15994663-prep.bam \ --dbsnp /usr/local/share/genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \ -L 11:1628982-1651500 \ -T HaplotypeCaller \ -o /home/INDELanalysis/validationrun/NA12878-11_0_15994663-raw.vcf.gz

I ran the same analysis using Samtools and the following SNP was in the final filtered output

Samtools output 11 1651228 rs71454096 C G 117 PASS AC=1;AN=2;BQB=0.61207;BaseQRankSum=-0.458;DB;DP=21;DP4=11,0,5,3;FS=7.225;GC=71.29;HOB=0.5;HRun=3;ICB=1;MQ=56.97;MQ0=0;MQ0F=0;MQB=0.395294;MQRankSum=-1.091;MQSB=0.798516;QD=5.57;RPB=0.902014;ReadPosRankSum=-1.162;SGB=-0.651104;VDB=0.0103095;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCg/gGg|A53G|237|KRTAP5-5|protein_coding|CODING|ENST00000399676|1|G) GT:DP:PL 0/1:19:150,0,158

However when I run it through HC with the above settings the above variant is filtered out of the analysis. When I ran HC with --emitRefConfidence GVCF I got the following output for this variant:

HC output 11 1651228 rs71454096 C G,<NON_REF> 0 . DB;DP=20;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.22 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:10,0,0:10:34:0|1:1651120_G_T:0,34,442,34,442,442:10,0,0,0

I am a little confused why it got a QUAL score of 0. So looking nearby I also noticed that there were some sizeable INDELS

11 1651198 . GAGGCTGTGGGGGCTGTGGCTCCGGCTGTGC G,<NON_REF> 0 . DP=25;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.38 GT:AD:DP:GQ:PL:SB 0/0:14,0,0:14:37:0,37,585,42,589,594:12,2,0,0 11 1651199 rs71025763 A AGGCTGTGGCTCC,<NON_REF> 0 . DB;DP=18;MLEAC=0,0;MLEAF=0.00,0.00;MQ=57.74 GT:AD:DP:GQ:PL:SB 0/0:6,0,0:6:28:0,28,332,28,332,332:5,1,0,0

However neither of these INDELS were called either. Can anyone shed some light as to why nothing is called in this region?
If I look at the site
11 1651228
in the bam file I can see that most of the calls were made on the forward strand so there is likely to be some strand bias here and in
actual fact we can see that HC called the genotype 0/0 with only 10 reads contributing to this genotype call. I presume the remaining reads were filtered out. However when I look at the pileup for this position I don't understand why over half the reads were filtered

11 1651228 C CCCCGGGGCCCGGCCCGGCGG BBFIFB<FFIIFFFFFFFBBB

Any advice help would be much appreciated.

Answers

  • kjprkjpr LondonMember

    Ok thinking a little more about this call I mentioned that over half the reads were filtered out and it would seem based on the AD value that the only reads that remained were the ones relating to Reference.
    So below I ran the bam file before and after running it through bamout. The top track is from the BAM file before processing it through bamout. You can clearly see a number of alternative G's being called. The bottom track is the bamout output BAM. The reads aligning in this region look a lot different. For a start there are a number of reads carrying large deletions. So I am guessing that once candidate haplotypes have been calculated and the reads were aligned back then a number of the reads in the original BAM became uninformative (see post on uninformative reads https://www.broadinstitute.org/gatk/guide/topic?name=problems ) and a number of reads which didn't even align to this region in the original BAM now align to this region as a result of the added belief that an alternative haplotype exists in this region. Is this interpretation correct?

    image

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kjpr
    Hi,

    Yes, it looks like this is in an incredibly messy region. I think you are right about the uninformative reads. There seems to be too many different alleles at the site which is confusing Haplotype Caller.

    -Sheila

  • kjprkjpr LondonMember

    Thanks Sheila for replying. So does this mean that the conclusion is that HC is unable to reliably determine any calls in this regions that is to say we are unable to determine that the bases are either REF or ALT? Is there a way in HC to flag these regions as such?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kjpr
    Hi,

    Because the data is not clean, Haplotype Caller cannot make any definite calls in the region. Unfortunately, there is no way to flag those messy regions in Haplotype Caller. It is up to you to do some QC on your data and find the regions that are callable.

    -Sheila

  • kjprkjpr LondonMember

    OK thank you Sheila. I guess the idea is that the reads show strong evidence for the presence of specific haplotypes but that means that re-alignment of these reads through re-assembly of these haplotypes results in certain messy areas which I will have to go through and assess.

Sign In or Register to comment.