Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

1bp deletion (Sanger confirmed) not called with HaplotypeCaller and UnifiedGenotyper

sanderpajusalusanderpajusalu Tartu, EstoniaMember

Dear GATK Team,

I have encountered a problem while doing a variant calling. I received raw reads from sequencing company and wanted to do some analysis myself. I did everything according to GATK best practices and everything was fine until I looked at VCF file for called variants and I did not find the variant which was reported to me from sequencing company as a disease causing mutation and therefore was the only one I especially searched for. It is a 1bp heterozygous deletion, which can be seen in IGV picture attached to this post. I have tried to change parameters for both HaplotypeCaller and UnifiedGenotyper, but nothing has made them to call this indel.

The last command I used was:

java -jar ~/NGS_programs/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ~/NGS_data/hg19/ucsc.hg19.fasta -I KO002_P.merged.dedup.realn.bam -stand_call_conf 20 -stand_emit_conf 1 -L chr6:157097064-157533913 -o ARID1B/ARID1B_UG_emit1_call20_INDEL.vcf -glm INDEL

I also have tried lowering the -minIndelCnt, -minIndelFrac arguments, and maximized the --max_deletion_fraction while using UG. But nothing helped.

Do you have any ideas what could be the problem?

Thank you in advance!




  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sander,

    I would recommend you try calling the interval around this indel with HaplotypeCaller and use the -bamOut argument. This will produce a bam file that shows you what HC is seeing after local de novo reassembly of that region.

  • sanderpajusalusanderpajusalu Tartu, EstoniaMember

    Hi Geraldine,

    Thank you for you reply. I used -bamout argument as you suggested and the result was somewhat surprising for me as the bam file in the region that interests me is kind of empty (please find the IGV image attached - the bottom track is the input bam file and the others came from -bamout argument). After I saw the almost empty bam file I ran the -bamout argument also on the whole chromosome 6, but the results are the same at the locus of my interest. I attached also an image from another locus, where indel was called by HaplotypeCaller on the same chromosome.

    Command was: java -jar ~/NGS_programs/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/NGS_data/hg19/ucsc.hg19.fasta -I KO002_P.merged.dedup.realn.bam -stand_call_conf 20 -stand_emit_conf 1 -L chr6 -o chr6_HC_emit1_call20.vcf -bamout chr6.bam

    I would be grateful for your further recommendations...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Interesting. Well, -bamout is meant for debugging small regions only so in the case of the full chr6 run, I'm not surprised it didn't work properly. But for the other region, that is surprising. Can you tell me what command line you used for the first (small) region? What size interval did you use?

  • KurtKurt Member ✭✭✭

    If there is no variant called in the first place, I think the default behavior of -bamoutput is not write anything out. It will only write bam files for regions that will have a variant call.

  • KurtKurt Member ✭✭✭

    --bamWriterType Type CALLED_HAPLOTYPES How should haplotypes be written to the BAM?

  • sanderpajusalusanderpajusalu Tartu, EstoniaMember

    Now I used -bamout on a very small region:

    java -jar ~/NGS_programs/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/NGS_data/hg19/ucsc.hg19.fasta -I KO002_P.merged.dedup.realn.bam -stand_call_conf 20 -stand_emit_conf 1 -L chr6:157,150,038-157,151,178 -o chr6:157,150,038-157,151,178_HC_emit1_call20.vcf -bamout chr6:157,150,038-157,151,178.bam

    Result: no called variants in VCF file, and the bam is also empty.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh right, I forgot about that -- thanks @Kurt.

    @sanderpajusalu, you'll also need to add --bamWriterType ALL_POSSIBLE_HAPLOTYPES to your command line. The output may be a little more complicated to parse but will give you some insight into what HC is seeing at this locus. Be sure to read the full instructions for the --bamout argument in the tool doc.

  • sanderpajusalusanderpajusalu Tartu, EstoniaMember

    Thank you both so much!

    The missing argument was my bad. Anyways, now I started to see something after using the ALL_POSSIBLE_HAPLOTYPES. IGV image is attached.

    Command line: java -jar ~/NGS_programs/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/NGS_data/hg19/ucsc.hg19.fasta -I KO002_P.merged.dedup.realn.bam -stand_call_conf 20 -stand_emit_conf 1 -L chr6:157,150,038-157,151,178 -o chr6:157,150,038-157,151,178_HC_emit1_call20-allpossibleHTs.vcf -bamout chr6:157,150,038-157,151,178-allpossibleHTs.bam --bamWriterType ALL_POSSIBLE_HAPLOTYPES

    When I used CALLED_HAPLOTYPES argument, it showed nothing as expected.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, now we're getting somewhere, good. So the HC sees your indel but dismissing it. Based on the coverage plot I'm going to guess that the fraction of reads with the indel is too low to be "convincing". You say you validated the indel with Sanger?

  • sanderpajusalusanderpajusalu Tartu, EstoniaMember

    Yes, It is validated by Sanger. I even have the Sanger sequencing image, which you can see in attached picture (first row is patient, followed by parental samples). :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hey @sanderpajusalu, sorry to get back to you so late. I'm not sure why HC is dismissing the haplotype with your indel. I will ask @ebanks if he has any ideas that might help.

  • sanderpajusalusanderpajusalu Tartu, EstoniaMember
    edited February 2014

    Thank you Geraldine for dealing with my case. Meanwhile I ran samtools mpileup to call variants with alternative tool and it managed to call the variant and gave my the following variant description:


    chr6 157150401 0 TG T 27.5 0 INDEL;IS=8,0.235294;DP=34;VDB=2.341677e-01;AF1=0.5;AC1=1;DP4=17,7,4,1;MQ=50;FQ=30.5;PV4=1,0.025,1,0.18 GT:PL:GQ 0/1:65,0,255:68

    I am not an expert, but it seems to me that this just might be the case of low concordance between variant callers reported many times in different papers like:

    O'Rawe J, Guangqing S, Wang W, Hu J, Bodily P, Tian L, et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013;5:28.

    It is of course a major problem when dealing with single exomes/genomes, but probably that is the way it is at the moment.

    GATK is wonderful tool, but as I understand it is not developed to analyze single samples, am I right? Therefore, it is understandable why it misses some variants, especially indels with not so good coverage and mapping quality as was the case for me.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @sanderpajusalu, discordance between calling results are indeed due to caller-specific error modes. The difficult part is knowing which caller is right for which variants!

    The GATK is fully capable of calling variants on single samples, though its ability to use information across samples makes it more powerful when calling multiple samples, especially if there are coverage and mapping quality issues. If you have such issues in your data that may be why HC is dismissing your variant, as it is quite demanding in terms of quality when there is only a single sample to work with.

Sign In or Register to comment.