Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK variant calling


I am using this code to call SNPs for DNA seq paired end files aligned using bowtie2:

module load gatk/ picard/2.9.2
picard CreateSequenceDictionary R=ref.fa O=ref.dict
java -jar picard.jar ValidateSamFile \
I=bowtie.sorted.bam \
java -jar picard.jar AddOrReplaceReadGroups \
I=bowtie.sorted.bam \
O=bowtie_output.bam \
RGID=1 \
RGLB=lib1 \
RGPL=illumina \
RGPU=unit1 \
java -jar picard.jar ValidateSamFile \
I=bowtie_output.bam \
samtools index bowtie_output.bam
gatk --java-options "-Xmx4G" HaplotypeCaller -R ref.fa -I bowtie_output.bam -O bowtie_gatk.vcf

I got the vcf file but the vcf file size is very low (almost 10,000 times) when I compared with the vcf file that I got using samtools with the same input using this code:
bcftools mpileup -f ref.fa bowtie.sorted.bam > bowtie_samtools.vcf

I am not sure if this is normal or not?


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited May 24


    I would need more specific information to answer that question. Would you please share with us some variants that are called by the mpileup piepline but not by Haplotypecaller. IGV snapshots of these variant regions from the two vcfs will help us resolve this issue.

    You can find the variants unique to each vcf file by using this tool: https://vcftools.github.io/man_latest.html#SYNOPSIS. Use the DIFF VCF FILE option.

  • sangusangu Member
    Here is one screenshot attached of the difference in SNPs found in gatk vs samtools
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @sangu

    There can be a few reasons why these variants were not called by Haplotypecaller. In order to dig deeper into this please share with me the bowtie_output.bam, bowtie_gatk.vcf and also please generate a bamout file in Haplotypecaller and share that with me as well.
    You can find instructions to share data here.

  • sangusangu Member
    I will share the data. Is there no error in the command lines that I am using!
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited May 29


    My guess is that most of the variants you don't see with Haplotypecaller is due to insufficient envidence for variants in that region. This could be due to low quality reads etc.
    Here is a document that explains how Haplotypecaller performs Local re-assembly and haplotype determination for variant calling. Also Haplotypecaller has parameters that can be tweaked to be more sensitive to calling variants which can be found in the tool documentation page (warning, this increases the chance of false positives). Another resource to see what Haplotypecaller is doing behind the scenes is this doc "(howto) Generate a "bamout file" showing how HaplotypeCaller has remapped sequence reads".

    Might be worthwhile giving these documents a read before you share the data.

  • sangusangu Member
    I have uploaded the files. Unfortunately, I found to upload single archive file after I did my files. I hope it will be okay this time. My file names are:
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @sangu

    Did you get a chance to look over the documentation, especially the bamout one? That should be helpful in understanding what went wrong.
    If you have already gone over the documentation and verified your results would you please give me an update on it.

  • sangusangu Member

    So I figured out why the .vcf file with samtools is so large than the .vcf file with gatk. But before moving ahead, I still want to make sure vcf file from gatk is also correct. I read the documentation you mentioned earlier. I have compared bam.out file and bowtie.bam file in IGV (attached). But I am not sure if I understood bam.out file is correct or not to make sure gatk is working correct.
  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Hi @sangu,

    The bamout contains "the assembled haplotypes and locally realigned reads" . When comparing the input bam to the bamout bam, there should be less noise in the later due to the higher quality in the local alignment which is what is seen in your image.

    If you want to get a better understanding of bamout you may try running through one of tutorials here which walks you through the IGV steps of exploring a bamout file. Note: this uses GATK3 but this shouldn't stop you from running the IGV steps and getting a good understanding of the concept.

    Mind sharing why samtools has a larger VCF compared to Bowtie2?

  • sangusangu Member
    Hi @bshifaw,

    Thanks! I have gone through the tutorial but I am still not sure if my files are correct.

    And I was missing 'call' command with samtools so it was generating a big file which did not have true variants.
  • bshifawbshifaw moonMember, Broadie, Moderator admin

    You can use ValidateSamFile and ValidateVariants to confirm the produced files are correct.

Sign In or Register to comment.