We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

A mismatch between BAM file and the final SNP sequence.

mcarmelmcarmel TAU israelMember

We performed whole exome sequencing for 7 individuals from the same family (parents and 5 children). The company who has done the test used the GATK software.
In the analysis we looked for unique risk factors variants in the affected children. We found some SNPs that were unique to some of the children and were missing in the parents. In other words, the parents were homozygous for certain SNP (AA) and some of the children were heterozygous for the same SNP (AC). We thought it might be caused by a de novo change.
In order to check this, we examined the BAM files at the SNP location for the parents and the children and found:
Mother: number of reads A-73 (76.8%); C-22 (23.2%). The final SNP received by software was AA.
Father: number of reads A-76 (81.7%); C-17 (18.3%). The final SNP received by software was AA
Child 1: number of reads A-56 (76.7%); C-17 (23.3%). The final SNP received by software was AC
Child 2: number of reads A-45 (68.2%); C-21 (31.8%). The final SNP received by software was AA

I would appreciate if someone can explain to me:
What are the parameters that affect and determine the final decision of the SNP sequence?
Why the final SNP sequence is different although the distribution of the reads is the same for all members of the family (especially the mother and child 1)?
What cause the discrepancy?

Thank you from a new beginner in this interesting area


  • SheilaSheila Broad InstituteMember, Broadie admin



    Welcome to the field!

    Do you know if the company used Haplotype Caller to determine the variants? Have you had the chance to run GATK yourself on the files?

    There are many factors that determine variant calls. The reads must have high quality scores, but the biggest determinant is a reassembly of the regions that have the potential for variation (assuming Haplotype Caller is used). The original bam file you are looking at has the original alignment of the reads to the whole genome. When Haplotype Caller is run, it performs a new local reassembly on the active regions. If you run GATK, you can use the -bamout argument to see the newly reassembled bam file and compare it to the original bam file.

    Please read more about how the reassembly works here: http://www.broadinstitute.org/gatk/guide/article?id=4146


  • SheilaSheila Broad InstituteMember, Broadie admin



    A couple other questions to help us:

    1) What version of GATK did the company use?

    2) What calling parameters did they use?



  • mcarmelmcarmel TAU israelMember

    Hi Sheila,
    Thank you for your reply.
    I have not run the results in the GATK by myself yet, but I plan to do so soon.

    The company used for both snps and indels three different variant callers
    1. GATK's unified genotyper - http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html
    2. Samtools mpileup - http://samtools.sourceforge.net/mpileup.shtml
    3. FreeBayes - https://github.com/ekg/freebayes

    Then they used GATK's variant recalibrator (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html) to recalibrate the results from the variant callers which is then annotated with various external databases.

    In addition we have got a file called *.snp.final.tsv that is the tab delimited variant text file version of this annotated vcf file.

    I don't know what version of GATK and what are the calling parameters the company used.
    I sent a request for details and I'm waiting for reply.

    Thanks a lot!


  • SheilaSheila Broad InstituteMember, Broadie admin


    Hi Miri,

    My previous comments were about Haplotype Caller, not Unified Genotyper.

    It is not in our Best Practices documentation to use Unified Genotyper. Haplotype Caller is now a better variant caller to use.

    Let us know which version and parameters were used. We can give you a better answer then.


  • mcarmelmcarmel TAU israelMember

    Hi Sheila,
    I received a reply from the company:
    They used Genome Analysis Toolkit (GATK) v2.1-12-g2d7797a.
    And the calling parameters were:
    Arguments for Variant Recalibrator for SNPs -l INFO -T VariantRecalibrator -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbSNP.137.vcf -L SeqCap_EZ_Exome_v3_Edge_v2.bed -S LENIENT -rscriptFile intermediate/variant/VC621.snp.recal.plots.R -R human_g1k_v37.fasta --maxGaussians 6 -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -input intermediate/variant/VC621.snp.vcf -recalFile intermediate/variant/VC621.snp.recal -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf --mode SNP -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf -resource:intersect,known=false,training=true,truth=true,prior=20.0 intermediate/variant/VC621.intersect.vcf -tranchesFile intermediate/variant/VC621.snp.recal.tranches

    Arguments for Variant Recalibrator for indels -l INFO -T VariantRecalibrator -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbSNP.137.vcf --TStranche 100.0 --TStranche 99.9 --TStranche 99.0 --TStranche 95.0 --TStranche 90.0 -L SeqCap_EZ_Exome_v3_Edge_v2.bed -S LENIENT -rscriptFile VC621.indel.recal.plots.R -R human_g1k_v37.fasta --maxGaussians 4 -resource:indels,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf -an HaplotypeScore -an ReadPosRankSum -an FS -input intermediate/variant/VC621.indel.vcf -recalFile intermediate/variant/VC621.indel.recal --mode INDEL -resource:intersect,known=false,training=true,truth=true,prior=20.0 intermediate/variant/VC621.intersect.vcf -tranchesFile intermediate/variant/VC621.indel.recal.tranches


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Miri,

    I'm sorry to tell you that is an old version of GATK that we no longer support. And while it's hard to say without seeing the data in detail, the genotype result you posted is probably wrong. This may be due in part to the program version, the choice of using UnifiedGenotyper instead of the more recent HaplotypeCaller, and certain methodology choices. My recommendation would be to redo the analysis with the current version (3.1) and following our best practices more closely.

    The difficulty here is that the company is probably using that old version because it is free, whereas for the current version they would have to buy a license. So the cost is lower, but results are less likely to be accurate.

    Note however that if you are doing academic / not-for-profit work, you can use GATK yourself for free. We provide detailed step-by-step documentation on how to process the data, if you would like to try this or have a student do it.

  • mcarmelmcarmel TAU israelMember

    Hi Geraldine,
    Thank you for your answer; it helps me understand the discrepancies in the results.
    I'm working on downloading the GATK 3.1 and then I will start working on re-run the raw data.
    Thanks a lot

Sign In or Register to comment.