Why is there difference of variants between after-BQSR bam and after-HaplotypeCaller bam?

scha36scha36 S. KoreaMember

Dear GATK team,

Hi, I have followed Best Practices to find out germline variants (GATK-3.7) of my samples designed by case-control study for ~500 samples in total.
I have run BQSR, Prind Reads, and then HaplotypeCaller as described in below:

BQSR
java -jar $GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R $Reference -knownSites $dbSNP138 -knownSites $Mills -knownSites $oneKGindels -nct 8 -I $Output/$1.sort.dup.ir.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o $Output/$1.recal.data.grp -L $Interval -ip 100

Print Reads
java -jar $GATK/GenomeAnalysisTK.jar -T PrintReads -nct 8 -R $Reference -I $Output/$1.sort.dup.ir.bam -BQSR $Output/$1.recal.data.grp -o $Output/$1.sort.dup.ir.BQSR.bam

HaplotypeCaller (HC)
java -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R $Reference -I $Input/$1.sort.dup.ir.BQSR.bam -o $Output/$1.hc.vcf.gz -L chr14:92537200-92537700 -bamout $Output/$1.bamout.bam

When I comparing variants of after-BQSR bam with those of after-HC bam in region of chr14:92537200-92537700 using IGV, I noticed that both of the bams showed different looking especially for indels like this:

So I have several questions,
1) Why is there difference of variants between after-BQSR bam and after-HC bam in terms of indels? The indels at chr14:92,537,354 were not in after-BQSR bam, but those were in after-HC bam. Among my processed samples, some samples showed same indels in both bams, but others showed different indels.
2) I noticed that some regions seems to be snapped in after-HC bam, not in after-BQSR bam. I don't have an idea why this happened.
3) Some samples showed that variants in whole regions of chr14:92537200-92537700 were not called in after-HC bam, but reads were mapped in the same regions in after-BQSR bam. How can I interpret it?

I don't know exactly but I guess that there are quite possibility to calling inaccurate variants since the regions I interested in have several repeat sequences as well as the variants are repeated indels. Is this right? I don't know what can I do, so I ask for help me regarding to this issues.

Thanks in advance!

Best regards,
Soojin

Best Answer

Answers

  • scha36scha36 S. KoreaMember

    Thank you for your answer, shlee.

    The documentation you link helps me to understand HaplotypeCaller more clearly.
    Also, I didn't know that bamout file from HaplotypeCaller basically contains only the active region, although it's not included in our discussion.
    Anyway, now I can guess what happens to my samples.

    On the other hand, may I give you one more question?
    You tell me that the region I interested in which contains several repeat sequence is 'low complexity region'. But I confused when I see indels in the regions showed different length as well as even not seen in after-BQSR bam (top panel in my figure). How can I have confidence to that indels?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi at @scha36,

    Glad to hear the documentation helped to clarify.

    Confidence in variants in low complexity or repeat regions will be low. The difficulty is in deciding whether these variants are true biological variants or artifacts of SBS sequencing, which is prone to the same polymerase slippage as in the cell. The resolution will depend on factors including (i) whether such a sequence is presently uniquely in the reference or elsewhere as well (the ability to map accurately relevant reads), (ii) your read length and insert length if using paired end reads and whether your pipeline considers mate information, and (iii) sequencing depth and library complexity.

    The region that you show above has ten glutamine (Q) amino acids in a row. Do you study neurodegeneration? If GATK workflows are giving lower confidence calls than you requires, I think it would be helpful for you to look into external tools that are designed to resolve short tandem repeats. I am only aware of lobSTR but I am certain there are many others out there.

  • scha36scha36 S. KoreaMember

    Yes, I'm studying neurodegeneration.
    According to your comment, I may consider to use other tools (lobSTR, ...) specific to repeat region for my study since it is hard to have confidence to the indels with my information.
    Your comments really help me.
    Thank you very much :)

    Best regards,
    Soojin

Sign In or Register to comment.