This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Why is there difference of variants between after-BQSR bam and after-HaplotypeCaller bam?
Dear GATK team,
Hi, I have followed Best Practices to find out germline variants (GATK-3.7) of my samples designed by case-control study for ~500 samples in total.
I have run BQSR, Prind Reads, and then HaplotypeCaller as described in below:
java -jar $GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R $Reference -knownSites $dbSNP138 -knownSites $Mills -knownSites $oneKGindels -nct 8 -I $Output/$1.sort.dup.ir.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o $Output/$1.recal.data.grp -L $Interval -ip 100
java -jar $GATK/GenomeAnalysisTK.jar -T PrintReads -nct 8 -R $Reference -I $Output/$1.sort.dup.ir.bam -BQSR $Output/$1.recal.data.grp -o $Output/$1.sort.dup.ir.BQSR.bam
java -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R $Reference -I $Input/$1.sort.dup.ir.BQSR.bam -o $Output/$1.hc.vcf.gz -L chr14:92537200-92537700 -bamout $Output/$1.bamout.bam
When I comparing variants of after-BQSR bam with those of after-HC bam in region of chr14:92537200-92537700 using IGV, I noticed that both of the bams showed different looking especially for indels like this:
So I have several questions,
1) Why is there difference of variants between after-BQSR bam and after-HC bam in terms of indels? The indels at chr14:92,537,354 were not in after-BQSR bam, but those were in after-HC bam. Among my processed samples, some samples showed same indels in both bams, but others showed different indels.
2) I noticed that some regions seems to be snapped in after-HC bam, not in after-BQSR bam. I don't have an idea why this happened.
3) Some samples showed that variants in whole regions of chr14:92537200-92537700 were not called in after-HC bam, but reads were mapped in the same regions in after-BQSR bam. How can I interpret it?
I don't know exactly but I guess that there are quite possibility to calling inaccurate variants since the regions I interested in have several repeat sequences as well as the variants are repeated indels. Is this right? I don't know what can I do, so I ask for help me regarding to this issues.
Thanks in advance!