Analysis Pipeline Discrepancy in SNP Calling and Coverage

Hi, All,

So I am new to GATK so please bear with me... Essentially, I have developed a unix script to analyze the fastq sequencing output for a novel targeting technique. I am only targeting 27 SNPs with a small amplicon size and the coverage is much more than traditional sequencing methods. I want to report the genotype and coverage at each location (even the homozygous reference sites). One major issue that I have witnessed is that at a given SNP in IGV, I have approx 20,000X coverage with perfectly paired reads (paired end) with MQ 60; however, following my analysis pipeline, my VCF reports 10X. I cannot figure out what I am doing wrong! Also, at other SNP sites, the VCF is reporting balanced allele depth (AD) for a given heterozygous genotype (which is correct), but the genotype call (GT) in the VCF reports as 0/0 (homozygous reference). Below is the general script, a screenshot of IGV for the SNP with 20,000X coverage, and screenshot of IGV for the SNP that is reporting as homozygous when it is a true heterozygous. Please help!

Thank you all! You are great!



for i in "$fastqDir/*R1*.fastq"

sample=`basename $i .fastq`
file2="$fastqDir"`echo $sample | sed 's/_R1_/_R2_/'`.fastq

# look for a read-pair
if [ ! -f "$file2" ] # none detected
$file2 = ""

bwa mem -R "@RG\tID:$sample\tSM:$sample\tPL:ILLUMINA\tLB:$sample" -t 10 $refGenome $i $file2 | samtools view -bSh | samtools sort -m 10G -o sample.bam -T Temp
samtools index sample.bam

gatk HaplotypeCaller --arguments_file gatkArgumentsFile.txt --reference $refGenome --input sample.bam --output sample.vcf --intervals SNPCoordinates.bed --emit-ref-confidence BP_RESOLUTION

#gatk HaplotypeCaller --arguments_file gatkArgumentsFile.txt --input sample.bam --output sample.2.vcf --intervals SNPCoordinates.bed --disable-tool-default-read-filters true --emit-ref-confidence BP_RESOLUTION
#bcftools mpileup -q 5 -d 9999999 -f reference.fasta sample.bam | bcftools call -g 10 -a FORMAT/DP -f GQ,GP -m -T SNPCoordinates.bed -o sample.vcf



  • RachelKRachelK Member
    Other screenshot...incorrect GT being called in VCF while AD shows different genotype.
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited February 1


    Your pipeline seems perfectly legit and for the sake of analysis since you are looking at known SNP sites I would argue against using HaplotypeCaller for these kinds of experiments. For my job here We are also working with similar samples when checking pre and post implantation samples for graft rejection and the best solution is to use a Locus based caller like bcftools, unifiedgenotyper heck even bam-readcount gives the best results for analysis. HaplotypeCaller uses local assembly and realignment to generate variants which may end up showing different numbers compared to your expectations. Also please check the downsampling option for HC. It may be left turned on. I am not saying that HC is not suitable for calling SNPs however when checking SNPs using very short amplicons the number of bases covered around the SNP region may not be enough to generate a viable haplotype reassembly at the depth of your expectation.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @RachelK

    Thank you for the detailed summary of the problem, it is much appreciated.

    Here is a discussion thread that seems to indicate that GATK is not fine-tuned for Amplicon calling and @SkyWarrior is correct that possibly using the UnifiedGenotyper from a slightly older version of GATK may retain more information.

    The HaplotypeCaller will remove some reads to do the realign for better variants calling.
    In other words, HaplotypeCaller only uses as many reads as necessary, and it may be inherently downsampling because you are only looking at a short segment of sequence. Does that make sense?

    Another option to look at is low mapping quality? I don't think you have thousands of poorly mapped reads, but you can see a discussion about that here.

Sign In or Register to comment.