We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Calling variants in a small target region

Hi,
This may not directly relate to how GATK works, but I am still asking to get your expert input (I do use GATK for this pipeline).

I am trying to test my variant calling pipeline that I have prepared for my target region of 5Mb for which I need a test dataset where the fastq files and the VCF files are provided so that I can run my pipeline on those fastqs and then compare my VCF to those VCFs. So, I used the two whole exome datasets from Genome In a Bottle data with the following steps:

  1. In order to obtain the fastqs that originated from my target region, I used the bam files published by GIAB and extracted the alignments falling in my target region using samtools

  2. Converted the extracted bams in those regions to paired-end fastqs

  3. Aligned those fastqs to the entire genome using BWA

  4. Called the variants in my region of interest using GATK (with -L option, restricting my analysis to my target region for RealignerTargetCreator and HaplotypeCaller steps only). I don't do BQSR and VQSR.

  5. Compared my variants to the variants from GIAB in the target region.

But I only get 50% variant calls correctly. I call only 50% of the total variants in that region. What could be the reason for this low agreement rate?

Interestingly, when I call the variants on the whole exome for which the fastqs were originally created, I can obtain 96% of variants in the whole-exome region published by GIAB. Moreover, from those variants, when I extract the variants that fall in my target regions and compare it to the corresponding GIAB variants, I can go upto 97%. In other words, when I use the entire whole-exome data, I can call 97% of the variants in my target region, but when I start with the alignments falling in my target region, convert to fastq and then call variants in the target region, I only get 50%. I use the exact same settings of GATK in both cases (except the -L region which is different).

Can you please help me figure out what could be going wrong?

Thanks,

~N

Issue · Github
by Sheila

Issue Number
604
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • I realized that due to some reason the reads were not mapping correctly. Moreover, adding another sample to increase depth of coverage helped.
    Thanks!

Sign In or Register to comment.