Unusual False Positives from HC/UG on edges of small, very high-coverage regions

nhvanlienhvanlie Posts: 9Member
edited April 2013 in Ask the GATK team

Hey there,

I'm currently working on a project that uses the MiSeq system to perform very deep sequencing (5000x coverage) on a small set of genes related to a particular cancer type. I have had some odd calls with HaplotypeCaller and UnifiedGenotyper, especially near the edges of the sequenced regions. It seems that often Haplotype calls very large deletions and Unified calls SNPs, e.g. (from the VCFs):

Haplotype Caller:

chr2 212576988 . TATTTTTAATTGTA T 13.95 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.805;ClippingRankSum=-0.916;DP=1000;FS=8.285;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.836;QD=0.00;ReadPosRankSum=0.528 GT:AD:GQ:PL 0/1:308,53:42:42,0,8418

Unified Genotyper:

chr2 212576990 . T C 555.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.991;DP=1000;Dels=0.00;FS=0.000;HaplotypeScore=22.6027;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.073;QD=0.56;ReadPosRankSum=0.164 GT:AD:DP:GQ:PL 0/1:860,130:3743:99:584,0,32767

A screenshot from IGV showing some of the (few) BAM records in the area of these calls is attached.

Many of these false variant calls can be removed through hard filtering, but I was a little surprised to see them called in the first place, because our raw results had been quite good with lower-coverage data. Is this to be expected with high coverage data?

I am also wondering if there is a hard cap after which downsampling occurs. I set -dcov 10000 for all our samples, but the depth reported in the DP tag seemed max out at 1000. Is there another parameter I need to change? In some areas, due to the very small regions being sequenced, I have coverage in excess of 15,000 reads, so downsampling to 1000 might skew the results.

Any guidance is appreciated!
Cheers,
Natascha

igv_variantatedge.png
1262 x 913 - 11K
Post edited by nhvanlie on

Best Answer

Answers

  • nhvanlienhvanlie Posts: 9Member

    Thanks @Geraldine_VdAuwera! With the nightly build the strange calls near the edge of sequenced regions appear to be gone however the DP tag values still seems to max out at 1000. Is this expected behavior?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Great, thanks for confirming it's fixed. The DP maxing out is expected because of downsampling. You can look up what is the default downsampled coverage value for each tool in their corresponding doc pages in the Technical Documentation.

    Geraldine Van der Auwera, PhD

  • nhvanlienhvanlie Posts: 9Member

    Hi @Geraldine_VdAuwera! It seems from the documentation that the defaults for both HaplotypeCaller and UnifiedGenotyper are 250. However this value is much lower than where our DP values maxed out (1000) and the value that we passed to the dcov option (10000). A down sampling value of 250 or even 1000 is too low for our data, is there any way to further increase the DP beyond the default?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    OK, that's weird. We're looking into this; I'll let you know if we need any test files.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    We can't replicate the error on our test files. Could you please make a snippet of your file including a small region affected by this problem, and upload it to our FTP server? Detailed instructions here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • nhvanlienhvanlie Posts: 9Member

    After a bit more testing, I figured out why the values were capped at 1000 - it's because I was also running my VCFs through Variant Annotator, which by default downsamples to 1000. So, even though the raw calls were made with a higher coverage, Variant Annotator changed the DP tag based on its own downsampling. This just means I have to pass on coverage to Variant Annotator, too.

    Sorry for the red herring!

    Cheers,
    Natascha

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Ah, that makes sense now. I'm glad you found the solution!

    Geraldine Van der Auwera, PhD

  • ShobanaSekarShobanaSekar TempePosts: 1Member

    Hi @nhvanlie and @Geraldine_VdAuwera

    I am dealing with high coverage MiSeq data as well; we got > 50,000X coverage and are facing issues identifying variants. I had the duplicates marked after alignment and was about to try -dcov 70000. Any suggestions? Were you able to get results with -dcov 10000?

    Thanks in advance!

    Shobana

Sign In or Register to comment.