The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Unusual False Positives from HC/UG on edges of small, very high-coverage regions

nhvanlienhvanlie Member Posts: 9
edited April 2013 in Ask the GATK team

Hey there,

I'm currently working on a project that uses the MiSeq system to perform very deep sequencing (5000x coverage) on a small set of genes related to a particular cancer type. I have had some odd calls with HaplotypeCaller and UnifiedGenotyper, especially near the edges of the sequenced regions. It seems that often Haplotype calls very large deletions and Unified calls SNPs, e.g. (from the VCFs):

Haplotype Caller:

chr2 212576988 . TATTTTTAATTGTA T 13.95 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.805;ClippingRankSum=-0.916;DP=1000;FS=8.285;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.836;QD=0.00;ReadPosRankSum=0.528 GT:AD:GQ:PL 0/1:308,53:42:42,0,8418

Unified Genotyper:

chr2 212576990 . T C 555.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.991;DP=1000;Dels=0.00;FS=0.000;HaplotypeScore=22.6027;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.073;QD=0.56;ReadPosRankSum=0.164 GT:AD:DP:GQ:PL 0/1:860,130:3743:99:584,0,32767

A screenshot from IGV showing some of the (few) BAM records in the area of these calls is attached.

Many of these false variant calls can be removed through hard filtering, but I was a little surprised to see them called in the first place, because our raw results had been quite good with lower-coverage data. Is this to be expected with high coverage data?

I am also wondering if there is a hard cap after which downsampling occurs. I set -dcov 10000 for all our samples, but the depth reported in the DP tag seemed max out at 1000. Is there another parameter I need to change? In some areas, due to the very small regions being sequenced, I have coverage in excess of 15,000 reads, so downsampling to 1000 might skew the results.

Any guidance is appreciated!
Cheers,
Natascha

igv_variantatedge.png
1262 x 913 - 11K

Best Answer

Answers

  • nhvanlienhvanlie Member Posts: 9

    Thanks @Geraldine_VdAuwera! With the nightly build the strange calls near the edge of sequenced regions appear to be gone however the DP tag values still seems to max out at 1000. Is this expected behavior?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,669 admin

    Great, thanks for confirming it's fixed. The DP maxing out is expected because of downsampling. You can look up what is the default downsampled coverage value for each tool in their corresponding doc pages in the Technical Documentation.

    Geraldine Van der Auwera, PhD

  • nhvanlienhvanlie Member Posts: 9

    Hi @Geraldine_VdAuwera! It seems from the documentation that the defaults for both HaplotypeCaller and UnifiedGenotyper are 250. However this value is much lower than where our DP values maxed out (1000) and the value that we passed to the dcov option (10000). A down sampling value of 250 or even 1000 is too low for our data, is there any way to further increase the DP beyond the default?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,669 admin

    OK, that's weird. We're looking into this; I'll let you know if we need any test files.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,669 admin

    We can't replicate the error on our test files. Could you please make a snippet of your file including a small region affected by this problem, and upload it to our FTP server? Detailed instructions here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • nhvanlienhvanlie Member Posts: 9

    After a bit more testing, I figured out why the values were capped at 1000 - it's because I was also running my VCFs through Variant Annotator, which by default downsamples to 1000. So, even though the raw calls were made with a higher coverage, Variant Annotator changed the DP tag based on its own downsampling. This just means I have to pass on coverage to Variant Annotator, too.

    Sorry for the red herring!

    Cheers,
    Natascha

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,669 admin

    Ah, that makes sense now. I'm glad you found the solution!

    Geraldine Van der Auwera, PhD

  • ShobanaSekarShobanaSekar TempeMember Posts: 1

    Hi @nhvanlie and @Geraldine_VdAuwera

    I am dealing with high coverage MiSeq data as well; we got > 50,000X coverage and are facing issues identifying variants. I had the duplicates marked after alignment and was about to try -dcov 70000. Any suggestions? Were you able to get results with -dcov 10000?

    Thanks in advance!

    Shobana

Sign In or Register to comment.