Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Extremely high depth of coverage

Dear all,
I've run the DepthOfCoverage tool on 263 WGS samples and have found some unusual total and averages for some regions.
Does it mean any sort of error on the alignment or I can just filtered this regions out when calling the genotypes?
I'm attaching a example for one chromosome.
Adriana image

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @adri_somavilla,

    Do you have any reason to suspect these high coverage region reads are not normal? For example, do you suspect some of your samples may be contaminated with bacterial sequence?

    have found some unusual total and averages for some regions.

    Does this observation generalize across all your samples or does it apply to just some of your samples? To some extent, these kind of high coverage regions are expected in WGS.

    When calling genotypes, will you use a genomic intervals list? If so, do these regions fall within your chosen intervals? Typically, when calculating average coverage, e.g. 30x advertised by sequencing centers, it is for some chosen set of intervals that exclude regions of Ns for which no alignment is expected, and that exclude other regions that typically end up with these types of outlier coverages, e.g. decoy contigs.

    Remember that GATK tools come with a variety of upfront read filters. For example, DepthOfCoverage does not factor in secondary alignments (NotPrimaryAlignmentFilter) as well as other types of reads but does include supplementary alignments and low/zero MAPQ alignments.

    Our callers should be able to handle most situations, including high-coverage, that persists after the reads are filtered upfront. For example, HaplotypeCaller filters out low MAPQ and secondary alignments as well as other types of alignments. Then, for regions of high-coverage, it will down-sample alignments down to 500 per sample.

    I hope this is helpful.

  • adri_somavillaadri_somavilla EdinburghMember

    Hi @shlee,
    We think these regions might be related to repetitive regions that are not present in the reference genome (I'm working with chicken and there are many gaps). Also, I was processing data from 263 1x coverage samples all together. Anyway this is still a pilot and I'm now filtering out bases with maximum mean coverage = (mean cov)/(6*sd).
    Also, I'm calling genome-wise genotypes.
    Thank you.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    You may find of interest how other references handle such sequence regions. For example, the human GRCh38 reference genome includes representative repeat arrays to distribute such read sequences as well as decoy contigs to siphon off sequences that appear in libraries and are likely artifacts (I think determined empirically), e.g. from cloning plasmid backbones. These decoy contigs in the human reference were contributed by Heng Li. If you BLAST some of these sequences, do you get any hits that make sense?

  • adri_somavillaadri_somavilla EdinburghMember

    Hi @shlee
    I haven't tried mainly because we don't really care about these regions in this stage of the project, but I'll keep this in mind. Thanks.

Sign In or Register to comment.