Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Depth of coverage on an exome

Aled_Jones86Aled_Jones86 Guy's hospital, LondonMember

Hi,

Apologies, I think this has been asked in a number of ways previously (http://gatkforums.broadinstitute.org/gatk/discussion/1831/depth-of-coverage-only-first-gene-summary-output) but I am hoping you can clarify something for me. Apologies also for the length of this question!

I am trying to generate the depth of coverage (% of bases covered @>;20X) for each gene across an exome and I had an issue similar to that in the above link where genes from the refseq file were absent from the depth of coverage output.

My understanding is the --gene-list is a refseq file used to create a list of genes. A list of intervals (-L) are provided in a bed file. For each gene in the refseq file(?) all the intervals within this gene are taken and any overlapping intervals are combined. Only bases within these intervals are used to calculate depth of coverage for that gene. (Are the exon coordinates in the refseq file used at all?).

The intervals override the coordinates within the genelist so that all the bases within the intervals are used to calculate coverage for that gene. so in the below example gene1 would have bases 90-190.

Bedfile
chr1 90 120 gene1
chr1 115 150 gene1
chr1 150 190 gene1
chr1 180 250 gene2
chr1 240 300 gene2

gene info from refseq
gene1 100-180
gene2 200-300

In my case gene 2 was missing. The bedfile I was using was the exome capture design so there were a series of tiled regions.
Therefore my thought was that for gene1 the first interval from the bedfile is taken and then all overlapping intervals are merged regardless of the identifiers PRIOR to any calculation of coverage. This resulted in an interval of 90-300, therefore all the reads within gene 2 are counted towards the depth of coverage for gene1. Intervals are then not reused for gene2, hence the absence of gene2 from the output. (Is that why it's absent as opposed to having a coverage of 0?)

I changed the bedfile so there was no overlap between gene1 and gene2 and gene2 re-appeared in the depth of coverage output.

Therefore I could just create a better bedfile, containing one interval for each exon in each gene so there is no overlap between genes.

However if my above understanding is correct there will be an issue where genes have overlapping exons. eg MUTYH and TOE1.

In the analysis where I was missing "gene2", TOE1 appeared despite being partially covered by an interval assigned to MUTYH. I think my above understanding still stands as TOE1 also has some intervals which does not overlap with the previous gene.
If I were to make a bedfile with an interval for each exon if two exons did overlap these intervals would still be merged and the depth of coverage for gene2 would not include any exons which overlap with the previous gene.

Therefore my current plan of attack is to identify any genes where an exon overlaps with the gene before and perform a separate depth of coverage analysis for these genes. Does this sound sensible or am I missing a really obvious flag (eg not merging intervals)?

I haven't tested this but can I run depth of coverage where the interval list contains intervals for only a small proportion of genes?

I am using version 3.5. I did look into diagnose targets tool as suggested elsewhere but I don't think this provides the % bases covered @ > 20X.

I hope this makes sense! Thank you very much!

Issue · Github
by Sheila

Issue Number
1471
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Aled_Jones86Aled_Jones86 Guy's hospital, LondonMember

    Actually if I were to repeat this with two bedfiles, and using the same refseq file I think it's still going to assign an interval to the wrong gene. eg

    Bedfile
    chr1 195 300 gene 2

    refseq file
    gene1 chr1:100-200
    gene2 chr1:195-300

    Would the interval in the bedfile would be assigned to gene 1 or 2?

    In terms of my above approach I guess I'll have to create a refseq file for each bedfile?

    FYI from my list of 18886 genes 570 are overlapping.

  • Aled_Jones86Aled_Jones86 Guy's hospital, LondonMember
Sign In or Register to comment.