If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Depth of coverage on an exome
Apologies, I think this has been asked in a number of ways previously (http://gatkforums.broadinstitute.org/gatk/discussion/1831/depth-of-coverage-only-first-gene-summary-output) but I am hoping you can clarify something for me. Apologies also for the length of this question!
I am trying to generate the depth of coverage (% of bases covered @>20X) for each gene across an exome and I had an issue similar to that in the above link where genes from the refseq file were absent from the depth of coverage output.
My understanding is the --gene-list is a refseq file used to create a list of genes. A list of intervals (-L) are provided in a bed file. For each gene in the refseq file(?) all the intervals within this gene are taken and any overlapping intervals are combined. Only bases within these intervals are used to calculate depth of coverage for that gene. (Are the exon coordinates in the refseq file used at all?).
The intervals override the coordinates within the genelist so that all the bases within the intervals are used to calculate coverage for that gene. so in the below example gene1 would have bases 90-190.
chr1 90 120 gene1
chr1 115 150 gene1
chr1 150 190 gene1
chr1 180 250 gene2
chr1 240 300 gene2
gene info from refseq
In my case gene 2 was missing. The bedfile I was using was the exome capture design so there were a series of tiled regions.
Therefore my thought was that for gene1 the first interval from the bedfile is taken and then all overlapping intervals are merged regardless of the identifiers PRIOR to any calculation of coverage. This resulted in an interval of 90-300, therefore all the reads within gene 2 are counted towards the depth of coverage for gene1. Intervals are then not reused for gene2, hence the absence of gene2 from the output. (Is that why it's absent as opposed to having a coverage of 0?)
I changed the bedfile so there was no overlap between gene1 and gene2 and gene2 re-appeared in the depth of coverage output.
Therefore I could just create a better bedfile, containing one interval for each exon in each gene so there is no overlap between genes.
However if my above understanding is correct there will be an issue where genes have overlapping exons. eg MUTYH and TOE1.
In the analysis where I was missing "gene2", TOE1 appeared despite being partially covered by an interval assigned to MUTYH. I think my above understanding still stands as TOE1 also has some intervals which does not overlap with the previous gene.
If I were to make a bedfile with an interval for each exon if two exons did overlap these intervals would still be merged and the depth of coverage for gene2 would not include any exons which overlap with the previous gene.
Therefore my current plan of attack is to identify any genes where an exon overlaps with the gene before and perform a separate depth of coverage analysis for these genes. Does this sound sensible or am I missing a really obvious flag (eg not merging intervals)?
I haven't tested this but can I run depth of coverage where the interval list contains intervals for only a small proportion of genes?
I am using version 3.5. I did look into diagnose targets tool as suggested elsewhere but I don't think this provides the % bases covered @ > 20X.
I hope this makes sense! Thank you very much!