It looks like you're new here. If you want to get involved, click one of these buttons!
For a complete, detailed argument reference, refer to the GATK document page here.
DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:
Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.
For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.
DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.
To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument
-geneList /path/to/gene/list.txt
The provided gene list must be of the following format:
585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0,
587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0,
587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0,
587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0,
589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0,
589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0,
589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,
If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at
/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt
If you supply the -geneList argument, DepthOfCoverage v3.0 will output an additional summary file that looks as follows:
Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1
SORT1 594710 238.27 594710 238.27 165 245 330
NOTCH2 3011542 357.84 3011542 357.84 222 399 >500
LMNA 563183 186.73 563183 186.73 116 187 262
NOS1AP 513031 203.50 513031 203.50 91 191 290
Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.
Comments
If we are not on the broad network, is there still a way to obtain the gene list?
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
No, I'm afraid we do not provide public access to that resource. You'll need to make your own, sorry.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, the link for the GATK DepthOfCoverage Documentation isn't working... Is it possible to post the updated link?
Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hmm, the link is what it should be but the doc page is missing. I will find out why. In the meantime you can get the command line arguments by running the walker with
-h(for "help").Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Is there a description somewhere of the various output files?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Does "Coverage by Gene" calculates coverage only within the intervals specified the intervals files (-L)? For example, if we have a whole exome experiment, and we provide DepthOfCoverage with the captured regions .intervals file, and we run coverage by gene as well, Is the output stats per gene only for regions within the captured regions or it is for all sites of the gene?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •That's correct, using an intervals file with
-Lwill limit the scope of analysis to only the targeted intervals.Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks a lot. I was wondering if there is a way to add gene name or a custom annotation as a column in the sample_interval_summary that gets generated. As an example, say I feed the tool with whole exome captured (.intervals file) regions as well as refseq genes using
-geneList, the sample_interval_summary would show only the coordinate of the regions and no other annotation (such as gene or transcript). Is there a way to get those annotations in the sample_interval_summary file?- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The gene name should be referenced in the summary table as explained here:
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Is there a description available of the columns in the genelist file?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Nevermind! http://genome.ucsc.edu/cgi-bin/hgTables
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •@Geraldine_VdAuwera I use this tool to identify regions in our intervals(ROI) that have at least one base covered less than 30x. But In the sample_interval_summary file, the accuracy of %_above_30 (or %_above_anyDP) is not high enough. Accuracy is 0.001 (0.1%) which means if I have a region with the length of 5000bp, there should be at least 5bp below the DP cutoff for DepthOfCoverage tool to report a coverage percentage below 100.0% (for that DP cutoff). Is there a way to improve this accuracy?
Alternatively, if you can add minimum coverage for each interval (not sure how to get min coverage per interval using current GATK walkers), that will address this accuracy issue to some extent.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there, sorry for the delayed response.
There is no option to directly control the accuracy of this tool. Depending on what you're trying to so, you may want to look at our other diagnostics / BAM processing tools such as DiagnoseTargets.
If by minimum coverage per interval you mean you want to exclude those low-covered intervals from analysis, there is currently no direct way to do that. We prefer to call those intervals then filter out variants that are supported by too few reads.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Along the lines of Sarmadym's November post, if I have BED file with gene intervals and names, is is possible for GATK to output a sample_interval_summary file that includes not only chromosome and coordinates, but the also the gene name itself? I am using '-L path/to/bed.bed' as the interval file. I have already tried using '-geneList path/to/refseq/gene/list.txt' but I'm not seeing the additional summary file that's described above. Thanks for your help.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
We are actually in the process of retiring the DepthOfCoverage tool. It is being replaced by DiagnoseTargets. Please try using DiagnoseTargets, and if there are any functionalities that are missing from the tool, let us know and we will consider adding them.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks - very good to know. DiagnoseTargets seems easy to use and fast, and it presents useful info in an easy-to-understand format. However, I'm still having trouble getting my feature names to appear in the output file. Currently the ID field in that file is blank (a dot). The BED file I'm feeding the program is in this format:
chr_name \t start_coordinate \t end_coordinate \t feature_name \t strand
I've searched and searched GATK documentation, but can't find any clues for how to modify my BED file so that DiagnoseTargets will populate the ID field with feature names. Any ideas are much appreciated.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I am not sure I understand what your question is. Could you provide me with the command line you're using for DiagnoseTargets?
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Sure thing, thanks. I believe DiagnoseTargets is working fine, it's just not populating the "ID" field in the output file.
Here's the command line I'm using: java -jar ~/src/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar \ -T DiagnoseTargets \ -I alignments2/ERR066280_sorted.bam.bam \ -R reference_genome/PlasmoDB-9.2_Pfalciparum3D7_Genome.fasta \ -L pileups/genetest.bed \ -o GATK_DiagnoseTargets/tryagain.txt
This runs for a few seconds (assessing coverage for 6 genes - I have 5000+ I'd eventually like to look at) and produces a 10-column file, tryagain.txt, attached. Also produces a .idx file. See how the ID column of the attached file is blank? I'd like that column to include the names of my genes, which are in the BED file I'm using as input, genetest.bed, which looks like this:
Pf3D7_01_v3 71623 72425 PF3D7_0101200 + Pf3D7_01_v3 74562 75365 PF3D7_0101300 + Pf3D7_01_v3 75981 76808 PF3D7_0101400 - Pf3D7_01_v3 78240 79890 PF3D7_0101500 + Pf3D7_01_v3 81764 83105 PF3D7_0101600 - Pf3D7_01_v3 84790 86151 PF3D7_0101700 -
Col1 is the chr name, 2 and 3 are the coordinates of the gene, 4 is the gene name, and 5 is the strand. If you know of any way I could get the gene names (column for of the above BED) into the ID column of the attached DiagnoseTargets output, I'd be very appreciative.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hmmm, the ID field in the VCF format is designed to represent the variant id in a database, like the mutation id in DBSNP or HAPMAP. Here we are using the VCF to display a region, instead of a variant, and filling the ID field would probably be a frowned upon by the VCF community.
That being said, this is a very easy problem to solve, you can do it yourself with an awk script to match the chr/pos to the id and fill in the blanks if you need. Is this something you (or someone in your team) would be comfortable doing?
From our end, if we were to support BED files with ID's it would have to be a more formal approach. I will look into it and whether or not the ID field would be the best one to provide. We accept not only BED files , but also interval and vcf files in the -L argument, and not all of them will have the ID in the same position.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Ohhh, that's what the ID field is for. Now it makes sense why I couldn't get my gene names to appear there. That would be great if you supported BED files with gene names, but in the mean time you're right - awk is probably the easiest solution. I can implement that for our own purposes. Thanks for your help...
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Document page still not active... help?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @dandrew, as stated earlier we are replacing DepthOfCoverage by DiagnoseTargets. The missing doc is just an unrelated technical side issue but we've decided not to take the time to fix it since the tool is being retired anyway. Sorry for the inconvenience but this is a matter of resource prioritization.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Dandrew, please use DiagnoseTargets instead. We are discontinuing DepthOfCoverage and the help page will be removed.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I have been using the DepthOfCoverage tool a lot recently and have found that it produces a lot of really useful information in the output files that I am using for QC at our genome centre. I plan to continue to use them extensively in the future.
When I went to check some things in the documentation, I saw that you are sadly going to retire this tool, and it will be replaced by DiagnoseTargets. So I am responding to your request for comments on missing functionalities.
I have now tried DiagnoseTargets and DepthOfCoverage on the same datasets, and I can see that there is a lot of functionality missing from DiagnoseTargets.
The output vcf file from DignoseTargets provides some of the same information as the out.sample_interval_summary from DepthOfCoverage but not all, the total experiment coverage, average experiment coverage and total sample coverage are not provided. Additionally the actual values are different – is this due to downsampling in DiagnoseTargets? Here is an example:
out.sample_interval_summary:
Target chr1:762098-762270
total coverage 227719
average coverage 1316.29
Sample1 total cvg 156728
Sample1 mean cvg 905.94
Sample1 granular Q1 >500
Sample1 granular median >500
Sample1 granular Q3 >500
Sample1 % above 15 100
Sample2 total cvg 70991
Sample2 mean cvg 410.35
Sample2 granular Q1 322
Sample2 granular median 435
Sample2 granular Q3 >500
Sample2 % above 15 100
diagnoseTargets.vcf:
QUAL 0CHROM chr1
POS 762098
ID 0
REF T
ALT
FILTER PASS
INFO AVG_INTERVAL_DP=287.96;END=762270
FORMAT AVG_INTERVAL_DP:MED:Q1:Q3
Sample1 175.43:200.00:108.00:247.00
Sample2 112.53:124.00:65.00:163.00
There isn't any replacement for any of the other files from DepthOfCoverage:
out:
total coverage, average sample coverage, coverage per sample at every base position in interval file
out.sample_summary:
total coverage, average coverage, quantiles coverage, %bases above 15 coverage for each sample
out.sample_statistics:
base position counts for x amount (default intevals of 1, 0 to 500 and 500+) of coverage for each sample [For histogram]
out.sample_cumulative_coverage_proportions:
proportion of base positions for number of reads greater than x for each sample
out.sample_cumulative_coverage_counts:
matrix of base position counts for number of samples greater than n, by number of reads greater than x
out.sample_interval_statistics:
matrix of number of intevals for number of samples greater than n, by number of reads greater than x
For me at the moment the out.sample_summary, out.sample_cumulative_coverage_proportions, out and out.sample_interval_summary files are really useful but I can envisage that the other output is useful for other parts of analysis and I may well use them in the future too.
It would be really great if you could add the functionality outlined above to DiagnoseTargets, or better still reverse your decision to retire DepthOfCoverage.
I am also finding the use of vcf style files for non variant data irritating. They require a lot of extra parsing, cannot just be passed on to end users and can potentially take up a lot more space than a tab delimited file.
Thanks for listening to my thoughts!
Anna
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •Hi Anna,
Thanks for the feedback, this is very useful. We will see what we can do to make DiagnoseTargets better respond to your needs.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine. I agree with everything that Annat has posted.
I use DepthOfCoverage for a large number (a few hundreds) of human exomes from individuals with rare diseases. Two of the most useful outputs of DepthOfCoverage are missing in DiagnoseTargets (from what I can tell): - sample_summary: I like to have a concise summary of mean and median coverage for each sample, as well as %of target bases covered at different thresholds (e.g. 5x, 10x, 20x, 30x, etc) averaged over all targeted bases in the sample. - sample_interval_summary: I like to have the %of target bases covered at different thresholds for EACH target (exon) in the sample. This is often useful because we may have a restricted set of genes in mind. We can check whether any specific exons are missing coverage in one or more samples and might want to follow those up.
The fixed classifier of DiagnoseTargets (e.g. "LOW_COVERAGE") based on a single threshold (e.g. --minimum_coverage in DiagnoseTargets) is much less useful to me. I also agree that the VCF-style output format is awkward and inappropriate.
Also, I have had no trouble running DepthOfCoverage with reasonable memory limits (4 Gb) even on large (100 Gb) multi-sample BAM files by controlling --read_buffer_size. With DiagnoseTargets I have yet to get it to run on the same BAM file even when providing 30 Gb of memory (our cluster kills the job when it exceeds the requested memory usage). This occurs when I specify a whole chromosome as the interval of interest. It would run when I gave shorter intervals.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for the useful feedback, Jeremy. We'll take this into account.
Not sure what's going on with performance but we'll take a look at that too and try to improve it.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I ma trying to use depthofcoverage, to get the coverage over an interval. But I get an error:
ERROR A USER ERROR has occurred (version 1.3-20-g447e9bf):
ERROR MESSAGE: File associated with name /Users/ngs_user/Desktop/Coverage/makeCoverageScript/allexonsT1.bed is malformed: Interval file could not be parsed in any supported format. caused by BED files must be parsed through Tribble; parsing them as intervals through the GATK engine is no longer supported
ERROR ------------------------------------------------------------------------------------------
My interval file has the format: chr1:12011-12058 chr1:12180-12228 chr1:12614-12698 chr1:12976-13053 chr1:13222-13375
It works fine without an interval file or just just one interval (-L chr1:12011-12058). Is the interval file format incorrect? Thanks in advance!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Your intervals look fine, but the version you're using is downright ancient... You need to upgrade to a more recent version; preferably the latest.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, In the new version of GATK there is ReduceReads part. Is there a way to call DepthOfCoverage from the output of ReduceReads reduced.bam file?
Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Yes, you can use DepthOfCoverage with reduced nam files just as you would any other bam. No special treatment necessary.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I would like to calculate the depth of coverage ratio of tumor and blood samples. I am using .coverage.sample_interval_summary output of Depth of Coverage. If I run DepthOfCoverage on reduced.bam files of tumor and blood, the ratio won't be correct right? Because the reads that are not informative in variant calling are removed? I was using BaseQualityRelicalibration output for DepthOfCoverage calculation. Should I continue with BaseQualityRecalibration output?
Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •It depends what settings you used in ReduceReads -- but in principle it is not a problem anyway, if what you want to know is the useful coverage that is informative for calling variants. However, if you are trying to diagnose coverage as a sequencing QC process, then you should do it earlier in your pipeline.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Why is DepthOfCoverage being retired? DiagnoseTargets seems to provide less information in a format that requires additional processing.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Edited the message -- DoC will not be retired immediately, we will wait until DiagnoseTargets satisfies users' needs.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I wish to find out position specific coverage in a bacterial genome. I am not sure how to set the "-L" option to do so.
java -jar GenomeAnalysisTK.jar -R reference.fasta -T DepthOfCoverage -o depth_of_coverage.txt -I aligned_reads.bam -L??
The examples in gatk website show it like "-L chr1:100-200". I tried different combinations but i get an error saying contig does not match. Could you please help me with this option.
The header for SAM file looks like this if you need it.
@HD VN:1.3
@SQ SN:ref000001|sid|576|accn|NC_000962 LN:4411532 M5:e49320ac872e44cb8058de0407772516
@RG ID:1a7146a3a4 SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s2_p0PL:PacBio_RS DT:2012-10-09T21:01:18
@RG ID:d2c759b45d SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s1_p0PL:PacBio_RS DT:2012-10-09T21:01:18
@CO READS:151621
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Ashu,
The format is
-L contig:start-stop. You need to know what is the name of the contig you want to analyze.Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •