The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Using DepthOfCoverage to find out how much sequence data you have

Posts: 71GATK Developer mod
edited August 2013

#### See this announcement regarding our plans for support of DepthOfCoverage and DiagnoseTargets. If you find that there are functionalities missing in either tool, leave us a comment and we will consider adding them.

For a complete, detailed argument reference, refer to the GATK document page here.

### Introduction

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

• Total, mean, median, and quartiles for each partition type: aggregate
• Total, mean, median, and quartiles for each partition type: for each interval
• A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
• A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
• A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
• A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.

For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

### Coverage by Gene

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,

If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt

If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     &gt;500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290

Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 33Member

If we are not on the broad network, is there still a way to obtain the gene list?

Thanks,

MC

Hi there,

No, I'm afraid we do not provide public access to that resource. You'll need to make your own, sorry.

Geraldine Van der Auwera, PhD

• Posts: 1Member

Hi, the link for the GATK DepthOfCoverage Documentation isn't working... Is it possible to post the updated link?

Thanks!

Hmm, the link is what it should be but the doc page is missing. I will find out why. In the meantime you can get the command line arguments by running the walker with -h (for "help").

Geraldine Van der Auwera, PhD

• Posts: 8Member

Is there a description somewhere of the various output files?

• Posts: 5Member

Does "Coverage by Gene" calculates coverage only within the intervals specified the intervals files (-L)? For example, if we have a whole exome experiment, and we provide DepthOfCoverage with the captured regions .intervals file, and we run coverage by gene as well, Is the output stats per gene only for regions within the captured regions or it is for all sites of the gene?

That's correct, using an intervals file with -L will limit the scope of analysis to only the targeted intervals.

Geraldine Van der Auwera, PhD

• Posts: 5Member

@Geraldine_VdAuwera said:
That's correct, using an intervals file with -L will limit the scope of analysis to only the targeted intervals.

Thanks a lot. I was wondering if there is a way to add gene name or a custom annotation as a column in the sample_interval_summary that gets generated. As an example, say I feed the tool with whole exome captured (.intervals file) regions as well as refseq genes using -geneList , the sample_interval_summary would show only the coordinate of the regions and no other annotation (such as gene or transcript). Is there a way to get those annotations in the sample_interval_summary file?

The gene name should be referenced in the summary table as explained here:

The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls.

Geraldine Van der Auwera, PhD

• Posts: 10Member

Is there a description available of the columns in the genelist file?

• Posts: 10Member
edited December 2012

@emoss said:
Is there a description available of the columns in the genelist file?

Nevermind! http://genome.ucsc.edu/cgi-bin/hgTables

Post edited by emoss on
• Posts: 5Member
edited December 2012

@Geraldine_VdAuwera I use this tool to identify regions in our intervals(ROI) that have at least one base covered less than 30x. But In the sample_interval_summary file, the accuracy of %_above_30 (or %_above_anyDP) is not high enough. Accuracy is 0.001 (0.1%) which means if I have a region with the length of 5000bp, there should be at least 5bp below the DP cutoff for DepthOfCoverage tool to report a coverage percentage below 100.0% (for that DP cutoff). Is there a way to improve this accuracy?

Alternatively, if you can add minimum coverage for each interval (not sure how to get min coverage per interval using current GATK walkers), that will address this accuracy issue to some extent.

Hi there, sorry for the delayed response.

There is no option to directly control the accuracy of this tool. Depending on what you're trying to so, you may want to look at our other diagnostics / BAM processing tools such as DiagnoseTargets.

If by minimum coverage per interval you mean you want to exclude those low-covered intervals from analysis, there is currently no direct way to do that. We prefer to call those intervals then filter out variants that are supported by too few reads.

Geraldine Van der Auwera, PhD

• Posts: 8Member

Along the lines of Sarmadym's November post, if I have BED file with gene intervals and names, is is possible for GATK to output a sample_interval_summary file that includes not only chromosome and coordinates, but the also the gene name itself? I am using '-L path/to/bed.bed' as the interval file. I have already tried using '-geneList path/to/refseq/gene/list.txt' but I'm not seeing the additional summary file that's described above. Thanks for your help.

• Posts: 8Member
edited January 2013

Thanks - very good to know. DiagnoseTargets seems easy to use and fast, and it presents useful info in an easy-to-understand format. However, I'm still having trouble getting my feature names to appear in the output file. Currently the ID field in that file is blank (a dot). The BED file I'm feeding the program is in this format:

chr_name \t start_coordinate \t end_coordinate \t feature_name \t strand

I've searched and searched GATK documentation, but can't find any clues for how to modify my BED file so that DiagnoseTargets will populate the ID field with feature names. Any ideas are much appreciated.

Post edited by cparobek on

I am not sure I understand what your question is. Could you provide me with the command line you're using for DiagnoseTargets?

• Posts: 8Member

Sure thing, thanks. I believe DiagnoseTargets is working fine, it's just not populating the "ID" field in the output file.

Here's the command line I'm using:
java -jar ~/src/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar \
-T DiagnoseTargets \
-I alignments2/ERR066280_sorted.bam.bam \
-R reference_genome/PlasmoDB-9.2_Pfalciparum3D7_Genome.fasta \
-L pileups/genetest.bed \
-o GATK_DiagnoseTargets/tryagain.txt

This runs for a few seconds (assessing coverage for 6 genes - I have 5000+ I'd eventually like to look at) and produces a 10-column file, tryagain.txt, attached. Also produces a .idx file. See how the ID column of the attached file is blank? I'd like that column to include the names of my genes, which are in the BED file I'm using as input, genetest.bed, which looks like this:

Pf3D7_01_v3 71623 72425 PF3D7_0101200 +
Pf3D7_01_v3 74562 75365 PF3D7_0101300 +
Pf3D7_01_v3 75981 76808 PF3D7_0101400 -
Pf3D7_01_v3 78240 79890 PF3D7_0101500 +
Pf3D7_01_v3 81764 83105 PF3D7_0101600 -
Pf3D7_01_v3 84790 86151 PF3D7_0101700 -

Col1 is the chr name, 2 and 3 are the coordinates of the gene, 4 is the gene name, and 5 is the strand. If you know of any way I could get the gene names (column for of the above BED) into the ID column of the attached DiagnoseTargets output, I'd be very appreciative.

Hmmm, the ID field in the VCF format is designed to represent the variant id in a database, like the mutation id in DBSNP or HAPMAP. Here we are using the VCF to display a region, instead of a variant, and filling the ID field would probably be a frowned upon by the VCF community.

That being said, this is a very easy problem to solve, you can do it yourself with an awk script to match the chr/pos to the id and fill in the blanks if you need. Is this something you (or someone in your team) would be comfortable doing?

From our end, if we were to support BED files with ID's it would have to be a more formal approach. I will look into it and whether or not the ID field would be the best one to provide. We accept not only BED files , but also interval and vcf files in the -L argument, and not all of them will have the ID in the same position.

• Posts: 8Member

Ohhh, that's what the ID field is for. Now it makes sense why I couldn't get my gene names to appear there. That would be great if you supported BED files with gene names, but in the mean time you're right - awk is probably the easiest solution. I can implement that for our own purposes. Thanks for your help...

• Posts: 21Member

I have been using the DepthOfCoverage tool a lot recently and have found that it produces a lot of really useful information in the output files that I am using for QC at our genome centre. I plan to continue to use them extensively in the future.

When I went to check some things in the documentation, I saw that you are sadly going to retire this tool, and it will be replaced by DiagnoseTargets. So I am responding to your request for comments on missing functionalities.

I have now tried DiagnoseTargets and DepthOfCoverage on the same datasets, and I can see that there is a lot of functionality missing from DiagnoseTargets.

The output vcf file from DignoseTargets provides some of the same information as the out.sample_interval_summary from DepthOfCoverage but not all, the total experiment coverage, average experiment coverage and total sample coverage are not provided. Additionally the actual values are different – is this due to downsampling in DiagnoseTargets? Here is an example:

out.sample_interval_summary:
Target chr1:762098-762270
total coverage 227719
average coverage 1316.29
Sample1 total cvg 156728
Sample1 mean cvg 905.94
Sample1 granular Q1 >500
Sample1 granular median >500
Sample1 granular Q3 >500
Sample1 % above 15 100
Sample2 total cvg 70991
Sample2 mean cvg 410.35
Sample2 granular Q1 322
Sample2 granular median 435
Sample2 granular Q3 >500
Sample2 % above 15 100

diagnoseTargets.vcf:
CHROM chr1
POS 762098
ID 0
REF T
ALT

QUAL 0
FILTER PASS
INFO AVG_INTERVAL_DP=287.96;END=762270
FORMAT AVG_INTERVAL_DP:MED:Q1:Q3
Sample1 175.43:200.00:108.00:247.00
Sample2 112.53:124.00:65.00:163.00

There isn't any replacement for any of the other files from DepthOfCoverage:

out:
total coverage, average sample coverage, coverage per sample at every base position in interval file

out.sample_summary:
total coverage, average coverage, quantiles coverage, %bases above 15 coverage for each sample

out.sample_statistics:
base position counts for x amount (default intevals of 1, 0 to 500 and 500+) of coverage for each sample [For histogram]

out.sample_cumulative_coverage_proportions:
proportion of base positions for number of reads greater than x for each sample

out.sample_cumulative_coverage_counts:
matrix of base position counts for number of samples greater than n, by number of reads greater than x

out.sample_interval_statistics:
matrix of number of intevals for number of samples greater than n, by number of reads greater than x

For me at the moment the out.sample_summary, out.sample_cumulative_coverage_proportions, out and out.sample_interval_summary files are really useful but I can envisage that the other output is useful for other parts of analysis and I may well use them in the future too.

It would be really great if you could add the functionality outlined above to DiagnoseTargets, or better still reverse your decision to retire DepthOfCoverage.

I am also finding the use of vcf style files for non variant data irritating. They require a lot of extra parsing, cannot just be passed on to end users and can potentially take up a lot more space than a tab delimited file.

Thanks for listening to my thoughts!
Anna

Hi Anna,

Thanks for the feedback, this is very useful. We will see what we can do to make DiagnoseTargets better respond to your needs.

Geraldine Van der Auwera, PhD

• Posts: 5Member

Hi Geraldine.
I agree with everything that Annat has posted.

I use DepthOfCoverage for a large number (a few hundreds) of human exomes from individuals with rare diseases.
Two of the most useful outputs of DepthOfCoverage are missing in DiagnoseTargets (from what I can tell):
- sample_summary: I like to have a concise summary of mean and median coverage for each sample, as well as %of target bases covered at different thresholds (e.g. 5x, 10x, 20x, 30x, etc) averaged over all targeted bases in the sample.
- sample_interval_summary: I like to have the %of target bases covered at different thresholds for EACH target (exon) in the sample. This is often useful because we may have a restricted set of genes in mind. We can check whether any specific exons are missing coverage in one or more samples and might want to follow those up.

The fixed classifier of DiagnoseTargets (e.g. "LOW_COVERAGE") based on a single threshold (e.g. --minimum_coverage in DiagnoseTargets) is much less useful to me.
I also agree that the VCF-style output format is awkward and inappropriate.

Also, I have had no trouble running DepthOfCoverage with reasonable memory limits (4 Gb) even on large (100 Gb) multi-sample BAM files by controlling --read_buffer_size. With DiagnoseTargets I have yet to get it to run on the same BAM file even when providing 30 Gb of memory (our cluster kills the job when it exceeds the requested memory usage). This occurs when I specify a whole chromosome as the interval of interest. It would run when I gave shorter intervals.

Thanks for the useful feedback, Jeremy. We'll take this into account.

Not sure what's going on with performance but we'll take a look at that too and try to improve it.

Geraldine Van der Auwera, PhD

• Posts: 1Member

Hi,
I ma trying to use depthofcoverage, to get the coverage over an interval. But I get an error:

##### ERROR ------------------------------------------------------------------------------------------

My interval file has the format:
chr1:12011-12058
chr1:12180-12228
chr1:12614-12698
chr1:12976-13053
chr1:13222-13375

It works fine without an interval file or just just one interval (-L chr1:12011-12058). Is the interval file format incorrect?

edited February 2013

Your intervals look fine, but the version you're using is downright ancient... You need to upgrade to a more recent version; preferably the latest.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 2Member

Hi, In the new version of GATK there is ReduceReads part. Is there a way to call DepthOfCoverage from the output of ReduceReads reduced.bam file?

Thanks!

Yes, you can use DepthOfCoverage with reduced nam files just as you would any other bam. No special treatment necessary.

Geraldine Van der Auwera, PhD

• Posts: 2Member

I would like to calculate the depth of coverage ratio of tumor and blood samples. I am using .coverage.sample_interval_summary output of Depth of Coverage. If I run DepthOfCoverage on reduced.bam files of tumor and blood, the ratio won't be correct right? Because the reads that are not informative in variant calling are removed? I was using BaseQualityRelicalibration output for DepthOfCoverage calculation. Should I continue with BaseQualityRecalibration output?

Thanks!

It depends what settings you used in ReduceReads -- but in principle it is not a problem anyway, if what you want to know is the useful coverage that is informative for calling variants. However, if you are trying to diagnose coverage as a sequencing QC process, then you should do it earlier in your pipeline.

Geraldine Van der Auwera, PhD

• Posts: 31Member

Why is DepthOfCoverage being retired? DiagnoseTargets seems to provide less information in a format that requires additional processing.

Edited the message -- DoC will not be retired immediately, we will wait until DiagnoseTargets satisfies users' needs.

Geraldine Van der Auwera, PhD

• Posts: 21Member
edited April 2013

Hi, I wish to find out position specific coverage in a bacterial genome. I am not sure how to set the "-L" option to do so.

java -jar GenomeAnalysisTK.jar -R reference.fasta -T DepthOfCoverage -o depth_of_coverage.txt -I aligned_reads.bam -L??

The examples in gatk website show it like "-L chr1:100-200". I tried different combinations but i get an error saying contig does not match. Could you please help me with this option.

The header for SAM file looks like this if you need it.

@HD VN:1.3

@SQ SN:ref000001|sid|576|accn|NC_000962 LN:4411532 M5:e49320ac872e44cb8058de0407772516

@RG ID:1a7146a3a4 SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s2_p0PL:PacBio_RS DT:2012-10-09T21:01:18

@RG ID:d2c759b45d SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s1_p0PL:PacBio_RS DT:2012-10-09T21:01:18

Post edited by Ashu on

Hi Ashu,

The format is -L contig:start-stop. You need to know what is the name of the contig you want to analyze.

Geraldine Van der Auwera, PhD

• Posts: 228Member

what's the different between file *.interval_list and *.intervals?

@blueskypy, those are used interchangeably as extension names for interval list files; there is no inherent difference.

Geraldine Van der Auwera, PhD

• Posts: 228Member

what's the meaning of the 1st column in the geneList file?

The UCSC table browser gives a full header for the columns. The first column is called #bin. For more details, you should look it up on their website.

Geraldine Van der Auwera, PhD

• Posts: 228Member

Took a while to figure out the format of the -geneList file. In case I'm not the only dummy here, the format is described at http://genome.ucsc.edu/FAQ/FAQformat.html#format9

Thanks for sharing your findings (especially direct links!), I'm sure this will be useful to others.

Geraldine Van der Auwera, PhD

• Posts: 228Member
edited July 2013

I use the following command;

java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage \ -R$refGenome -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam \
-geneList HLA.genePred -rf BadCigar -o HLA

but some of the output files are empty, why?

$ls -al HLA* -rw-r--r-- 1 user01 root 71394299565 Jul 3 05:34 HLA -rw-r--r-- 1 user01 root 0 Jul 3 05:34 HLA.sample_interval_statistics -rw-r--r-- 1 user01 root 0 Jul 3 05:34 HLA.sample_gene_summary -rw-r--r-- 1 user01 root 11462 Jul 3 05:34 HLA.sample_cumulative_coverage_proportions -rw-r--r-- 1 user01 root 266 Jul 3 05:34 HLA.sample_summary -rw-r--r-- 1 user01 root 17963 Jul 3 05:34 HLA.sample_statistics -rw-r--r-- 1 user01 root 0 Jul 3 05:34 HLA.sample_interval_summary -rw-r--r-- 1 user01 root 16574 Jul 3 05:34 HLA.sample_cumulative_coverage_counts -rw-r--r-- 1 user01 root 5367 Jul 1 22:47 HLA.genePred Post edited by blueskypy on • Posts: 228Member And here are a few lines in the HLA.genePred 834 NM_020056 chr6 + 32709162 32714664 32709220 32714171 5 32709162,32712935,32713567,32714016,32714358, 32709302,32713184,32713849,32714191,32714664, 0 HLA-DQA2 cmpl cmpl 0,1,1,1,-1, 834 NM_001198858 chr6 - 32723874 32731330 32724229 32731247 5 32723874,32725057,32726626,32729436,32731150, 32724243,32725081,32726908,32729703,32731330, 0 HLA-DQB2 cmpl cmpl 1,1,1,1,0, 823 NM_005514 chr6 - 31321648 31324989 31322259 31324935 8 31321648,31322255,31322409,31322883,31323093,31323943,31324464,31324862, 31322073,31322303,31322442,31323000,31323369,31324219,31324734,31324989, 0 HLA-B cmpl cmpl -1,1,1,1,1,1,1,0, • Posts: 7,110Administrator, GATK Developer admin @blueskypy, can you tell me what version you are using? Geraldine Van der Auwera, PhD • Posts: 228Member hi, Geraldine, Thanks so much for the help! I'm using GATK-2.5-2 • Posts: 7,110Administrator, GATK Developer admin Actually, it's not version dependent -- it's because you didn't specify intervals. If you specify an interval list (using -L) the files will get populated. Geraldine Van der Auwera, PhD • Posts: 228Member hi, Geraldine, Thanks for the help! I thought the -L isn't necessary if -geneList is provided since the interval info is in the geneList file. So if I extract the exonStarts and exonEnds from the geneList file to the interval file, will DepthOfCoverage map the intervals to gene names and produce the .sample_gene_summary file? • Posts: 228Member btw, since the -geneList file contains intevals for both CDS and exons, does DepthOfCoverage compute the coverage depth of genes based on CDS or exons? • Posts: 228Member I use the following command: java -Xmx4g -jar$gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage -R $refGenome -l DEBUG \ -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam -geneList HLA.genePred -L HLA.intervals -rf BadCigar -o HLA And got the following warning on every genes in the geneList file HLA.genePred: WARN 16:58:32,227 Utils - ******************************************************************************** WARN 16:58:32,227 Utils - * WARNING: WARN 16:58:32,227 Utils - * WARN 16:58:32,227 Utils - * RefSeq file is potentially incorrect, as some transcripts or exons WARN 16:58:32,227 Utils - * have a negative length (chr6) WARN 16:58:32,228 Utils - ******************************************************************************** However, there is no negative length in the entry of genes, here is one example: 833 NM_002122 chr6 + 32605182 32611429 32605235 32610541 5 32605182,32609086,32609748,32610386,32610728, 32605317,32609335,32610030,32610561,32611429, 0 HLA-DQA1 cmpl cmpl 0,1,1,1,-1, Could someone help? Thanks a lot! • Posts: 7,110Administrator, GATK Developer admin Not sure what's going on here. Have you tried testing on a subset and checking whether the results look reasonable? Geraldine Van der Auwera, PhD • Posts: 59Member Is the coverage that we get from depth of coverage for the whole genome equal to the one we get from Lander/Waterman equation as given on http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf Or is it average read depth? • Posts: 7,110Administrator, GATK Developer admin DepthOfCoverage gives you the straight-up average depth. Geraldine Van der Auwera, PhD • BroadPosts: 1Member Has anybody successfully used the DepthofCoverage tool to find average coverage per gene for RNA-Seq data? I've tried running the following... java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T DepthOfCoverage -o test.txt -I input.bam -geneList refGene.sorted.txt -L refGene.sorted.intervals -U ALLOW_N_CIGAR_READS ... but the summary file is empty. Any suggestions? • Posts: 7,110Administrator, GATK Developer admin This may be an issue of how your intervals are set up. Have a look at this question & answer for more details on how it should be. If that's not the issue, have you checked if it works for you with regular DNA data? This to rule out the N-containing cigars as a cause for the problem. Geraldine Van der Auwera, PhD • Posts: 33Member Hi, I'm trying to run depthofcoverage on 40 reduced exome bams (one sample per bam) and the processing time per 1M reads seems to be getting higher and higher such that it seems that it will take weeks for this to finish. I've limited the scope of the analysis using -L argument, but I was wondering has anyone experienced something like this before? Thanks, MC Thanks, MC • Posts: 7,110Administrator, GATK Developer admin Hi @chongm, Sounds like the program got stuck in a region of excessive coverage, perhaps. Are you using downsampling? And are you running the latest version? Geraldine Van der Auwera, PhD • Posts: 33Member Hi Geraldine, I can't run the latest version of GATK unfortunately because we are still stuck on GATK1.6. Downsampling helps but it is still very slow. Thanks, MC • Posts: 33Member Hi Everyone, I've used GATK DOC to calculate coverage for some exonic intervals of two different genes. I'm noticing that for some exons the sample_mean_coverage value from the .sample_interval_summary file appears to be exactly 127.00 for all samples, which is almost impossible just by chance. Has anyone else experienced anything like this? I'm using GATK version 2.5.2 . Is this a version bug or am I doing something wrong here? Commands: java -Xmx4g -jar$gatk \
-T DepthOfCoverage \
-R $ref2 \ -I$data/interstroke_AMY1A_LPA.list \
-L $LPA \ -o${data}/interstroke_coverage_LPA

Thanks,

MC

Hi @chongm,

Are you using DoC on reduced bams, by any chance? In version 2.5 ReduceReads applied downsampling to 127...

Geraldine Van der Auwera, PhD

• Posts: 33Member

You are absolutely right! Sorry I didn't realize that downsampling was defaulted for reduce reads.

Thanks,

Mike Chong

Thanks,

MC

• Posts: 7Member
edited December 2013

Hello,

In this post you say that two of the outputs of DepthOfCoverage are:

a) A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)

b) A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

Which, from my understanding, correspond to:

a) _cumulative_coverage_counts

b) _cumulative_coverage_proportions

Correct?

If yes, my questions are:

1) in _cumulative_coverage_counts the cumulative number of bases covered at 1X, for example, decreases with sample size. Is this because it is increasing the minimum coverage too? For example, for one sample it looks for bases with a minimum of 1X coverage, for 2 samples it looks for bases with a minimum of 2X coverage (one read from each sample), and so on and so forth.

2) still in _cumulative_coverage_counts, is the sample that covers more bases chosen first and then it goes on adding the second sample that covers more bases and so on?

3) in _cumulative_coverage_proportions the first column is not the number of samples (NSamples_1, etc.), but samples names. So, my understanding is not that it is showing me the proportion of loci with ≥18x coverage for ≥15 libraries, but that it is showing me the proportion of loci with ≥18x coverage for each sample separately. Is this correct?

Thank you,
ISP.

Post edited by isp2 on

Hi @isp2,

No, I think what you're referring to are the tables in the interval_statistics and gene_statistics output files. The cumulative_coverage output files are histogram outputs that the article refers to.

Geraldine Van der Auwera, PhD

• UKPosts: 10Member

Hi Geraldine,
I was wondering what filters you use to measure genome coverage for WGS Human data in a production environment? Such as base quality, mapping quality, etc.
I could not find Best Practices for that on the forum.
Thank you.

Hi @ekanterakis‌,

We currently don't provide detailed recommendations for doing production sequencing QC; at the Broad this is done upstream of us by a specialized group in the genome sequencing platform, which is responsible for production-level sequencing. At the moment we have no plans to develop documentation on this but if that changes, I'll let you know.

Geraldine Van der Auwera, PhD

• Los Angeles, CAPosts: 4Member

I have 24 exome samples that I am trying to get coverage information for. These have been run over two flow cells, and I think the depth of coverage that I'm getting from this tool is much lower than I believe it should be (based on number of reads and depth of coverage calculated from the first flow cell alone). Should I be looking at the mean that is output from DepthofCoverage? Is there a way to get information for the percentage of intervals that have >30X coverage or >40X coverage, etc? Thank you.

Hi @lmelee,

The DepthOfCoverage tool emits various statistics per locus and per interval, including histograms that you can use for that purpose.

Geraldine Van der Auwera, PhD

• Posts: 8Member

Does DepthOfCoverage rely on correct values in the "bin" column (i.e. the first column) of the UCSC table?

I'd like to use DepthOfCoverage to analyze gene-by-gene sequencing coverage for an organism that is not hosted by UCSC's genome browser. Our organism has a GFF file, which I've modified to make a -geneList table like the one you've provided above. For the bin column, I've tried a variety of values (for example: 585, NA, 1). Regardless of the "bin" value, when I run DepthOfCoverage using this handmade file, it outputs by-chromosome statistics that seem reasonable but by-gene statistics that are totally off. In the "sample_gene_summary" table, instead of reporting coverage for each of the ~5000 genes in our organism, the program reports coverage for the first gene on each chromosome, but instead of reporting the coverage for that gene, it actually reports the coverage for that gene's chromosome.

I'm trying to troubleshoot what could be wrong with my handmade -geneList file, and wondering if the "bin" column could be the culprit. I've attached the first few lines of my -geneList file, if that helps.

Also, I'm using GATK 3.2 with the following command: java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ref.fasta -I sample1.realigned.bam -o sample1.coverage -geneList test8.txt -L chromosome.intervals.

Thanks for any help!

Hi @cparobek‌,

Sorry for the late reply. I'm not sure what could be wrong as I haven't run DoC myself in ages, and am too short on time to give it a go right now. But I think I remember someone mentioning an unintuitive interaction between the -L intervals and the genelist. Can you do a run with just the gene list (not the -L) and let me know what happens?

Geraldine Van der Auwera, PhD

• Posts: 183Member ✭✭✭

@cparobek

the *sample_gene_summary is only where the -L intervals intersects with -geneList

Hah, thanks @Kurt. I must have erased that knowledge from my brain to make space for something new since then

Geraldine Van der Auwera, PhD

• Posts: 183Member ✭✭✭

I purge my "mind palace" hourly nowadays it seems @Geraldine_VdAuwera

Ooh, your "mind palace", is it? That sounds awfully grandiose ))

Geraldine Van der Auwera, PhD

• Posts: 183Member ✭✭✭

BBC's SHERLOCK reference

Been wanting to use that one for awhile now

Hah, I should have recognized that! Great show. Glad you got to use that reference. You're welcome

Geraldine Van der Auwera, PhD

• Posts: 8Member

Thank you both! Like @Kurt suggested, I ended up providing an interval file where the regions fall inside the regions provided in the RefSeq file. Can also make the start/stop positions in the interval file be the same as in the RefSeq file. Also, to answer my question about the bins (RefSeq file column 1), I am using NAs in that column and it seems to be performing just fine.

• United StatesPosts: 4Member

Thanks for a great tutorial. If I am interested in calculating average coverage for each chromosome/plasmid is the best approach to use the argument: -geneList /path/to/gene/list.txt
Should the list.txt include a -geneList /path/to/gene/list.txt of all chromosome names and do I also need to include interval positions which span the length of the entire chromosome?

Thank you!

Hi,

An interval list specifies which parts of the dataset to analyze, whereas a gene list specifies intervals for which results should be aggregated.

For average coverage for each chromosome/plasmid, you can simply use -L with a list of the chromosomes and plasmids.

-Sheila

• United StatesPosts: 4Member

Fantastic this works nicely, thank you for the quick response!

• hkPosts: 10Member

hi Geraldine,
i have the target region coordiate information like this:chr1 10137431 10137716;
i have got the refseq fiile.
chr1 2323213 2336885 NM_007033 0 + 2327229 2334563 0 7 184,88,105,100,79,136,2412, 0,4009,5341,....
i think this is intervel file,right?

so my script is :
java -Xmx8g -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /media/Analysis/gatk_resource/ucsc.hg19.fasta --calculateCoverageOverGenes:REFSEQ /media/Analysis/annovar20140618/humandb/hg19_refGene.txt --outputFormat table --omitDepthOutputAtEachBase -o Coverage_summary -I ./realigned.recal.bam -L geneTrack.refSeq.bed
error message is

##### ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature chr11:89057522-89223909 followed later by chr5:154198052-154230213 and then chr11:319673-320914, for input source: /media/Analysis/annovar20140618/humandb/hg19_refGene.txt

thanks.

• Posts: 13Member
edited January 8

So sort the file to get the chromosomes in order: chr1,chr2 etc..

Post edited by Zaag on