Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Questions about DepthOfCoverage

This discussion was created from comments split from: Using DepthOfCoverage to find out how much sequence data you have.
Comments
If we are not on the broad network, is there still a way to obtain the gene list?
Hi there,
No, I'm afraid we do not provide public access to that resource. You'll need to make your own, sorry.
Hi, the link for the GATK DepthOfCoverage Documentation isn't working... Is it possible to post the updated link?
Thanks!
Hmm, the link is what it should be but the doc page is missing. I will find out why. In the meantime you can get the command line arguments by running the walker with
-h
(for "help").Is there a description somewhere of the various output files?
Does "Coverage by Gene" calculates coverage only within the intervals specified the intervals files (-L)? For example, if we have a whole exome experiment, and we provide DepthOfCoverage with the captured regions .intervals file, and we run coverage by gene as well, Is the output stats per gene only for regions within the captured regions or it is for all sites of the gene?
That's correct, using an intervals file with
-L
will limit the scope of analysis to only the targeted intervals.Thanks a lot. I was wondering if there is a way to add gene name or a custom annotation as a column in the sample_interval_summary that gets generated. As an example, say I feed the tool with whole exome captured (.intervals file) regions as well as refseq genes using
-geneList
, the sample_interval_summary would show only the coordinate of the regions and no other annotation (such as gene or transcript). Is there a way to get those annotations in the sample_interval_summary file?The gene name should be referenced in the summary table as explained here:
Is there a description available of the columns in the genelist file?
Nevermind! http://genome.ucsc.edu/cgi-bin/hgTables
@Geraldine_VdAuwera I use this tool to identify regions in our intervals(ROI) that have at least one base covered less than 30x. But In the sample_interval_summary file, the accuracy of %_above_30 (or %_above_anyDP) is not high enough. Accuracy is 0.001 (0.1%) which means if I have a region with the length of 5000bp, there should be at least 5bp below the DP cutoff for DepthOfCoverage tool to report a coverage percentage below 100.0% (for that DP cutoff). Is there a way to improve this accuracy?
Alternatively, if you can add minimum coverage for each interval (not sure how to get min coverage per interval using current GATK walkers), that will address this accuracy issue to some extent.
Hi there, sorry for the delayed response.
There is no option to directly control the accuracy of this tool. Depending on what you're trying to so, you may want to look at our other diagnostics / BAM processing tools such as DiagnoseTargets.
If by minimum coverage per interval you mean you want to exclude those low-covered intervals from analysis, there is currently no direct way to do that. We prefer to call those intervals then filter out variants that are supported by too few reads.
Along the lines of Sarmadym's November post, if I have BED file with gene intervals and names, is is possible for GATK to output a sample_interval_summary file that includes not only chromosome and coordinates, but the also the gene name itself? I am using '-L path/to/bed.bed' as the interval file. I have already tried using '-geneList path/to/refseq/gene/list.txt' but I'm not seeing the additional summary file that's described above. Thanks for your help.
Thanks - very good to know. DiagnoseTargets seems easy to use and fast, and it presents useful info in an easy-to-understand format. However, I'm still having trouble getting my feature names to appear in the output file. Currently the ID field in that file is blank (a dot). The BED file I'm feeding the program is in this format:
chr_name \t start_coordinate \t end_coordinate \t feature_name \t strand
I've searched and searched GATK documentation, but can't find any clues for how to modify my BED file so that DiagnoseTargets will populate the ID field with feature names. Any ideas are much appreciated.
I am not sure I understand what your question is. Could you provide me with the command line you're using for DiagnoseTargets?
Sure thing, thanks. I believe DiagnoseTargets is working fine, it's just not populating the "ID" field in the output file.
Here's the command line I'm using:
java -jar ~/src/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar \
-T DiagnoseTargets \
-I alignments2/ERR066280_sorted.bam.bam \
-R reference_genome/PlasmoDB-9.2_Pfalciparum3D7_Genome.fasta \
-L pileups/genetest.bed \
-o GATK_DiagnoseTargets/tryagain.txt
This runs for a few seconds (assessing coverage for 6 genes - I have 5000+ I'd eventually like to look at) and produces a 10-column file, tryagain.txt, attached. Also produces a .idx file. See how the ID column of the attached file is blank? I'd like that column to include the names of my genes, which are in the BED file I'm using as input, genetest.bed, which looks like this:
Pf3D7_01_v3 71623 72425 PF3D7_0101200 +
Pf3D7_01_v3 74562 75365 PF3D7_0101300 +
Pf3D7_01_v3 75981 76808 PF3D7_0101400 -
Pf3D7_01_v3 78240 79890 PF3D7_0101500 +
Pf3D7_01_v3 81764 83105 PF3D7_0101600 -
Pf3D7_01_v3 84790 86151 PF3D7_0101700 -
Col1 is the chr name, 2 and 3 are the coordinates of the gene, 4 is the gene name, and 5 is the strand. If you know of any way I could get the gene names (column for of the above BED) into the ID column of the attached DiagnoseTargets output, I'd be very appreciative.
Hmmm, the ID field in the VCF format is designed to represent the variant id in a database, like the mutation id in DBSNP or HAPMAP. Here we are using the VCF to display a region, instead of a variant, and filling the ID field would probably be a frowned upon by the VCF community.
That being said, this is a very easy problem to solve, you can do it yourself with an awk script to match the chr/pos to the id and fill in the blanks if you need. Is this something you (or someone in your team) would be comfortable doing?
From our end, if we were to support BED files with ID's it would have to be a more formal approach. I will look into it and whether or not the ID field would be the best one to provide. We accept not only BED files , but also interval and vcf files in the -L argument, and not all of them will have the ID in the same position.
Ohhh, that's what the ID field is for. Now it makes sense why I couldn't get my gene names to appear there. That would be great if you supported BED files with gene names, but in the mean time you're right - awk is probably the easiest solution. I can implement that for our own purposes. Thanks for your help...
I have been using the DepthOfCoverage tool a lot recently and have found that it produces a lot of really useful information in the output files that I am using for QC at our genome centre. I plan to continue to use them extensively in the future.
When I went to check some things in the documentation, I saw that you are sadly going to retire this tool, and it will be replaced by DiagnoseTargets. So I am responding to your request for comments on missing functionalities.
I have now tried DiagnoseTargets and DepthOfCoverage on the same datasets, and I can see that there is a lot of functionality missing from DiagnoseTargets.
The output vcf file from DignoseTargets provides some of the same information as the out.sample_interval_summary from DepthOfCoverage but not all, the total experiment coverage, average experiment coverage and total sample coverage are not provided. Additionally the actual values are different – is this due to downsampling in DiagnoseTargets? Here is an example:
out.sample_interval_summary:
Target chr1:762098-762270
total coverage 227719
average coverage 1316.29
Sample1 total cvg 156728
Sample1 mean cvg 905.94
Sample1 granular Q1 >500
Sample1 granular median >500
Sample1 granular Q3 >500
Sample1 % above 15 100
Sample2 total cvg 70991
Sample2 mean cvg 410.35
Sample2 granular Q1 322
Sample2 granular median 435
Sample2 granular Q3 >500
Sample2 % above 15 100
diagnoseTargets.vcf:
CHROM chr1
POS 762098
ID 0
REF T
ALT
QUAL 0
FILTER PASS
INFO AVG_INTERVAL_DP=287.96;END=762270
FORMAT AVG_INTERVAL_DP:MED:Q1:Q3
Sample1 175.43:200.00:108.00:247.00
Sample2 112.53:124.00:65.00:163.00
There isn't any replacement for any of the other files from DepthOfCoverage:
out:
total coverage, average sample coverage, coverage per sample at every base position in interval file
out.sample_summary:
total coverage, average coverage, quantiles coverage, %bases above 15 coverage for each sample
out.sample_statistics:
base position counts for x amount (default intevals of 1, 0 to 500 and 500+) of coverage for each sample [For histogram]
out.sample_cumulative_coverage_proportions:
proportion of base positions for number of reads greater than x for each sample
out.sample_cumulative_coverage_counts:
matrix of base position counts for number of samples greater than n, by number of reads greater than x
out.sample_interval_statistics:
matrix of number of intevals for number of samples greater than n, by number of reads greater than x
For me at the moment the out.sample_summary, out.sample_cumulative_coverage_proportions, out and out.sample_interval_summary files are really useful but I can envisage that the other output is useful for other parts of analysis and I may well use them in the future too.
It would be really great if you could add the functionality outlined above to DiagnoseTargets, or better still reverse your decision to retire DepthOfCoverage.
I am also finding the use of vcf style files for non variant data irritating. They require a lot of extra parsing, cannot just be passed on to end users and can potentially take up a lot more space than a tab delimited file.
Thanks for listening to my thoughts!
Anna
Hi Anna,
Thanks for the feedback, this is very useful. We will see what we can do to make DiagnoseTargets better respond to your needs.
Hi Geraldine.
I agree with everything that Annat has posted.
I use DepthOfCoverage for a large number (a few hundreds) of human exomes from individuals with rare diseases.
Two of the most useful outputs of DepthOfCoverage are missing in DiagnoseTargets (from what I can tell):
The fixed classifier of DiagnoseTargets (e.g. "LOW_COVERAGE") based on a single threshold (e.g. --minimum_coverage in DiagnoseTargets) is much less useful to me.
I also agree that the VCF-style output format is awkward and inappropriate.
Also, I have had no trouble running DepthOfCoverage with reasonable memory limits (4 Gb) even on large (100 Gb) multi-sample BAM files by controlling --read_buffer_size. With DiagnoseTargets I have yet to get it to run on the same BAM file even when providing 30 Gb of memory (our cluster kills the job when it exceeds the requested memory usage). This occurs when I specify a whole chromosome as the interval of interest. It would run when I gave shorter intervals.
Thanks for the useful feedback, Jeremy. We'll take this into account.
Not sure what's going on with performance but we'll take a look at that too and try to improve it.
Hi,
I ma trying to use depthofcoverage, to get the coverage over an interval. But I get an error:
ERROR A USER ERROR has occurred (version 1.3-20-g447e9bf):
ERROR MESSAGE: File associated with name /Users/ngs_user/Desktop/Coverage/makeCoverageScript/allexonsT1.bed is malformed: Interval file could not be parsed in any supported format. caused by BED files must be parsed through Tribble; parsing them as intervals through the GATK engine is no longer supported
ERROR ------------------------------------------------------------------------------------------
My interval file has the format:
chr1:12011-12058
chr1:12180-12228
chr1:12614-12698
chr1:12976-13053
chr1:13222-13375
It works fine without an interval file or just just one interval (-L chr1:12011-12058). Is the interval file format incorrect?
Thanks in advance!
Your intervals look fine, but the version you're using is downright ancient... You need to upgrade to a more recent version; preferably the latest.
Hi, In the new version of GATK there is ReduceReads part. Is there a way to call DepthOfCoverage from the output of ReduceReads reduced.bam file?
Thanks!
Yes, you can use DepthOfCoverage with reduced nam files just as you would any other bam. No special treatment necessary.
I would like to calculate the depth of coverage ratio of tumor and blood samples. I am using .coverage.sample_interval_summary output of Depth of Coverage. If I run DepthOfCoverage on reduced.bam files of tumor and blood, the ratio won't be correct right? Because the reads that are not informative in variant calling are removed? I was using BaseQualityRelicalibration output for DepthOfCoverage calculation. Should I continue with BaseQualityRecalibration output?
Thanks!
It depends what settings you used in ReduceReads -- but in principle it is not a problem anyway, if what you want to know is the useful coverage that is informative for calling variants. However, if you are trying to diagnose coverage as a sequencing QC process, then you should do it earlier in your pipeline.
Why is DepthOfCoverage being retired? DiagnoseTargets seems to provide less information in a format that requires additional processing.
Edited the message -- DoC will not be retired immediately, we will wait until DiagnoseTargets satisfies users' needs.
Hi, I wish to find out position specific coverage in a bacterial genome. I am not sure how to set the "-L" option to do so.
java -jar GenomeAnalysisTK.jar -R reference.fasta -T DepthOfCoverage -o depth_of_coverage.txt -I aligned_reads.bam -L??
The examples in gatk website show it like "-L chr1:100-200". I tried different combinations but i get an error saying contig does not match. Could you please help me with this option.
The header for SAM file looks like this if you need it.
@HD VN:1.3
@SQ SN:ref000001|sid|576|accn|NC_000962 LN:4411532 M5:e49320ac872e44cb8058de0407772516
@RG ID:1a7146a3a4 SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s2_p0PL:PacBio_RS DT:2012-10-09T21:01:18
@RG ID:d2c759b45d SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s1_p0PL:PacBio_RS DT:2012-10-09T21:01:18
@CO READS:151621
Hi Ashu,
The format is
-L contig:start-stop
. You need to know what is the name of the contig you want to analyze.what's the different between file *.interval_list and *.intervals?
@blueskypy, those are used interchangeably as extension names for interval list files; there is no inherent difference.
what's the meaning of the 1st column in the geneList file?
The UCSC table browser gives a full header for the columns. The first column is called #bin. For more details, you should look it up on their website.
Took a while to figure out the format of the -geneList file. In case I'm not the only dummy here, the format is described at http://genome.ucsc.edu/FAQ/FAQformat.html#format9
And the refGene file can be downloaded at ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
Thanks for sharing your findings (especially direct links!), I'm sure this will be useful to others.
I use the following command;
java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage \ -R $refGenome -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam \ -geneList HLA.genePred -rf BadCigar -o HLA
but some of the output files are empty, why?
And here are a few lines in the HLA.genePred
@blueskypy, can you tell me what version you are using?
hi, Geraldine,
Thanks so much for the help! I'm using GATK-2.5-2
Actually, it's not version dependent -- it's because you didn't specify intervals. If you specify an interval list (using
-L
) the files will get populated.hi, Geraldine,
Thanks for the help! I thought the -L isn't necessary if -geneList is provided since the interval info is in the geneList file. So if I extract the exonStarts and exonEnds from the geneList file to the interval file, will DepthOfCoverage map the intervals to gene names and produce the .sample_gene_summary file?
btw, since the -geneList file contains intevals for both CDS and exons, does DepthOfCoverage compute the coverage depth of genes based on CDS or exons?
I use the following command:
java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage -R $refGenome -l DEBUG \ -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam -geneList HLA.genePred -L HLA.intervals -rf BadCigar -o HLA
And got the following warning on every genes in the geneList file HLA.genePred:
However, there is no negative length in the entry of genes, here is one example:
833 NM_002122 chr6 + 32605182 32611429 32605235 32610541 5 32605182,32609086,32609748,32610386,32610728, 32605317,32609335,32610030,32610561,32611429, 0 HLA-DQA1 cmpl cmpl 0,1,1,1,-1,
Could someone help? Thanks a lot!
Not sure what's going on here. Have you tried testing on a subset and checking whether the results look reasonable?
Is the coverage that we get from depth of coverage for the whole genome equal to the one we get from Lander/Waterman equation as given on http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
Or is it average read depth?
DepthOfCoverage gives you the straight-up average depth.
Has anybody successfully used the DepthofCoverage tool to find average coverage per gene for RNA-Seq data?
I've tried running the following...
java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T DepthOfCoverage -o test.txt -I input.bam -geneList refGene.sorted.txt -L refGene.sorted.intervals -U ALLOW_N_CIGAR_READS
... but the summary file is empty. Any suggestions?
This may be an issue of how your intervals are set up. Have a look at this question & answer for more details on how it should be.
If that's not the issue, have you checked if it works for you with regular DNA data? This to rule out the N-containing cigars as a cause for the problem.
Hi,
I'm trying to run depthofcoverage on 40 reduced exome bams (one sample per bam) and the processing time per 1M reads seems to be getting higher and higher such that it seems that it will take weeks for this to finish. I've limited the scope of the analysis using -L argument, but I was wondering has anyone experienced something like this before?
Thanks,
MC
Hi @chongm,
Sounds like the program got stuck in a region of excessive coverage, perhaps. Are you using downsampling? And are you running the latest version?
Hi Geraldine, I can't run the latest version of GATK unfortunately because we are still stuck on GATK1.6. Downsampling helps but it is still very slow.
Hi Everyone,
I've used GATK DOC to calculate coverage for some exonic intervals of two different genes. I'm noticing that for some exons the sample_mean_coverage value from the .sample_interval_summary file appears to be exactly 127.00 for all samples, which is almost impossible just by chance. Has anyone else experienced anything like this? I'm using GATK version 2.5.2 . Is this a version bug or am I doing something wrong here?
Commands:
java -Xmx4g -jar $gatk \ -T DepthOfCoverage \ -R $ref2 \ -I $data/interstroke_AMY1A_LPA.list \ -L $LPA \ -o ${data}/interstroke_coverage_LPA
Hi @chongm,
Are you using DoC on reduced bams, by any chance? In version 2.5 ReduceReads applied downsampling to 127...
Hi @Geraldine_VdAuwera
You are absolutely right! Sorry I didn't realize that downsampling was defaulted for reduce reads.
Thanks,
Mike Chong
Hello,
In this post you say that two of the outputs of DepthOfCoverage are:
a) A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
b) A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)
Which, from my understanding, correspond to:
a) _cumulative_coverage_counts
b) _cumulative_coverage_proportions
Correct?
If yes, my questions are:
1) in _cumulative_coverage_counts the cumulative number of bases covered at 1X, for example, decreases with sample size. Is this because it is increasing the minimum coverage too? For example, for one sample it looks for bases with a minimum of 1X coverage, for 2 samples it looks for bases with a minimum of 2X coverage (one read from each sample), and so on and so forth.
2) still in _cumulative_coverage_counts, is the sample that covers more bases chosen first and then it goes on adding the second sample that covers more bases and so on?
3) in _cumulative_coverage_proportions the first column is not the number of samples (NSamples_1, etc.), but samples names. So, my understanding is not that it is showing me the proportion of loci with ≥18x coverage for ≥15 libraries, but that it is showing me the proportion of loci with ≥18x coverage for each sample separately. Is this correct?
Thank you,
ISP.
Hi @isp2,
No, I think what you're referring to are the tables in the interval_statistics and gene_statistics output files. The cumulative_coverage output files are histogram outputs that the article refers to.
Hi Geraldine,
I was wondering what filters you use to measure genome coverage for WGS Human data in a production environment? Such as base quality, mapping quality, etc.
I could not find Best Practices for that on the forum.
Thank you.
Hi @ekanterakis,
We currently don't provide detailed recommendations for doing production sequencing QC; at the Broad this is done upstream of us by a specialized group in the genome sequencing platform, which is responsible for production-level sequencing. At the moment we have no plans to develop documentation on this but if that changes, I'll let you know.
I have 24 exome samples that I am trying to get coverage information for. These have been run over two flow cells, and I think the depth of coverage that I'm getting from this tool is much lower than I believe it should be (based on number of reads and depth of coverage calculated from the first flow cell alone). Should I be looking at the mean that is output from DepthofCoverage? Is there a way to get information for the percentage of intervals that have >30X coverage or >40X coverage, etc? Thank you.
Hi @lmelee,
The DepthOfCoverage tool emits various statistics per locus and per interval, including histograms that you can use for that purpose.
Does DepthOfCoverage rely on correct values in the "bin" column (i.e. the first column) of the UCSC table?
I'd like to use DepthOfCoverage to analyze gene-by-gene sequencing coverage for an organism that is not hosted by UCSC's genome browser. Our organism has a GFF file, which I've modified to make a
-geneList
table like the one you've provided above. For the bin column, I've tried a variety of values (for example: 585, NA, 1). Regardless of the "bin" value, when I run DepthOfCoverage using this handmade file, it outputs by-chromosome statistics that seem reasonable but by-gene statistics that are totally off. In the "sample_gene_summary" table, instead of reporting coverage for each of the ~5000 genes in our organism, the program reports coverage for the first gene on each chromosome, but instead of reporting the coverage for that gene, it actually reports the coverage for that gene's chromosome.I'm trying to troubleshoot what could be wrong with my handmade
-geneList
file, and wondering if the "bin" column could be the culprit. I've attached the first few lines of my-geneList
file, if that helps.Also, I'm using GATK 3.2 with the following command:
java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ref.fasta -I sample1.realigned.bam -o sample1.coverage -geneList test8.txt -L chromosome.intervals
.Thanks for any help!
Hi @cparobek,
Sorry for the late reply. I'm not sure what could be wrong as I haven't run DoC myself in ages, and am too short on time to give it a go right now. But I think I remember someone mentioning an unintuitive interaction between the -L intervals and the genelist. Can you do a run with just the gene list (not the -L) and let me know what happens?
@Geraldine_VdAuwera
@cparobek
the *sample_gene_summary is only where the -L intervals intersects with -geneList
http://gatkforums.broadinstitute.org/discussion/comment/2016/#Comment_2016
Hah, thanks @Kurt. I must have erased that knowledge from my brain to make space for something new since then
Ooh, your "mind palace", is it? That sounds awfully grandiose
))
BBC's SHERLOCK reference
Been wanting to use that one for awhile now
Hah, I should have recognized that! Great show. Glad you got to use that reference. You're welcome
Thank you both! Like @Kurt suggested, I ended up providing an interval file where the regions fall inside the regions provided in the RefSeq file. Can also make the start/stop positions in the interval file be the same as in the RefSeq file. Also, to answer my question about the bins (RefSeq file column 1), I am using NAs in that column and it seems to be performing just fine.
Thanks for a great tutorial. If I am interested in calculating average coverage for each chromosome/plasmid is the best approach to use the argument: -geneList /path/to/gene/list.txt
Should the list.txt include a -geneList /path/to/gene/list.txt of all chromosome names and do I also need to include interval positions which span the length of the entire chromosome?
Thank you!
@Katie
Hi,
An interval list specifies which parts of the dataset to analyze, whereas a gene list specifies intervals for which results should be aggregated.
For average coverage for each chromosome/plasmid, you can simply use -L with a list of the chromosomes and plasmids.
-Sheila
Fantastic this works nicely, thank you for the quick response!
hi Geraldine,
i have the target region coordiate information like this:chr1 10137431 10137716;
With this guide https://www.broadinstitute.org/gatk/guide/article?id=1329
i have got the refseq fiile.
chr1 2323213 2336885 NM_007033 0 + 2327229 2334563 0 7 184,88,105,100,79,136,2412, 0,4009,5341,....
i think this is intervel file,right?
For the genelist file , we can download from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/,right?
so my script is :
java -Xmx8g -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /media/Analysis/gatk_resource/ucsc.hg19.fasta --calculateCoverageOverGenes:REFSEQ /media/Analysis/annovar20140618/humandb/hg19_refGene.txt --outputFormat table --omitDepthOutputAtEachBase -o Coverage_summary -I ./realigned.recal.bam -L geneTrack.refSeq.bed
error message is
ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature chr11:89057522-89223909 followed later by chr5:154198052-154230213 and then chr11:319673-320914, for input source: /media/Analysis/annovar20140618/humandb/hg19_refGene.txt
thanks.
So sort the file to get the chromosomes in order: chr1,chr2 etc..
I was wondering if we can get coverage of exons if we give only the -i bam whose exons we are interested in, and -L exonsTargets.bed or .intervals without any genes list, so if we are not interested in annotation.
Is genesList required parameter for running -T DepthOfCoverage ?
Moreover the intervals list should be like "contig:start-end" or can it be in bed-like file : chr1 start stop ?
Hello Sheila, I am working on a similar command of DepthOfCoverage for my internship project and I am stuck !
Can you help me ?
I would like to find the coverage of the exons(as an average value of each exon interval) from an input bam, giving a exonlist.bed in format of columns : [chr start stop ] ,
as -L parameter.
I don't know if it is stupid to give it without genes-I haven't worked before on DepthOfcov- but practically I don't need the gene names/intervals.
Would it work and give what I want a command like the following for example:
java -Xmx4g -jar Programs/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T DepthOfCoverage
-I 39_sorted.bam -o 39_DepthofCovbyExon -L exons_intervals.bed -rf BadCigar
Is there any other parameter I can define for doing that better ?
Thanks in advance,
@mariagr The genes are not necessary, it's just an option if you want that information. If not you can go ahead with the command you posted, that should work just fine.
Hello again,many Thanks!
I did so, and it worked fine for a small bam of chr22 reads, but now I faced a problem I think, given the message I see at the end of the execution: For each bam (of 1 chr) I get :
MicroScheduler - -> 458 reads (0.02% of total) failing NotPrimaryAlignmentFilter
MicroScheduler - -> 3344 reads (0.15% of total) failing UnmappedReadFilter
Nb of reads is different but always on these two filters there are reads failing filtering.
I have already applied BWA,sorting with samtools on the whole genome bam, prepared the fasta (fai & dict) and then I do
samtools view -h -b [chr] > mychr_reads.bam
samtools index mychr_reads.bam
so I ran DepthOfCoverage like the command mentioned in my last post, on the smaller bam by chromosome.
Can this cause the problem?
Otherwise how can I apply these filters for overcoming these failure messages ?
Please guide me through any documentation I need to read for this,too, because I didn't find anything by googling the error and I ll have to run DepthOfCov for lots and lots of bam in a bash script so I have to be sure I won't get any errors or have bad quality output.
Best Regards,
Maria
@mariagr, this is not a problem; it is expected to have some reads that fail these filters. The coverage tool cannot use unmapped reads so they are ignored for this analysis. The filtering summary is not a error message. If you see it, it means your jobs ran successfully.
aaa, ok ! So I can ignore it ! I was afraid I would need more correction commands which take time !!
Merci :-)
Hello,
there is a detail I didn't consider and I realise now ! I 'm not sure if I should re-run everything or no.
I used sorted bam files for Depth of Coverage but I didn't run MarkDuplicates ! How can I verify from DepthOfCov output that I have the correct output? Is it absolutely necessary to have done it before for calculating coverage ? Will there be a bias in coverage results ? How can I verify it ?From statistics files maybe ? Or would I have got an error message concerning duplicates if that was a problem?
@mariagr if you did not mark duplicates, the analysis cannot differentiate between duplicate and non duplicate read, so it cannot estimate the useable amount o coverage you have. I would strongly recommend redoing the analysis after marking duplicates.
@Geraldine_VdAuwera Hi, I have a question about mm10.refgene from UCSC. The one I got based on your description, do not contain chrM and when I ran the Depthofcoverage command, I have this error: mm10.refgene.txt and reference genome are not compatible. The only difference between them is chrM is not exist in refseq gene list for mm10. Can you please help me here. Many thanks in advance. rahel
@Rash
Hi,
It sounds like you got the incorrect refgene list. You should go back and ensure you have the correct one. This should not happen if you specified the exact same reference your BAM file is aligned to.
Check whether the mm10 reference provided by UCSC is the exact same one you used to align your reads to. If not, see if your reference is actually a choice on the UCSC download website.
-Sheila
@Sheila , It works now, I needed to sort the refgen file based on the reference genome and then it worked and it was not related to chrM.
Many thanks for your reply, Rahel
@Rash
Hi Rahel
Thanks for sharing your solution!
-Sheila
I am admittedly using an older version of GATK still, 3.4-46, but when running this tool I have set a number of -ct arguments, at 100, 500, 1000, 2000, and 5000. The resulting 'sample_summary' file reports the same % latter four thresholds. Example.
sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_100 %_bases_above_500 %_bases_above_1000 %_bases_above_2000 %_bases_above_5000 0346 209898124 1238.61 500 500 327 93.0 63.7 63.7 63.7 63.7
This can not be true, and especially not for all samples. Was this perhaps a bug that has been resolved in later versions or am I doing something wrong? Thank you.
-bwubb
User error.
@bwubb
Hi bwubb,
It would be nice if you share what exactly the user error was so other users can benefit
-Sheila
P.S. Glad to hear you solved it on your own!
Hi,
I am not sure to have to set (-pt)option. I have multiple files(different library, run, and lane) of the same sample and after mapping these files I merged it in marked duplicate step. I followed the GATK best practices till printsReads and got the process single bam file.
Although I provided the same (one-merged) BAM file for coverage analysis, the default of the software was to break down the summary between all the initial files. Now I’ve identified the option partition (-pt). So I used it but still, it gives the same kind of results. I want to see the just one file coverage information.
Here are my commands-
java -Xmx20g -jar GenomeAnalysisTK.jar \
-T DepthOfCoverage \
-I Sample1.dedup.realigned.recal_reads.bam \
-R human_g1k_v37_decoy.fasta \
-L exome_targetedregions.bed \
-pt sample \
-omitBaseOutput \
--summaryCoverageThreshold 30 \
-o coverage_stats/Sample1.coverage.stats;)
And created summary output file is -
sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_30
Sample1_L1_R3a 305086414 4.91 8 5 2 0.2
Sample1_L1_R2 581390401 9.36 15 8 3 3.9
Sample1_L2_R4 303865489 4.89 8 5 2 0.2
Sample1_L1_R1 1824546268 29.39 43 24 11 40.3
Sample1_L1_R3 28066614 0.45 1 1 1 0.0
Total 3042955186 49.01 N/A N/A N/A
But I am only looking only one value for my this sample, something like
Sample1 total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_30
Combined value of all the library(L1/L2) and Runs(R1/R2/R3/R3a).
Please help me.
@priyatama
Hi,
It looks like you have different sample names. You should have the same identifier for the SM tag in the BAM file if the reads come from the same sample. Have a look at this document for more information.
-sheila
Thanks, Sheila.