Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about DepthOfCoverage

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: Using DepthOfCoverage to find out how much sequence data you have.

Comments

  • chongmchongm Member

    If we are not on the broad network, is there still a way to obtain the gene list?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    No, I'm afraid we do not provide public access to that resource. You'll need to make your own, sorry.

  • pepe13pepe13 AustraliaMember

    Hi, the link for the GATK DepthOfCoverage Documentation isn't working... Is it possible to post the updated link?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, the link is what it should be but the doc page is missing. I will find out why. In the meantime you can get the command line arguments by running the walker with -h (for "help").

  • artitandonartitandon Member ✭✭

    Is there a description somewhere of the various output files?

  • sarmadymsarmadym Member

    Does "Coverage by Gene" calculates coverage only within the intervals specified the intervals files (-L)? For example, if we have a whole exome experiment, and we provide DepthOfCoverage with the captured regions .intervals file, and we run coverage by gene as well, Is the output stats per gene only for regions within the captured regions or it is for all sites of the gene?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's correct, using an intervals file with -L will limit the scope of analysis to only the targeted intervals.

  • sarmadymsarmadym Member

    @Geraldine_VdAuwera said:
    That's correct, using an intervals file with -L will limit the scope of analysis to only the targeted intervals.

    Thanks a lot. I was wondering if there is a way to add gene name or a custom annotation as a column in the sample_interval_summary that gets generated. As an example, say I feed the tool with whole exome captured (.intervals file) regions as well as refseq genes using -geneList , the sample_interval_summary would show only the coordinate of the regions and no other annotation (such as gene or transcript). Is there a way to get those annotations in the sample_interval_summary file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The gene name should be referenced in the summary table as explained here:

    The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls.

  • emossemoss Member

    Is there a description available of the columns in the genelist file?

  • emossemoss Member
    edited December 2012

    @emoss said:
    Is there a description available of the columns in the genelist file?

    Nevermind! http://genome.ucsc.edu/cgi-bin/hgTables

  • sarmadymsarmadym Member
    edited December 2012

    @Geraldine_VdAuwera I use this tool to identify regions in our intervals(ROI) that have at least one base covered less than 30x. But In the sample_interval_summary file, the accuracy of %_above_30 (or %_above_anyDP) is not high enough. Accuracy is 0.001 (0.1%) which means if I have a region with the length of 5000bp, there should be at least 5bp below the DP cutoff for DepthOfCoverage tool to report a coverage percentage below 100.0% (for that DP cutoff). Is there a way to improve this accuracy?

    Alternatively, if you can add minimum coverage for each interval (not sure how to get min coverage per interval using current GATK walkers), that will address this accuracy issue to some extent.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, sorry for the delayed response.

    There is no option to directly control the accuracy of this tool. Depending on what you're trying to so, you may want to look at our other diagnostics / BAM processing tools such as DiagnoseTargets.

    If by minimum coverage per interval you mean you want to exclude those low-covered intervals from analysis, there is currently no direct way to do that. We prefer to call those intervals then filter out variants that are supported by too few reads.

  • cparobekcparobek Member

    Along the lines of Sarmadym's November post, if I have BED file with gene intervals and names, is is possible for GATK to output a sample_interval_summary file that includes not only chromosome and coordinates, but the also the gene name itself? I am using '-L path/to/bed.bed' as the interval file. I have already tried using '-geneList path/to/refseq/gene/list.txt' but I'm not seeing the additional summary file that's described above. Thanks for your help.

  • cparobekcparobek Member
    edited January 2013

    Thanks - very good to know. DiagnoseTargets seems easy to use and fast, and it presents useful info in an easy-to-understand format. However, I'm still having trouble getting my feature names to appear in the output file. Currently the ID field in that file is blank (a dot). The BED file I'm feeding the program is in this format:

    chr_name \t start_coordinate \t end_coordinate \t feature_name \t strand

    I've searched and searched GATK documentation, but can't find any clues for how to modify my BED file so that DiagnoseTargets will populate the ID field with feature names. Any ideas are much appreciated.

  • CarneiroCarneiro Charlestown, MAMember admin

    I am not sure I understand what your question is. Could you provide me with the command line you're using for DiagnoseTargets?

  • cparobekcparobek Member

    Sure thing, thanks. I believe DiagnoseTargets is working fine, it's just not populating the "ID" field in the output file.

    Here's the command line I'm using:
    java -jar ~/src/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar \
    -T DiagnoseTargets \
    -I alignments2/ERR066280_sorted.bam.bam \
    -R reference_genome/PlasmoDB-9.2_Pfalciparum3D7_Genome.fasta \
    -L pileups/genetest.bed \
    -o GATK_DiagnoseTargets/tryagain.txt

    This runs for a few seconds (assessing coverage for 6 genes - I have 5000+ I'd eventually like to look at) and produces a 10-column file, tryagain.txt, attached. Also produces a .idx file. See how the ID column of the attached file is blank? I'd like that column to include the names of my genes, which are in the BED file I'm using as input, genetest.bed, which looks like this:

    Pf3D7_01_v3 71623 72425 PF3D7_0101200 +
    Pf3D7_01_v3 74562 75365 PF3D7_0101300 +
    Pf3D7_01_v3 75981 76808 PF3D7_0101400 -
    Pf3D7_01_v3 78240 79890 PF3D7_0101500 +
    Pf3D7_01_v3 81764 83105 PF3D7_0101600 -
    Pf3D7_01_v3 84790 86151 PF3D7_0101700 -

    Col1 is the chr name, 2 and 3 are the coordinates of the gene, 4 is the gene name, and 5 is the strand. If you know of any way I could get the gene names (column for of the above BED) into the ID column of the attached DiagnoseTargets output, I'd be very appreciative.

  • CarneiroCarneiro Charlestown, MAMember admin

    Hmmm, the ID field in the VCF format is designed to represent the variant id in a database, like the mutation id in DBSNP or HAPMAP. Here we are using the VCF to display a region, instead of a variant, and filling the ID field would probably be a frowned upon by the VCF community.

    That being said, this is a very easy problem to solve, you can do it yourself with an awk script to match the chr/pos to the id and fill in the blanks if you need. Is this something you (or someone in your team) would be comfortable doing?

    From our end, if we were to support BED files with ID's it would have to be a more formal approach. I will look into it and whether or not the ID field would be the best one to provide. We accept not only BED files , but also interval and vcf files in the -L argument, and not all of them will have the ID in the same position.

  • cparobekcparobek Member

    Ohhh, that's what the ID field is for. Now it makes sense why I couldn't get my gene names to appear there. That would be great if you supported BED files with gene names, but in the mean time you're right - awk is probably the easiest solution. I can implement that for our own purposes. Thanks for your help...

  • annatannat Member

    I have been using the DepthOfCoverage tool a lot recently and have found that it produces a lot of really useful information in the output files that I am using for QC at our genome centre. I plan to continue to use them extensively in the future.

    When I went to check some things in the documentation, I saw that you are sadly going to retire this tool, and it will be replaced by DiagnoseTargets. So I am responding to your request for comments on missing functionalities.

    I have now tried DiagnoseTargets and DepthOfCoverage on the same datasets, and I can see that there is a lot of functionality missing from DiagnoseTargets.

    The output vcf file from DignoseTargets provides some of the same information as the out.sample_interval_summary from DepthOfCoverage but not all, the total experiment coverage, average experiment coverage and total sample coverage are not provided. Additionally the actual values are different – is this due to downsampling in DiagnoseTargets? Here is an example:

    out.sample_interval_summary:
    Target chr1:762098-762270
    total coverage 227719
    average coverage 1316.29
    Sample1 total cvg 156728
    Sample1 mean cvg 905.94
    Sample1 granular Q1 >500
    Sample1 granular median >500
    Sample1 granular Q3 >500
    Sample1 % above 15 100
    Sample2 total cvg 70991
    Sample2 mean cvg 410.35
    Sample2 granular Q1 322
    Sample2 granular median 435
    Sample2 granular Q3 >500
    Sample2 % above 15 100

    diagnoseTargets.vcf:
    CHROM chr1
    POS 762098
    ID 0
    REF T
    ALT


    QUAL 0
    FILTER PASS
    INFO AVG_INTERVAL_DP=287.96;END=762270
    FORMAT AVG_INTERVAL_DP:MED:Q1:Q3
    Sample1 175.43:200.00:108.00:247.00
    Sample2 112.53:124.00:65.00:163.00

    There isn't any replacement for any of the other files from DepthOfCoverage:

    out:
    total coverage, average sample coverage, coverage per sample at every base position in interval file

    out.sample_summary:
    total coverage, average coverage, quantiles coverage, %bases above 15 coverage for each sample

    out.sample_statistics:
    base position counts for x amount (default intevals of 1, 0 to 500 and 500+) of coverage for each sample [For histogram]

    out.sample_cumulative_coverage_proportions:
    proportion of base positions for number of reads greater than x for each sample

    out.sample_cumulative_coverage_counts:
    matrix of base position counts for number of samples greater than n, by number of reads greater than x

    out.sample_interval_statistics:
    matrix of number of intevals for number of samples greater than n, by number of reads greater than x

    For me at the moment the out.sample_summary, out.sample_cumulative_coverage_proportions, out and out.sample_interval_summary files are really useful but I can envisage that the other output is useful for other parts of analysis and I may well use them in the future too.

    It would be really great if you could add the functionality outlined above to DiagnoseTargets, or better still reverse your decision to retire DepthOfCoverage.

    I am also finding the use of vcf style files for non variant data irritating. They require a lot of extra parsing, cannot just be passed on to end users and can potentially take up a lot more space than a tab delimited file.

    Thanks for listening to my thoughts!
    Anna

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Anna,

    Thanks for the feedback, this is very useful. We will see what we can do to make DiagnoseTargets better respond to your needs.

  • Jeremy37Jeremy37 Member

    Hi Geraldine.
    I agree with everything that Annat has posted.

    I use DepthOfCoverage for a large number (a few hundreds) of human exomes from individuals with rare diseases.
    Two of the most useful outputs of DepthOfCoverage are missing in DiagnoseTargets (from what I can tell):

    • sample_summary: I like to have a concise summary of mean and median coverage for each sample, as well as %of target bases covered at different thresholds (e.g. 5x, 10x, 20x, 30x, etc) averaged over all targeted bases in the sample.
    • sample_interval_summary: I like to have the %of target bases covered at different thresholds for EACH target (exon) in the sample. This is often useful because we may have a restricted set of genes in mind. We can check whether any specific exons are missing coverage in one or more samples and might want to follow those up.

    The fixed classifier of DiagnoseTargets (e.g. "LOW_COVERAGE") based on a single threshold (e.g. --minimum_coverage in DiagnoseTargets) is much less useful to me.
    I also agree that the VCF-style output format is awkward and inappropriate.

    Also, I have had no trouble running DepthOfCoverage with reasonable memory limits (4 Gb) even on large (100 Gb) multi-sample BAM files by controlling --read_buffer_size. With DiagnoseTargets I have yet to get it to run on the same BAM file even when providing 30 Gb of memory (our cluster kills the job when it exceeds the requested memory usage). This occurs when I specify a whole chromosome as the interval of interest. It would run when I gave shorter intervals.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for the useful feedback, Jeremy. We'll take this into account.

    Not sure what's going on with performance but we'll take a look at that too and try to improve it.

  • MamtaMamta Member

    Hi,
    I ma trying to use depthofcoverage, to get the coverage over an interval. But I get an error:

    ERROR A USER ERROR has occurred (version 1.3-20-g447e9bf):
    ERROR MESSAGE: File associated with name /Users/ngs_user/Desktop/Coverage/makeCoverageScript/allexonsT1.bed is malformed: Interval file could not be parsed in any supported format. caused by BED files must be parsed through Tribble; parsing them as intervals through the GATK engine is no longer supported
    ERROR ------------------------------------------------------------------------------------------

    My interval file has the format:
    chr1:12011-12058
    chr1:12180-12228
    chr1:12614-12698
    chr1:12976-13053
    chr1:13222-13375

    It works fine without an interval file or just just one interval (-L chr1:12011-12058). Is the interval file format incorrect?
    Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited February 2013

    Your intervals look fine, but the version you're using is downright ancient... You need to upgrade to a more recent version; preferably the latest.

  • asease Member

    Hi, In the new version of GATK there is ReduceReads part. Is there a way to call DepthOfCoverage from the output of ReduceReads reduced.bam file?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, you can use DepthOfCoverage with reduced nam files just as you would any other bam. No special treatment necessary.

  • asease Member

    I would like to calculate the depth of coverage ratio of tumor and blood samples. I am using .coverage.sample_interval_summary output of Depth of Coverage. If I run DepthOfCoverage on reduced.bam files of tumor and blood, the ratio won't be correct right? Because the reads that are not informative in variant calling are removed? I was using BaseQualityRelicalibration output for DepthOfCoverage calculation. Should I continue with BaseQualityRecalibration output?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It depends what settings you used in ReduceReads -- but in principle it is not a problem anyway, if what you want to know is the useful coverage that is informative for calling variants. However, if you are trying to diagnose coverage as a sequencing QC process, then you should do it earlier in your pipeline.

  • igorigor New YorkMember ✭✭

    Why is DepthOfCoverage being retired? DiagnoseTargets seems to provide less information in a format that requires additional processing.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Edited the message -- DoC will not be retired immediately, we will wait until DiagnoseTargets satisfies users' needs.

  • AshuAshu Member
    edited April 2013

    Hi, I wish to find out position specific coverage in a bacterial genome. I am not sure how to set the "-L" option to do so.

    java -jar GenomeAnalysisTK.jar -R reference.fasta -T DepthOfCoverage -o depth_of_coverage.txt -I aligned_reads.bam -L??

    The examples in gatk website show it like "-L chr1:100-200". I tried different combinations but i get an error saying contig does not match. Could you please help me with this option.

    The header for SAM file looks like this if you need it.

    @HD VN:1.3

    @SQ SN:ref000001|sid|576|accn|NC_000962 LN:4411532 M5:e49320ac872e44cb8058de0407772516

    @RG ID:1a7146a3a4 SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s2_p0PL:PacBio_RS DT:2012-10-09T21:01:18

    @RG ID:d2c759b45d SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s1_p0PL:PacBio_RS DT:2012-10-09T21:01:18

    @CO READS:151621

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Ashu,

    The format is -L contig:start-stop. You need to know what is the name of the contig you want to analyze.

  • blueskypyblueskypy Member ✭✭

    what's the different between file *.interval_list and *.intervals?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @blueskypy, those are used interchangeably as extension names for interval list files; there is no inherent difference.

  • blueskypyblueskypy Member ✭✭

    what's the meaning of the 1st column in the geneList file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The UCSC table browser gives a full header for the columns. The first column is called #bin. For more details, you should look it up on their website.

  • blueskypyblueskypy Member ✭✭

    Took a while to figure out the format of the -geneList file. In case I'm not the only dummy here, the format is described at http://genome.ucsc.edu/FAQ/FAQformat.html#format9
    And the refGene file can be downloaded at ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for sharing your findings (especially direct links!), I'm sure this will be useful to others.

  • blueskypyblueskypy Member ✭✭
    edited July 2013

    I use the following command;

    java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage \ -R $refGenome -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam \ -geneList HLA.genePred -rf BadCigar -o HLA

    but some of the output files are empty, why?

    $ ls -al HLA*
    -rw-r--r-- 1 user01 root 71394299565 Jul  3 05:34 HLA
    -rw-r--r-- 1 user01 root           0 Jul  3 05:34 HLA.sample_interval_statistics
    -rw-r--r-- 1 user01 root           0 Jul  3 05:34 HLA.sample_gene_summary
    -rw-r--r-- 1 user01 root       11462 Jul  3 05:34 HLA.sample_cumulative_coverage_proportions
    -rw-r--r-- 1 user01 root         266 Jul  3 05:34 HLA.sample_summary
    -rw-r--r-- 1 user01 root       17963 Jul  3 05:34 HLA.sample_statistics
    -rw-r--r-- 1 user01 root           0 Jul  3 05:34 HLA.sample_interval_summary
    -rw-r--r-- 1 user01 root       16574 Jul  3 05:34 HLA.sample_cumulative_coverage_counts
    -rw-r--r-- 1 user01 root        5367 Jul  1 22:47 HLA.genePred
    
  • blueskypyblueskypy Member ✭✭

    And here are a few lines in the HLA.genePred

    834 NM_020056   chr6    +   32709162    32714664    32709220    32714171    5   32709162,32712935,32713567,32714016,32714358,   32709302,32713184,32713849,32714191,32714664,   0   HLA-DQA2    cmpl    cmpl    0,1,1,1,-1,
    834 NM_001198858    chr6    -   32723874    32731330    32724229    32731247    5   32723874,32725057,32726626,32729436,32731150,   32724243,32725081,32726908,32729703,32731330,   0   HLA-DQB2    cmpl    cmpl    1,1,1,1,0,
    823 NM_005514   chr6    -   31321648    31324989    31322259    31324935    8   31321648,31322255,31322409,31322883,31323093,31323943,31324464,31324862,    31322073,31322303,31322442,31323000,31323369,31324219,31324734,31324989,    0   HLA-B   cmpl    cmpl    -1,1,1,1,1,1,1,0,
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @blueskypy, can you tell me what version you are using?

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    Thanks so much for the help! I'm using GATK-2.5-2

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually, it's not version dependent -- it's because you didn't specify intervals. If you specify an interval list (using -L) the files will get populated.

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    Thanks for the help! I thought the -L isn't necessary if -geneList is provided since the interval info is in the geneList file. So if I extract the exonStarts and exonEnds from the geneList file to the interval file, will DepthOfCoverage map the intervals to gene names and produce the .sample_gene_summary file?

  • blueskypyblueskypy Member ✭✭

    btw, since the -geneList file contains intevals for both CDS and exons, does DepthOfCoverage compute the coverage depth of genes based on CDS or exons?

  • blueskypyblueskypy Member ✭✭

    I use the following command:

    java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage -R $refGenome -l DEBUG \ -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam -geneList HLA.genePred -L HLA.intervals -rf BadCigar -o HLA

    And got the following warning on every genes in the geneList file HLA.genePred:

    WARN  16:58:32,227 Utils - ********************************************************************************
    WARN  16:58:32,227 Utils - * WARNING:
    WARN  16:58:32,227 Utils - *
    WARN  16:58:32,227 Utils - * RefSeq file is potentially incorrect, as some transcripts or exons
    WARN  16:58:32,227 Utils - * have a negative length (chr6)
    WARN  16:58:32,228 Utils - ********************************************************************************
    

    However, there is no negative length in the entry of genes, here is one example:

    833 NM_002122 chr6 + 32605182 32611429 32605235 32610541 5 32605182,32609086,32609748,32610386,32610728, 32605317,32609335,32610030,32610561,32611429, 0 HLA-DQA1 cmpl cmpl 0,1,1,1,-1,

    Could someone help? Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Not sure what's going on here. Have you tried testing on a subset and checking whether the results look reasonable?

  • smk_84smk_84 Member

    Is the coverage that we get from depth of coverage for the whole genome equal to the one we get from Lander/Waterman equation as given on http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
    Or is it average read depth?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    DepthOfCoverage gives you the straight-up average depth.

  • CMBielskiCMBielski BroadMember

    Has anybody successfully used the DepthofCoverage tool to find average coverage per gene for RNA-Seq data?

    I've tried running the following...

    java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T DepthOfCoverage -o test.txt -I input.bam -geneList refGene.sorted.txt -L refGene.sorted.intervals -U ALLOW_N_CIGAR_READS

    ... but the summary file is empty. Any suggestions?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This may be an issue of how your intervals are set up. Have a look at this question & answer for more details on how it should be.

    If that's not the issue, have you checked if it works for you with regular DNA data? This to rule out the N-containing cigars as a cause for the problem.

  • chongmchongm Member

    Hi,

    I'm trying to run depthofcoverage on 40 reduced exome bams (one sample per bam) and the processing time per 1M reads seems to be getting higher and higher such that it seems that it will take weeks for this to finish. I've limited the scope of the analysis using -L argument, but I was wondering has anyone experienced something like this before?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @chongm,

    Sounds like the program got stuck in a region of excessive coverage, perhaps. Are you using downsampling? And are you running the latest version?

  • chongmchongm Member

    Hi Geraldine, I can't run the latest version of GATK unfortunately because we are still stuck on GATK1.6. Downsampling helps but it is still very slow.

  • chongmchongm Member

    Hi Everyone,

    I've used GATK DOC to calculate coverage for some exonic intervals of two different genes. I'm noticing that for some exons the sample_mean_coverage value from the .sample_interval_summary file appears to be exactly 127.00 for all samples, which is almost impossible just by chance. Has anyone else experienced anything like this? I'm using GATK version 2.5.2 . Is this a version bug or am I doing something wrong here?

    Commands:

    java -Xmx4g -jar $gatk \ -T DepthOfCoverage \ -R $ref2 \ -I $data/interstroke_AMY1A_LPA.list \ -L $LPA \ -o ${data}/interstroke_coverage_LPA

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @chongm,

    Are you using DoC on reduced bams, by any chance? In version 2.5 ReduceReads applied downsampling to 127...

  • chongmchongm Member

    Hi @Geraldine_VdAuwera

    You are absolutely right! Sorry I didn't realize that downsampling was defaulted for reduce reads.

    Thanks,

    Mike Chong

  • isp2isp2 Member
    edited December 2013

    Hello,

    In this post you say that two of the outputs of DepthOfCoverage are:

    a) A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)

    b) A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

    Which, from my understanding, correspond to:

    a) _cumulative_coverage_counts

    b) _cumulative_coverage_proportions

    Correct?

    If yes, my questions are:

    1) in _cumulative_coverage_counts the cumulative number of bases covered at 1X, for example, decreases with sample size. Is this because it is increasing the minimum coverage too? For example, for one sample it looks for bases with a minimum of 1X coverage, for 2 samples it looks for bases with a minimum of 2X coverage (one read from each sample), and so on and so forth.

    2) still in _cumulative_coverage_counts, is the sample that covers more bases chosen first and then it goes on adding the second sample that covers more bases and so on?

    3) in _cumulative_coverage_proportions the first column is not the number of samples (NSamples_1, etc.), but samples names. So, my understanding is not that it is showing me the proportion of loci with ≥18x coverage for ≥15 libraries, but that it is showing me the proportion of loci with ≥18x coverage for each sample separately. Is this correct?

    Thank you,
    ISP.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @isp2,

    No, I think what you're referring to are the tables in the interval_statistics and gene_statistics output files. The cumulative_coverage output files are histogram outputs that the article refers to.

  • ekanterakisekanterakis UKMember

    Hi Geraldine,
    I was wondering what filters you use to measure genome coverage for WGS Human data in a production environment? Such as base quality, mapping quality, etc.
    I could not find Best Practices for that on the forum.
    Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @ekanterakis‌,

    We currently don't provide detailed recommendations for doing production sequencing QC; at the Broad this is done upstream of us by a specialized group in the genome sequencing platform, which is responsible for production-level sequencing. At the moment we have no plans to develop documentation on this but if that changes, I'll let you know.

  • lmeleelmelee Los Angeles, CAMember

    I have 24 exome samples that I am trying to get coverage information for. These have been run over two flow cells, and I think the depth of coverage that I'm getting from this tool is much lower than I believe it should be (based on number of reads and depth of coverage calculated from the first flow cell alone). Should I be looking at the mean that is output from DepthofCoverage? Is there a way to get information for the percentage of intervals that have >30X coverage or >40X coverage, etc? Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @lmelee,

    The DepthOfCoverage tool emits various statistics per locus and per interval, including histograms that you can use for that purpose.

  • Does DepthOfCoverage rely on correct values in the "bin" column (i.e. the first column) of the UCSC table?

    I'd like to use DepthOfCoverage to analyze gene-by-gene sequencing coverage for an organism that is not hosted by UCSC's genome browser. Our organism has a GFF file, which I've modified to make a -geneList table like the one you've provided above. For the bin column, I've tried a variety of values (for example: 585, NA, 1). Regardless of the "bin" value, when I run DepthOfCoverage using this handmade file, it outputs by-chromosome statistics that seem reasonable but by-gene statistics that are totally off. In the "sample_gene_summary" table, instead of reporting coverage for each of the ~5000 genes in our organism, the program reports coverage for the first gene on each chromosome, but instead of reporting the coverage for that gene, it actually reports the coverage for that gene's chromosome.

    I'm trying to troubleshoot what could be wrong with my handmade -geneList file, and wondering if the "bin" column could be the culprit. I've attached the first few lines of my -geneList file, if that helps.

    Also, I'm using GATK 3.2 with the following command: java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ref.fasta -I sample1.realigned.bam -o sample1.coverage -geneList test8.txt -L chromosome.intervals.

    Thanks for any help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @cparobek‌,

    Sorry for the late reply. I'm not sure what could be wrong as I haven't run DoC myself in ages, and am too short on time to give it a go right now. But I think I remember someone mentioning an unintuitive interaction between the -L intervals and the genelist. Can you do a run with just the gene list (not the -L) and let me know what happens?

  • KurtKurt Member ✭✭✭

    @Geraldine_VdAuwera‌

    @cparobek

    the *sample_gene_summary is only where the -L intervals intersects with -geneList

    http://gatkforums.broadinstitute.org/discussion/comment/2016/#Comment_2016

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hah, thanks @Kurt. I must have erased that knowledge from my brain to make space for something new since then :)

  • KurtKurt Member ✭✭✭

    :) I purge my "mind palace" hourly nowadays it seems @Geraldine_VdAuwera :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ooh, your "mind palace", is it? That sounds awfully grandiose :)))

  • KurtKurt Member ✭✭✭

    BBC's SHERLOCK reference ;)

    Been wanting to use that one for awhile now ;)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hah, I should have recognized that! Great show. Glad you got to use that reference. You're welcome ;)

  • cparobekcparobek Member

    Thank you both! Like @Kurt suggested, I ended up providing an interval file where the regions fall inside the regions provided in the RefSeq file. Can also make the start/stop positions in the interval file be the same as in the RefSeq file. Also, to answer my question about the bins (RefSeq file column 1), I am using NAs in that column and it seems to be performing just fine.

  • KatieKatie United StatesMember ✭✭

    Thanks for a great tutorial. If I am interested in calculating average coverage for each chromosome/plasmid is the best approach to use the argument: -geneList /path/to/gene/list.txt
    Should the list.txt include a -geneList /path/to/gene/list.txt of all chromosome names and do I also need to include interval positions which span the length of the entire chromosome?

    Thank you!

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Katie‌

    Hi,

    An interval list specifies which parts of the dataset to analyze, whereas a gene list specifies intervals for which results should be aggregated.

    For average coverage for each chromosome/plasmid, you can simply use -L with a list of the chromosomes and plasmids.

    -Sheila

  • KatieKatie United StatesMember ✭✭

    Fantastic this works nicely, thank you for the quick response!

  • yuegeorgeyuegeorge hkMember

    hi Geraldine,
    i have the target region coordiate information like this:chr1 10137431 10137716;
    With this guide https://www.broadinstitute.org/gatk/guide/article?id=1329
    i have got the refseq fiile.
    chr1 2323213 2336885 NM_007033 0 + 2327229 2334563 0 7 184,88,105,100,79,136,2412, 0,4009,5341,....
    i think this is intervel file,right?

    For the genelist file , we can download from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/,right?
    so my script is :
    java -Xmx8g -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /media/Analysis/gatk_resource/ucsc.hg19.fasta --calculateCoverageOverGenes:REFSEQ /media/Analysis/annovar20140618/humandb/hg19_refGene.txt --outputFormat table --omitDepthOutputAtEachBase -o Coverage_summary -I ./realigned.recal.bam -L geneTrack.refSeq.bed
    error message is

    ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature chr11:89057522-89223909 followed later by chr5:154198052-154230213 and then chr11:319673-320914, for input source: /media/Analysis/annovar20140618/humandb/hg19_refGene.txt

    thanks.

  • ZaagZaag Member ✭✭
    edited January 2015

    So sort the file to get the chromosomes in order: chr1,chr2 etc..

  • mariagrmariagr ZMBHMember
    edited April 2015

    @Geraldine_VdAuwera said:
    The gene name should be referenced in the summary table as explained here:

    The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls.

    I was wondering if we can get coverage of exons if we give only the -i bam whose exons we are interested in, and -L exonsTargets.bed or .intervals without any genes list, so if we are not interested in annotation.
    Is genesList required parameter for running -T DepthOfCoverage ?
    Moreover the intervals list should be like "contig:start-end" or can it be in bed-like file : chr1 start stop ?

  • mariagrmariagr ZMBHMember
    edited April 2015

    @Sheila said:
    Hi,
    An interval list specifies which parts of the dataset to analyze, whereas a gene list specifies intervals for which results should be aggregated.
    For average coverage for each chromosome/plasmid, you can simply use -L with a list of the chromosomes and plasmids.
    -Sheila

    Hello Sheila, I am working on a similar command of DepthOfCoverage for my internship project and I am stuck !
    Can you help me ?
    I would like to find the coverage of the exons(as an average value of each exon interval) from an input bam, giving a exonlist.bed in format of columns : [chr start stop ] ,
    as -L parameter.
    I don't know if it is stupid to give it without genes-I haven't worked before on DepthOfcov- but practically I don't need the gene names/intervals.
    Would it work and give what I want a command like the following for example:
    java -Xmx4g -jar Programs/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T DepthOfCoverage
    -I 39_sorted.bam -o 39_DepthofCovbyExon -L exons_intervals.bed -rf BadCigar

    Is there any other parameter I can define for doing that better ?
    Thanks in advance,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mariagr The genes are not necessary, it's just an option if you want that information. If not you can go ahead with the command you posted, that should work just fine.

  • mariagrmariagr ZMBHMember
    edited May 2015

    Hello again,many Thanks!
    I did so, and it worked fine for a small bam of chr22 reads, but now I faced a problem I think, given the message I see at the end of the execution: For each bam (of 1 chr) I get :
    MicroScheduler - -> 458 reads (0.02% of total) failing NotPrimaryAlignmentFilter
    MicroScheduler - -> 3344 reads (0.15% of total) failing UnmappedReadFilter

    Nb of reads is different but always on these two filters there are reads failing filtering.
    I have already applied BWA,sorting with samtools on the whole genome bam, prepared the fasta (fai & dict) and then I do
    samtools view -h -b [chr] > mychr_reads.bam
    samtools index mychr_reads.bam

    so I ran DepthOfCoverage like the command mentioned in my last post, on the smaller bam by chromosome.
    Can this cause the problem?
    Otherwise how can I apply these filters for overcoming these failure messages ?

    Please guide me through any documentation I need to read for this,too, because I didn't find anything by googling the error and I ll have to run DepthOfCov for lots and lots of bam in a bash script so I have to be sure I won't get any errors or have bad quality output.
    Best Regards,
    Maria

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mariagr, this is not a problem; it is expected to have some reads that fail these filters. The coverage tool cannot use unmapped reads so they are ignored for this analysis. The filtering summary is not a error message. If you see it, it means your jobs ran successfully.

  • mariagrmariagr ZMBHMember

    aaa, ok ! So I can ignore it ! I was afraid I would need more correction commands which take time !!
    Merci :-)

  • mariagrmariagr ZMBHMember

    Hello,
    there is a detail I didn't consider and I realise now ! I 'm not sure if I should re-run everything or no.
    I used sorted bam files for Depth of Coverage but I didn't run MarkDuplicates ! How can I verify from DepthOfCov output that I have the correct output? Is it absolutely necessary to have done it before for calculating coverage ? Will there be a bias in coverage results ? How can I verify it ?From statistics files maybe ? Or would I have got an error message concerning duplicates if that was a problem?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mariagr if you did not mark duplicates, the analysis cannot differentiate between duplicate and non duplicate read, so it cannot estimate the useable amount o coverage you have. I would strongly recommend redoing the analysis after marking duplicates.

  • RashRash Member

    @Geraldine_VdAuwera Hi, I have a question about mm10.refgene from UCSC. The one I got based on your description, do not contain chrM and when I ran the Depthofcoverage command, I have this error: mm10.refgene.txt and reference genome are not compatible. The only difference between them is chrM is not exist in refseq gene list for mm10. Can you please help me here. Many thanks in advance. rahel

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Rash
    Hi,

    It sounds like you got the incorrect refgene list. You should go back and ensure you have the correct one. This should not happen if you specified the exact same reference your BAM file is aligned to.

    Check whether the mm10 reference provided by UCSC is the exact same one you used to align your reads to. If not, see if your reference is actually a choice on the UCSC download website.

    -Sheila

  • RashRash Member

    @Sheila , It works now, I needed to sort the refgen file based on the reference genome and then it worked and it was not related to chrM.
    Many thanks for your reply, Rahel

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Rash
    Hi Rahel

    Thanks for sharing your solution! :smile:

    -Sheila

  • bwubbbwubb Member ✭✭

    I am admittedly using an older version of GATK still, 3.4-46, but when running this tool I have set a number of -ct arguments, at 100, 500, 1000, 2000, and 5000. The resulting 'sample_summary' file reports the same % latter four thresholds. Example.

    sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_100 %_bases_above_500 %_bases_above_1000 %_bases_above_2000 %_bases_above_5000 0346 209898124 1238.61 500 500 327 93.0 63.7 63.7 63.7 63.7

    This can not be true, and especially not for all samples. Was this perhaps a bug that has been resolved in later versions or am I doing something wrong? Thank you.

    -bwubb

  • bwubbbwubb Member ✭✭
  • SheilaSheila Broad InstituteMember, Broadie admin

    @bwubb
    Hi bwubb,

    It would be nice if you share what exactly the user error was so other users can benefit :smile:

    -Sheila

    P.S. Glad to hear you solved it on your own!

  • priyatamapriyatama CAMember

    Hi,
    I am not sure to have to set (-pt)option. I have multiple files(different library, run, and lane) of the same sample and after mapping these files I merged it in marked duplicate step. I followed the GATK best practices till printsReads and got the process single bam file.
    Although I provided the same (one-merged) BAM file for coverage analysis, the default of the software was to break down the summary between all the initial files. Now I’ve identified the option partition (-pt). So I used it but still, it gives the same kind of results. I want to see the just one file coverage information.
    Here are my commands-

    java -Xmx20g -jar GenomeAnalysisTK.jar \
    -T DepthOfCoverage \
    -I Sample1.dedup.realigned.recal_reads.bam \
    -R human_g1k_v37_decoy.fasta \
    -L exome_targetedregions.bed \
    -pt sample \
    -omitBaseOutput \
    --summaryCoverageThreshold 30 \
    -o coverage_stats/Sample1.coverage.stats;)

    And created summary output file is -

    sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_30
    Sample1_L1_R3a 305086414 4.91 8 5 2 0.2
    Sample1_L1_R2 581390401 9.36 15 8 3 3.9
    Sample1_L2_R4 303865489 4.89 8 5 2 0.2
    Sample1_L1_R1 1824546268 29.39 43 24 11 40.3
    Sample1_L1_R3 28066614 0.45 1 1 1 0.0
    Total 3042955186 49.01 N/A N/A N/A

    But I am only looking only one value for my this sample, something like
    Sample1 total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_30

    Combined value of all the library(L1/L2) and Runs(R1/R2/R3/R3a).

    Please help me.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @priyatama
    Hi,

    It looks like you have different sample names. You should have the same identifier for the SM tag in the BAM file if the reads come from the same sample. Have a look at this document for more information.

    -sheila

  • priyatamapriyatama CAMember

    Thanks, Sheila.

Sign In or Register to comment.