Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Using DepthOfCoverage to find out how much sequence data you have

delangeldelangel Posts: 71GATK Developer mod
edited August 2013 in Methods and Workflows

See this announcement regarding our plans for support of DepthOfCoverage and DiagnoseTargets. If you find that there are functionalities missing in either tool, leave us a comment and we will consider adding them.

For a complete, detailed argument reference, refer to the GATK document page here.

Introduction

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

  • Total, mean, median, and quartiles for each partition type: aggregate
  • Total, mean, median, and quartiles for each partition type: for each interval
  • A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
  • A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
  • A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
  • A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.

For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

Coverage by Gene

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,

If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt

If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     >500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290

Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

Post edited by Geraldine_VdAuwera on

Comments

  • chongmchongm Posts: 33Member

    If we are not on the broad network, is there still a way to obtain the gene list?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi there,

    No, I'm afraid we do not provide public access to that resource. You'll need to make your own, sorry.

    Geraldine Van der Auwera, PhD

  • pepe13pepe13 Posts: 1Member

    Hi, the link for the GATK DepthOfCoverage Documentation isn't working... Is it possible to post the updated link?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hmm, the link is what it should be but the doc page is missing. I will find out why. In the meantime you can get the command line arguments by running the walker with -h (for "help").

    Geraldine Van der Auwera, PhD

  • artitandonartitandon Posts: 8Member

    Is there a description somewhere of the various output files?

  • sarmadymsarmadym Posts: 5Member

    Does "Coverage by Gene" calculates coverage only within the intervals specified the intervals files (-L)? For example, if we have a whole exome experiment, and we provide DepthOfCoverage with the captured regions .intervals file, and we run coverage by gene as well, Is the output stats per gene only for regions within the captured regions or it is for all sites of the gene?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    That's correct, using an intervals file with -L will limit the scope of analysis to only the targeted intervals.

    Geraldine Van der Auwera, PhD

  • sarmadymsarmadym Posts: 5Member

    @Geraldine_VdAuwera said: That's correct, using an intervals file with -L will limit the scope of analysis to only the targeted intervals.

    Thanks a lot. I was wondering if there is a way to add gene name or a custom annotation as a column in the sample_interval_summary that gets generated. As an example, say I feed the tool with whole exome captured (.intervals file) regions as well as refseq genes using -geneList , the sample_interval_summary would show only the coordinate of the regions and no other annotation (such as gene or transcript). Is there a way to get those annotations in the sample_interval_summary file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    The gene name should be referenced in the summary table as explained here:

    The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls.

    Geraldine Van der Auwera, PhD

  • emossemoss Posts: 10Member

    Is there a description available of the columns in the genelist file?

  • emossemoss Posts: 10Member
    edited December 2012

    @emoss said: Is there a description available of the columns in the genelist file?

    Nevermind! http://genome.ucsc.edu/cgi-bin/hgTables

    Post edited by emoss on
  • sarmadymsarmadym Posts: 5Member
    edited December 2012

    @Geraldine_VdAuwera I use this tool to identify regions in our intervals(ROI) that have at least one base covered less than 30x. But In the sample_interval_summary file, the accuracy of %_above_30 (or %_above_anyDP) is not high enough. Accuracy is 0.001 (0.1%) which means if I have a region with the length of 5000bp, there should be at least 5bp below the DP cutoff for DepthOfCoverage tool to report a coverage percentage below 100.0% (for that DP cutoff). Is there a way to improve this accuracy?

    Alternatively, if you can add minimum coverage for each interval (not sure how to get min coverage per interval using current GATK walkers), that will address this accuracy issue to some extent.

    Post edited by sarmadym on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi there, sorry for the delayed response.

    There is no option to directly control the accuracy of this tool. Depending on what you're trying to so, you may want to look at our other diagnostics / BAM processing tools such as DiagnoseTargets.

    If by minimum coverage per interval you mean you want to exclude those low-covered intervals from analysis, there is currently no direct way to do that. We prefer to call those intervals then filter out variants that are supported by too few reads.

    Geraldine Van der Auwera, PhD

  • cparobekcparobek Posts: 4Member

    Along the lines of Sarmadym's November post, if I have BED file with gene intervals and names, is is possible for GATK to output a sample_interval_summary file that includes not only chromosome and coordinates, but the also the gene name itself? I am using '-L path/to/bed.bed' as the interval file. I have already tried using '-geneList path/to/refseq/gene/list.txt' but I'm not seeing the additional summary file that's described above. Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi there,

    We are actually in the process of retiring the DepthOfCoverage tool. It is being replaced by DiagnoseTargets. Please try using DiagnoseTargets, and if there are any functionalities that are missing from the tool, let us know and we will consider adding them.

    Geraldine Van der Auwera, PhD

  • cparobekcparobek Posts: 4Member
    edited January 2013

    Thanks - very good to know. DiagnoseTargets seems easy to use and fast, and it presents useful info in an easy-to-understand format. However, I'm still having trouble getting my feature names to appear in the output file. Currently the ID field in that file is blank (a dot). The BED file I'm feeding the program is in this format:

    chr_name \t start_coordinate \t end_coordinate \t feature_name \t strand

    I've searched and searched GATK documentation, but can't find any clues for how to modify my BED file so that DiagnoseTargets will populate the ID field with feature names. Any ideas are much appreciated.

    Post edited by cparobek on
  • CarneiroCarneiro Posts: 275Administrator, GATK Developer admin

    I am not sure I understand what your question is. Could you provide me with the command line you're using for DiagnoseTargets?

  • cparobekcparobek Posts: 4Member

    Sure thing, thanks. I believe DiagnoseTargets is working fine, it's just not populating the "ID" field in the output file.

    Here's the command line I'm using: java -jar ~/src/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar \ -T DiagnoseTargets \ -I alignments2/ERR066280_sorted.bam.bam \ -R reference_genome/PlasmoDB-9.2_Pfalciparum3D7_Genome.fasta \ -L pileups/genetest.bed \ -o GATK_DiagnoseTargets/tryagain.txt

    This runs for a few seconds (assessing coverage for 6 genes - I have 5000+ I'd eventually like to look at) and produces a 10-column file, tryagain.txt, attached. Also produces a .idx file. See how the ID column of the attached file is blank? I'd like that column to include the names of my genes, which are in the BED file I'm using as input, genetest.bed, which looks like this:

    Pf3D7_01_v3 71623 72425 PF3D7_0101200 + Pf3D7_01_v3 74562 75365 PF3D7_0101300 + Pf3D7_01_v3 75981 76808 PF3D7_0101400 - Pf3D7_01_v3 78240 79890 PF3D7_0101500 + Pf3D7_01_v3 81764 83105 PF3D7_0101600 - Pf3D7_01_v3 84790 86151 PF3D7_0101700 -

    Col1 is the chr name, 2 and 3 are the coordinates of the gene, 4 is the gene name, and 5 is the strand. If you know of any way I could get the gene names (column for of the above BED) into the ID column of the attached DiagnoseTargets output, I'd be very appreciative.

    txt
    txt
    tryagain.txt
    5K
  • CarneiroCarneiro Posts: 275Administrator, GATK Developer admin

    Hmmm, the ID field in the VCF format is designed to represent the variant id in a database, like the mutation id in DBSNP or HAPMAP. Here we are using the VCF to display a region, instead of a variant, and filling the ID field would probably be a frowned upon by the VCF community.

    That being said, this is a very easy problem to solve, you can do it yourself with an awk script to match the chr/pos to the id and fill in the blanks if you need. Is this something you (or someone in your team) would be comfortable doing?

    From our end, if we were to support BED files with ID's it would have to be a more formal approach. I will look into it and whether or not the ID field would be the best one to provide. We accept not only BED files , but also interval and vcf files in the -L argument, and not all of them will have the ID in the same position.

  • cparobekcparobek Posts: 4Member

    Ohhh, that's what the ID field is for. Now it makes sense why I couldn't get my gene names to appear there. That would be great if you supported BED files with gene names, but in the mean time you're right - awk is probably the easiest solution. I can implement that for our own purposes. Thanks for your help...

  • dandrewdandrew Posts: 1Member

    Document page still not active... help?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi @dandrew, as stated earlier we are replacing DepthOfCoverage by DiagnoseTargets. The missing doc is just an unrelated technical side issue but we've decided not to take the time to fix it since the tool is being retired anyway. Sorry for the inconvenience but this is a matter of resource prioritization.

    Geraldine Van der Auwera, PhD

  • CarneiroCarneiro Posts: 275Administrator, GATK Developer admin

    Dandrew, please use DiagnoseTargets instead. We are discontinuing DepthOfCoverage and the help page will be removed.

  • annatannat Posts: 7Member

    I have been using the DepthOfCoverage tool a lot recently and have found that it produces a lot of really useful information in the output files that I am using for QC at our genome centre. I plan to continue to use them extensively in the future.

    When I went to check some things in the documentation, I saw that you are sadly going to retire this tool, and it will be replaced by DiagnoseTargets. So I am responding to your request for comments on missing functionalities.

    I have now tried DiagnoseTargets and DepthOfCoverage on the same datasets, and I can see that there is a lot of functionality missing from DiagnoseTargets.

    The output vcf file from DignoseTargets provides some of the same information as the out.sample_interval_summary from DepthOfCoverage but not all, the total experiment coverage, average experiment coverage and total sample coverage are not provided. Additionally the actual values are different – is this due to downsampling in DiagnoseTargets? Here is an example:

    out.sample_interval_summary:
    Target chr1:762098-762270
    total coverage 227719
    average coverage 1316.29
    Sample1 total cvg 156728
    Sample1 mean cvg 905.94
    Sample1 granular Q1 >500
    Sample1 granular median >500
    Sample1 granular Q3 >500
    Sample1 % above 15 100
    Sample2 total cvg 70991
    Sample2 mean cvg 410.35
    Sample2 granular Q1 322
    Sample2 granular median 435
    Sample2 granular Q3 >500
    Sample2 % above 15 100

    diagnoseTargets.vcf:
    CHROM chr1
    POS 762098
    ID 0
    REF T
    ALT

    QUAL 0
    FILTER PASS
    INFO AVG_INTERVAL_DP=287.96;END=762270
    FORMAT AVG_INTERVAL_DP:MED:Q1:Q3
    Sample1 175.43:200.00:108.00:247.00
    Sample2 112.53:124.00:65.00:163.00

    There isn't any replacement for any of the other files from DepthOfCoverage:

    out:
    total coverage, average sample coverage, coverage per sample at every base position in interval file

    out.sample_summary:
    total coverage, average coverage, quantiles coverage, %bases above 15 coverage for each sample

    out.sample_statistics:
    base position counts for x amount (default intevals of 1, 0 to 500 and 500+) of coverage for each sample [For histogram]

    out.sample_cumulative_coverage_proportions:
    proportion of base positions for number of reads greater than x for each sample

    out.sample_cumulative_coverage_counts:
    matrix of base position counts for number of samples greater than n, by number of reads greater than x

    out.sample_interval_statistics:
    matrix of number of intevals for number of samples greater than n, by number of reads greater than x

    For me at the moment the out.sample_summary, out.sample_cumulative_coverage_proportions, out and out.sample_interval_summary files are really useful but I can envisage that the other output is useful for other parts of analysis and I may well use them in the future too.

    It would be really great if you could add the functionality outlined above to DiagnoseTargets, or better still reverse your decision to retire DepthOfCoverage.

    I am also finding the use of vcf style files for non variant data irritating. They require a lot of extra parsing, cannot just be passed on to end users and can potentially take up a lot more space than a tab delimited file.

    Thanks for listening to my thoughts!
    Anna

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi Anna,

    Thanks for the feedback, this is very useful. We will see what we can do to make DiagnoseTargets better respond to your needs.

    Geraldine Van der Auwera, PhD

  • Jeremy37Jeremy37 Posts: 5Member

    Hi Geraldine. I agree with everything that Annat has posted.

    I use DepthOfCoverage for a large number (a few hundreds) of human exomes from individuals with rare diseases. Two of the most useful outputs of DepthOfCoverage are missing in DiagnoseTargets (from what I can tell): - sample_summary: I like to have a concise summary of mean and median coverage for each sample, as well as %of target bases covered at different thresholds (e.g. 5x, 10x, 20x, 30x, etc) averaged over all targeted bases in the sample. - sample_interval_summary: I like to have the %of target bases covered at different thresholds for EACH target (exon) in the sample. This is often useful because we may have a restricted set of genes in mind. We can check whether any specific exons are missing coverage in one or more samples and might want to follow those up.

    The fixed classifier of DiagnoseTargets (e.g. "LOW_COVERAGE") based on a single threshold (e.g. --minimum_coverage in DiagnoseTargets) is much less useful to me. I also agree that the VCF-style output format is awkward and inappropriate.

    Also, I have had no trouble running DepthOfCoverage with reasonable memory limits (4 Gb) even on large (100 Gb) multi-sample BAM files by controlling --read_buffer_size. With DiagnoseTargets I have yet to get it to run on the same BAM file even when providing 30 Gb of memory (our cluster kills the job when it exceeds the requested memory usage). This occurs when I specify a whole chromosome as the interval of interest. It would run when I gave shorter intervals.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Thanks for the useful feedback, Jeremy. We'll take this into account.

    Not sure what's going on with performance but we'll take a look at that too and try to improve it.

    Geraldine Van der Auwera, PhD

  • MamtaMamta Posts: 1Member

    Hi, I ma trying to use depthofcoverage, to get the coverage over an interval. But I get an error:

    ERROR A USER ERROR has occurred (version 1.3-20-g447e9bf):
    ERROR MESSAGE: File associated with name /Users/ngs_user/Desktop/Coverage/makeCoverageScript/allexonsT1.bed is malformed: Interval file could not be parsed in any supported format. caused by BED files must be parsed through Tribble; parsing them as intervals through the GATK engine is no longer supported
    ERROR ------------------------------------------------------------------------------------------

    My interval file has the format: chr1:12011-12058 chr1:12180-12228 chr1:12614-12698 chr1:12976-13053 chr1:13222-13375

    It works fine without an interval file or just just one interval (-L chr1:12011-12058). Is the interval file format incorrect? Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin
    edited February 2013

    Your intervals look fine, but the version you're using is downright ancient... You need to upgrade to a more recent version; preferably the latest.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • asease Posts: 2Member

    Hi, In the new version of GATK there is ReduceReads part. Is there a way to call DepthOfCoverage from the output of ReduceReads reduced.bam file?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Yes, you can use DepthOfCoverage with reduced nam files just as you would any other bam. No special treatment necessary.

    Geraldine Van der Auwera, PhD

  • asease Posts: 2Member

    I would like to calculate the depth of coverage ratio of tumor and blood samples. I am using .coverage.sample_interval_summary output of Depth of Coverage. If I run DepthOfCoverage on reduced.bam files of tumor and blood, the ratio won't be correct right? Because the reads that are not informative in variant calling are removed? I was using BaseQualityRelicalibration output for DepthOfCoverage calculation. Should I continue with BaseQualityRecalibration output?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    It depends what settings you used in ReduceReads -- but in principle it is not a problem anyway, if what you want to know is the useful coverage that is informative for calling variants. However, if you are trying to diagnose coverage as a sequencing QC process, then you should do it earlier in your pipeline.

    Geraldine Van der Auwera, PhD

  • igorigor Posts: 31Member

    Why is DepthOfCoverage being retired? DiagnoseTargets seems to provide less information in a format that requires additional processing.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Edited the message -- DoC will not be retired immediately, we will wait until DiagnoseTargets satisfies users' needs.

    Geraldine Van der Auwera, PhD

  • AshuAshu Posts: 21Member
    edited April 2013

    Hi, I wish to find out position specific coverage in a bacterial genome. I am not sure how to set the "-L" option to do so.

    java -jar GenomeAnalysisTK.jar -R reference.fasta -T DepthOfCoverage -o depth_of_coverage.txt -I aligned_reads.bam -L??

    The examples in gatk website show it like "-L chr1:100-200". I tried different combinations but i get an error saying contig does not match. Could you please help me with this option.

    The header for SAM file looks like this if you need it.

    @HD VN:1.3

    @SQ SN:ref000001|sid|576|accn|NC_000962 LN:4411532 M5:e49320ac872e44cb8058de0407772516

    @RG ID:1a7146a3a4 SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s2_p0PL:PacBio_RS DT:2012-10-09T21:01:18

    @RG ID:d2c759b45d SM:c100389932550000001523034410251241 PU:m121009_210118_42152_c100389932550000001523034410251241_s1_p0PL:PacBio_RS DT:2012-10-09T21:01:18

    @CO READS:151621

    Post edited by Ashu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi Ashu,

    The format is -L contig:start-stop. You need to know what is the name of the contig you want to analyze.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    what's the different between file *.interval_list and *.intervals?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    @blueskypy, those are used interchangeably as extension names for interval list files; there is no inherent difference.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    what's the meaning of the 1st column in the geneList file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    The UCSC table browser gives a full header for the columns. The first column is called #bin. For more details, you should look it up on their website.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    Took a while to figure out the format of the -geneList file. In case I'm not the only dummy here, the format is described at http://genome.ucsc.edu/FAQ/FAQformat.html#format9 And the refGene file can be downloaded at ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Thanks for sharing your findings (especially direct links!), I'm sure this will be useful to others.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member
    edited July 2013

    I use the following command;

    java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage \
        -R $refGenome -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam \
        -geneList HLA.genePred -rf BadCigar -o HLA

    but some of the output files are empty, why?

    $ ls -al HLA*
    -rw-r--r-- 1 user01 root 71394299565 Jul  3 05:34 HLA
    -rw-r--r-- 1 user01 root           0 Jul  3 05:34 HLA.sample_interval_statistics
    -rw-r--r-- 1 user01 root           0 Jul  3 05:34 HLA.sample_gene_summary
    -rw-r--r-- 1 user01 root       11462 Jul  3 05:34 HLA.sample_cumulative_coverage_proportions
    -rw-r--r-- 1 user01 root         266 Jul  3 05:34 HLA.sample_summary
    -rw-r--r-- 1 user01 root       17963 Jul  3 05:34 HLA.sample_statistics
    -rw-r--r-- 1 user01 root           0 Jul  3 05:34 HLA.sample_interval_summary
    -rw-r--r-- 1 user01 root       16574 Jul  3 05:34 HLA.sample_cumulative_coverage_counts
    -rw-r--r-- 1 user01 root        5367 Jul  1 22:47 HLA.genePred
    
    Post edited by blueskypy on
  • blueskypyblueskypy Posts: 194Member

    And here are a few lines in the HLA.genePred

    834 NM_020056   chr6    +   32709162    32714664    32709220    32714171    5   32709162,32712935,32713567,32714016,32714358,   32709302,32713184,32713849,32714191,32714664,   0   HLA-DQA2    cmpl    cmpl    0,1,1,1,-1,
    834 NM_001198858    chr6    -   32723874    32731330    32724229    32731247    5   32723874,32725057,32726626,32729436,32731150,   32724243,32725081,32726908,32729703,32731330,   0   HLA-DQB2    cmpl    cmpl    1,1,1,1,0,
    823 NM_005514   chr6    -   31321648    31324989    31322259    31324935    8   31321648,31322255,31322409,31322883,31323093,31323943,31324464,31324862,    31322073,31322303,31322442,31323000,31323369,31324219,31324734,31324989,    0   HLA-B   cmpl    cmpl    -1,1,1,1,1,1,1,0,
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    @blueskypy, can you tell me what version you are using?

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    hi, Geraldine, Thanks so much for the help! I'm using GATK-2.5-2

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Actually, it's not version dependent -- it's because you didn't specify intervals. If you specify an interval list (using -L) the files will get populated.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    hi, Geraldine, Thanks for the help! I thought the -L isn't necessary if -geneList is provided since the interval info is in the geneList file. So if I extract the exonStarts and exonEnds from the geneList file to the interval file, will DepthOfCoverage map the intervals to gene names and produce the .sample_gene_summary file?

  • blueskypyblueskypy Posts: 194Member

    btw, since the -geneList file contains intevals for both CDS and exons, does DepthOfCoverage compute the coverage depth of genes based on CDS or exons?

  • blueskypyblueskypy Posts: 194Member

    I use the following command:

    java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T DepthOfCoverage -R $refGenome -l DEBUG \
    -I H06.recal.bam -I H09.recal.bam -I H10.recal.bam -geneList HLA.genePred -L HLA.intervals -rf BadCigar -o HLA

    And got the following warning on every genes in the geneList file HLA.genePred:

    WARN  16:58:32,227 Utils - ********************************************************************************
    WARN  16:58:32,227 Utils - * WARNING:
    WARN  16:58:32,227 Utils - *
    WARN  16:58:32,227 Utils - * RefSeq file is potentially incorrect, as some transcripts or exons
    WARN  16:58:32,227 Utils - * have a negative length (chr6)
    WARN  16:58:32,228 Utils - ********************************************************************************
    

    However, there is no negative length in the entry of genes, here is one example:

    833 NM_002122 chr6 + 32605182 32611429 32605235 32610541 5 32605182,32609086,32609748,32610386,32610728, 32605317,32609335,32610030,32610561,32611429, 0 HLA-DQA1 cmpl cmpl 0,1,1,1,-1,

    Could someone help? Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Not sure what's going on here. Have you tried testing on a subset and checking whether the results look reasonable?

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member

    Is the coverage that we get from depth of coverage for the whole genome equal to the one we get from Lander/Waterman equation as given on http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf Or is it average read depth?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    DepthOfCoverage gives you the straight-up average depth.

    Geraldine Van der Auwera, PhD

  • CMBielskiCMBielski BroadPosts: 1Member

    Has anybody successfully used the DepthofCoverage tool to find average coverage per gene for RNA-Seq data?

    I've tried running the following...

    java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T DepthOfCoverage -o test.txt -I input.bam -geneList refGene.sorted.txt -L refGene.sorted.intervals -U ALLOW_N_CIGAR_READS

    ... but the summary file is empty. Any suggestions?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    This may be an issue of how your intervals are set up. Have a look at this question & answer for more details on how it should be.

    If that's not the issue, have you checked if it works for you with regular DNA data? This to rule out the N-containing cigars as a cause for the problem.

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    Hi,

    I'm trying to run depthofcoverage on 40 reduced exome bams (one sample per bam) and the processing time per 1M reads seems to be getting higher and higher such that it seems that it will take weeks for this to finish. I've limited the scope of the analysis using -L argument, but I was wondering has anyone experienced something like this before?

    Thanks,

    MC

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi @chongm,

    Sounds like the program got stuck in a region of excessive coverage, perhaps. Are you using downsampling? And are you running the latest version?

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    Hi Geraldine, I can't run the latest version of GATK unfortunately because we are still stuck on GATK1.6. Downsampling helps but it is still very slow.

    Thanks,

    MC

  • chongmchongm Posts: 33Member

    Hi Everyone,

    I've used GATK DOC to calculate coverage for some exonic intervals of two different genes. I'm noticing that for some exons the sample_mean_coverage value from the .sample_interval_summary file appears to be exactly 127.00 for all samples, which is almost impossible just by chance. Has anyone else experienced anything like this? I'm using GATK version 2.5.2 . Is this a version bug or am I doing something wrong here?

    Commands:

    java -Xmx4g -jar $gatk \
    -T DepthOfCoverage \
    -R $ref2 \
    -I $data/interstroke_AMY1A_LPA.list \
    -L $LPA \
    -o ${data}/interstroke_coverage_LPA

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi @chongm,

    Are you using DoC on reduced bams, by any chance? In version 2.5 ReduceReads applied downsampling to 127...

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    Hi @Geraldine_VdAuwera

    You are absolutely right! Sorry I didn't realize that downsampling was defaulted for reduce reads.

    Thanks,

    Mike Chong

    Thanks,

    MC

  • isp2isp2 Posts: 7Member
    edited December 2013

    Hello,

    In this post you say that two of the outputs of DepthOfCoverage are:

    a) A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)

    b) A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

    Which, from my understanding, correspond to:

    a) _cumulative_coverage_counts

    b) _cumulative_coverage_proportions

    Correct?

    If yes, my questions are:

    1) in _cumulative_coverage_counts the cumulative number of bases covered at 1X, for example, decreases with sample size. Is this because it is increasing the minimum coverage too? For example, for one sample it looks for bases with a minimum of 1X coverage, for 2 samples it looks for bases with a minimum of 2X coverage (one read from each sample), and so on and so forth.

    2) still in _cumulative_coverage_counts, is the sample that covers more bases chosen first and then it goes on adding the second sample that covers more bases and so on?

    3) in _cumulative_coverage_proportions the first column is not the number of samples (NSamples_1, etc.), but samples names. So, my understanding is not that it is showing me the proportion of loci with ≥18x coverage for ≥15 libraries, but that it is showing me the proportion of loci with ≥18x coverage for each sample separately. Is this correct?

    Thank you, ISP.

    Post edited by isp2 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi @isp2,

    No, I think what you're referring to are the tables in the interval_statistics and gene_statistics output files. The cumulative_coverage output files are histogram outputs that the article refers to.

    Geraldine Van der Auwera, PhD

  • ekanterakisekanterakis UKPosts: 1Member

    Hi Geraldine, I was wondering what filters you use to measure genome coverage for WGS Human data in a production environment? Such as base quality, mapping quality, etc. I could not find Best Practices for that on the forum. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi @ekanterakis‌,

    We currently don't provide detailed recommendations for doing production sequencing QC; at the Broad this is done upstream of us by a specialized group in the genome sequencing platform, which is responsible for production-level sequencing. At the moment we have no plans to develop documentation on this but if that changes, I'll let you know.

    Geraldine Van der Auwera, PhD

  • lmeleelmelee Los Angeles, CAPosts: 4Member

    I have 24 exome samples that I am trying to get coverage information for. These have been run over two flow cells, and I think the depth of coverage that I'm getting from this tool is much lower than I believe it should be (based on number of reads and depth of coverage calculated from the first flow cell alone). Should I be looking at the mean that is output from DepthofCoverage? Is there a way to get information for the percentage of intervals that have >30X coverage or >40X coverage, etc? Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,990Administrator, GATK Developer admin

    Hi @lmelee,

    The DepthOfCoverage tool emits various statistics per locus and per interval, including histograms that you can use for that purpose.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.