DepthOfCoverage

moranmoran The BroadPosts: 17Member

Hello,

I'm trying to calculate depth of coverage for entire contigs for multiple samples. I have ran the following command: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R Ecoli/Ecoli.allSubTypes.fasta \ -T DepthOfCoverage \ -o ../Ecoli/all.tmpOut \ -I Ecoli/bamlist.list \ -geneList Ecoli/Ecoli.refSeq

Where I've tried to generate a refSeq file with one line per contig.

I was expecting to have the output be in the form of a matrix with the various contigs as rows and the samples as columns.

Instead I got this looking file: Locus Total_Depth Average_Depth_sample Depth_for_sample1 gi|312944605|gb|CP001855.1|:1 0 0.00 0 gi|312944605|gb|CP001855.1|:2 0 0.00 0 gi|312944605|gb|CP001855.1|:3 0 0.00 0 gi|312944605|gb|CP001855.1|:4 0 0.00 0 gi|312944605|gb|CP001855.1|:5 0 0.00 0 gi|312944605|gb|CP001855.1|:6 0 0.00 0 gi|312944605|gb|CP001855.1|:7 0 0.00 0 gi|312944605|gb|CP001855.1|:8 0 0.00 0 gi|312944605|gb|CP001855.1|:9 0 0.00 0

were each base is a row. right?

What am I doing wrong?

Thanks! Moran.

Tagged:

Best Answer

Answers

  • moranmoran The BroadPosts: 17Member

    Thanks for the very fast answer! Do you also have an example for a bam list file? from the outputs I think it's treating all my bams as a single file...

    thanks again!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    The GATK always processes the content of all bams in a bam list together as if the data came from a single file. I do believe DoC reports results partitioned by sample by default, but they will all be in a single file per output type. They should be identified by sample in the summary table. If that's not the case, can you please post a few lines from the table so I can see what you're getting?

    Geraldine Van der Auwera, PhD

  • moranmoran The BroadPosts: 17Member

    I've fixed the sample in the header tag, and it works great now.

    But, now I have a question about the content.. In the mean coverage statistics, does it normalize this value by the total number of mapped reads for each sample?

    Also, can I define additional statistics to be calculated per interval per sample? For example, the percent of the interval covered.

    thanks for all the prompt replies!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Ah, good to hear.

    As I recall DoC doesn't do any normalization; if it did you'd get different values depending on whether you ran samples alone or together, which would be bad in my opinion.

    All the statistics that can currently be calculated are listed in the technical doc for the tool. If you are interested in statistics that are not available, you can always modify the tool yourself; we are always happy to look at a patch to include user contributions in the codebase.

    That said you may want to check out DiagnoseTargets first, which provides a lot of statistics about intervals that DoC doesn't. Maybe it will have what you want.

    Geraldine Van der Auwera, PhD

  • moranmoran The BroadPosts: 17Member

    Great. will check it now.

    One more question: Is there a way to define an interval that contains multiple contigs? I'm working with bacteria, and I have many contigs per genome, and I would like to summarize this per genome. (I've aligned my reads to a reference sequence that contains multiple genomes concatenated).

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Not directly, no. You'll need to calculate that from the per-contig summary table. I would recommend writing a script to process the table.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.