The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# DepthOfCoverage summary reports too many bases

BaltimoreMember

I'm a bit confused about some of the output of GATK's DepthOfCoverage script.
In the *.GATK.covdepth.sample_summary, what does the "total" column signify?
Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp.
Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script?

Tagged:

Hi there,

The "total" column does indeed mean total coverage (number of bases) in the dataset. Can you tell me how you calculated the ~37.2 B number?

• BaltimoreMember

It was using a custom script that simply counts the bases in a .fasta file. I've checked on several files of known size, and my count was confirmed by someone else who had used the data for a previous project. Also, the data contains ~369.6 million reads, each 101bp long, so 101 * 369.9 million = 37.2 billion bp, so I'm pretty sure GATK is reporting more bases than there are.

• BaltimoreMember

Hi,

Like RaphaelK, I have found that GATK's DepthOfCoverage script reported many more total bases than expected from my raw read counts:

36 billion bases - while DepthOfCoverage reports 135 billion
34 billion bases - while DepthOfCoverage reports 140 billion
37 billion bases - while DepthOfCoverage reports 147 billion
39 billion bases - while DepthOfCoverage reports 140 billion
38 billion bases - while DepthOfCoverage reports 154 billion
37 billion bases - while DepthOfCoverage reports 145 billion

I too would like to understand what the 'total bases' column signifies.

Thanks.

James

Hi there,

Sorry for dropping @RaphaelK's original question. Which version are you using? Does this still happen with the latest version (2.7)?

• BaltimoreMember

Hi,

Thanks for the prompt reply! I will give version 2.7 a try - I am pretty sure that I used 2.2-5!

James

Ooh, that's way old (by our crazy standards )

• BaltimoreMember

I ran the new version of GATK on my data, and I got the exact same results. Also, on a different, smaller, data set, DepthOfCoverage reported a total of 2.65 billion bp instead of 1.86 billion. Seems like the size of the error depends on how much data you have.

• Raphael

Hi @RaphaelK, it sounds like we have a bug; thanks for reporting it. Could you please generate a bam snippet that reproduces the error and upload it to our FTP server so we can debug this locally? Detailed instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

• BaltimoreMember

Sorry it took so long to get back, but my results just came in. Like Raphael, DepthOfCoverage for GATK2.7-2 reported the same results as the prior version. I will report it as noted above.
James

• BaltimoreMember

I have a question regarding creation of a snippet of the .bam file. There was no detectable problematic region. The DepthOfCoverage script ran to completion without incident. Do I include the whole 101GB .bam file in the zipped directory along with the other necessary files?
James

Hi @JamesM,

We would prefer a smaller file. Ideally what we would want is a small region where you have ascertained the number of bases by a different method and get a result that is different from what DepthOfCoverage returns.

• BaltimoreMember

I was unable to reproduce the error with small snippets of .bam file, but a 1GB file produced the error. I uploaded the necessary stuff on the ftp server, at 2013_10_01_basepair_counting_error.zip.

Thanks for the files, I'll try to look at this later today.

• BaltimoreMember

Hi Geraldine,
I have tried to create a reduced .bam file with no success. Can you point me to some advice on modifying, editing .bam files to produce the necessary file?
James

Hi James,

Can you tell me more about what you've tried and what were the results?

• BaltimoreMember

Hi Geraldine,
I am struggling with putting together a smaller version of my .bam file.
James

$samtools view -H TF_1.coordSorted.readheadersadded.bam > header$ samtools view TF_1.coordSorted.readheadersadded.bam > align

Stitch the headers together with a truncated version of the align file to create crapple.sam.

samtools view -bS crapple.sam | samtools sort - crapple_sorted [bam_header_read] EOF marker is absent. The input is probably truncated. [samopen] SAM header is present: 271591 sequences. @CCDDD>; XA:i:0 MD:Z:101 RG:Z:1 NM:i:0 XS:A:-' is recognized as '*'. Parse error at line 271594: invalid CIGAR character • Member You could probably use PrintReads in GATK or DownsampleSam in Picard to create a smaller bam. If you think the issue is size (> 1Gb as opposed to region) I think you would set "dfrac" in PrintReads or "P" in DownsampleSam to 0.01...If your BAM file is 101 Gb then 0.01 should give you about a 1 GB file. • Cambridge, MAMember, Administrator, Broadie Hi @JamesM, As Kurt suggests, you can use PrintReads to downsample your data (with -dfrac or -dcov depending on whether you want to keep the same coverage distribution or not) and/or extract selected regions (with -L`). • BaltimoreMember Thanks Kurt and Geraldine - I'll give it a try. • BaltimoreMember Hi, I think I figured out what is going on here. Example: I have a file that consists of 345,229,470 Illumina reads. Multiply that by the read length (101) and you get 34,868,176,470 bases. This agrees with the numbers from the sequencing delivery report. However, the GATK reported 140,772,926,249 bases in the depth of coverage report. So, why the discrepancy? GATK is using the .bam file which includes all aligned reads. Reads may be mapped multiple times and thus, are not necessarily unique. In fact - samtools view -c my_file.bam - reports 1,393,791,349 aligned reads in the .bam file.

Multiply 1,393,791,349 by the read length and you get the same number of bases reported by the GATK, i.e. 140,772,926,249.

So please correct me if I am wrong, but there is still an issue here as the depth of coverage report metrics are being calculated for reads that are mapping multiple times.

James

Hi James,

Unfortunately that explanation cannot hold because like most if not all GATK walkers, DoC applies the NotPrimaryAlignment filter before even looking at the data, so it can't possibly be counting multiple mappings. Have you been able to generate a file snippet at all?

• BaltimoreMember

Hi Geraldine,

James

Thanks James, I'll have a look at this asap.

• BaltimoreMember

Thanks Geraldine! MUCH appreciated.

James

• BergenMember

Hi,
I could also observe this in our data (GATK from 9/07/2013). I was wondering what is the status of this issue? I suppose it could have impact on other coverage statistics, e.g. % of positions covered to >= coverage threshold. This is where I saw discrepancy with stats provided by the sequencing center as well.

Ah, I'm afraid this has been gathering dust, sorry. We've just had a lot of other higher-priority items to deal with. Right now is a bad time still, but remind me after March 17th and I promise I will get to the bottom of this.

• BaltimoreMember

Hi Geraldine,
When you get a chance - I'm still very interested in knowing what is going on here.
James

Hi James, I hate to keep you waiting but I'm still swamped. I promise I'll get around to it though. Feel free to ping me until I do.

• BaltimoreMember

Geraldine,
Understood.
James

• BaltimoreMember

Hi Geraldine,

'Bump'

James

Noted -- getting closer!

• BaltimoreMember

Hi Geraldine,

I'm still interested in figuring out what is going on here.

James

Getting closer on the issue list...

• Member

Hi Geraldine,

Is there any further update on this issue?

Sanjeev

Not yet, but we're ticking issues off the priority list. We'll get to it.

• BaltimoreMember

Hi Geraldine,

Any progress on this?

Thanks.

James

Hi James,

Yes! Coverage analysis is finally on the support map. @Sheila‌ has taken on your bug report in the context of a new project to provide better support and documentation for coverage analysis. She will be in touch with you with any updates.

• BaltimoreMember

Fantastic! Thank you.

James

Hi,

I apologize for taking so long, but the good news is that there is no issue in GATK version 3.3. I checked both of your files and the numbers are correct.

I used samtools flagstat to check the number of unfiltered reads. I then used GATK CountReads with the filters applied in DepthOfCoverage to get the number of reads after filtering. Multiplying the number of reads after filtering by 101 gives the number of bases which matches the output of DepthOfCoverage.

I also used Qualimap's Bam QC tool to verify the number of reads in both of your files. You can read mode about Qualimpap's tool here: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html#cmdline-counts

-Sheila

Hi,

This issue seems to be resolved in GATK version 3.3
Please have a look at my above response to RaphaelK and JamesM.

-Sheila

• BaltimoreMember

Hi Sheila,

Fantastic! I will give GATK 3.3 a try.

James