The current GATK version is 3.4-46

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# DepthOfCoverage summary reports too many bases

BaltimorePosts: 4Member

I'm a bit confused about some of the output of GATK's DepthOfCoverage script.
In the *.GATK.covdepth.sample_summary, what does the "total" column signify?
Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp.
Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script?

Tagged:

Hi there,

The "total" column does indeed mean total coverage (number of bases) in the dataset. Can you tell me how you calculated the ~37.2 B number?

Geraldine Van der Auwera, PhD

• BaltimorePosts: 4Member

It was using a custom script that simply counts the bases in a .fasta file. I've checked on several files of known size, and my count was confirmed by someone else who had used the data for a previous project. Also, the data contains ~369.6 million reads, each 101bp long, so 101 * 369.9 million = 37.2 billion bp, so I'm pretty sure GATK is reporting more bases than there are.

• BaltimorePosts: 17Member

Hi,

Like RaphaelK, I have found that GATK's DepthOfCoverage script reported many more total bases than expected from my raw read counts:

36 billion bases - while DepthOfCoverage reports 135 billion
34 billion bases - while DepthOfCoverage reports 140 billion
37 billion bases - while DepthOfCoverage reports 147 billion
39 billion bases - while DepthOfCoverage reports 140 billion
38 billion bases - while DepthOfCoverage reports 154 billion
37 billion bases - while DepthOfCoverage reports 145 billion

I too would like to understand what the 'total bases' column signifies.

Thanks.

James

Hi there,

Sorry for dropping @RaphaelK's original question. Which version are you using? Does this still happen with the latest version (2.7)?

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Hi,

Thanks for the prompt reply! I will give version 2.7 a try - I am pretty sure that I used 2.2-5!

James

Ooh, that's way old (by our crazy standards )

Geraldine Van der Auwera, PhD

• BaltimorePosts: 4Member

I ran the new version of GATK on my data, and I got the exact same results. Also, on a different, smaller, data set, DepthOfCoverage reported a total of 2.65 billion bp instead of 1.86 billion. Seems like the size of the error depends on how much data you have.

• Raphael

Hi @RaphaelK, it sounds like we have a bug; thanks for reporting it. Could you please generate a bam snippet that reproduces the error and upload it to our FTP server so we can debug this locally? Detailed instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Sorry it took so long to get back, but my results just came in. Like Raphael, DepthOfCoverage for GATK2.7-2 reported the same results as the prior version. I will report it as noted above.
James

• BaltimorePosts: 17Member

I have a question regarding creation of a snippet of the .bam file. There was no detectable problematic region. The DepthOfCoverage script ran to completion without incident. Do I include the whole 101GB .bam file in the zipped directory along with the other necessary files?
James

Hi @JamesM,

We would prefer a smaller file. Ideally what we would want is a small region where you have ascertained the number of bases by a different method and get a result that is different from what DepthOfCoverage returns.

Geraldine Van der Auwera, PhD

• BaltimorePosts: 4Member

I was unable to reproduce the error with small snippets of .bam file, but a 1GB file produced the error. I uploaded the necessary stuff on the ftp server, at 2013_10_01_basepair_counting_error.zip.

Thanks for the files, I'll try to look at this later today.

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Hi Geraldine,
I have tried to create a reduced .bam file with no success. Can you point me to some advice on modifying, editing .bam files to produce the necessary file?
James

Hi James,

Can you tell me more about what you've tried and what were the results?

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Hi Geraldine,
I am struggling with putting together a smaller version of my .bam file.
James

$samtools view -H TF_1.coordSorted.readheadersadded.bam > header$ samtools view TF_1.coordSorted.readheadersadded.bam > align

Stitch the headers together with a truncated version of the align file to create crapple.sam.

samtools view -bS crapple.sam | samtools sort - crapple_sorted [bam_header_read] EOF marker is absent. The input is probably truncated. [samopen] SAM header is present: 271591 sequences. @CCDDD>; XA:i:0 MD:Z:101 RG:Z:1 NM:i:0 XS:A:-' is recognized as '*'. Parse error at line 271594: invalid CIGAR character • Posts: 225Member ✭✭✭ You could probably use PrintReads in GATK or DownsampleSam in Picard to create a smaller bam. If you think the issue is size (> 1Gb as opposed to region) I think you would set "dfrac" in PrintReads or "P" in DownsampleSam to 0.01...If your BAM file is 101 Gb then 0.01 should give you about a 1 GB file. • Posts: 8,428Administrator, GATK Dev admin Hi @JamesM, As Kurt suggests, you can use PrintReads to downsample your data (with -dfrac or -dcov depending on whether you want to keep the same coverage distribution or not) and/or extract selected regions (with -L). Geraldine Van der Auwera, PhD • BaltimorePosts: 17Member Thanks Kurt and Geraldine - I'll give it a try. • BaltimorePosts: 17Member Hi, I think I figured out what is going on here. Example: I have a file that consists of 345,229,470 Illumina reads. Multiply that by the read length (101) and you get 34,868,176,470 bases. This agrees with the numbers from the sequencing delivery report. However, the GATK reported 140,772,926,249 bases in the depth of coverage report. So, why the discrepancy? GATK is using the .bam file which includes all aligned reads. Reads may be mapped multiple times and thus, are not necessarily unique. In fact - samtools view -c my_file.bam - reports 1,393,791,349 aligned reads in the .bam file.

Multiply 1,393,791,349 by the read length and you get the same number of bases reported by the GATK, i.e. 140,772,926,249.

So please correct me if I am wrong, but there is still an issue here as the depth of coverage report metrics are being calculated for reads that are mapping multiple times.

James

Hi James,

Unfortunately that explanation cannot hold because like most if not all GATK walkers, DoC applies the NotPrimaryAlignment filter before even looking at the data, so it can't possibly be counting multiple mappings. Have you been able to generate a file snippet at all?

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Hi Geraldine,

I just uploaded 'gatk_DeothOdCoverage.zip'.

James

Thanks James, I'll have a look at this asap.

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Thanks Geraldine! MUCH appreciated.

James

• BergenPosts: 24Member ✭✭

Hi,
I could also observe this in our data (GATK from 9/07/2013). I was wondering what is the status of this issue? I suppose it could have impact on other coverage statistics, e.g. % of positions covered to >= coverage threshold. This is where I saw discrepancy with stats provided by the sequencing center as well.

Ah, I'm afraid this has been gathering dust, sorry. We've just had a lot of other higher-priority items to deal with. Right now is a bad time still, but remind me after March 17th and I promise I will get to the bottom of this.

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Hi Geraldine,
When you get a chance - I'm still very interested in knowing what is going on here.
James

Hi James, I hate to keep you waiting but I'm still swamped. I promise I'll get around to it though. Feel free to ping me until I do.

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Geraldine,
Understood.
James

• BaltimorePosts: 17Member

Hi Geraldine,

'Bump'

James

Noted -- getting closer!

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Hi Geraldine,

I'm still interested in figuring out what is going on here.

James

Getting closer on the issue list...

Geraldine Van der Auwera, PhD

• Posts: 18Member

Hi Geraldine,

Is there any further update on this issue?

Sanjeev

Not yet, but we're ticking issues off the priority list. We'll get to it.

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Hi Geraldine,

Any progress on this?

Thanks.

James

Hi James,

Yes! Coverage analysis is finally on the support map. @Sheila‌ has taken on your bug report in the context of a new project to provide better support and documentation for coverage analysis. She will be in touch with you with any updates.

Geraldine Van der Auwera, PhD

• BaltimorePosts: 17Member

Fantastic! Thank you.

James

• Broad InstitutePosts: 2,192Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

Hi,

I apologize for taking so long, but the good news is that there is no issue in GATK version 3.3. I checked both of your files and the numbers are correct.

I used samtools flagstat to check the number of unfiltered reads. I then used GATK CountReads with the filters applied in DepthOfCoverage to get the number of reads after filtering. Multiplying the number of reads after filtering by 101 gives the number of bases which matches the output of DepthOfCoverage.

I also used Qualimap's Bam QC tool to verify the number of reads in both of your files. You can read mode about Qualimpap's tool here: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html#cmdline-counts

-Sheila

• Broad InstitutePosts: 2,192Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

Hi,

This issue seems to be resolved in GATK version 3.3
Please have a look at my above response to RaphaelK and JamesM.

-Sheila

• BaltimorePosts: 17Member

Hi Sheila,

Fantastic! I will give GATK 3.3 a try.

James