DepthOfCoverage interval_summary and interval_statistics
Hi GATK team,
I think I got some problem when I tried to analysis the GATK DOC results to find the rate of intervals with coverage >=100X in my targeted sequence data.
When I check the XXX.sample_interval_statistics file, just use awk F"\t" '{print $102}' XXX.sample_interval_statistics, it gave me the following result:
depth>=100
372
but when I checked the interval_summary file, I tried to do
tail n +2 XXX.sample_interval_summary  awk F"\t" '$3>=100'  wc l, it gave me
387
different from the previous 372. It means the third column in XXX.sample_interval_summary doesn't present the common "X coverage"? Does I misunderstand something here?
Thank you very much.
bless~
XL
Best Answer

Geraldine_VdAuwera Cambridge, MA admin
@liuxingliang The average coverage for an interval is the sum of coverage values per position, divided by the number of positions in the interval. Any other approaches would have to assume unrealistic things like perfectly even distribution.
Answers
@liuxingliang
Hi XL,
The third column in XXX.sample_interval_summary is average_coverage. The average coverage is different from coverage >= 100, so the two output numbers you get may not be the same.
Average coverage gives you the average number of bases seen over all loci in each of your samples. The coverage >= 100 gives you the number of samples that have greater than or equal to 100 bases at least at one position.
So, in your case, 372 of your samples have a depth of greater than or equal to 100 in at least one of the positions. But, 387 of your samples have an average coverage of greater than or equal to 100.
I hope this helps.
Sheila
@Sheila
Hi Sheila,
Thank you for your quick response. Based on my understanding of your answer, I have two questions,
No.1
In you explanation:
"Average coverage gives you the average number of bases seen over all loci in each of your samples. The coverage >= 100 gives you the number of samples that have greater than or equal to 100 bases at least at one position."
should the "sample" be "interval", am I right.
No.2
So, let me confirm my understanding of depth on one position and average coverage, so this becomes a definition question, I am a newbie, so want to get an answer from an expert, sorry for the trouble.
the average coverage of one interval is calculated as: the number of reads covering the interval * length of read / the number of bases of that interval, it is actually what we usually called X coverage, am I right.
When it comes to XXX.sample_interval_statistics, the depth on one position, is just count the number of reads covered this position, no need to consider the read length and interval length, am I right.
Thank you again.
bless~
XL
@Sheila
Hi Sheila,
I think the average coverage is not what I originally thought of, it is not the X coverage, because based on my previous understanding, the average coverage is actually calculated by total coverage divided by interval length. The total coverage should be: the number of reads covering this interval * read length. However, I tried to divide the total coverage by our read length (150, all of them are same), the result is not a integer. What's wrong here? How we calculate the total coverage for one interval?
thank you.
bless~
XL
@liuxingliang The average coverage for an interval is the sum of coverage values per position, divided by the number of positions in the interval. Any other approaches would have to assume unrealistic things like perfectly even distribution.