Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
What does the output of DepthOfCoverage means?
I have tried looking for the good discussion on how to calculate the average coverage of exome sequencing after alignment. I found that depthofcoverage is a good tool to get the output, however, I am unable to understand what all the output of DepthOfCoverage means.
My Aim is to calculate the average x coverage or statistics summary of a depth of coverage of 7 samples of exome sequencing after alignment.
So for that I followed the steps:
create an input bam file with list the bam files with path directing to it. file called input_bam.list
we have bed files with region and chr
chr start stop name
I created refgene files as well using
http://genome.ucsc.edu/cgi-bin/hgTables?command=start plus for region using bed file
and sorted the file using following command
sort -nk3 -nk5 hgTables.txt > genes_refgene_sorted.txt
after executing following command:
java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed -geneList genes_refgene_sorted.txt -dt NONE
MESSAGE: Input file must have contiguous chromosomes. Saw feature chr22:19510547-19512860 followed later by chr18:19993564-19997878 and then chr22:22113947-22221970, for input source: Desktop/genes_refgene_sorted.txt
please suggest if I should sort the file with a different command.
If I use the command without refgene
java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed
I get the following output files
I don't understand which output file is the best to answer my question fo depth.
In the last output file -- file_base_name_withbedfile.sample_summary
the output looks like
sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15
test 1162396121 1775.69 500 500 343 91.7
Total 1162396121 1775.69 N/A N/A N/A
I don't understand what to make of it, and why there are NA
and in file file_base_name_withbedfile.sample_interval_summary
the output looks like the following, I don't understand what to make out of this apart from total coverage over 3 bam files for that location. That means there are total 6638920 reads (or nt) in 3 bam files (for example) in that particular location. what does test granular Q value mean? which column should I use to average x coverage to state that after alignment the exomes have x coverage.
Target total_coverage average_coverage test_total_cvg test_mean_cvg test_granular_Q1 test_granular_median test_granular_Q3 test_%_above_15
chr1:1716462-1719040 6638920 2574.22 6638920 2574.22 >500 >500 >500 100.0
chr1:1719110-1720851 4192130 2406.50 4192130 2406.50 >500 >500 >500 91.8
chr1:1721604-1722165 1011309 1799.48 1011309 1799.48 >500 >500 >500 99.3
chr1:1724574-1725729 3912540 3384.55 3912540 3384.55 >500 >500 >500 99.9
If this is a redundant question, could anyone direct me to the correct discussion to understand the output.
Thanks in advance.