Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Exon output Depth of Coverage

Martin1Martin1 Posts: 6Member

Hello,

I'm runng DOC on my exome data and the output is hard to connect. This is the command i'm using:
java -Xmx2g -jar ~/programs/GenomeAnalysisTK-2.2-2-gf44cc4e/GenomeAnalysisTK.jar -T DepthOfCoverage -R ~/dependencies/reference/hg19.fasta -o doc -I doctest.bam -geneList:REFSEQ ../refSeqGenes.txt -L ~/dependencies/bedfiles/coverage/500genes.exons.refGene.sort.bed -ct 1 -ct 5 -ct 10 -ct 20 -ct 50

The refseqfile looks like this:
75 NM_003036 chr1 + 2160133 2241652 2160205 2238204 7 2160133,2234416,2234723,2235278,2235731,2237458,2238015, 2161174,2234542,2234839,2235541,2236024,2237689,2241652, 0SKI cmpl cmpl 0,0,0,2,1,0,0,
637 NM_001195563 chr1 + 6845383 6932107 6845590 6931825 4 6845383,6880240,6885151,6931816, 6845635,6880310,6885270,6932107, 0 CAMTA1 cmpl cmpl 0,0,1,0,
676 NM_000302 chr1 + 11994723 12035599 11994836 12034865 19 11994723,12008032,12009829,12010413,12012679,12014886,12016973,12017898,12018572,12020702,12023588,12024231,12024700,12025536,12026307,12027043,12030726,12032928,12034709, 11994912,12008124,12009963,12010577,12012792,12014950,12017071,12018000,12018704,12020824,12023693,12024357,12024842,12025650,12026373,12027148,12030873,12033054,12035599, 0 PLOD1 cmpl cmpl 0,1,0,2,1,0,1,0,0,0,2,2,2,0,0,0,0,0,0,

The exon bed file looks like this:
chr1 2160205 2161174 NM_003036_cds_0_0_chr1_2160206_f 0 +
chr1 2234416 2234542 NM_003036_cds_1_0_chr1_2234417_f 0 +
chr1 2234723 2234839 NM_003036_cds_2_0_chr1_2234724_f 0 +
chr1 2235278 2235541 NM_003036_cds_3_0_chr1_2235279_f 0 +
chr1 2235731 2236024 NM_003036_cds_4_0_chr1_2235732_f 0 +
chr1 2237458 2237689 NM_003036_cds_5_0_chr1_2237459_f 0 +
chr1 2238015 2238204 NM_003036_cds_6_0_chr1_2238016_f 0 +
chr1 6845590 6845635 NM_001195563_cds_0_0_chr1_6845591_f 0 +
chr1 6845590 6845635 NM_015215_cds_0_0_chr1_6845591_f 0 +
chr1 6880240 6880310 NM_001195563_cds_1_0_chr1_6880241_f 0 +
chr1 6880240 6880310 NM_015215_cds_1_0_chr1_6880241_f 0 +
chr1 6885151 6885270 NM_001195563_cds_2_0_chr1_6885152_f 0 +
chr1 6885151 6885270 NM_015215_cds_2_0_chr1_6885152_f 0 +
chr1 6931816 6931825 NM_001195563_cds_3_0_chr1_6931817_f 0 +
chr1 7151363 7151431 NM_015215_cds_3_0_chr1_7151364_f 0 +

The gene_summary file shows coverage on all genes:
SKI 73147 33.45 73147 33.45 21 32 45 100.0 99.8 93.4 75.8 16.6
CAMTA1 8158 33.57 8158 33.57 25 33 47 100.0 100.0 100.0 81.5 17.3
UNKNOWN 5312955 54.97 5312955 54.97 32 49 74 99.7 98.2 96.2 87.9 48.3
PLOD1 85711 39.24 85711 39.24 26 34 48 100.0 99.5 97.3 84.8 22.8
HSPG2 407112 30.90 407112 30.90 19 29 41 99.5 98.0 91.8 72.3 13.6

and the interval_summery file shows all exons:
chr1:2160206-2161174 31939 32.96 31939 32.96 18 32 44 100.0 100.0 93.3 70.4 16.4
chr1:2234417-2234542 7165 56.87 7165 56.87 41 60 71 100.0 100.0 100.0 100.0 62.7
chr1:2234724-2234839 4946 42.64 4946 42.64 42 43 45 100.0 100.0 100.0 100.0 0.0
chr1:2235279-2235541 9542 36.28 9542 36.28 30 38 46 100.0 100.0 100.0 100.0 10.3
chr1:2235732-2236024 7763 26.49 7763 26.49 24 28 32 100.0 100.0 100.0 90.8 0.0
chr1:2237459-2237689 9711 42.04 9711 42.04 34 49 53 100.0 100.0 100.0 88.3 42.9
chr1:2238016-2238204 2081 11.01 2081 11.01 9 13 15 100.0 97.9 57.7 0.0 0.0
chr1:6845591-6845635 655 14.56 655 14.56 14 16 17 100.0 100.0 100.0 0.0 0.0

This information is used to verify that a certain variant was covered good, not only on the position of the variant but also over the rest of the gene it is in. This is a bit hard to see in the interval_summery file because I can't see the exon or genenames in there and I can't see this in the gene_summery file because i mis individual exon information. This means crossrefferencing the bedfile and interval_summery with ugly scripting or creating a spoofed refseq file per exon. Is there an option to format the output where the coverage of each exon is shown with the exon and/or gene name (from the info in the bed or refseq file)?

e.g. something like this:
chr1:2160206-2161174 NM_003036_cds_0_0_chr1_2160206_f SKI 31939 32.96 31939 32.96 18 32 44 100.0 100.0 93.3 70.4 16.4

Regards,

Martin

Tagged:

Best Answer

Answers

  • Martin1Martin1 Posts: 6Member

    Thanks for your respons Geraldine!

    I've found a way to do what I wanted by cheating the genelist option. using awk I converted my exon bedfile to refseq format with the following command:
    awk -F $'\t' 'BEGIN {OFS = FS} {split($4,genename,"_"); print NR,$4,$1,$6,$2,$3,$2,$3,"1",$2,$3,$5,genename[1]"_"genename[2]" cds_"genename[4],"cmpl","cmpl","0"}' exons.bed
    where the exons.bed file looks like:
    chr1 2160205 2161174 NM_003036_cds_0_0_chr1_2160206_f 0 +
    chr1 2234416 2234542 NM_003036_cds_1_0_chr1_2234417_f 0 +

    This will give a refseq table file which looks like this:
    1 NM_003036_cds_0_0_chr1_2160206_f chr1 + 2160205 2161174 2160205 2161174 1 2160205 2161174 0 NM_003036 cds_0 cmpl cmpl 0
    2 NM_003036_cds_1_0_chr1_2234417_f chr1 + 2234416 2234542 2234416 2234542 1 2234416 2234542 0 NM_003036 cds_1 cmpl cmpl 0
    3 NM_003036_cds_2_0_chr1_2234724_f chr1 + 2234723 2234839 2234723 2234839 1 2234723 2234839 0 NM_003036 cds_2 cmpl cmpl 0
    4 NM_003036_cds_3_0_chr1_2235279_f chr1 + 2235278 2235541 2235278 2235541 1 2235278 2235541 0 NM_003036 cds_3 cmpl cmpl 0
    5 NM_003036_cds_4_0_chr1_2235732_f chr1 + 2235731 2236024 2235731 2236024 1 2235731 2236024 0 NM_003036 cds_4 cmpl cmpl 0
    6 NM_003036_cds_5_0_chr1_2237459_f chr1 + 2237458 2237689 2237458 2237689 1 2237458 2237689 0 NM_003036 cds_5 cmpl cmpl 0
    7 NM_003036_cds_6_0_chr1_2238016_f chr1 + 2238015 2238204 2238015 2238204 1 2238015 2238204 0 NM_003036 cds_6 cmpl cmpl 0
    8 NM_001195563_cds_0_0_chr1_6845591_f chr1 + 6845590 6845635 6845590 6845635 1 6845590 6845635 0 NM_001195563 cds_0 cmpl cmpl 0

    using the -geneList option of DoC, i get this output in my file (and --outputFormat csv):
    NM_003036 cds_0,31939,32.96,31939,32.96,18,32,44,100.0,100.0,93.3,70.4,16.4
    NM_003036 cds_1,7165,56.87,7165,56.87,41,60,71,100.0,100.0,100.0,100.0,62.7
    NM_003036 cds_2,4946,42.64,4946,42.64,42,43,45,100.0,100.0,100.0,100.0,0.0
    NM_003036 cds_3,9542,36.28,9542,36.28,30,38,46,100.0,100.0,100.0,100.0,10.3
    NM_003036 cds_4,7763,26.49,7763,26.49,24,28,32,100.0,100.0,100.0,90.8,0.0
    NM_003036 cds_5,9711,42.04,9711,42.04,34,49,53,100.0,100.0,100.0,88.3,42.9
    NM_003036 cds_6,2081,11.01,2081,11.01,9,13,15,100.0,97.9,57.7,0.0,0.0
    NM_001195563 cds_0,655,14.56,655,14.56,14,16,17,100.0,100.0,100.0,0.0,0.0
    NM_001195563 cds_1,1854,26.49,1854,26.49,26,27,29,100.0,100.0,100.0,100.0,0.0

    In this result you can see that the value of the name2 field in the refseq file is printed as the first column, so now i can lookup a gene and or exon based on the NM_n name and look at the coverage.

    Martin

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,819Administrator, GATK Developer admin

    That looks good, thanks for sharing your solution, Martin.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.