Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Exon output Depth of Coverage

Martin1Martin1 Posts: 6Member

Hello,

I'm runng DOC on my exome data and the output is hard to connect. This is the command i'm using: java -Xmx2g -jar ~/programs/GenomeAnalysisTK-2.2-2-gf44cc4e/GenomeAnalysisTK.jar -T DepthOfCoverage -R ~/dependencies/reference/hg19.fasta -o doc -I doctest.bam -geneList:REFSEQ ../refSeqGenes.txt -L ~/dependencies/bedfiles/coverage/500genes.exons.refGene.sort.bed -ct 1 -ct 5 -ct 10 -ct 20 -ct 50

The refseqfile looks like this:

75  NM_003036   chr1    +   2160133 2241652 2160205 2238204 7   2160133,2234416,2234723,2235278,2235731,2237458,2238015,    2161174,2234542,2234839,2235541,2236024,2237689,2241652,    0SKI    cmpl    cmpl    0,0,0,2,1,0,0,
637 NM_001195563    chr1    +   6845383 6932107 6845590 6931825 4   6845383,6880240,6885151,6931816,    6845635,6880310,6885270,6932107,    0   CAMTA1  cmpl    cmpl    0,0,1,0,
676 NM_000302   chr1    +   11994723    12035599    11994836    12034865    19  11994723,12008032,12009829,12010413,12012679,12014886,12016973,12017898,12018572,12020702,12023588,12024231,12024700,12025536,12026307,12027043,12030726,12032928,12034709, 11994912,12008124,12009963,12010577,12012792,12014950,12017071,12018000,12018704,12020824,12023693,12024357,12024842,12025650,12026373,12027148,12030873,12033054,12035599, 0   PLOD1   cmpl    cmpl    0,1,0,2,1,0,1,0,0,0,2,2,2,0,0,0,0,0,0,

The exon bed file looks like this:

chr1    2160205 2161174 NM_003036_cds_0_0_chr1_2160206_f    0   +
chr1    2234416 2234542 NM_003036_cds_1_0_chr1_2234417_f    0   +
chr1    2234723 2234839 NM_003036_cds_2_0_chr1_2234724_f    0   +
chr1    2235278 2235541 NM_003036_cds_3_0_chr1_2235279_f    0   +
chr1    2235731 2236024 NM_003036_cds_4_0_chr1_2235732_f    0   +
chr1    2237458 2237689 NM_003036_cds_5_0_chr1_2237459_f    0   +
chr1    2238015 2238204 NM_003036_cds_6_0_chr1_2238016_f    0   +
chr1    6845590 6845635 NM_001195563_cds_0_0_chr1_6845591_f 0   +
chr1    6845590 6845635 NM_015215_cds_0_0_chr1_6845591_f    0   +
chr1    6880240 6880310 NM_001195563_cds_1_0_chr1_6880241_f 0   +
chr1    6880240 6880310 NM_015215_cds_1_0_chr1_6880241_f    0   +
chr1    6885151 6885270 NM_001195563_cds_2_0_chr1_6885152_f 0   +
chr1    6885151 6885270 NM_015215_cds_2_0_chr1_6885152_f    0   +
chr1    6931816 6931825 NM_001195563_cds_3_0_chr1_6931817_f 0   +
chr1    7151363 7151431 NM_015215_cds_3_0_chr1_7151364_f    0   +

The gene_summary file shows coverage on all genes:

SKI 73147   33.45   73147   33.45   21  32  45  100.0   99.8    93.4    75.8    16.6
CAMTA1  8158    33.57   8158    33.57   25  33  47  100.0   100.0   100.0   81.5    17.3
UNKNOWN 5312955 54.97   5312955 54.97   32  49  74  99.7    98.2    96.2    87.9    48.3
PLOD1   85711   39.24   85711   39.24   26  34  48  100.0   99.5    97.3    84.8    22.8
HSPG2   407112  30.90   407112  30.90   19  29  41  99.5    98.0    91.8    72.3    13.6

and the interval_summery file shows all exons:

chr1:2160206-2161174    31939   32.96   31939   32.96   18  32  44  100.0   100.0   93.3    70.4    16.4
chr1:2234417-2234542    7165    56.87   7165    56.87   41  60  71  100.0   100.0   100.0   100.0   62.7
chr1:2234724-2234839    4946    42.64   4946    42.64   42  43  45  100.0   100.0   100.0   100.0   0.0
chr1:2235279-2235541    9542    36.28   9542    36.28   30  38  46  100.0   100.0   100.0   100.0   10.3
chr1:2235732-2236024    7763    26.49   7763    26.49   24  28  32  100.0   100.0   100.0   90.8    0.0
chr1:2237459-2237689    9711    42.04   9711    42.04   34  49  53  100.0   100.0   100.0   88.3    42.9
chr1:2238016-2238204    2081    11.01   2081    11.01   9   13  15  100.0   97.9    57.7    0.0 0.0
chr1:6845591-6845635    655 14.56   655 14.56   14  16  17  100.0   100.0   100.0   0.0 0.0

This information is used to verify that a certain variant was covered good, not only on the position of the variant but also over the rest of the gene it is in. This is a bit hard to see in the interval_summery file because I can't see the exon or genenames in there and I can't see this in the gene_summery file because i mis individual exon information. This means crossrefferencing the bedfile and interval_summery with ugly scripting or creating a spoofed refseq file per exon. Is there an option to format the output where the coverage of each exon is shown with the exon and/or gene name (from the info in the bed or refseq file)?

e.g. something like this: chr1:2160206-2161174 NM_003036_cds_0_0_chr1_2160206_f SKI 31939 32.96 31939 32.96 18 32 44 100.0 100.0 93.3 70.4 16.4

Regards,

Martin

Tagged:

Best Answer

Answers

  • Martin1Martin1 Posts: 6Member

    Thanks for your respons Geraldine!

    I've found a way to do what I wanted by cheating the genelist option. using awk I converted my exon bedfile to refseq format with the following command: awk -F $'\t' 'BEGIN {OFS = FS} {split($4,genename,"_"); print NR,$4,$1,$6,$2,$3,$2,$3,"1",$2,$3,$5,genename[1]"_"genename[2]" cds_"genename[4],"cmpl","cmpl","0"}' exons.bed where the exons.bed file looks like:

    chr1    2160205 2161174 NM_003036_cds_0_0_chr1_2160206_f        0       +
    chr1    2234416 2234542 NM_003036_cds_1_0_chr1_2234417_f        0       +

    This will give a refseq table file which looks like this:

    1       NM_003036_cds_0_0_chr1_2160206_f        chr1    +       2160205 2161174 2160205 2161174 1       2160205 2161174 0       NM_003036 cds_0 cmpl    cmpl    0
    2       NM_003036_cds_1_0_chr1_2234417_f        chr1    +       2234416 2234542 2234416 2234542 1       2234416 2234542 0       NM_003036 cds_1 cmpl    cmpl    0
    3       NM_003036_cds_2_0_chr1_2234724_f        chr1    +       2234723 2234839 2234723 2234839 1       2234723 2234839 0       NM_003036 cds_2 cmpl    cmpl    0
    4       NM_003036_cds_3_0_chr1_2235279_f        chr1    +       2235278 2235541 2235278 2235541 1       2235278 2235541 0       NM_003036 cds_3 cmpl    cmpl    0
    5       NM_003036_cds_4_0_chr1_2235732_f        chr1    +       2235731 2236024 2235731 2236024 1       2235731 2236024 0       NM_003036 cds_4 cmpl    cmpl    0
    6       NM_003036_cds_5_0_chr1_2237459_f        chr1    +       2237458 2237689 2237458 2237689 1       2237458 2237689 0       NM_003036 cds_5 cmpl    cmpl    0
    7       NM_003036_cds_6_0_chr1_2238016_f        chr1    +       2238015 2238204 2238015 2238204 1       2238015 2238204 0       NM_003036 cds_6 cmpl    cmpl    0
    8       NM_001195563_cds_0_0_chr1_6845591_f     chr1    +       6845590 6845635 6845590 6845635 1       6845590 6845635 0       NM_001195563 cds_0      cmpl    cmpl    0

    using the -geneList option of DoC, i get this output in my file (and --outputFormat csv):

    NM_003036 cds_0,31939,32.96,31939,32.96,18,32,44,100.0,100.0,93.3,70.4,16.4
    NM_003036 cds_1,7165,56.87,7165,56.87,41,60,71,100.0,100.0,100.0,100.0,62.7
    NM_003036 cds_2,4946,42.64,4946,42.64,42,43,45,100.0,100.0,100.0,100.0,0.0
    NM_003036 cds_3,9542,36.28,9542,36.28,30,38,46,100.0,100.0,100.0,100.0,10.3
    NM_003036 cds_4,7763,26.49,7763,26.49,24,28,32,100.0,100.0,100.0,90.8,0.0
    NM_003036 cds_5,9711,42.04,9711,42.04,34,49,53,100.0,100.0,100.0,88.3,42.9
    NM_003036 cds_6,2081,11.01,2081,11.01,9,13,15,100.0,97.9,57.7,0.0,0.0
    NM_001195563 cds_0,655,14.56,655,14.56,14,16,17,100.0,100.0,100.0,0.0,0.0
    NM_001195563 cds_1,1854,26.49,1854,26.49,26,27,29,100.0,100.0,100.0,100.0,0.0

    In this result you can see that the value of the name2 field in the refseq file is printed as the first column, so now i can lookup a gene and or exon based on the NM_n name and look at the coverage.

    Martin

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,295Administrator, GATK Developer admin

    That looks good, thanks for sharing your solution, Martin.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.