Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to generate RefSeq ROD

Dear team,

I would like to include a RefSeq ROD file in order to get the coverage per gene using the GATK coverage tool (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). However, it is not really clear to me how I can easily generate such a file, since I can not find the right documentation. All the links that should point to this information seem to be (incorrectly?) redirected to the main GATK homepage). For example this http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_refseq_RefSeqCodec.html has a link that points to http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq which is redirected to http://www.broadinstitute.org/gatk/.
Can someone point me to the correct docs? Other resources I found also point to the same wiki, which I can't find at the moment..

Kind regards,
JJ

Best Answer

Answers

  • mxqianmxqian Member

    Hi,
    When using the DepthOfCoverage, I got very strange ERROR MESSAGE as below. The program goes well to
    INFO 17:02:30,037 ProgressMeter - chrY:59358254 2.28041336E8 2.2 h 34.0 s 100.0% 2.2 h 0.0 s
    INFO 17:02:30,773 GATKRunReport - Uploaded run statistics report to AWS S3.

    And result for "per locus coverage" can be generated but other summary not. I count the numbers of exons and exon frameshifts and the number is exact 131. This transcript surely has too many exons.

    ##### ERROR MESSAGE: Unknown file is malformed: Data format error: numbers of exons and exon frameshifts differ for line=26 NM_001278267 chr1 + 144146810 146467744 144158383 146466121 131 144146810,144148789,144149726,144150981,144151518,144153012,144156971,144158177,144158378,144158870,144164518,144179473,144180355,144181063,144181950,144182695,144183587,144184251,144185133,144185827,144186714,144187459,144188351,144189011,144189893,144190581,144191468,144192205,144193097,144193759,144194641,144195343,144196230,144196975,144197867,144201704,144202586,144203298,144204185,144204924,144205816,144216027,144216909,144217611,144218498,144219235,144220127,144220786,144221668,144222378,144223265,144224002,144824704,145313333,145314215,145314903,145315790,145316511,145317405,145318053,145318935,145319623,145320510,145330682,145331574,145332222,145333104,145333796,145334683,145335388,145336280,145336928,145337810,145338516,145339403,145340108,145341000,145341648,145342530,145343240,145344127,145344832,145345724,145346372,145347254,145347964,145348851,145349562,145350454,145351102,145351984,145352684,145353571,145354294,145355186,145355836,145362978,145363678,145364565,146420018,146420910,146421558,146422440,146423162,146424049,146424783,146425675,146426323,146427205,146427927,146428814,146435853,146436735,146437457,146438344,146443692,146444574,146445298,146446923,146447725,146448373,146454019,146454737,146455624,146456358,146462010,146462657,146463539,146464263,146465150,146465877, 144147021,144148892,144149941,144151054,144151724,144153064,144157135,144158252,144158391,144159043,144164570,144179646,144180407,144181236,144182059,144182868,144183639,144184424,144185185,144186000,144186823,144187632,144188403,144189184,144189945,144190754,144191577,144192378,144193149,144193932,144194693,144195516,144196339,144197148,144197919,144201877,144202638,144203471,144204294,144205097,144205868,144216200,144216961,144217784,144218607,144219408,144220179,144220959,144221720,144222551,144223374,144224175,144824756,145313506,145314267,145315076,145315899,145316684,145317457,145318226,145318987,145319796,145320619,145330855,145331626,145332395,145333156,145333969,145334792,145335561,145336332,145337101,145337862,145338689,145339512,145340281,145341052,145341821,145342582,145343413,145344236,145345005,145345776,145346545,145347306,145348137,145348960,145349735,145350506,145351275,145352036,145352857,145353680,145354467,145355238,145356009,145363030,145363851,145364674,146420191,146420962,146421731,146422492,146423335,146424158,146424956,146425727,146426496,146427257,146428100,146428923,146436026,146436787,146437630,146438453,146443865,146444626,146445388,146447006,146447777,146448546,146454071,146454910,146455733,146456531,146462062,146462830,146463591,146464436,146465259,146467744, 0 NBPF20 cmpl cmpl -1,-1,-1,-1,-1,-1,-1,-1,0,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,2,1,2,1,2,1

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mxqian It sounds like your input file has a problem. Unfortunately we cannot help you with that. You will need to fix the difference or skip this locus.

  • wxwangwxwang Member

    Hello, I am having similar problem with @mxqian and hope to follow up.

    Below is the error message I got with the input I used for -geneList and -L (both attached as test.list and GRCh38_mut.genelist_gene.list, respectively):

    ERROR MESSAGE: Unknown file is malformed: Data format error: numbers of exons and exon frameshifts differ for line=500 uc001ssx.4 12 + 65824130 65966295 65825270 65963292 5 65824130,65828000,65838518,65951382,65963244, 65825381,65828087,65838569,65951415,65966295, P52926 ENST00000403681.6 cmpl cmpl 0,

    I am aware that I am supposed to fix something in my input file. But after comfirming that the exon number (column 9: 5) and the exon intervals (column 10, 11) in my input file did match, and comparing the format of my input files and your online example, the only thing the seemed problematic to me is the last column. Could you please advice on the definition of this column? Also for reference the definition of the 1st, the 2nd and 3rd to last columns?

    Also since I am using ensembl references and generated my input for -geneList using the UCSC browser accordingly. Do I need to specify it like you did in an example with refseq input: -[arg]:REFSEQ /path/to/refSeq?

    Thanks and I apologize if I missed something in your online notes.

    Issue · Github
    by Sheila

    Issue Number
    3144
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wxwang
    Hi,

    Can you please post the exact command and version you are using?

    Thanks,
    Sheila

  • Hi @Sheila,

    Please see below for the command I used, where
    test.list=test.txt
    GRCh38_mut.genelist_gene.list=GRCh38_mut.genelist_gene_txt

    java -jar gatk_3.8.1/GenomeAnalysisTK.jar \
    -T DepthOfCoverage \
    -R Homo_sapiens.GRCh38.dna.primary_assembly_wERCC.fa \
    -I dedup.RG.sortedByCoord.bam \
    -o DCG_RG \
    -geneList test.list \
    -L GRCh38_mut.genelist_gene.list \
    -U ALLOW_N_CIGAR_READS

    Here is the version I got:
    java -jar GenomeAnalysisTK.jar -T DepthOfCoverage
    INFO 10:16:30,907 HelpFormatter - ------------------------------------------------------------------------------------
    INFO 10:16:30,990 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wxwang
    Hi,

    I will ask my teammate to have a look. It may take some time, as she is on vacation now. In the meantime, perhaps you can try asking in other forums, as we do not really provide support for this.

    -Sheila

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @wxwang,

    I am back today from vacation and will get to your question shortly. Thanks for your patience.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited August 2018

    Hi @wxwang,

    As you say

    I am aware that I am supposed to fix something in my input file....the only thing the seemed problematic to me is the last column. Could you please advice on the definition of this column? Also for reference the definition of the 1st, the 2nd and 3rd to last columns?

    It is indeed the REFSEQ file that is causing the error. For reference, the error is caused by lines 152-153 at https://github.com/broadgsa/gatk/blob/3.8-1/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/codecs/refseq/RefSeqCodec.java#L152-L153.

    I've been learning Java (in my free time). If we parse the code, it basically says it expects the number of items in columns 9 and 15 to match exactly. Your column 15 contains only a single value (value=0) for each line in text.txt and as shown in the ERROR message you posted above.

    [0-7]
    500 uc001ssx.4 12 + 65824130 65966295 65825270 65963292 
    
    [8]
    5 
    [9]
    65824130,65828000,65838518,65951382,65963244, 
    [10]
    65825381,65828087,65838569,65951415,65966295, 
    
    [11-14]
    P52926 ENST00000403681.6 cmpl cmpl 
    
    [15]
    0,
    

    The code specifies for the number specified in column 8 (e.g. 5), columns 9, 10 and 15 must have the same number of items (i.e. 5 each) separated by commas. What you can do is fill these in with 0's to fulfill the number requirement (assuming the actual frame offset is zero or unimportant to your analysis).

    You asked what some of these columns refer to and here is what I have for you:

    [1]: transcript ID (feature.setTranscript_id)
    [2]: contig name (contig_name), e.g. chromosome 12
    [15]: Exon frame offsets {0,1,2} (eframes). The frame offset relates to the 3-base amino-acid codon.

    I'm not sure what the first column ([0]) of the file refers to. You can read more about your file's format at http://genome.ucsc.edu/FAQ/FAQformat#format9, under Gene Predictions (Extended).

    Besides fixing the file format, here are some other solutions to consider:

    • SOLUTION 2: This seems a funny thing to do, and so so long as the frames are no matter to your downstream analyses, then an alternative to consider is removing use of eframes from the code (at least four instances on this particular code page). Now I'm not familiar with how the frame offset is used by GATK/Picard tools, so this removal option is something you will have to tell me how it may impact the science. We will definitely take this into consideration when we add the tool/functionality to GATK4. To be clear, DepthOfCoverage functionality will be in GATK4, possibly under a different name or as features of another tool, as discussed in https://github.com/broadinstitute/gatk/issues/4551.

    • SOLUTION 3: Use a different file format. I see DepthOfCoverage takes several. At glance, converting to BED or BEDTABLE would be the first I'd investigate.

    Available Reference Ordered Data types:
             Name        FeatureType   Documentation
             BCF2     VariantContext   (this is an external codec and is not documented within GATK)
           BEAGLE      BeagleFeature   (this is an external codec and is not documented within GATK)
              BED         BEDFeature   (this is an external codec and is not documented within GATK)
         BEDTABLE       TableFeature   (this is an external codec and is not documented within GATK)
    EXAMPLEBINARY            Feature   (this is an external codec and is not documented within GATK)
        RAWHAPMAP   RawHapMapFeature   (this is an external codec and is not documented within GATK)
           REFSEQ      RefSeqFeature   (this is an external codec and is not documented within GATK)
        SAMPILEUP   SAMPileupFeature   (this is an external codec and is not documented within GATK)
          SAMREAD     SAMReadFeature   (this is an external codec and is not documented within GATK)
            TABLE       TableFeature   (this is an external codec and is not documented within GATK)
              VCF     VariantContext   (this is an external codec and is not documented within GATK)
             VCF3     VariantContext   (this is an external codec and is not documented within GATK)
    

    I hope this is helpful to you.

    P.S. Please let us know if by following linked instructions on https://software.broadinstitute.org/gatk/documentation/article.php?id=1329, the RefSeq format that you get doesn not correct for the last column.

Sign In or Register to comment.