Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Where can I get a gene list in RefSeq format?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited October 2016 in Frequently Asked Questions

1. About the RefSeq Format

From the NCBI RefSeq website

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

2. In the GATK

The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.

3. Generating RefSeq files

Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:

clade:    Mammal
genome:   Human
assembly: ''choose the appropriate assembly for the reference you're using''
group:    Genes abd Gene Prediction Tracks
track:    RefSeq Genes
table:    refGene
region:   ''choose the genome option''

Choose a good output filename, something like geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order.

4. Running with the GATK

You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:

-[arg]:REFSEQ /path/to/refSeq

Using the filename from above.

Warning:

The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromosome one:
Chr1:0-19 (UCSC system)
Chr1:1-19 (GATK)

All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • vivekdas_1987vivekdas_1987 MilanMember

    I have some doubts, I have been using the GATK 2.3 for last one and half months and formulated a pipeline to analyze the exome datasets for our samples, I am having 2 tumor samples and its 4 IPSC lines which I am after reconsidering the Best Practices and being the first time user did it on single samples and then proceeded with annotation. I am indeed grateful to the developers of the tool and the people who have been constantly upgrading the document keeping in line with the latest findings and incorporating them in the forums and the docs page. Some things which I missed out while I was developing the pipeline. I would like to perform is that I want to find out the DepthofCoverage of my aligned bam files , this would help me getting a statistics of number of bases being read each time. Please correct me if am wrong. From the posts I read this can be be done. However I see have to generate a refSeq file from the UCSC genome browser. Is this not the hg19 reference genome fasta file which I have been using for the analysis or do I have to generate another one? The command states

    java -Xmx2g -jar GenomeAnalysisTK.jar \
    -R ref.fasta \
    -T DepthOfCoverage \
    -o file_name_base \
    -I input_bams.list
    [-geneList refSeq.sorted.txt] \
    [-pt readgroup] \
    [-ct 4 -ct 6 -ct 10] \
    [-L my_capture_genes.interval_list]

    I am a bit confused how to generate the -geneList refSeq.sorted.txt , this can be generate from the UCSC right with the instructions provided above? but then the output format is in bed format right so what should I do there? also the my_capture_genes.interval_list. Can you please guide me about this? Is this some customized interval list? Or if I say that I have the below data with me then can I generate the coverage statistics?

    java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam
    What is the input bam list referring to. If I want to do for single samples every time, I have not encountered much problem with your toll and the sailing have been quite smooth with your guidance but here am a bit confused. It would be nice if you can guide me through.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vivekdas_1987,

    You don't necessarily have to provide a RefSeq file to run DepthOfCoverage, that's only needed if you want to get coverage metrics aggregated by gene. What you do have to provide, since you're working with exome data, is the list of capture intervals that was used to generate your data. In fact, you should also be using that interval list in other steps of the Best Practices (especially base recalibration and variant calling). If you do not have this list, you should contact the lab that generated the sequence data and ask them what exome intervals were used and where you can get a copy of the intervals file.

  • vivekdas_1987vivekdas_1987 MilanMember

    Yes I have that list of intervals its a bed file and the version is V4 sure select interval list. This is the file SureSelect_XT_Human_All_Exon_V4.bed. Please let me know if this is correct or not? Then if I reframe the code it should look like this below:

    java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam -L SureSelect_XT_Human_All_Exon_V4.bed

    Please let me know if this looks correct or not? I hope this would serve the purpose now?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, that looks correct.

  • vivekdas_1987vivekdas_1987 MilanMember

    Thanks I have prepared the scripts for all my 6 samples and now gave it for run. Once It finishes will update here about the run and if I face any problem in understanding any features. Once again thanks a lot for all the help and the advices.

  • rglowerglowe AustraliaMember

    I would like to know the recommended method to make a refseq file in the "UCSC table dump" format for a genome that isn't on the UCSC genome browser. My genome of interest isn't on the UCSC browser and my initial searches haven't yet found a specification for that table layout. Can I use a GTF or GFF to supply genes to CalculateDepthOfCoverage, for example? I guess I could calc coverage per gene afterwards...
    regards,
    Rohan

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @rglowe I'm not aware of a formal spec sheet, so my recommendation would be to pick an organism that is in the UCSC db, do a table dump, and then use it as a template to build yours. Primitive but it should work.

    If there is a kind soul out there who wants to be helpful, do that and derive a spec sheet from it that we can share with the world, we're happy to add it to the docs.

  • pdexheimerpdexheimer Member ✭✭✭✭

    I believe that the "spec", such as it is, is provided by the .sql files associated with each table. I personally find it easier to use the "describe table schema" button in the UCSC Table Browser

  • rglowerglowe AustraliaMember

    Thanks for the advice. I've had a look at the table for Human and I might be able to replicate it with my rudimentary parsing skills. Hopefully I can leave out the 'exonFrames' column otherwise I'll have to generate that data.

    One further question: in the original post you mention
    "To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order."

    I am guessing this only applies to Human? I would be using different naming convention and don't have karyotype information.

    Thanks again.

  • rglowerglowe AustraliaMember

    I found a short description of the Refseq table format in the document describing RefSeqCodec.
    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_utils_codecs_refseq_RefSeqCodec.php

  • BiocybermanBiocyberman Member
    edited October 2016

    @Geraldine_VdAuwera
    A technical problem: I could not find the sortByRef.pl anywhere.

    Update (after 2 mins): I found it in gatk/public/perl directory of my local github repo. But still can't find it online. You may need to update the link to the file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    We deprecated that script some time ago, and removed it from the codebase. We now recommend using Picard SortVcfs for this purpose.
  • Good to know. Then it is also a good idea to update the tutorial to reflex the change.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, fair point, will do.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually, my bad -- SortVcf doesn't run on refseq files; I was thinking of the VCF sorting use case. We deprecated the perl script thinking that all use cases were covered but it looks like we didn't account for the refseq case. That does mean we no longer provide an official recommendation for generating properly sorted refseq files. I'll look into options for fixing that (aside from bringing back the perl script, which we'd rather avoid) but in the meantimes we'd be happy to take recommendations from the community.

  • Well, I am not sure why the old Perl file is deprecated. I tried it and it does not handle every thing completely and produce in compatible output to GATK. I wrote a short Linux-based, bash script to do the job. With the ucsc.refseq.txt prepared the way you described in the tutorial, the script's output can be run with GATK's DepthOfCoverage.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Biocyberman Thanks for sharing your solution! As far as I can remember the Perl script broke and no one wanted to put in the effort to fix it.

  • mglclinicalmglclinical USAMember

    Dear GATK Team,

    On UCSC's Table Browser website, the parameter "output format" has the following 8 options :

    all fields from selected table
    selected fields from primary and related tables
    sequence
    GTF- gene transfer format
    CDS FASTA alignment from multiple alignment
    BED - browser extensible data
    custom track
    hyperlinks to Genome Browser

    Could you please specify which "output format" should I select in order to make that output file to be given as input to my -geneList parameter of the DepthOfCoverage tool ?

    Thanks,
    mglclinical

  • mglclinicalmglclinical USAMember
    edited October 2016

    Based on the format posted on this link : http://gatkforums.broadinstitute.org/gatk/discussion/40/performing-sequence-coverage-analysis

    I suspect that the answer is "all fields from selected table" .

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited October 2016

    Hi @mglclinical, the GTF format sounds familiar but I'd have to double-check for this specific tool what this is used for and if it is appropriate. Can you try this and let us know if your output is as expected?

    P.S. 10/6: Turns out we have an article related to this here. Also, check out this thread and this slightly digressive thread.

    Post edited by shlee on
  • I can fix the Perl script if you prefer. But it can be only in December when I have a little free time.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Thanks, but I think we'd do better to make a proper GATK tool that can sort simple non-vcf reference-ordered data formats, rather than rely on external scripts. I'll see if we can get a new dev to take it on as an onboarding exercise.

    Issue · Github
    by Sheila

    Issue Number
    2197
    State
    open
    Last Updated
  • mglclinicalmglclinical USAMember

    Hi @shlee

    I have downloaded the refseq file with the output format "all fields from selected table" and that worked for me, it is the same format that is specified in this link http://gatkforums.broadinstitute.org/gatk/discussion/40/performing-sequence-coverage-analysis . After downloading the file I have processed it (sorted, removed unnecessary contigs) in order to make it work with DepthOfCoverage tool

    however, I did not try the GTF format.

    Thanks,
    mglclinical

  • mglclinicalmglclinical USAMember

    Hi @Geraldine_VdAuwera

    I believe I have successfully used the DepthOfCoverage tool(GATK 3.5) for calculating gene level coverage by using the -geneList parameter .

    I have performed some manual QC on the gene level coverage(example human genes : ABCD1, LRRC8C, and GNG10) and compared those values to the "_mean_cvg" column values in the ".sample_gene_summary" file to make sure that the gene level depth of coverage is accurately being calculated by DepthOfCoverage tool.

    I observed that my manual calculations(total_coverage & mean_coverage) for genes ABCD1 and LRRC8C matched identically against the calculations from DepthOfCoverage tool. Where as for gene GNG10 they did not match.

    My total_coverage calculation for GNG10 was ~10,000, where as DepthOfCoverage's calculation of total_coverage for GNG10 is 2623, which is a gross under representation. I went back and looked into my calculations and observed that 1st exon of GNG10 has a total_coverage of 2623 and that matched the over all DepthOfCoverage's calculation for gene GNG10. And this GNG10 has multiple transcripts(NM_001017998, NM_001198664) in my refseq file. My other example genes ABCD1 and LRRC8C that I considered for my manual QC have single transcripts in my geneList file. So, it seems to suggest to me that coverage for genes with multiple transcripts in the geneList file are being be underrepresented by the DepthOfCoverage tool.

    I am not sure if this is a bug in the DoC tool, and please let me know if you have any advice on this.

    Thanks,
    mglclinical

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mglclinical
    Hi mglclinical,

    Ah, yes. Only the first transcript is represented in the output of DepthOfCoverage.

    -Sheila

  • mglclinicalmglclinical USAMember
    edited October 2016

    Hi @Sheila

    GNG10 gene has at least 2 CDS exons in humans.

    DepthOfCoverage tool calculated the coverage based on the reads in 1st exon region, and it completely ignored the reads in the 2nd exon region , hence I believe it is a bug in the software(DepthOfCoverage)

    For exon1 : total_coverage & mean_coverage are ~2000 and ~17
    For exon2 : total_coverage & mean_coverage are ~9000 and ~77

    So, for overall gene_level calculation the total_coverage & mean_coverage should be around ~11000 and ~50.

    But the DepthOfCoverage tool's gene_level calculation of total_coverage & mean_coverage is 2623 and 16.39

    Hence I think the DepthOfCoverage tool has considered only the 1st exon and ignored the 2nd exon for gene level coverage calculation

    Thanks,
    mglclinical

    Post edited by mglclinical on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mglclinical
    Hi mglclinical,

    Would you be able to submit a bug report? Instructions are here.

    Thanks,
    Sheila

  • mglclinicalmglclinical USAMember

    Hi @Sheila , I will submit a bug report, Thanks mglclinical

  • Hello! I am a plant guy. There is no RefSeq of Arabidopsis in UCSC genome table browser. Where can I get a gene list of Arabidopsis in the RefSeq format?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @zhiyewang1
    Hi,

    No need to post twice. We are helping here.

    -Sheila

  • mLalondemLalonde Member
    edited July 3

    Hi,
    I think the "track" and "table" and "region" options in UCSC have changed. I think this article should be slated for a re-write, and maybe could include a description (or downloadable example) of what the output should look like?
    In any case, this tutorial is no longer up-to-date b/c of changes to the UCSC table browser UI.
    Cheers,
    Matt

    Edit: I just read other topics that indicate tools such as DepthOfCoverage are not yet ported to GATK4, so I'd imagine the documentation is even further down the list of priorities. In that case, I just want to say that I appreciate all your hard work, and I look forward to GATK4 being fully built out and documented.

    Post edited by mLalonde on

    Issue · Github
    by shlee

    Issue Number
    3125
    State
    open
    Last Updated
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @mLalonde,

    Thanks for letting us know that the options in UCSC have changed. I've let the team know we should consider updating this document.

Sign In or Register to comment.