Where can I get a gene list in RefSeq format?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,751Administrator, GATK Dev admin
edited May 16 in FAQs

1. About the RefSeq Format

From the NCBI RefSeq website

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

2. In the GATK

The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.

3. Generating RefSeq files

Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:

clade:    Mammal
genome:   Human
assembly: ''choose the appropriate assembly for the reference you're using''
group:    Genes abd Gene Prediction Tracks
track:    RefSeq Genes
table:    refGene
region:   ''choose the genome option''

Choose a good output filename, something like geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order. This can be done with a combination of grep, sort, and a script called sortByRef.pl that is available here.

4. Running with the GATK

You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:

-[arg]:REFSEQ /path/to/refSeq

Using the filename from above.


The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromosome one:
Chr1:0-19 (UCSC system)
Chr1:1-19 (GATK)

All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD



  • vivekdas_1987vivekdas_1987 MilanPosts: 33Member

    I have some doubts, I have been using the GATK 2.3 for last one and half months and formulated a pipeline to analyze the exome datasets for our samples, I am having 2 tumor samples and its 4 IPSC lines which I am after reconsidering the Best Practices and being the first time user did it on single samples and then proceeded with annotation. I am indeed grateful to the developers of the tool and the people who have been constantly upgrading the document keeping in line with the latest findings and incorporating them in the forums and the docs page. Some things which I missed out while I was developing the pipeline. I would like to perform is that I want to find out the DepthofCoverage of my aligned bam files , this would help me getting a statistics of number of bases being read each time. Please correct me if am wrong. From the posts I read this can be be done. However I see have to generate a refSeq file from the UCSC genome browser. Is this not the hg19 reference genome fasta file which I have been using for the analysis or do I have to generate another one? The command states

    java -Xmx2g -jar GenomeAnalysisTK.jar \
    -R ref.fasta \
    -T DepthOfCoverage \
    -o file_name_base \
    -I input_bams.list
    [-geneList refSeq.sorted.txt] \
    [-pt readgroup] \
    [-ct 4 -ct 6 -ct 10] \
    [-L my_capture_genes.interval_list]

    I am a bit confused how to generate the -geneList refSeq.sorted.txt , this can be generate from the UCSC right with the instructions provided above? but then the output format is in bed format right so what should I do there? also the my_capture_genes.interval_list. Can you please guide me about this? Is this some customized interval list? Or if I say that I have the below data with me then can I generate the coverage statistics?

    java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam
    What is the input bam list referring to. If I want to do for single samples every time, I have not encountered much problem with your toll and the sailing have been quite smooth with your guidance but here am a bit confused. It would be nice if you can guide me through.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,751Administrator, GATK Dev admin

    Hi @vivekdas_1987,

    You don't necessarily have to provide a RefSeq file to run DepthOfCoverage, that's only needed if you want to get coverage metrics aggregated by gene. What you do have to provide, since you're working with exome data, is the list of capture intervals that was used to generate your data. In fact, you should also be using that interval list in other steps of the Best Practices (especially base recalibration and variant calling). If you do not have this list, you should contact the lab that generated the sequence data and ask them what exome intervals were used and where you can get a copy of the intervals file.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 33Member

    Yes I have that list of intervals its a bed file and the version is V4 sure select interval list. This is the file SureSelect_XT_Human_All_Exon_V4.bed. Please let me know if this is correct or not? Then if I reframe the code it should look like this below:

    java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam -L SureSelect_XT_Human_All_Exon_V4.bed

    Please let me know if this looks correct or not? I hope this would serve the purpose now?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,751Administrator, GATK Dev admin

    Yes, that looks correct.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 33Member

    Thanks I have prepared the scripts for all my 6 samples and now gave it for run. Once It finishes will update here about the run and if I face any problem in understanding any features. Once again thanks a lot for all the help and the advices.

  • rglowerglowe AustraliaPosts: 3Member

    I would like to know the recommended method to make a refseq file in the "UCSC table dump" format for a genome that isn't on the UCSC genome browser. My genome of interest isn't on the UCSC browser and my initial searches haven't yet found a specification for that table layout. Can I use a GTF or GFF to supply genes to CalculateDepthOfCoverage, for example? I guess I could calc coverage per gene afterwards...

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,751Administrator, GATK Dev admin

    @rglowe I'm not aware of a formal spec sheet, so my recommendation would be to pick an organism that is in the UCSC db, do a table dump, and then use it as a template to build yours. Primitive but it should work.

    If there is a kind soul out there who wants to be helpful, do that and derive a spec sheet from it that we can share with the world, we're happy to add it to the docs.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 512Member, GATK Dev, DSDE Dev mod

    I believe that the "spec", such as it is, is provided by the .sql files associated with each table. I personally find it easier to use the "describe table schema" button in the UCSC Table Browser

  • rglowerglowe AustraliaPosts: 3Member

    Thanks for the advice. I've had a look at the table for Human and I might be able to replicate it with my rudimentary parsing skills. Hopefully I can leave out the 'exonFrames' column otherwise I'll have to generate that data.

    One further question: in the original post you mention
    "To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order."

    I am guessing this only applies to Human? I would be using different naming convention and don't have karyotype information.

    Thanks again.

  • rglowerglowe AustraliaPosts: 3Member

    I found a short description of the Refseq table format in the document describing RefSeqCodec.

Sign In or Register to comment.