Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

RefSeq - gene list format

From the NCBI RefSeq website

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

Several GATK tools accept a RefSeq-formatted gene list. Mind you, there are many file format flavors of ReqSeq; we've chosen to use the table dump format produced by the UCSC genome table browser.

Generating RefSeq files

Go to the UCSC genome table browser. There are many output options; here are the settings we care about:

assembly: [be sure to pick the same [reference genome build](https://software.broadinstitute.org/gatk/documentation/article.php?id=11011) that you're working with]
group:    Genes and Gene Prediction Tracks
track:    RefSeq Genes
table:    refGene

Choose a good output filename, something like my_organism.geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which may not be sorted, and may contain non-standard contigs. You may need to re-sort the file and remove any contigs to match exactly the sequence dictionary of the reference build you're working with.

Warning

The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromosome one:
Chr1:0-19 (UCSC system)
Chr1:1-19 (GATK)

All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

Tagged:
Sign In or Register to comment.