The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!
Where can I get a gene list in RefSeq format?
1. About the RefSeq Format
From the NCBI RefSeq website
The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.
2. In the GATK
The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.
3. Generating RefSeq files
Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:
clade: Mammal genome: Human assembly: ''choose the appropriate assembly for the reference you're using'' group: Genes abd Gene Prediction Tracks track: RefSeq Genes table: refGene region: ''choose the genome option''
Choose a good output filename, something like
geneTrack.refSeq, and click the
get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order.
4. Running with the GATK
You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:
Using the filename from above.
The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.
The first 19 bases in Chromosome one: Chr1:0-19 (UCSC system) Chr1:1-19 (GATK)
All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.