The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Where can I get a gene list in RefSeq format?

edited October 2016 in FAQs

### 1. About the RefSeq Format

From the NCBI RefSeq website

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

### 2. In the GATK

The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.

### 3. Generating RefSeq files

Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:

clade:    Mammal
genome:   Human
assembly: ''choose the appropriate assembly for the reference you're using''
group:    Genes abd Gene Prediction Tracks
track:    RefSeq Genes
table:    refGene
region:   ''choose the genome option''


Choose a good output filename, something like geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order.

### 4. Running with the GATK

You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:

-[arg]:REFSEQ /path/to/refSeq


Using the filename from above.

#### Warning:

The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromosome one:
Chr1:0-19 (UCSC system)
Chr1:1-19 (GATK)


All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

Geraldine Van der Auwera, PhD

Post edited by Geraldine_VdAuwera on
Tagged:

• MilanMember Posts: 33

I have some doubts, I have been using the GATK 2.3 for last one and half months and formulated a pipeline to analyze the exome datasets for our samples, I am having 2 tumor samples and its 4 IPSC lines which I am after reconsidering the Best Practices and being the first time user did it on single samples and then proceeded with annotation. I am indeed grateful to the developers of the tool and the people who have been constantly upgrading the document keeping in line with the latest findings and incorporating them in the forums and the docs page. Some things which I missed out while I was developing the pipeline. I would like to perform is that I want to find out the DepthofCoverage of my aligned bam files , this would help me getting a statistics of number of bases being read each time. Please correct me if am wrong. From the posts I read this can be be done. However I see have to generate a refSeq file from the UCSC genome browser. Is this not the hg19 reference genome fasta file which I have been using for the analysis or do I have to generate another one? The command states

java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T DepthOfCoverage \
-o file_name_base \
-I input_bams.list
[-geneList refSeq.sorted.txt] \
[-ct 4 -ct 6 -ct 10] \
[-L my_capture_genes.interval_list]

I am a bit confused how to generate the -geneList refSeq.sorted.txt , this can be generate from the UCSC right with the instructions provided above? but then the output format is in bed format right so what should I do there? also the my_capture_genes.interval_list. Can you please guide me about this? Is this some customized interval list? Or if I say that I have the below data with me then can I generate the coverage statistics?

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam
What is the input bam list referring to. If I want to do for single samples every time, I have not encountered much problem with your toll and the sailing have been quite smooth with your guidance but here am a bit confused. It would be nice if you can guide me through.

You don't necessarily have to provide a RefSeq file to run DepthOfCoverage, that's only needed if you want to get coverage metrics aggregated by gene. What you do have to provide, since you're working with exome data, is the list of capture intervals that was used to generate your data. In fact, you should also be using that interval list in other steps of the Best Practices (especially base recalibration and variant calling). If you do not have this list, you should contact the lab that generated the sequence data and ask them what exome intervals were used and where you can get a copy of the intervals file.

Geraldine Van der Auwera, PhD

• MilanMember Posts: 33

Yes I have that list of intervals its a bed file and the version is V4 sure select interval list. This is the file SureSelect_XT_Human_All_Exon_V4.bed. Please let me know if this is correct or not? Then if I reframe the code it should look like this below:

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam -L SureSelect_XT_Human_All_Exon_V4.bed

Please let me know if this looks correct or not? I hope this would serve the purpose now?

Yes, that looks correct.

Geraldine Van der Auwera, PhD

• MilanMember Posts: 33

Thanks I have prepared the scripts for all my 6 samples and now gave it for run. Once It finishes will update here about the run and if I face any problem in understanding any features. Once again thanks a lot for all the help and the advices.

• AustraliaMember Posts: 3

I would like to know the recommended method to make a refseq file in the "UCSC table dump" format for a genome that isn't on the UCSC genome browser. My genome of interest isn't on the UCSC browser and my initial searches haven't yet found a specification for that table layout. Can I use a GTF or GFF to supply genes to CalculateDepthOfCoverage, for example? I guess I could calc coverage per gene afterwards...
regards,
Rohan

@rglowe I'm not aware of a formal spec sheet, so my recommendation would be to pick an organism that is in the UCSC db, do a table dump, and then use it as a template to build yours. Primitive but it should work.

If there is a kind soul out there who wants to be helpful, do that and derive a spec sheet from it that we can share with the world, we're happy to add it to the docs.

Geraldine Van der Auwera, PhD

• Member, Dev Posts: 544 ✭✭✭✭

I believe that the "spec", such as it is, is provided by the .sql files associated with each table. I personally find it easier to use the "describe table schema" button in the UCSC Table Browser

• AustraliaMember Posts: 3

Thanks for the advice. I've had a look at the table for Human and I might be able to replicate it with my rudimentary parsing skills. Hopefully I can leave out the 'exonFrames' column otherwise I'll have to generate that data.

One further question: in the original post you mention
"To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order."

I am guessing this only applies to Human? I would be using different naming convention and don't have karyotype information.

Thanks again.

• AustraliaMember Posts: 3

I found a short description of the Refseq table format in the document describing RefSeqCodec.

• Member Posts: 35
edited October 2016

@Geraldine_VdAuwera
A technical problem: I could not find the sortByRef.pl anywhere.

Update (after 2 mins): I found it in gatk/public/perl directory of my local github repo. But still can't find it online. You may need to update the link to the file.

We deprecated that script some time ago, and removed it from the codebase. We now recommend using Picard SortVcfs for this purpose.

Geraldine Van der Auwera, PhD

• Member Posts: 35

Good to know. Then it is also a good idea to update the tutorial to reflex the change.

Ah, fair point, will do.

Geraldine Van der Auwera, PhD

Actually, my bad -- SortVcf doesn't run on refseq files; I was thinking of the VCF sorting use case. We deprecated the perl script thinking that all use cases were covered but it looks like we didn't account for the refseq case. That does mean we no longer provide an official recommendation for generating properly sorted refseq files. I'll look into options for fixing that (aside from bringing back the perl script, which we'd rather avoid) but in the meantimes we'd be happy to take recommendations from the community.

Geraldine Van der Auwera, PhD

• Member Posts: 35

Well, I am not sure why the old Perl file is deprecated. I tried it and it does not handle every thing completely and produce in compatible output to GATK. I wrote a short Linux-based, bash script to do the job. With the ucsc.refseq.txt prepared the way you described in the tutorial, the script's output can be run with GATK's DepthOfCoverage.

@Biocyberman Thanks for sharing your solution! As far as I can remember the Perl script broke and no one wanted to put in the effort to fix it.

Geraldine Van der Auwera, PhD

• USAMember Posts: 95

Dear GATK Team,

On UCSC's Table Browser website, the parameter "output format" has the following 8 options :

all fields from selected table
selected fields from primary and related tables
sequence
GTF- gene transfer format
CDS FASTA alignment from multiple alignment
BED - browser extensible data
custom track

Could you please specify which "output format" should I select in order to make that output file to be given as input to my -geneList parameter of the DepthOfCoverage tool ?

Thanks,
mglclinical

• USAMember Posts: 95
edited October 2016

I suspect that the answer is "all fields from selected table" .

edited October 2016

Hi @mglclinical, the GTF format sounds familiar but I'd have to double-check for this specific tool what this is used for and if it is appropriate. Can you try this and let us know if your output is as expected?

P.S. 10/6`: Turns out we have an article related to this here. Also, check out this thread and this slightly digressive thread.

Post edited by shlee on
• Member Posts: 35

I can fix the Perl script if you prefer. But it can be only in December when I have a little free time.

Thanks, but I think we'd do better to make a proper GATK tool that can sort simple non-vcf reference-ordered data formats, rather than rely on external scripts. I'll see if we can get a new dev to take it on as an onboarding exercise.

Geraldine Van der Auwera, PhD

#### Issue · Github October 2016 by Sheila

Issue Number
2197
State
open
Last Updated
• USAMember Posts: 95

Hi @shlee

I have downloaded the refseq file with the output format "all fields from selected table" and that worked for me, it is the same format that is specified in this link http://gatkforums.broadinstitute.org/gatk/discussion/40/performing-sequence-coverage-analysis . After downloading the file I have processed it (sorted, removed unnecessary contigs) in order to make it work with DepthOfCoverage tool

however, I did not try the GTF format.

Thanks,
mglclinical

• USAMember Posts: 95

I believe I have successfully used the DepthOfCoverage tool(GATK 3.5) for calculating gene level coverage by using the -geneList parameter .

I have performed some manual QC on the gene level coverage(example human genes : ABCD1, LRRC8C, and GNG10) and compared those values to the "_mean_cvg" column values in the ".sample_gene_summary" file to make sure that the gene level depth of coverage is accurately being calculated by DepthOfCoverage tool.

I observed that my manual calculations(total_coverage & mean_coverage) for genes ABCD1 and LRRC8C matched identically against the calculations from DepthOfCoverage tool. Where as for gene GNG10 they did not match.

My total_coverage calculation for GNG10 was ~10,000, where as DepthOfCoverage's calculation of total_coverage for GNG10 is 2623, which is a gross under representation. I went back and looked into my calculations and observed that 1st exon of GNG10 has a total_coverage of 2623 and that matched the over all DepthOfCoverage's calculation for gene GNG10. And this GNG10 has multiple transcripts(NM_001017998, NM_001198664) in my refseq file. My other example genes ABCD1 and LRRC8C that I considered for my manual QC have single transcripts in my geneList file. So, it seems to suggest to me that coverage for genes with multiple transcripts in the geneList file are being be underrepresented by the DepthOfCoverage tool.

I am not sure if this is a bug in the DoC tool, and please let me know if you have any advice on this.

Thanks,
mglclinical

@mglclinical
Hi mglclinical,

Ah, yes. Only the first transcript is represented in the output of DepthOfCoverage.

-Sheila

• USAMember Posts: 95
edited October 2016

Hi @Sheila

GNG10 gene has at least 2 CDS exons in humans.

DepthOfCoverage tool calculated the coverage based on the reads in 1st exon region, and it completely ignored the reads in the 2nd exon region , hence I believe it is a bug in the software(DepthOfCoverage)

For exon1 : total_coverage & mean_coverage are ~2000 and ~17
For exon2 : total_coverage & mean_coverage are ~9000 and ~77

So, for overall gene_level calculation the total_coverage & mean_coverage should be around ~11000 and ~50.

But the DepthOfCoverage tool's gene_level calculation of total_coverage & mean_coverage is 2623 and 16.39

Hence I think the DepthOfCoverage tool has considered only the 1st exon and ignored the 2nd exon for gene level coverage calculation

Thanks,
mglclinical

Post edited by mglclinical on

@mglclinical
Hi mglclinical,

Would you be able to submit a bug report? Instructions are here.

Thanks,
Sheila

• USAMember Posts: 95

Hi @Sheila , I will submit a bug report, Thanks mglclinical