The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# What input files does the GATK accept / require?

edited July 30 in FAQs

All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:

• Reference genome sequence
• Intervals of interest
• Reference-ordered data

This article describes the corresponding file formats that are acceptable for use with the GATK.

### 1. Reference Genome Sequence

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. All the standard IUPAC bases are accepted, but keep in mind that non-standard bases (i.e. other than ACGT, such as W for example) will be ignored (i.e. those positions in the genome will be skipped).

Some users have reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself.

Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.

#### Important note about human genome reference versions

If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.

Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.

The only input format for sequence reads that the GATK itself supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.

If you don't find the information you need in this section, please see our FAQs on BAM files.

If you are starting out your pipeline with raw reads (typically in FASTQ format) you'll need to make sure that when you map those reads to the reference and produce a BAM file, the resulting BAM file is fully compliant with the GATK requirements. See the Best Practices documentation for detailed instructions on how to do this.

In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:

• The file must be binary (with .bam file extension).
• The file must be indexed.
• The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).
• The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).
• Each read in the file must be associated with exactly one read group.

Below is an example well-formed SAM field header and fields (with @SQ dictionary truncated to show only the first two chromosomes for brevity):

@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:249250621    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:2    LN:243199373    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:a0d9851da00400dec1098a9255ac712e
@RG     ID:ERR000162    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR000252    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001684    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001685    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@PG     ID:GATK TableRecalibration      VN:v2.2.16      CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau
@PG     ID:bwa  VN:0.5.5
ERR001685.4315085       16      1       9997    25      35M     *       0       0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@?     XT:A:U  XN:i:4    X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001685  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???>
ERR001689.1165834       117     1       9997    0       *       =       9997    0       CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@     RG:Z:ERR001689    OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>???????
ERR001689.1165834       185     1       9997    25      35M     =       9997    0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     758A:?>>8?=@@>>?;4<>=??@@==??@?==?8     XT:A:U  XN:i:4    SM:i:25 AM:i:0  X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001689  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>>
ERR001688.2681347       117     1       9998    0       *       =       9998    0       CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     5@BA@A6B???A?B??>B@B??>B@B??>BAB???     RG:Z:ERR001688    OQ:Z:=>>>><4><<?><??????????????????????


#### Note about fixing BAM files with alternative sortings

The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.

### 3. Intervals of interest

If you don't find the information you need in this section, please see our FAQs on interval lists.

The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files typically have an extension such as '.list' or more explicitly,.interval_list, and look like this:

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6
...


consisting of aSAM-file-like sequence dictionary (the header), and targets in the form of <chr> <start> <stop> + <target_name>. These interval lists are tab-delimited. They are also 1-based (first position in the genome is position 1, not position 0). The easiest way to create such a file is to combine your reference file's sequence dictionary (the file stored alongside the reference fasta file with the .dict extension) and your intervals into one file.

You can also specify a list of intervals formatted as <chr>:<start>-<stop> (one interval per line). No sequence dictionary is necessary. This file format also uses 1-based coordinates.

Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.

### 4. Reference Ordered Data (ROD) file formats

The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:

-argumentName:name,type file


Where name is the name in the GATK tool (like "eval" in VariantEval), type is the type of the file, such as VCF or dbSNP, and file is the path to the file containing the ROD data.

The GATK supports several common file formats for reading ROD data:

• VCF : VCF type, the recommended format for representing variant loci and genotype calls. The GATK will only process valid VCF files; VCFTools provides the official VCF validator. See here for a useful poster detailing the VCF specification.
• UCSC formated dbSNP : dbSNP type, UCSC dbSNP database output
• BED : BED type, a general purpose format for representing genomic interval data, useful for masks and other interval outputs. Please note that the bed format is 0-based while most other formats are 1-based.

Note that we no longer support the PED format. See here for converting .ped files to VCF.

If you need additional information on VCF files, please see our FAQs on VCF files here and here.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 11Member

Can you provide a link for "Preparing the essential GATK input files: the reference genome?"

Geraldine Van der Auwera, PhD

• Posts: 11Member

It's also missing under Guide -> Introductory Materials -> What input files does the GATK accept?

Thanks for the quick response!

That's actually the same article (the Guide "pulls" articles from the forum), but it's a cached version that hasn't been refreshed yet -- will be soon though.

I'm here to help :)

Geraldine Van der Auwera, PhD

• Posts: 103Member

Hi, Geraldine:

the document as above mentioned: The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT. The order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. However, I noticed that one of training data at VQSR step: dbsnp_135.hg19.vcf has chrM at beginning, and so I run into issue at VQSR step. When I changed my reference chromosome order as chrM, chr1, chr2,…chr10,chr11,…….chr22,chrX, chrY, then it works fine. Could you comment on this behavior?

Thanks,

Mike

Hi Mike,

The order given in the document refers to the ordering we use preferentially, which corresponds to the b37 version of the human reference genome. The version of dbsnp you used is ordered to suit the UCSC-style reference genome, hg19 (which you can recognize in the filename).

You should always choose your reference version very carefully at the start of your analysis. If your data is unaligned, you can usually choose whatever you feel like. But once your data is already aligned to a reference, you should pick the corresponding reference, and stick with it throughout your analysis. That means that when it comes to using dbsnp, for example, you have to use the right version that matches your reference (again, the filename will tell you which reference it corresponds to). We provide both versions in our resource bundle.

If you don't do this, or try to mix & match halfway through, you're gonna have a bad time.

Geraldine Van der Auwera, PhD

• Posts: 103Member

Thanks for the input! Mike

• Posts: 9Member

When building picard-style Interval Lists, must my targeted intervals be sorted according to the chromosome which they target? Or can I list the targeted intervals in any order?

Off the top of my head I would say that they don't have to be ordered, but I'll have to check. You can do a very simple test by feeding in a few unordered intervals; you'll see quickly if the program chokes or not...

Geraldine Van der Auwera, PhD

• Posts: 679GATK Developer mod

That's right - the GATK will sort intervals on the fly.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 2Member

Do you have a link pointing to this::: See [here] for converting .ped files to VCF

Geraldine Van der Auwera, PhD

• Posts: 22Member

I have a question about the chromosome ordering in the reference file. When I go into your bundle (/bundle/1.5/hg19) and look at the ucsc.hg19.fasta file located there I find that it begins with >chrM and not >chr1. On this page it mentions that our reference file should go from 1-22, X, Y, M. I have assembled my own genome.fa from the ensemble chromosome files that follows your suggested order and am creating an index file, but I wonder whether I should just be using the available ensemble genome.fa and precomputed BWA index file that follows the same order of your ucsc.hg19.fasta? Do you have any suggestions to offer on these two options? Thanks in advance for the assistance!

• Posts: 679GATK Developer mod

I fixed the text above to differentiate between b3x and hg1x references.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 103Member
edited November 2012

Hi,

what if my reference has some contigs like

>chr4_gl000194_random
>chr4_gl000193_random
>chr9_gl000200_random


which is not in your dbSNP file, does it matter how they are ordered as long as put them all at the end of the reference file after all the regular chromosomes such as chrM, chr1, chr2, chr21, chr22, chrX, chrY?

Any suggestions?

Thanks!

Mike

Post edited by Geraldine_VdAuwera on

That should be fine, Mike.

Geraldine Van der Auwera, PhD

• Posts: 103Member

Dear Geraldine_VdAuwera:

Mike

• Posts: 1Member

Hi Geraldine,

FYI, the link to VCF poster specifications is missing (and I was quite curious about it, I must say ;) )

best Andrea

Hi Andrea, thanks for pointing this out. It seems we have a couple of missing links there, sorry about that. I'll try to restore them asap; in the meantime the best resources for learning about VCFs are at the 1000 Genomes project website and the VCFTools website, which I've added links for in that paragraph.

Geraldine Van der Auwera, PhD

• Posts: 29Member

I was trying to use the Select Variants to filter variants over a specified interval list. I understand that the ".interval_list" is picard interval style. Usually I use the header from the bam as the header for this file. If I do not have the bam, is there a way to create .interval_list?

Thank you, Teja

edited February 2013

Hi Teja,

The intervals list doesn't need a header. Just write your intervals, one per line, in the following format:
1:10000-20000

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 4Member

Hi Geraldine

I'm just about to start using GATK, but you say the reference must only contain the standard bases, ATCG. What about N's?

Our reference is full of iupac codes, but if N's are ok I will just swap them all for N's and see how we go.

thanks

Gareth

N is totally fine.

• Posts: 4Member

Great thanks

• Posts: 7Member

Hi, You mention that "no non-standard bases (W, for example) are tolerated": why does the GATK bundle b37 fasta file contains R and M bases in chr 3 (human_g1k_v37.fasta.gz file) if this is mentioned as an issue?

Can that influence the analysis somehow?

Thank you!

Hi @igcocole,

This is a case of outdated doc -- the GATK now handles all IUPAC bases, though by "handles" I mean "ignores" (whereas the previous behavior was to blow up). I will correct the doc, thanks for pointing this out.

Geraldine Van der Auwera, PhD

• Posts: 1Member

Hi, Geraldine I use newest hg19 from UCSC as reference, there are contigs like chr6_apd_hap1.fa, chr8_gl000197_random.fa and unplaced contigs like chrUn_gl000221.fa ... what order should these contigs be placed? or just put them all at the end of the reference file after all the regular ?

Hi @drunkcoder,

At the most basic level, the important thing is to have your bam sorted in the same order as the reference you use. The convention we use is to sort the reference lexicographically.

In practice however -- keep in mind that if you're using the latest version of the reference, you may not be able to use the resource files we provide (snps, known indels etc). If the new version just has some additional contigs then it may be compatible as long as the extra contigs are placed at the end. But if the reference has been patched in the regular contigs then our resources might not be compatible.

Geraldine Van der Auwera, PhD

• Posts: 4Member

I had the error about the non-iupac "10" character, but i have to start with bam files that were aligned to another reference file and thus can't use the ones in the gatk bundle. You said that we could clean up the reference file ourselves... how could I do that? Thanks in advance!

The simplest way is to write a script that will recognize and remove the bad characters. We do not provide guidance on how to do that, since it's a general programming question that is not related to the GATK, but perhaps someone in the community might be able to share some advice. Otherwise I would recommend asking your IT department or a colleague with relevant experience.

Geraldine Van der Auwera, PhD

• Posts: 4Member

I talked to some people with more programming experience than me, and they can't figure out how to clean up the reference file. I had already tried common methods of converting windows to unix files (e.g. dos2unix) [although my reference files had never been opened by a windows program].

Since I can't realign my files to the GATK bundle reference file, I guess I'm going to have to go to an older version of GATK to bypass this error, although I know that's not ideal.

Just as a suggestion, could you look at changing the error message so that it indicates exactly which file has the non-iupac base (i.e. the fasta file, the index file, or the dictionary file) and which line the non-iupac base is found on? That would give us a place to start in cleaning up the file.

Hi Jilin,

The reference file is just a text file, so you don't really need to convert anything. The general idea is to run a simple character substitution function to remove the bad characters.

This problem is not going to go away if you use an older version of GATK -- the program is always going to choke on the bad characters.

If you're working with standard human data you could just revert your bams, remap them to a clean reference and proceed normally.

Finally, the error message does state what file the problem is in. If you post the full error message you got I can help you find where this is written.

Geraldine Van der Auwera, PhD

• Posts: 4Member

Geraldine,

Your help would be appreciated! I can easily substitute characters with sed, but I'm not sure what characters I'm looking for. I searched for '10,' but the only 10s that come up are for chromosome number (and bp position in the index file).

The older version of GATK did actually run on my reference file... so maybe I'm interpreting this error wrong?

How long would it take to revert bams and remap them to the ref file from the GATK bundle?

The full error message was as follows:

##### ERROR ------------------------------------------------------------------------------------------

The command was: java -jar GenomeAnalysisTK.jar -R hg19.genome.new.fa -T UnifiedGenotyper -I bamlist.list -L chr1.bed -o chr1.vcf -U

Hmm, which older version did you use?

I have to correct what I said earlier -- simple character substitution isn't going to work since the problem is in the file encoding, not content -- my mistake, I was thinking of a different problem.

I haven't ever needed to do this myself so I can't give you exact instructions, but this should be a good starting point: http://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets

The error message does state that it's your reference file which has a problem:

We encountered a non-standard non-IUPAC base in the provided reference: '10'

Reverting and remapping can take quite a while if you have a lot of data. If you figure out how to solve this encoding problem, you will be able to deal with it quickly if it happens again. I recommend trying to fix your reference. But it's your choice.

Geraldine Van der Auwera, PhD

• Posts: 4Member

Thanks for the link, I will check it out. The older version is GATK 1.4.

Yes, I new remapping would take time that we really don't have for this project unfortunately, so I will press on with the older version if I can't fix the reference file!

Thanks.

It's up to you, but I have to caution you that using version 1.4 is a really bad idea. The software has improved enormously since then and I can guarantee you would have much better results with the latest version.

Geraldine Van der Auwera, PhD

• Posts: 2Member

Dear Geraldine, I downloaded the human_g1k_v37 from Broad bundle and I created the database for gsnap aligner. After obtaining my .bam (sorted and indexd and after duplicate removal) GATK returns me a chromosome order error. it is referreder only to non canonical (chromosome like GL000207.1, GL000226.1). How can I order my bam file? Is the human_g1k_v37 sorted and indexed with samtools? Usually I use samtools to sort and index my bam. Thanks in advance, Francesco

• Posts: 2Member

Hi,

I've noticed that the intervals given in the .interval_list file are grouped when they overlap or when the start of one interval is contiguous to the end of another one. So I have 200 lines in my interval_list file, but I get only 100 in the sample_interval_summary. Is there a way to avoid this behaviour to get the 200 intervals separately in the sample_interval_summary file?

Olivia

Hi @frankof,

The reference we provide in our bundle is sorted and indexed according to the rules laid out in our documentation, you shouldn't need to do any additional sorting if you followed the instructions we provide. Can you post the text of the error you got?

Geraldine Van der Auwera, PhD

Hi @oliviajm,

No, the interval merging behavior cannot be overridden, sorry.

Geraldine Van der Auwera, PhD

• Posts: 2Member

Hi Geraldine. Thank for your prompt response the error is: ##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.

##### ERROR ------------------------------------------------------------------------------------------

Now I'm trying to reorder my bam with picard RerderSam.jar. I hope it will be the right way. What do you suggest? Many thanks, Francesco

Hmm, I'm not sure why the sorting order would be different if you mapped your reads to that reference, but using ReorderSAM should fix it, yes.

Geraldine Van der Auwera, PhD

• Posts: 36Member

Can anyone tell me whether the human_g1k_b37 reference is masked? I can't seem to find out from 1000 genomes...

Kath

• Posts: 35Member

Hi, there. I'm using SelectVariants to get the concordance of the variants got from GATK UnifiedGenotyper and SAMtools mpileup, but GATK kept complaining "Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:

##### ERROR Name FeatureType Documentation

I have rerun UnifiedGenotyper and use the newly generated vcf file, but got the same error, I don't know why, can anyone give some ideas?

• Posts: 35Member

I have validate my vcf file, nothing seems wrong, I wonder how this happens ?

Geraldine Van der Auwera, PhD

• Posts: 35Member

I used GATK version 2.4-9. Here is my command line: for generating variant file,

java -jar /home/Software/GenomeAnalysisTK-2.4-9/GenomeAnalysisTK.jar
-R ref.fa
-T UnifiedGenotyper
-I sample.recal.bam
-o INDEL.vcf
--dbsnp dbSNP.vcf
-glm INDEL
-rf BadCigar
and for SelectVariants,
java  -jar /home/Software/GenomeAnalysisTK-2.4-9/GenomeAnalysisTK.jar
-R ref.fa
-T SelectVariants
--variant sample.gatk.raw.vcf
--concordance sample.samtools.raw.vcf
-o sample.concordance.raw.vcf
I just don't know what's wrong...

Hi Jack,

I'm not sure, the GATK should recognize automatically that your files are VCF. You can try to force recognition by writing --variant:VCF sample.gatk.raw.vcf for example.

Are you using an older version of GATK? Can you tell from the error message exactly which file it is complaining about?

Geraldine Van der Auwera, PhD

• Posts: 335Member, GSA Collaborator ✭✭✭

I would concentrate on the samtools VCF. This is just a hypothesis, but it wouldn't surprise me greatly if it were a valid VCF version 4 file, and GATK requires 4.1 - or some similar situation

• Posts: 35Member
edited August 2013

I would concentrate on the samtools VCF.

I think I find the problem, it's the problem of samtools, now I have fixed it after rerunning, with a few arguments modified.

Thank you, Geraldine and pdexheimer!

Post edited by Geraldine_VdAuwera on
• Posts: 10Member

If anyone is still interested in it, I had the same problem and tried different aproaches to fix this:

@jlin said: Geraldine,

Your help would be appreciated! I can easily substitute characters with sed, but I'm not sure what characters I'm looking for. I searched for '10,' but the only 10s that come up are for chromosome number (and bp position in the index file).

The older version of GATK did actually run on my reference file... so maybe I'm interpreting this error wrong?

How long would it take to revert bams and remap them to the ref file from the GATK bundle?

The full error message was as follows:

##### ERROR ------------------------------------------------------------------------------------------

The command was: java -jar GenomeAnalysisTK.jar -R hg19.genome.new.fa -T UnifiedGenotyper -I bamlist.list -L chr1.bed -o chr1.vcf -U

Finally, I just imported the reference in byopython as fasta, and then exported it again, which fixed the problem.

#!/usr/bin/env python

from Bio import SeqIO

handle = open("reference.fa", "rU")

input_seq_iterator = SeqIO.parse(handle,"fasta")

output_handle = open("reference_clean.fa", "w")

SeqIO.write(input_seq_iterator, output_handle, "fasta")

output_handle.close() 

Best,

David

Thanks for reporting your solution, David -- and three cheers for Biopython!

Geraldine Van der Auwera, PhD

• qingdaoPosts: 2Member

Hi Geraldine! I'm new here to study gatk, and I'm very appreciated for your work. I have a question that I want to do snp calling for Marsupenaeus japonicus, and I have some NGS sequences, but I can't find appropriate reference sequences in the hige, then what should I do for this step? May I use the sequences in my hand itself as reference sequence? If not, is gatk the right software for me? Thanks a lot for your patient and labor, have a good day!

Hi @longyang,

Assuming you have several genomes from different individuals, you can do de novo assembly of one of them and use that as a reference to call variants on the others with GATK. Good luck!

Geraldine Van der Auwera, PhD

• qingdaoPosts: 2Member
edited December 2013

Post edited by longyang on
• DEPosts: 23Member

Hi @Geraldine_VdAuwera‌ (or whoever else is around!),

Just a quick question about ROD formats for known SNP sites for VQSR. I'm trying a set of SNPchip data and the genotype calls are in the following csv format:

JAX00000001,A,T,chr1,3013441,T,T,T,T,T,T,A,A,A,A,T,T,T,A,A,T,T,T,T,T,T,A,A,A,A,A,A,A,A,A,A,T,T,T,A,A,A,T,A,T,T,T,T,T,A,A,A,A,T,T,T,A,A,T,T,N,A,T,A,A,A,T,A,A,T,T,T,T,T,A,T,A,A,A,A,T,T,A,A,A,A,A,T,A,A,A,T,T,T,T,T,T,T,A,T,A,A,A,T,T,T,T,T,T,T,N,T,T,T,T,T,T,N,T,A,A,N,A,N,T,T,A,T,N,T,A,T,T,T,T,N,N,T,T,T,A,T,A,T,T,N,N,T,T,T,A,T,T,A,T,A,N,T,T,T,T,T,T,T,T,A,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,N,T,A,N,N,T,T,A,T,T,T,T,T,T,T,T Where col1=JAX SNP ID, col2= reference allele, col3 = alt allele, col4 = chrom, col5 = base position, and the following ~100 columns are per-mouse strain genotype calls.

My question is: which ROD format should I use?

My initial instinct would be to try and mangle the data into vcf format, but vcfs contain a lot of additional quality data that I just don't have.

So I guess I should be using the "UCSC formatted dbSNP" instead, but I just wanted to check that this refers to the format described on the UCSC wiki here: http://genomewiki.ucsc.edu/index.php/DbSNP_Track_Notes#Subset_of_NCBI_fields_used_to_build_snpNNN_track

Finally, do all of those UCSC dbSNP columns need to be populated, or will VariantRecalibrator be happy with what I have (the columns up to and including "observed")?

Thanks very much!

@mfletcher, I'm not sure as I've never used UCSC dbsnp myself, but this loooks like it should work. If it doesn't, VQSR will let you know quickly (and in more detail than you really want) that it's not happy. If so let me know and we'll try to figure it out.

Geraldine Van der Auwera, PhD

• Posts: 335Member, GSA Collaborator ✭✭✭

@mfletcher - UCSC dbSNP format is a "sites-only" format, you won't be able to get the per-strain genotypes into it. Of course, VQSR doesn't care about the calls, so that's alright.

It's pretty easy to make a minimal VCF, though - the rule of thumb is that any data you don't have can be replaced with a ".". So the VCF line for your example might be

chr1  3013441  JAX00000001  A  T  .  PASS  .


This is a sites-only record, and has no value for QUAL or for the INFO annotations. I would recommend converting to VCF rather than the UCSC record because support for VCF format will probably be much more stable over time

• DEPosts: 23Member
edited June 2

Hi @Geraldine_VdAuwera‌, yes you're totally correct - as always the error message had way more detail than I really wanted...!

##### ERROR ------------------------------------------------------------------------------------------

`

Interestingly there's no UCSC dbSNP option there... Suggestions? (Two comments: I commented out the header line of this file with a #, and saved it as a tab-delimited *txt file)

Thanks again!

Post edited by mfletcher on

Ah, you're going to need to convert your file upfront, as it seems the VQSR isn't capable of reading in non-VCF/BCF formats natively. Have a look at the VariantsToVCF tool: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_VariantsToVCF.html

Geraldine Van der Auwera, PhD

• RomaPosts: 1Member

Hi @Geraldine_VdAuwera‌ , Sorry but I don't understand very well this "Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.". It means that I have to add (+1) to the start column possitions of my bed file or (-1) to the end column ?

Thank you!