The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

Member Posts: 14
edited October 2012

Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?

Thanks!

Post edited by Geraldine_VdAuwera on
Tagged:

No, you have to give the tool truth/training sets, that is not optional. The thing is, the training/truth sets aren't supposed to come from your project, they are external sets for which we know what the accuracy is. The article says what to use, right under the base commandline box. For example, for SNPs called on a WGS dataset, you have:

Whole genome shotgun experiments

>

SNP specific recommendations

>

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP \
-mode SNP \


We provide very specific recommendations, you just have to pick the ones that fit depending on whether your data is WGS or Exome, and whether you are looking at SNPs or indels.

Geraldine Van der Auwera, PhD

Hi there, this is addressed in the FAQs section of the Guide.

Geraldine Van der Auwera, PhD

• Member Posts: 14
edited October 2012

Hi! if I am not wrong I am trying to follow this FAQ section "What VQSR training sets / arguments should I use for my specific project?" This is my command line:

java -Xms512m -Xmx8G -jar GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar -T VariantRecalibrator -R reference.fasta -input calling.vcf -recalFile output.recal -tranchesFile  output.tranches -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP -mode SNP


The answer of this command is:

ERROR MESSAGE: Invalid command line: No training set found! Please provide sets of known polymorphic loci marked with the training=true ROD binding tag. For example, -resource:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf


So, what should I use as a training set?

Thanks!

You have the right article, but you only passed half of the command, the "common base commandline". In addition to that, you need to also pass the part indicated like this"

[SPECIFY TRUTH AND TRAINING SETS] \
[SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
[SPECIFY WHICH CLASS OF VARIATION TO MODEL] \


What you specify there is indicated in the next part of the document.

Geraldine Van der Auwera, PhD

• Member Posts: 14

Ok, but as I understand I don't have a training dataset, right? So I should write something for [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] and [SPECIFY WHICH CLASS OF VARIATION TO MODEL]. Is there any information in the faq about this point?

Thanks!

No, you have to give the tool truth/training sets, that is not optional. The thing is, the training/truth sets aren't supposed to come from your project, they are external sets for which we know what the accuracy is. The article says what to use, right under the base commandline box. For example, for SNPs called on a WGS dataset, you have:

Whole genome shotgun experiments

>

SNP specific recommendations

>

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP \
-mode SNP \


We provide very specific recommendations, you just have to pick the ones that fit depending on whether your data is WGS or Exome, and whether you are looking at SNPs or indels.

Geraldine Van der Auwera, PhD

• Member Posts: 14

Ok, so I think I can't do anything because I don't have a training/truth set coming for an external project, the problem is that I have a new genome, it is not human.

Thanks!

Oh, I didn't realize you were working with non-human genomes, sorry. It is possible to generate your own training/truth sets using very high-confidence subsets of your initial calls (similar to what you may already have done to get a set of -knowns for for base recalibration) but it is a fairly complicated process. You may be better off with hard filtering for now; or try asking in "Ask the Community" what people normally do for non-human organisms.

Geraldine Van der Auwera, PhD

• Member Posts: 5
edited November 2012

Hi Geraldine, I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as you say. I was wondering if you have any suggestion about the parameters of filtration...

I am working with each strain as different organism, so I have good coverage (80X) but only one Lane

I tried with:

java -Xmx4g -jar GenomeAnalysisTK.jar -R S288c.fasta -T VariantFiltration --variant $1.raw.vcf --filterExpression "QD<2.0 || MQ<45.0 || FS>60 || MQEankSum< -12.5 || ReadPosRankSum<-8.0 " --filterName "hardtovalidate" -o$1.filt.vcf
`

to remove after the LowQual and hardtovalidate snps, that make sense? thanks for your help!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD