The current GATK version is 3.4-46

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Posts: 14Member
edited October 2012

Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?

Thanks!

Post edited by Geraldine_VdAuwera on
Tagged:

No, you have to give the tool truth/training sets, that is not optional. The thing is, the training/truth sets aren't supposed to come from your project, they are external sets for which we know what the accuracy is. The article says what to use, right under the base commandline box. For example, for SNPs called on a WGS dataset, you have:

Whole genome shotgun experiments

>

SNP specific recommendations

>

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP \
-mode SNP \


We provide very specific recommendations, you just have to pick the ones that fit depending on whether your data is WGS or Exome, and whether you are looking at SNPs or indels.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Hi there, this is addressed in the FAQs section of the Guide.

Geraldine Van der Auwera, PhD

• Posts: 14Member
edited October 2012

Hi! if I am not wrong I am trying to follow this FAQ section "What VQSR training sets / arguments should I use for my specific project?" This is my command line:

java -Xms512m -Xmx8G -jar GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar -T VariantRecalibrator -R reference.fasta -input calling.vcf -recalFile output.recal -tranchesFile  output.tranches -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP -mode SNP


The answer of this command is:

ERROR MESSAGE: Invalid command line: No training set found! Please provide sets of known polymorphic loci marked with the training=true ROD binding tag. For example, -resource:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf


So, what should I use as a training set?

Thanks!

Post edited by Geraldine_VdAuwera on

You have the right article, but you only passed half of the command, the "common base commandline". In addition to that, you need to also pass the part indicated like this"

[SPECIFY TRUTH AND TRAINING SETS] \
[SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
[SPECIFY WHICH CLASS OF VARIATION TO MODEL] \


What you specify there is indicated in the next part of the document.

Geraldine Van der Auwera, PhD

• Posts: 14Member

Ok, but as I understand I don't have a training dataset, right? So I should write something for [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] and [SPECIFY WHICH CLASS OF VARIATION TO MODEL]. Is there any information in the faq about this point?

Thanks!

No, you have to give the tool truth/training sets, that is not optional. The thing is, the training/truth sets aren't supposed to come from your project, they are external sets for which we know what the accuracy is. The article says what to use, right under the base commandline box. For example, for SNPs called on a WGS dataset, you have:

Whole genome shotgun experiments

>

SNP specific recommendations

>

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP \
-mode SNP \


We provide very specific recommendations, you just have to pick the ones that fit depending on whether your data is WGS or Exome, and whether you are looking at SNPs or indels.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 14Member

Ok, so I think I can't do anything because I don't have a training/truth set coming for an external project, the problem is that I have a new genome, it is not human.

Thanks!

Oh, I didn't realize you were working with non-human genomes, sorry. It is possible to generate your own training/truth sets using very high-confidence subsets of your initial calls (similar to what you may already have done to get a set of -knowns for for base recalibration) but it is a fairly complicated process. You may be better off with hard filtering for now; or try asking in "Ask the Community" what people normally do for non-human organisms.

Geraldine Van der Auwera, PhD

• Posts: 5Member
edited November 2012

Hi Geraldine, I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as you say. I was wondering if you have any suggestion about the parameters of filtration...

I am working with each strain as different organism, so I have good coverage (80X) but only one Lane

I tried with:

java -Xmx4g -jar GenomeAnalysisTK.jar -R S288c.fasta -T VariantFiltration --variant $1.raw.vcf --filterExpression "QD<2.0 || MQ<45.0 || FS>60 || MQEankSum< -12.5 || ReadPosRankSum<-8.0 " --filterName "hardtovalidate" -o$1.filt.vcf


to remove after the LowQual and hardtovalidate snps, that make sense? thanks for your help!

Post edited by Geraldine_VdAuwera on