Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Question about recalibration

ralonsoralonso Posts: 14Member
edited October 2012 in Ask the GATK team

Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?

Thanks!

Post edited by Geraldine_VdAuwera on

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,819 admin
    edited October 2012 Answer ✓

    No, you have to give the tool truth/training sets, that is not optional. The thing is, the training/truth sets aren't supposed to come from your project, they are external sets for which we know what the accuracy is. The article says what to use, right under the base commandline box. For example, for SNPs called on a WGS dataset, you have:

    Whole genome shotgun experiments

    SNP specific recommendations

    For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
    -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
    -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP \
    -mode SNP \

    We provide very specific recommendations, you just have to pick the ones that fit depending on whether your data is WGS or Exome, and whether you are looking at SNPs or indels.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,819Administrator, GATK Developer admin

    Hi there, this is addressed in the FAQs section of the Guide.

    Geraldine Van der Auwera, PhD

  • ralonsoralonso Posts: 14Member
    edited October 2012

    Hi! if I am not wrong I am trying to follow this FAQ section "What VQSR training sets / arguments should I use for my specific project?" This is my command line:

    java -Xms512m -Xmx8G -jar GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar -T VariantRecalibrator -R reference.fasta -input calling.vcf -recalFile output.recal -tranchesFile  output.tranches -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP -mode SNP 

    The answer of this command is:

    ERROR MESSAGE: Invalid command line: No training set found! Please provide sets of known polymorphic loci marked with the training=true ROD binding tag. For example, -resource:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf

    So, what should I use as a training set?

    Thanks!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,819Administrator, GATK Developer admin

    You have the right article, but you only passed half of the command, the "common base commandline". In addition to that, you need to also pass the part indicated like this"

    [SPECIFY TRUTH AND TRAINING SETS] \
    [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
    [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

    What you specify there is indicated in the next part of the document.

    Geraldine Van der Auwera, PhD

  • ralonsoralonso Posts: 14Member

    Ok, but as I understand I don't have a training dataset, right? So I should write something for [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] and [SPECIFY WHICH CLASS OF VARIATION TO MODEL]. Is there any information in the faq about this point?

    Thanks!

  • ralonsoralonso Posts: 14Member

    Ok, so I think I can't do anything because I don't have a training/truth set coming for an external project, the problem is that I have a new genome, it is not human.

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,819Administrator, GATK Developer admin

    Oh, I didn't realize you were working with non-human genomes, sorry. It is possible to generate your own training/truth sets using very high-confidence subsets of your initial calls (similar to what you may already have done to get a set of -knowns for for base recalibration) but it is a fairly complicated process. You may be better off with hard filtering for now; or try asking in "Ask the Community" what people normally do for non-human organisms.

    Geraldine Van der Auwera, PhD

  • lbernalberna Posts: 5Member
    edited November 2012

    Hi Geraldine, I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as you say. I was wondering if you have any suggestion about the parameters of filtration...

    I am working with each strain as different organism, so I have good coverage (80X) but only one Lane

    I tried with:

    java -Xmx4g -jar GenomeAnalysisTK.jar -R S288c.fasta -T VariantFiltration --variant $1.raw.vcf --filterExpression "QD<2.0 || MQ<45.0 || FS>60 || MQEankSum< -12.5 || ReadPosRankSum<-8.0 " --filterName "hardtovalidate" -o $1.filt.vcf

    to remove after the LowQual and hardtovalidate snps, that make sense? thanks for your help!

    Post edited by Geraldine_VdAuwera on
Sign In or Register to comment.