How to run Variant Quality score relcalibration on non-human dataset

Hi

I am working on Soybean plant dataset for variant and genotype analysis. As soybean is not very well annotated as Human's I was wondering how can I go forward in doing the vqsr analysis. Kindly detail me as what all I can do in order to finish analysis. I did not use any annotations while using base recalibration. I was wondering if I can go ahead in the same way for vqsr.

Thanks

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Please read the Best Practices documentation and other docs regarding VQSR and non-human datasets. We are happy to answer any specific questions that you may still have after reading these documents.

  • smk_84smk_84 Member
    edited April 2013

    Can you point me to some of these documents which talk about using vqsr on non-human/non-model organism datasets where indel and other annotations are not available .

    thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    There is a section on dealing with non-human data in the VQSR method article here:
    http://www.broadinstitute.org/gatk/guide/article?id=39

    However we also discussed this topic at our recent GATK workshop; you can find the videos of the talks here:
    http://www.broadinstitute.org/gatk/guide/events?id=2038

  • smk_84smk_84 Member

    on this page http://www.broadinstitute.org/gatk/guide/article?id=39
    there is a question
    Does the VQSR work with non-human variant calls?

    SO in the first r ound I have run till unified genotyper should I take the output of unified genotyper and use it as known sites and run it again and then use VQSR or what should be the progression of steps.

    thanks
    Saad

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You can take the variants called with the highest confidence by the UnifiedGenotyper and use those as truth set for VQSR, yes. You will probably need to experiment with the filtering settings to find what subset works best as truth set. Please understand that determining what is a good set of variants for the truth set for a given organism of study can be an analyst's research project in itself, and we cannot give you a quick solution to solve this.

  • smk_84smk_84 Member
    edited April 2013

    In the answer to the question "Does the VQSR work with non-human variant calls?" there is a suggestion that we can use different tools for variant calling and use consensus snps as input for another round of variant calling and VQSR. So I am assuming that these should be the steps :

    1. Call variants using unified genotyper (GATK) and other variant calling tools.
    2. Use the common subset as confident variants and common subset as confident indels
    3. Call the unified genotyper again giving it confident indels and variant information
    4. Follow the rest of GATK pipeline as described in the best practices.

    What other variant calling tools do you recommend. I have tried maq and SOAPsnp but I guess they are very different from GATK. One other package is SAMtools but its very similar in methodology to GATK although it might be a little quicker.

    Kindly let me know.

    Thanks

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I think you may be confusing two things here. The iterative rounds of calling are recommended for developing a set of known variants in order to to BQSR, i.e. base recalibration.

    For VQSR, it is not necessary to do several rounds of calling, because known variants (as in dbsnp) aren't used to evaluate call confidence, they are only used to populate the rsID field in the variant records. But you can do several rounds of recalibration in order to refine your truth set. The use of other callers can also be used to refine the truth set, based on the assumption that the intersection of the callsets contains high-confidence variants. Note that this assumption is somewhat flawed because it ignores the possibility that the different callers may be subject to similar error modes -- but absent any good alternatives, this is a fairly standard approach.

  • smk_84smk_84 Member
    edited April 2013

    So how exactly is the workflow going to look like if I am only using GATK and no other tools for the analysis.
    Will it look like as shown in attached image

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, that's right. That is the basic illustration of our Best Practices.

  • smk_84smk_84 Member
    edited April 2013

    Hmm that's gonna take a lot of disk space running things multiple times. So I only have to run bqsr several times until I get the blue line converging with the black line as shown in plots and use those set of SNP's for which I get this kind of plot.

    http://www.flickr.com/photos/[email protected]/8672132569/in/photostream

    Is that right ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's right.

  • smk_84smk_84 Member

    If I am doing variant calling on a single sample can I still use VQSR?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It depends how many variants you have. It might be possible if you have a lot of variants for that one sample, but there's no guarantee. When you try to run, the recalibrator will tell you if you do not have enough variants.

  • smk_84smk_84 Member

    What should I do if I don't have enough variants and want to the GATK analysis on single samples only. Also I had a question about
    the -resource tag. Since for my data I don't have hapmap/dbsnp information and I would only be using the high-confidence snp from the first run as training set and no known or truth data, what should I set my prior to be.

    from the workshop presentation

    -resource:hapmap,known=false,training=true,truth=true,prior=15.0
    hapmap_3.3.b37.sites.vcf
    -resource:omni,known=false,training=true,truth=false,prior=12.0
    1000G_omni2.5.b37.sites.vcf
    -resource:dbsnp,known=true,training=false,truth=false,prior=6.0
    dbsnp_137.b37.v

    #
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Based on what you're telling us, I don't think VQSR is going to work for you. You should look at hard-filtering instead (see the Best Practices docs, very last section).

  • smk_84smk_84 Member

    You mean

    "Recommendations for very small data sets (in terms of both number of samples or size of targeted regions)"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes that's what I mean.

  • smk_84smk_84 Member

    Sorry to ask one more question if we do take many samples together and only have a training data and no known or truth data as resource how are we going to assign the prior values to be.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately, if you do not have a truth data resource, you will not be able to use VQSR. You should do hard-filtering instead. That does not require any prior likelihood values.

  • smk_84smk_84 Member

    What kind of dataset is enough to qualify as a truth/known dataset and how much is enough e.g. in my case if I have dataset of 100 genomes can I use it as a truth data and how do I assign prior values to it?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you want to create your own truth set, data from 100 genomes may be enough, yes. You will need to produce a very high-quality, highly specific callset based on that dataset. There is no set rule to assign prior values -- you will have to experiment with different values and evaluate the results. This is a complex analytical task which takes time and expertise.

  • smk_84smk_84 Member
    edited May 2013

    Hard filtering

    Hi I am trying following hard filtering and getting the following error

    java -Xmx2g -jar /u1/tools/public/GenomeAnalysisTK_2.4/GenomeAnalysisTK.jar -R ../GATK_analysis/Soybean_ref_genome.fasta -T VariantFiltration -o filtered_SRS079352.vcf --variant SRS079352_out.vcf --filterExpression "QUAL > 267.77 && DP > 12","HaplotypeScore > 3.11795","QD < 20.48 && GQ < 15","ReadPosRankSum < -8.0"

    following is the error

    `
    INFO 16:16:55,332 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:16:55,334 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-3-g2a7af43, Compiled 2013/02/27 12:18:19
    INFO 16:16:55,335 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 16:16:55,335 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 16:16:55,339 HelpFormatter - Program Args: -R ../GATK_analysis/Soybean_ref_genome.fasta -T VariantFiltration -o filtered_SRS079352.vcf --variant SRS079352_out.vcf --filterExpression QUAL > 267.77 && DP > 12,HaplotypeScore > 3.11795,QD < 20.48 && GQ < 15,ReadPosRankSum < -8.0
    INFO 16:16:55,339 HelpFormatter - Date/Time: 2013/05/07 16:16:55
    INFO 16:16:55,339 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:16:55,340 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:16:55,349 ArgumentTypeDescriptor - Dynamically determined type of SRS079352_out.vcf to be VCF
    INFO 16:16:55,920 GenomeAnalysisEngine - Strictness is SILENT
    INFO 16:16:56,158 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 16:16:56,174 RMDTrackBuilder - Loading Tribble index from disk for file SRS079352_out.vcf
    INFO 16:16:56,472 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files
    INFO 16:16:56,484 GenomeAnalysisEngine - Done creating shard strategy
    INFO 16:16:56,484 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 16:16:56,485 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
    INFO 16:16:57,330 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalArgumentException: Inconsistent number of provided filter names and expressions: names=[] exps=[QUAL > 267.77 && DP > 12,HaplotypeScore > 3.11795,QD < 20.48 && GQ < 15,ReadPosRankSum < -8.0]
    at org.broadinstitute.variant.variantcontext.VariantContextUtils.initializeMatchExps(VariantContextUtils.java:200)
    at org.broadinstitute.variant.variantcontext.VariantContextUtils.initializeMatchExps(VariantContextUtils.java:211)
    at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.initialize(VariantFiltration.java:202)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:84)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:283)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.4-3-g2a7af43):
    ERROR
    ERROR Please visit the wiki to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Inconsistent number of provided filter names and expressions: names=[] exps=[QUAL > 267.77 && DP > 12,HaplotypeScore > 3.11795,QD < 20.48 && GQ < 15,ReadPosRankSum < -8.0]
    ERROR ------------------------------------------------------------------------------------------

    `

    According to JEXL expressions we can use comma seperated expressions. Even if I try to use a single value say QD < 2.0 I get the same error.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The problem is because you need to provide filter names to correspond to each filter expression. You didn't provide any filter names in the command you posted.

Sign In or Register to comment.