We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Old version of Best Practices from GATK 2.0 [RETIRED]



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Good, I'm here to help :)

    The simplest way to proceed to bootstrap the set of known sites is to call variants on all of your data together. You can use the same resulting set for all the samples/lanes. Some of the variants will be present in some samples but not in others, of course, but that's okay because the effects will get evened out inside the model. The common core should be big enough relative to the number of sample-specific variants (unless your organism is completely weird and violates all our assumptions).

  • sannesanne NorwayMember

    Thanks a million Geraldine! I think I should be alright now:) Will get back to you if I run into new issues with the SNP calling and filtering but hopefully it will go relatively smoothly...

  • oussama_benhrifoussama_benhrif marocMember

    Hello ,
    do i use R package or GATK tool directly by command ligne? what is the efficient route ?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @oussama_benhrif We provide a pipelining tool called Queue (see Downloads page) to build efficient pipelines. But you can also run GATK directly from the command line or from other pipelining frameworks, whatever you prefer.

  • robertbrobertb torontoMember

    @Geraldine_VdAuwera said:
    Hi Tristan,

    ApplyRecalibration is the final step of recalibration. What you're seeing is normal -- it is a very simple tool; given a minimum VQSLOD (say, 99.5%) specified by the user, ApplyRecalibration goes through the whole file, marking variants as 'pass' or 'fail' based on whether their VQSLODs are above or below that threshold. No lines are added or taken out of your VCF.

    There are additional steps that can be done with GATK tools to examine the quality of your callset, before moving on to the actual analyses you want to perform on it. See for example the documentation of the VariantEval tool. Based on the results, you may want to go back to the recalibration step and apply a different cutoff to your callset (i.e. repeat ApplyRecalibration with a different VQSLOD value). This is not included in the Best Practices because it depends more on your experiment, dataset and results, whereas the Best Practices constitute a set of core principles that should apply to all experiments.

    Glad you like the new site!

    Hi Geraldine,

    I have a quick question, thanks.
    I've got 140 genomes sequenced to 25-30X coverage and have followed the best practices as outlined here.
    Some people would like to know whether any hard filtering is worth pursuing after this step. I'm having the same question.
    If what I understand is correct, then ApplyRecalibration merely tells you whether a VARIANT is real or not. I interpret this to mean that a pass on the VQSLOD value simply means that at least one of your samples likely has the variant.
    What if the call is made for more than one sample? What I want to know is what samples have what variants. It's not enough to simply say that a variant does exist, somewhere, in my samples. Hence, there seems to be a case for hard filtering the individual sample calls as present in the genotype fields of the vcf. Does this make sense?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yep, you want to look at genotype quality per-sample. Check out the genotype refinement workflow docs in the methods section.

  • JahnDavikJahnDavik BioforskMember

    Hi, I am trying to get started with gatk calling snps in my 350 or so individuals from which I have GBS data. I have run the following code:
    java -jar /usr/local/bin/gatk/GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R /the/reference/genome/ \
    -I GT1.sorted.bam \
    -o GT1.vcf

    and it seems to work. At least I get no error message(s). However, I am totally lost on how to proceed to a multiple genotypes situation. I am looking to generate a .vcf containing information on all genotypes. Having that file in hand, I'd be happier.

Sign In or Register to comment.