Variant recalibration issue

Hi,
I am feeling a bit lost and I think I need an external opinion so here I am. Actually, I have a cohort of 25 samples of non-human whole genome sequencing data (21 HiSeq + 4 MiSeq), I also have a dbSNP-like vcf file which I made myself based on the SNP data publicly available from multiple studies on different strains belonging to the organism I am dealing with. The issue is that first the dbSNP-like vcf file is lacking to indel data and second the validation degree of confidence of that SNP call set is not known so it has to be assumed as low. Knowing that, I think I should operate roughly as follow:

1- VariantFiltration on my raw_whole_cohort.vcf (GenotypeGVCFs output) with stringent enough parameters to generate both SNP call set (hapmap_like.vcf) and Indel call set (mills_like.vcf) that both might be assumed as at high degree of confidence in the next steps.
2- VariantRecalibrator -mode SNP -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_like.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbSNP-like.vcf
3- VariantRecalibrator -mode INDEL -resource:mills,known=true,training=true,truth=true,prior=12.0 mills_like.vcf
4- ApplyRecalibration -input raw_whole_cohort.vcf -mode SNP -o whole_cohort_snp_recalibrated_indel_raw.vcf
5- ApplyRecalibration -input whole_cohort_snp_recalibrated_indel_raw.vcf -mode INDEL -o whole_cohort_snp_recalibrated_indel_recalibrated.vcf

Hence, I would like to ask where am I wrong (short version). Can I practically do the above steps ? and is it quite rational to unroll it that way ? (long version)

Thanks for help.

Ahmed

Comments

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @ahmed_chakroun Hi Ahmed. Your questions should be answered in full here:

    http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr
    

    I hope that helps. Otherwise I know two ladies, who might come to your rescue, if you are still stuck.

  • ahmed_chakrounahmed_chakroun TunisiaMember

    Hi,

    Thank you for your reply Tommy. I read the document you've recommended to me and still can't discern a clear path to go through so I will try reformulate my interrogation to be more precise. Actually, I used VariantFiltration to hard filter my raw variant calls according to the filteration recommendations in https://www.broadinstitute.org/gatk/guide/article?id=3225 and I end up with two files:

    1- all_genotypes_filtered_snps.vcf
    2- all_genotypes_filtered_indels.vcf

    My question is: May I consider these two hard filtered calls as training/truth resources to run VQSR? and if so would someone give me some insights about the prior likelihood I may apply to such data? The aim here, is to stick as much as possible to the best practices.

    Thanks for help.

    Ahmed.

    Issue · Github
    by Sheila

    Issue Number
    343
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Ahmed,

    In practice, yes that is how you could proceed. The difficulty is that as you are clearly aware, it is important to estimate the confidence that you may have in your known variants appropriately, but it is difficult to do so without external resources to draw from. The resources that we make available for human work were produced in part thanks to orthogonal technologies such as gene chips to validate the variants. Without that it is a lot harder to estimate the reliability of a set of known variants. You of course apply stringent filters, but the more stringent the filters, the fewer variants you will have available to build a model with -- and VQSR is a data-hungry beast.

    So in short, we can't give you any useful guidance except to recommend experimenting with various settings and evaluating results of these experiments relative to each other. But be aware that this is an analysis-heavy process, and we can't guarantee a happy outcome. On the bright side, if you produce such a resource for your organism/field, you will be adored by many :)

  • everestial007everestial007 GreensboroMember

    @ahmed_chakroun said:
    Hi,

    Thank you for your reply Tommy. I read the document you've recommended to me and still can't discern a clear path to go through so I will try reformulate my interrogation to be more precise. Actually, I used VariantFiltration to hard filter my raw variant calls according to the filteration recommendations in https://www.broadinstitute.org/gatk/guide/article?id=3225 and I end up with two files:

    1- all_genotypes_filtered_snps.vcf
    2- all_genotypes_filtered_indels.vcf

    My question is: May I consider these two hard filtered calls as training/truth resources to run VQSR? and if so would someone give me some insights about the prior likelihood I may apply to such data? The aim here, is to stick as much as possible to the best practices.

    Thanks for help.

    Ahmed.

    Hi Ahmed,
    This is the step I am also dealing with. Right now I am still preparing the hard filtered snps and indels. I would like to know how did your VQSR process go using this hard filtered snps and indels. Let me know if you have some advise.

    Thanks,

Sign In or Register to comment.