What is purpuse of multiple True Sites in VQSR

I have 3 questions:

1- What is the exact purpose of having both HapMap and Omni True Sites in VQSR, vs just one;
2- If I want to restrict the variant calling to my custom list of positions. Which of the 4 input resources do I reduce to my custom list of positions.
3- If I want to introduce my own custom true sites resource, can I replace HapMap with mine, and assign its variants a prior likelihood of say Q20

Thanks

Tagged:

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hm I must have misremembered, for some reason I thought we only used HapMap as truth. But yes, using both gives us more true sites to work with.

    For the intervals, there are several different formats accepted, including simple text file with one interval per line, but VCF is also a valid option -- then each record's position will be used as interval.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    1. We use them differently -- see the documentation on VQSR resources (truth vs training etc)
    2. Don't touch the input resources -- use the -L argument to pass in a list of intervals (which can be a VCF of sites)
    3. Yes (or you can add it to HapMap -- it's not an either/or thing)
  • dilawerkh4dilawerkh4 Member
    edited August 2017

    @Geraldine_VdAuwera said:
    1. We use them differently -- see the documentation on VQSR resources (truth vs training etc)
    2. Don't touch the input resources -- use the -L argument to pass in a list of intervals (which can be a VCF of sites)
    3. Yes (or you can add it to HapMap -- it's not an either/or thing)

    Thanks Geraldine, I do make it a point to check documentation before asking, but your documentation identifies both HapMap and Omni as Truth & Training:

    " True sites training resource: HapMap
    This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q15 (96.84%).

    True sites training resource: Omni
    

    This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). "

    Perhaps you are using 2 so that you can cover more positions (just guessing)

    2- Do you mean a text file with 1 position per row?

    Post edited by dilawerkh4 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hm I must have misremembered, for some reason I thought we only used HapMap as truth. But yes, using both gives us more true sites to work with.

    For the intervals, there are several different formats accepted, including simple text file with one interval per line, but VCF is also a valid option -- then each record's position will be used as interval.

  • Hi Geraldine,

    What is the acceptable formats for truth sites files (see above) for GATK. You mentioned that I can add mine samples to HapMap or 1000G, but I noticed that your truth files from the resource bundle are .tbi indexed, and if you look inside one of those files you will not see genotype columns for multiple samples, all you will see is 2 genotype columns (it llooks like they somehow have condensed the various samples into 2 columns, and are linking the genotypes for the samples that produced this condensed/merged sample if you will via allele frequency notation.

    So, do you think if I replace one of those truth files with my vcf that has multiple samples, with each sample having its own 2 genotype columns, it will work? or is there a way to incorporate my vcf samples into your say HapMap or 1000G truth file?

    Thanks, Dilawer

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dilawerkh4
    Hi Dilawer,

    Ah, I see what you were referring to in your original post. I misunderstood with the context there.

    In this case, Geraldine was not saying to actually add your samples to the HapMap VCF. You can simply add your VCF to the resource files you input to VQSR. For example, you would specify a new line for the new resource and specify whether it is for truth/training and the prior. You do not need to combine the new VCF with the HapMap VCF.

    I hope that helps.

    -Sheila

  • @Sheila said:
    @dilawerkh4
    Hi Dilawer,

    Ah, I see what you were referring to in your original post. I misunderstood with the context there.

    In this case, Geraldine was not saying to actually add your samples to the HapMap VCF. You can simply add your VCF to the resource files you input to VQSR. For example, you would specify a new line for the new resource and specify whether it is for truth/training and the prior. You do not need to combine the new VCF with the HapMap VCF.

    I hope that helps.

    -Sheila

    Thanks Sheila for the partial answer, but my main question which no one seems to have an answer to, and I would appreciate it if you can ask around is how did Broad merge so many HapMap or 1000G sample genotypes into just 1 genotype column.

    If you look inside one of those truth/traing resource files from The resource bundle you will notice that there is only one genotype column, yet the file represents the merger of many samples

    When we try to merge many samples using CombineVariants or vcftools merge-vcf, we get a vcf file with many genotype columns one for each sample .

Sign In or Register to comment.