What does one do with the raw VCF output of joint genotyping many gVCF files?

mmats010mmats010 Riverside CAMember

I have been following the best practices outline for calling SNPs on our samples, but I'm a little confused as to what to do with the VCF file produced following the joint genotyping/genotypeGVCFs step.

I understand the principle of gVCF calling for the most part, but my confusion is what are we to do with the VCF file once we do the joint genotyping step? We are looking at a F1 mapping population of a non-model organism, so does this VCF file have individual progeny (bam file names) indicated within it? I think not since I can't find any of the sample names while scrolling through it.

Can this VCF file be used to construct a pedigree file to use during genotype refinement? Should it be somehow fed back into Haplotypecaller to inform on likely calls during a second round of variant calling? Do you use it to go back to the individual gVCF files to extract the high confidence variants?

There seems to be a good amount of literature on the Broad websites about what a gVCF file is and how to perform joint genotyping, but not much direction about what to do with the joint genotyped VCF file once it is produced.

Any advice or referral to other walkthroughs/guides would be very appreciated.

Michael

[extra project information: My project involves calling SNPs across a mapping population for a non-model organism with the intent of mapping a trait. The goal is to produce robust SNP calls for each individual progeny (of which we have 30 currently, and >60 in the near future) and the two parents. We only have halfway-decent sequencing coverage of ~10-20x for each sample, which is thus why doing gVCF calling and joint genotyping sounds attractive to us. Since we work on a non-model, we also lack previously produced "gold standard" SNP sets or other resources allowing us to refine genotypes.]

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mmats010
    Hi Michael,

    After getting a raw VCF, we recommend filtering to remove false positives. You can either use VQSR or hard filtering.

    Also, have a look at the Best Practices which may help as well.

    -Sheila

  • mmats010mmats010 Riverside CAMember

    Hi Sheila,
    I understand how to hard filter (and we cant VQSR due to me working on a non-model organism), so my question is more along the lines of "how do you APPLY a jointly genotyped VCF file to your project?" As in, once you have hard filtered false positive variants, what are the downstream applications of joint genotyping? The Best Practices guidelines don't really make this clear.

    Michael

  • mmats010mmats010 Riverside CAMember

    Thanks for the clarification Geraldine,
    When adding the read group names during the bwa alignment, I just added in generic names for all the read groups without varying them between samples, including SM. Each SM for my ~60 bam files are identically "sample1", and I do in fact see the final column of my merged gVCF file is named "sample1". So, would using Picard to change each bam file to their actual names for the SM position, such as "my1.bam", "my2.bam", "my3.bam"...., and then re-running the joint genotyping step, solve this problem?

    Thanks,
    Michael

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mmats010
    Hi Michael,

    Yes, you can use Picard's AddOrReplaceReadGroups to change the sample names in your BAM files, then re-run the variant calling step. But, did you use the GVCF workflow? If you have individual GVCFs for each sample already, you can simply replace the sample name in the GVCFs manually and run GenotypeGVCFs on those GVCFs.

    -Sheila

Sign In or Register to comment.