How to filter jointly called variants

Hi,

I am genotyping a couple of hundred haploid yeast genomes, and did the following per sample:

map with bwa
mark duplicates
call variants using HaplotypeCaller
apply VariantFiltration using recommended filters
selecting passing variants, also requiring minimum coverage, and a minimum fraction of reads supporting variants
used these variants to recalibrate quality scores
called variants on the recalibrated data, in GVCF mode

Then, I combined all vcf files into a giant, 200 sample gvcf file, then:

perform joint calling on that file, using GenotypeGVCFs

I understand that I can't do variant recalibration, because I don't have a sufficient number of variants for training. What I don't understand is:

a) will the variants be 'better' - i.e. more accurate - from the joint calling than they were from the single sample calling (i.e. will variant metrics be different?), or was the joint calling superfluous as I am unable to do variant recalibration?
b) how do I now proceed with this giant gvcf file to apply hard filters - I see no documentation as to how to perform filtering on all samples, or even each sample from this multi-sample gvcf file.

Many thanks,
Gavin

Answers

  • Gavin_SherlockGavin_Sherlock StanfordMember

    To be more specific - I can apply a filter to the whole file thus:

    gatk VariantFiltration -R Saccharomyces_cerevisiae.R64.fasta -V called_genotypes.vcf.gz -filter 'QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -10.5 || ReadPosRankSum < -8.0' -output variants/final_pass_filtered.vcf -filter-name "hard_filter"

    but I get either PASS or hard_filter on a per line basis in the output file, not a per line per sample basis, which I don't really know how to interpret

    With the resulting file, I then don't know how to do the selection step - if I select based on attributes for a given sample, will I end up with one vcf file per sample in the end, e.g., I can do:

    gatk SelectVariants -R Saccharomyces_cerevisiae.R64.fasta -V variants/final_pass_selected.vcf -select '1.0vc.getGenotype("sample1").getDP() > 0 && (1.0vc.getGenotype("sample1").getAD().1)/(1.0vc.getGenotype("sample1").getDP()) > 0.9 && 1.0vc.getGenotype("sample1").getDP() > 6' -output {output.selected} --exclude-filtered

    will this just give me SNPs for sample1? I could then do this for each sample.

    I guess in the end, my question is, should I split the multi-sample gvcf file into multiple files, one per sample (and if so how?), and then perform the filtering on that? And if so, will it be any different that what I got earlier in my pipeline from variant calling on a per sample basis?

    Thanks,
    Gavin

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @Gavin_Sherlock

    I am investigating this question on the moment, I am trying to figure out how your workflow can be customized for yeast.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @Gavin_Sherlock

    Have you taken a look at the tutorial on hard filtering?

    This might be a good alternative to SelectVariants.

Sign In or Register to comment.