Calling variants at known sites with HaplotypeCaller

jphekmanjphekman Broad InstituteMember

Hi, all. We are calling variants on large numbers of dogs (WGS) of a variety of breeds using HaplotypeCaller followed by GenotypeGVCFs.

When we call VCFs of dogs of different breeds in one group, that goes well, but when we call a smaller number of dogs of a single breed, we lose variants in the final VCF. We suspect that the problem is that breed-specific variants are being lost. In other words, if we call variants on several Golden Retrievers, the pipeline will save variants that differ between those dogs, as well as variants where the dogs differ from the reference. However, German Shepherd-specific variants will be lost, as they do not appear in the Goldens.

We would like to specify a list of variant sites of interest, based on the large number of sequenced samples we currently have available to us. We'd then use this list of sites in our pipeline so that when we call a smaller number of dogs, all variants of interest are retained in that VCF, even if they are invariant in those samples and vs the reference. We would also retain variants new to these samples (so would not be LIMITED to sites previously of interest).

I've been struggling with the documentation and can't quite see how to do this, although there are a variety of parameters that are ALMOST what we want. What am I missing?

Currently using GATK 3.3, about to move to 4.0 and happy to find a 4.0 solution.

Best,
Jessica

Best Answer

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @jphekman

    I have a few questions.

    Are you using HaplotypeCaller on each breed and then merging the samples in GenotypeGVCFs? Or combining all the samples in the Haplotype Caller run?

    There are ways to instruct GenotypeVCF to output all variants every time, so I was wondering what you have tried so far.

    It would also be helpful to see the list of commands with the parameter settings you have chosen.

  • jphekmanjphekman Broad InstituteMember

    Hi @AdelaideR!

    We are using HC on small sets of samples (often, but not always, single breed - but always missing alleles due simply to small sample size) and then merging them. Currently we have to run GenotypeGVCFs on all existing gVCFs, which we save for this purpose, but as the number of samples rapidly increases, this becomes more and more computationally expensive. We would really like to just be able to run a small set of samples through HC and then generate a VCF for just those samples, but with all the variants we are interested in called, so that we do not lose information. If possible, this would save us a lot of compute.

    We did see the parameter to save all sites, but worry that this would generate enormous VCFs (essentially the equivalent of FASTAs) which would again be a problem with the amount of data we're dealing with. We're hoping to find a way to save only the variants we want.

    We call HC like so (currently GATK 3.3; $CAN_FAM is our reference; $chr is the chromosome we're calling this job on):

            $Gatk10g -T HaplotypeCaller -R $CAN_FAM -ERC GVCF -minPruning 2 -pcrModel NONE -L $chr -I "${dest}/bqsr/${1}-recal.bam" -o "${dest}/gvcfs/${1}-${2}.g.vcf"
    

    We then call GenotypeGVCFs like so - currently on massive (for us) numbers of samples which means we can only do it once a year or so and it is a HUGE deal. I would love to be able to do this on an ongoing basis on smaller numbers of samples without loss of variants sites!

            $Gatk10g -T GenotypeGVCFs -R $CAN_FAM -L $1 $(sed "s/^\(.*\)$/-V \1-${1}.g.vcf/" ${t\
    mpdir}/gvcflist.txt) -o "${dest}/vcfs/${1}.vcf" \
                    --disable_auto_index_creation_and_locking_when_reading_rods
    
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @jphekman

    I am curious whether you have set up a Firecloud account for this analysis? I understand the need to cut the costs of analysis, and the Firecloud allows you to see the cost per compute type. Setting the machine parameters to "preemptible" can save a lot of money on the overall cost of the run.

    What are your compute limitations? Memory, CPU? This will help me make a better suggestion on this workflow.

    Thanks for any info.

  • jphekmanjphekman Broad InstituteMember

    @AdelaideR

    Good question. We are struggling right now in our group with whether it makes sense to continue to maintain large amounts of data on the Broad servers (expensive disk space but free compute) vs move into the cloud (cheaper disk space, unknown how much compute would cost).

    Is it possible for you to help us figure out the workflow less in terms of existing compute limitations, and more in terms of tradeoffs? This would help us figure out whether to go forward with moving more of our work onto Firecloud, vs staying on prem.

    (I still hold out hope, though, that we'd be able to call 1-2 new dogs with all interesting variants without having to re-generate a VCF that might have 1K dogs in it.)

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @jphekman I am going to take this conversation off line, I think we could set up a meeting to troubleshoot your workflow to make it more efficient and less costly per run. Look for my pm in the forum.

  • jphekmanjphekman Broad InstituteMember

    Thanks @AdelaideR . Assuming you have not sent it yet, but if you have, I'm not seeing where to look. You can ping me on slack as jphekman or email jphekman at broad.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    I have sent it. Thanks.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @jphekman,

    It was nice to meet you and your team yesterday. Thanks, @AdelaideR for organizing. Let me summarize your analysis needs as I understand them.

    The workflow under development requires a sustainable solution that enables calling on a cohort of WGS samples as the samples trickle in over time. The solution (i) balances storage and compute costs and (ii) enables calling nonvariant sites. Ultimately, the cohort will contain ~1000 WGS samples.

    The outline of the workflow we discussed is GATK's Best Practice Workflow for short variant calling, using GATK4. Roughly consisting of:

    1. Per-sample GVCF calling (HaplotypeCaller -ERC GVCF mode) to .g.vcf.gz files.
    2. Import cohort GVCFs to GenomicsDB database and call on the cohort with GenotypeGVCFs.

    Here are the technical details I promised to look into on your behalf and a few more I thought of while thinking about your needs. To be clear, my answers here do not imply any optimal solution for cost-effectiveness.


    Currently, GenomicsDBImport only allows one import per database. The feature to allow incremental addition of samples to a database is currently in the works and should be available in 1-2 months. This work is being contributed by our collaborators at Intel Health and Lifesciences.

    Until this feature becomes available, you will have to import all cohort GVCFs into a database each time you want to call on an n+1 cohort. After this feature becomes available, you will be able to incrementally add samples to an existing GenomicsDB database.

    I did a quick comparison of the storage footprint for some extremely small files I had on hand.

    Three .g.vcf.gzs and their corresponding indices, at 5.5MiB, create a GenomicsDB database of 4.2MiB. I think it worthwhile to check (i) what the storage differential might be for large files and (ii) what is the compute cost of import for a cohort. This latter question would be a consideration in whether it is better to store a database or recreate it each time it is needed.


    You are in luck as GATK v4.1.0.0, released a week ago, implements a much-desired option to genotype nonvariant sites --include-non-variant-sites . I looked into its requirements and you can use it with GenotypeGVCFs like so:

    gatk4100 GenotypeGVCFs \
    -R ref/ref.fasta \
    -V gendb://precomputed/trio \
    --include-non-variant-sites \
    -O sandbox/trioGGVCF_invs_onelocus.vcf.gz \
    -L 20:10000000
    

    This gives the nonvariant site at the desired locus:

    If you do not specify any intervals, then --include-non-variant-sites produces a record for every basepair. In the feature's current state, you will have to perform a 2-pass analysis to rescue those nonvariant sites you are interested in. I believe this 2-pass method can produce incompatible overlapping records to the 1-pass calling that cannot be merged easily given a GATK4 feature that implements calling spanning deletions. However, you will lose these spanning deletions anyways given your hard-filtering approach of separating SNPs and indels before filtering.

    If you would find useful a feature to specify intervals with the --include-non-variant-sites parameter that asks the tool to do its usual thing and to then check to include those sites in the intervals not already called, then I can put in such a request on your behalf.


    Finally, I tested the following for you with GATK4.1.0.0. It is not possible to import combined GVCFs produced by CombineGVCF into a GenomicsDB database.

    I hope this is helpful. Again, I highly recommend reaching out the members of th MacArthur lab, to learn what pipelining solutions they have settled upon for their similar research needs. If you find during your pipelining that certain GATK features would greatly enhance the ease with which you can analyze your data, please do let us know. Our developers are keen to enable science.

    Best,
    Soo Hee

Sign In or Register to comment.