HaplotypeCaller, GenotypeGVCFs, and VQSR for alternatives in dbSNP

dmyersturnbulldmyersturnbull Stanford UniversityMember

From my whole-genome (human) BAM files, I want to obtain:
For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.

Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs:
CHROM POS ID REF ALT
chr1 1 . A # my data
chr1 1 . A T # dbSNP
I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below.

How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.

My detailed reason for needing this is below:

For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases:
1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2.
3. We do not have sufficient evidence to make any call for position n.

I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.

Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR.

Consequently, this seems to distinguish only between these two cases:
1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).

This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".

Thanks in advance!

Douglas

Issue · Github
by Geraldine_VdAuwera

Issue Number
870
State
closed
Last Updated
Closed By
vdauwera

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Given sufficient depth, GenotypeGVCFs should be able to answer your second question (ref vs unknown). What's the depth at the sites that are failing?

  • dmyersturnbulldmyersturnbull Stanford UniversityMember

    Thanks!

    I made a mistake in my question: I do get GT=0/0, not just ./.

    Here's what happens:
    GVCF:

    chr1    10391   .       C       <NON_REF\>       .       .       END=10392       GT:DP:GQ:MIN_DP:PL      0/0:133:99:132:0,120,1800
    

    VCF:

    chr1    10391   .       C       .       .       .       AN=2;DP=132;NCC=0       GT:DP   0/0:132
    

    In the GVCF, I do get GQ and PL. However, GenotypeGVCFs removes those properties.

    We're interested in the whole genome, but we're especially interested in a few hundred critical variants, all of which are in dbSNP. For these, we really need to know how confident we are in the call. The two options we'd be happy with are:
    1. GQ, PL, and VQSLOD for "NONREF", or
    2. GQ, PL, and VQSLOD for each specific variant at that position in dbSNP.


    Correction to my question (my angle brackets were escaped); it should read:

    CHROM POS ID REF ALT
    chr1 1 . A <NONREF\> # my data
    chr1 1 . A T # dbSNP
    

    I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of <NONREF> isn't useful to me for the reason explained below.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Normally GenotypeGVCFs should give you PLs -- which version are you using? And can you post your command line?

  • dmyersturnbulldmyersturnbull Stanford UniversityMember

    I'm using 3.3 currently, but I was using either 3.3 or 3.2 when I ran this:

    gatk -T GenotypeGVCFs --reference_sequence grch38.fna --variant variants.gvcf --dbsnp All.vcf.gz --standard_min_confidence_threshold_for_emitting 5 --standard_min_confidence_threshold_for_calling 20 --includeNonVariantSites --out variants.vcf
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    My bad -- I thought GGVCFs gave PLs for hom-ref sites but @Sheila reminded me it doesn't. We'll check with the devs if there is some way to emit the PLs since as far as I recall they do get calculated internally.

  • dmyersturnbulldmyersturnbull Stanford UniversityMember

    Thanks for looking into this, Geraldine. Any way to do this would be useful.

    I should mention that I found these line in the source for GenotypeGVCFs.cleanupGenotypeAnnotations:

    // also, the PLs are technically no longer usable
    builder.noPL();
    

    I don't understand why the PLs wouldn't be usable for homozygous calls. And if they're not, I still need some idea of the confidence (even if it's something other than PL and GQ).

  • dmyersturnbulldmyersturnbull Stanford UniversityMember

    @Geraldine_VdAuwera said:
    My bad -- I thought GGVCFs gave PLs for hom-ref sites but Sheila reminded me it doesn't. We'll check with the devs if there is some way to emit the PLs since as far as I recall they do get calculated internally.

    Hi Geraldine,

    Do you have any updates on this issue? It's currently bottling up our development.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @dmyersturnbull
    Hi,

    Thanks for the bump. I will put in a note to try to get this pushed up in priority. Our developers are really swamped with other priorities now, so I cannot make any promises.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @dmyersturnbull To follow up on Sheila's answer, I've initiated an internal discussion on this topic. This has actually sparked a lot of back-and-forth debate about the right thing to do; we'll let you know the outcome as soon as the dust clears.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    FYI the new annotation has been added and will be in the next nightly build under the name RGQ for "unconditional reference genotype quality".

  • dmyersturnbulldmyersturnbull Stanford UniversityMember
    edited March 2015

    Great! Yes, this sounds like what I need. Thanks @Geraldine_VdAuwera and @Sheila for leading this.

    I have two questions:

    1. Does the new annotation have the same statistical meaning as GQ? I'm wondering whether I can interpret it in the same way (and why it was given a new name).

    2. At least previously, GenotypeGVCFs would output INFO annotations that VQSR uses like MQ, MQRankSum, and ReadPosRankSum for variant calls but not for 0/0 calls. Because we want to keep only high-confidence genotype calls, I plan to run VQSR on the high-GQ variant and high-RGQ homozygous-reference calls. How can I get around the issue of missing annotations for 0/0 calls? Can I just re-add them with VariantAnnotator, and will VQSR run on those calls?

  • dmyersturnbulldmyersturnbull Stanford UniversityMember
    edited March 2015

    Okay, thanks. That explains (1) perfectly.

    Regarding (2), can I also include MQ, MQRankSum, and ReadPosRankSum in hard-filtering for monomorphic calls?
    And just so I understand, is the issue with VQSR that it uses only variant sites for training?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sure, you can use those annotations for hard-filtering non-variants in the same way you would for variant calls. And yes, the issue is that VQSR will simply ignore any non-variant sites (at all stages in the process).

  • SteveLSteveL BarcelonaMember ✭✭

    Hi @dmyersturnbull @Geraldine_VdAuwera . This new feature is very interesting for us too. I am currently running GenotypeGVCFs on ~100 GVCFs using today's nightly build (26th March, 2015). I can see ##FORMAT=<ID=RGQ in the header of the output, but I don't see RGQ anywhere else in the file.

    My command is shown below - do I need to add something to get the RGQ? I see that my 0/0 calls do have normal "GQ". For example, calls for one individual at two different positions.

    GT:AD:DP:GQ:PL 0/0:5,0:5:0:0,0,4
    GT:AD:DP:GQ:PGT:PID:PL 0/0:3,0:3:0:.:.:0,0,47

    Or am I misunderstanding and the RGQ will reappear following VQSR?

    java -Xmx33600m -Djava.io.tmpdir=$TMPDIR -jar /apps/GATK/3.3-0-g42bfc64/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R /project/production/Indexes/samtools/hsapiens.hs37d5.fasta \ -V /project/production/RD-Connect/ForUpload/sample1.g.vcf.gz \ . . . -V /project/production/RD-Connect/ForUpload/sample100.g.vcf.gz \ -o 100Samples.Emit30.Call30.Genotyped.vcf \ -nt 4 \ --standard_min_confidence_threshold_for_calling 30 \ --standard_min_confidence_threshold_for_emitting 30

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @SteveL, I assume what you're showing are genotypes from hom-ref samples at sites where other samples in your cohort are variant, since you are not running with the -allSites argument that is necessary to have the tool emit non-variant sites. For those genotypes, the "regular" GQ can be calculated because there is an ALT allele that can be used to contrast the likelihood of REF against. If you re-run with -allSites, you'll get the non-variant sites. At those sites, we can't provide GQ for any samples since there is no ALT allele; so that's where RGQ comes in. That, we can annotate for non-variants because it only depends on the REF allele. So to see this annotation, you need to emit non-variants using -allSites.

  • SteveLSteveL BarcelonaMember ✭✭

    Thanks for the speedy clarification Geraldine - I hadn't realised it was for the -allSites setting. You are quite correct that I was reporting sites that were variant in another member of the cohort. Nevertheless, as we are doing WES, -allSites may be a viable option, and RGQ will be very useful for Doug's original scenario 2.

Sign In or Register to comment.