Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

getting high quality monomorphic sites from GenotypeGVCFs

hello,

I am doing a population genetic analysis and am consequently very interested in obtaining high confidence monomorphic sites.

I used GATK 3.1 HaplotypeCaller in the new incremental variant discovery pipeline and then ran GenotypeGVCFs using the -inv option so as to print non variant sites after combining.

A monomorphic record looks like this:

scaffold_1 202986 . G . . . . GT:AD:DP:PL ./.:81:81:0 ./.:6:12:0 ./.:0:0:0 .....etc

I had a couple of questions:

  1. It seems GT, GQ, and PL is no longer reported after combining for monomorphic sites? is there a reason why? I'm asking because I'd like to be able to distinguish high qual (first genotype) / low qual (second genotype) monomorphic sites so I can change the low qual sites to missing (third genotype).

  2. If getting high confidence monomorphic sites is my goal, would you recommend that I do my own parsing starting with the individual gvcfs rather than using the GenotypeGVCFs walker?

thanks much for your help!

best,
Young Wha Lee

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Sorry for the late response. This is something I need to discuss with team members in order to get a complete answer. For now I can tell you that there were some issues with the propagation of genotype information at the CombineGVCFs step. The correct behavior (which is now in the nightlies and will be in the next release) is that genotype information should be considered unknown in the combined GVCFs, which is only meant to be an intermediate file, to be processed by GenotypeGVCFs. The output of the joint genotyping step is the one that you can analyze, which will have all the correct genotype info. I believe this is now working properly in the nightlies, so you shouldn't need to do your own parsing.

  • leeyoungwhaleeyoungwha Member

    thanks Geraldine!

    I have been skipping CombineGVCFs as I have only 189 samples - the off genotype information I had been getting was from GenotypeGVCFs output directly. One thing to note is that I have BP_RESOLUTION outputs, if that makes a difference. Will the nightly build version allow me to correctly genotype?

    We just tried out the nightly build and had a question - it seems the -inv option for GenotypeGVCFs is not supported in the nightly build. Will that continue to be an option?

    thank you always for your help!

    Young Wha

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh right -- then it might be a slightly different issue, but I believe you should get the ref GTs etc properly now with the latest nightly. I don't think BP_RESOLUTION should be a problem as opposed to GVCF.

    The disappeared -inv option is probably the result of some work we've been doing on refactoring the argument collections for variant calling. I'll look into it and make sure that functionality is restored in some form (probably the same argument but no guarantee).

  • JCGrenierJCGrenier Montreal, QCMember ✭✭

    Hi Geraldine, I have about the same question. I did try the latest nightly build but I'm still unable to recover the GQ for the calls. Is it normal and/or is it something that you hope to developp pretty soon? Note, I'm running HaplotypeCaller with the -ERF GVCF.

    Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @JCGrenier‌,

    To clarify, are you saying you see records without GQ annotation in the output of GenotypeGVCFs? If so, can you please post a few records here?

  • JCGrenierJCGrenier Montreal, QCMember ✭✭
    edited June 2014

    Here's an example for one of my monomorphic sites :

    Nightly Build 2014-06-03-g6ec26d6

    chr1 14688 . G . . . DP=105 GT:DP 0/0:52 0/0:23 0/0:30

    Version 3.1-1

    chr1 14688 . G . . . DP=105 GT:DP:PL ./.:52:0 ./.:23:0 ./.:30:0

    Coming from a gVCF where the line is formatted like that for one of my samples :

    chr1 14674 . G . . END=14698 GT:DP:GQ:MIN_DP:PL 0/0:41:99:30:0,90,1084

    Here's how I launched HaplotypeCaller for my 3 samples:

    java -Djava.io.tmpdir=/lscratch -Xmx22g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller -I sample.bam -R ref.fasta --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample.kit.q1.g.vcf --dbsnp dbsnp_138.b37.excluding_sites_after_129.vcf -stand_call_conf 30.0 -stand_emit_conf 0.0 -nct 12 -L 120430_HG19_ExomeV3_UTR_EZ_HX1_sorted_merged.bed --emitRefConfidence -dontUseSoftClippedBases -mmq 1

    Is the problem coming from the gVCF itself?

    Thanks a lot for your help.

    Post edited by JCGrenier on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see, you mean the invariant sites have no GQs. That is expected in the current version; since GQs are collapsed by band in the intermediate GVF for non-variant sites, they are not available to GenotypeGVCFs. Currently they are not recalculated in the interest of speed performance, however I believe that in future we will provide an option to have the GQs recalculated on request.

  • leeyoungwhaleeyoungwha Member

    hi Geraldine,

    just to clarify - by the above comment do you mean that the -inv option will be restored in future versions of GATK GenotypeGVCFs, and that this will report GQs for the invariant reference sites?

    thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @leeyoungwha‌,

    Yes, that's my understanding of what the plan is for GenotypeGVCFs functionalities.

  • JCGrenierJCGrenier Montreal, QCMember ✭✭

    Thanks Geraldine! In the meanwhile, do you think parsing out the values in the gvcf is a good approximation for this? Do you think a VQ value will be implemented as well, like what samtools mpileup is doing?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @JCGrenier, can you explain what you mean by parsing out the values in the gvcf? Which values are you thinking of using exactly?

    I'm not familiar with VQ so I can't comment on that off the top of my head...

  • JCGrenierJCGrenier Montreal, QCMember ✭✭

    In the gvcf files, we have GQ values for those intervals. For example :

    chr1 14700 . G . . END=14769 GT:DP:GQ:MIN_DP:PL 0/0:51:99:36:0,60,900

    So for all those ref/ref, positions, I suppose I can put GQ=99.

    And for those ones :

    chr1 14838 . A . . END=14843 GT:DP:GQ:MIN_DP:PL 0/0:16:48:7:0,21,235

    I can put GQ=48.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh, I see. Yes, it's reasonable to use the GQ from the corresponding GVCF band as a proxy, though you'll probably want to look up the respective GQ value bounds of the bands, which are defined in the header, since the actual GQ of each position can effectively be any value in that range. It's good to be aware of the potential spread of values. Does that make sense?

  • JCGrenierJCGrenier Montreal, QCMember ✭✭

    Yes, I think it's making a lot of sense. Do you know for which date the next stable release is supposed to be available?

    Thanks a lot for your support!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hard to say but I would guess the next release is still 4 to 6 weeks away at least. Some of the more cutting edge new stuff needs a lot of testing before we can be comfortable releasing.

  • mikedmiked Member

    @Geraldine_VdAuwera,

    I'm running the gVCF reference model pipeline as follows:
    HC using v3.1
    CombineGVCFs nightly build
    GenotypeGVCFs nightly build

    I'm trying to understand the difference between 0/0 and ./. for the GT tag when looking at the final VCF generated by GenotypeGVCFs:

    1       10443   .       C       T       256.69  ....... GT:AD:DP:GQ:PL 0/0:40,0:40:0:0,0,824   0/0:33,0:33:0:0,0,628   ./.:27,0:27     0/0:37,0:37:0:0,0,952   0/1:9,2:11:31:31,0,411
    

    My understanding was that ./. meant that no data was available to make a call but the example shows 27 reads are there for the reference base for one of the samples.

    Thanks for the help.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @miked‌

    Hi,

    Haplotype Caller finds active regions which are regions that have the potential for variation. It then does a reassembly of the reads in the active regions. A lot of times the reads that were originally mapped to a region are not mapped to the same region after reassembly.

    Haplotpye Caller emits the DP and AD of the original file to the VCF record, instead of the DP and AD observed after reassembly. I think this is where your issue occurs. If you use -bamout argument, you will see the reads that are mapped to the reference after reassembly.

    I hope this is helpful.

    -Sheila

  • mikedmiked Member

    Thanks for the response.

    It's very helpful to know that DP and AD are depth values coming out of the original BAM and not after reassembly. For this discussion, I'm focusing on the VCF that is coming out of GenotypeGVCFs.

    1. Does this mean every value for GT that is ./. should be interpreted as 'NoData' even if the DP and AD values are filled in as in this example:
      GT:AD:DP:GQ:PL ./.:27,0:27

    2. I'm also confused about the difference between ./. and 0/0. This thread suggests that monomorphic sites are represented as ./. and I'm wondering on why they are not represented as 0/0.

    Thanks for all the help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @leeyoungwha‌,

    FYI the -inv option has been restored in GenotypeGVCFs in version 3.2.

  • hi Geraldine,

    I've had a chance to try out the new version of GenotypeGVCFs with -allSites (new argument for -inv).

    I get output like this:

    scaffold_1 30115 . G . . . DP=89;NCC=3 GT:AD:DP 0/0:25:25 0/0:34:34 0/0:30:30

    So GQs are not recalculated in the current version for monomorphic sites, is that correct? Is there a plan to add that functionality in future versions?

    thanks!

    Young Wha

  • hi Geraldine,

    followup question! are the recalculated GQs for invariant sites in the joint file for a given individual expected to be quite different from corresponding ones in the individual gvcfs, if I have run the individual gvcfs on HaplotypeCaller with BP_RESOLUTION? I guess I am wondering if I can just take the GQs from the individuals gvcfs to use for filtering, if I have them in BP_RESOLUTION format..

    thanks as always!

    best,
    Young Wha

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @leeyoungwha,

    I'm not sure whether we recalculate the GQs for monomorphic sites but in theory the confidence won't be affected by observation of other hom-refs in the population, so you should get roughly the same result anyway.

    I suppose technically you could use the values from the BP GVCF for invariant sites, but I can't see why you would want to do that. The GVCF is really only meant to be an intermediate; we consider that the analysis is incomplete (and results are not ready for filtering) until you have run joint genotyping.

  • hi Geraldine,

    thanks for your reply! I had a follow up question:

    if GQs are not recalculated for monomorphic sites, how should one tell the difference from a high confidence site from a low confidence one? Do you recommend that we filter on the depth of reads instead? Or is there a plan to recalculate GQs in the joint genotyping step for invariant sites as well as for variant sites?

    thanks much,

    Young Wha

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you're using the new -ERC GVCF-based pipeline, any sites that come out monomorphic have already been determined to be high-confidence hom-ref sites. What remains to do is filter the sites that are called variant using variant recalibration (VQSR). Sites that are rejected by VQSR are the sites that cannot be confidently called as reference or variant.

  • hi Geraldine,

    to check that I am understanding correctly:

    1. I ran my original single samples with BP_RESOLUTION. When I ran GenotypeGVCFs on those single sample GVCFs, I get invariant sites where the genotypes are actually reported as 0/0, rather than ./. I am inferring from your reply and also from the reply to miked that if I had run the ERC option instead, I would have gotten ./. reported on all my hom-ref invariant sites, and that those would be good to take as is - is that correct?

    2. If 1. is correct, how should I interpret the 0/0 genotypes in my invariant sites from GenotypeGVCFs? can I also take those as is?

    3. The reason I ask the questino above is, when I look at my joint genotyping outputs on monomorphic sites (everything run with 3.2), I see odd looking genotypes:

      0/0:0:17 0/0:2:31 0/0:2:37 0/0:3:46 0/0:1:32 0/0:3:26 0/0:1:27 0/0:0:22 0/0:1:43 0/0:1:7

    where genotypes are called as hom-ref even if there are AD=0,1,2 etc. are these trustworthy?

    thanks very much for your advice - I'm sorry I keep bugging you with this!

    best,
    Young Wha

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @‌leeyoungwha

    To clarify, -ERC is the name of the parameter, which can take GVCF, BP_RESOLUTION or NONE as options. The first two produce GVCFs (banded or base-pair resolution, respectively).

    In older versions (3.1 and 3.0), GenotypeGVCFs reported invariant sites as ./., but this was changed to report them as 0/0 in version 3.2. In principle you should be able to take all 0/0 calls as is, yes.

    If you are concerned about odd-looking genotypes, please post the corresponding lines from the individual GVCFs and the full line from the final VCF. I'm happy to take a look but I need more complete information.

  • leeyoungwhaleeyoungwha Member
    edited August 2014

    hi Geraldine!
    here is one example - the genotypes are reported even if the GQs are very low in the individual files:

    scaffold_7      46      .       A       .       .       .       DP=2628;NCC=179 GT:AD:DP      0/0:7:27        0/0:14:29       0/0:0:5 
    

    individually:

    scaffold_7      46      .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:7,20:27:0:0,0,0
    scaffold_7      46      .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:14,15:29:0:0,0,45
    scaffold_7      46      .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:0,5:5:0:0,0,0
    

    thanks much, and please let me know if you need further information from me!

    Young Wha

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, now I'm wondering if I'm wrong about the expected behavior of the joint genotyper and how to interpret the output (because those certainly look like extremely low-confidence calls). Let me discuss with the devs to make sure I'm giving you the right info.

  • EvoStevoEvoStevo JGIMember
    edited August 2014

    @Geraldine_VdAuwera‌,
    I have a similar interest in the quality of the invariant sites emitted by GenotypeGVCFs v3.2-2 with -allsites. Has this been clarified or resolved? I have examples like this:

    A10 373 . T . . . DP=123;NCC=4 GT:DP 0/0:21 0/0:75 0/0:0 0/0:27

    as you can see, the third sample is being reported as homozygous reference, but does not have any reported reads . Earlier in this thread you said, "If you're using the new -ERC GVCF-based pipeline, any sites that come out monomorphic have already been determined to be high-confidence hom-ref sites." So am in misinterpreting the DP field here or am I misunderstanding the genotype call?

    Thanks for your consistently high quality clarifications.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @EvoStevo‌ What's probably happening here is that there are reads at the site, but they are filtered out by the internal QC filters that are applied before DP is reported. Have you looked at the site in IGV? You can use the -outbam argument to output the post-reassembly reads for the active region, and check if there are reads there or not. If not then we may have a problem (as we shouldn't output a genotype at a site that has no coverage at all) but otherwise it is the expected behavior (although I'm surprised there are no other sample annotations -- was this record edited?).

  • EvoStevoEvoStevo JGIMember

    @Geraldine_VdAuwera‌. Thanks for the reply. I finally circled back around on this. This record was not edited. I do have reads mapped to this location. If the reads are not of sufficient quality to pass the internal QC filter, isn't this the same as no reads? So, then how is a call being made? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie admin

    @EvoStevo‌

    Hi,

    This may have been fixed today. Can you please try this again with tonight's nightly build and let us know if the issue still occurs? Please find the nightly builds here: http://www.broadinstitute.org/gatk/nightly

    Thanks,
    Sheila

  • EvoStevoEvoStevo JGIMember
    edited September 2014

    @Sheila‌ , I re-ran the nightly build 2014-09-08-ga8f8a5b of GenotypeGVCFs with --includeNonVariantSites and I get a similar result, at least as far as calls without supporting reads:

    A10 373 . T . . . AN=8;DP=123;NCC=0 GT:DP 0/0:21 0/0:75 0/0:0 0/0:27

    @Geraldine_VdAuwera‌ is correct that there are reads that are being filtered out, but if the quality of the mapping is insufficient, I would think that that sample should be listed as ./. unless I am misunderstanding something. I apologize if I am being dense.

    Post edited by EvoStevo on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If the reads are not of sufficient quality to pass the internal QC filter, isn't this the same as no reads?

    Not quite. If there are reads, the genotyper will still try to make a call because it is set to be as greedy as possible. So it will actually use reads that are considered not good enough for variant calling. Basically, the standards for genotyping are lower than for variant calling. Typically this is something that should show up in the qualitative annotations of the genotype, which would allow you to decide whether you want to trust that genotype or not. I'm surprised they are not showing up in your file; you may need to explicitly request them in your command line or add them using VariantAnnotator.

  • EvoStevoEvoStevo JGIMember

    okay, thanks.

Sign In or Register to comment.