We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Genotype likelihoods from HapMap SNPs

Hello the team,
My ultimate objective is to generate genotype likelihoods for SNPs throughout samples. I first used GATK to make a complete VCF file based the SNP variants in HapMap format, and then am going to employ the CalculateGenotypePosteriors module for that specific calculation. I indeed got a VCF file without any warning via calling "-V VariantsToVCF". However, the INFO field is empty (filled by zero's). I wonder how to supplement unbiased GL (and GP) to the VCF. It should be mentioned that I have no BAM file for the SNPs. thanks



  • Thanks for your reply. Your recommendation is not workable. Actually I checked VariantAnnotator before posting this in the forum.
    As per your documentation, PL calculation is not impossible via GATK but only in the HaplotypeCaller module (https://software.broadinstitute.org/gatk/documentation/article?id=11075). However, this module requires Bam file and unfortunately it is unavailable to me.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @NicolasLIU, please share with us the exact command you used.

  • Here is the code for the VariantsToVCF module I used,

    java -Xmx4g -jar GenomeAnalysisTK.jar -R Ptric.JGI2.0.dna.fa -T VariantsToVCF -o snp681v2.vcf --variant:RawHapMap test.csv -writeFullFormat true

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭


    I see you are using VariantsToVCF. Given you say

    the INFO field is empty (filled by zero's)

    I suspect some mismatch in expectations in file format or content. Does your original data contain the information that the INFO level field annotations would expect to be derived from? Unfortunately for you, I am not altogether familiar with this tool. I suggest you search the forum using the upper right search box with VariantsToTable to see if others have encountered similar issues. If you don't find the answer, then do report back to this same thread. @Sheila will be back on the forum and can follow up with you.

  • I attempted to annotate that VCF file using the following code,
    java -jar GenomeAnalysisTK.jar -R Ptric.JGI2.0.dna.fa -T VariantAnnotator -G StandardAnnotation -V snp681v2.vcf -o snp681v2.full.vcf
    but still in vain. I also tried to calculate PL via including the parameter "--Annotation ". However, there is no annotation module directly for genotype likelihood calculation.
    Thank you all the same, anyway...

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I'm sorry to hear that @NicolasLIU. Given you say,

    My ultimate objective is to generate genotype likelihoods for SNPs throughout samples.

    A number of the annotations require the BAM and PL is one of them. You can view a list of the individual annotations available in GATK4 to VariantAnnotator at https://software.broadinstitute.org/gatk/documentation/tooldocs/current/. Click on Variant Annotations to expand the list of annotations. Perhaps one of these would work for you. It would be helpful for us to see what a typical record in your VCF looks like. Can you post one?

    Is there no way you can reanalyze the BAM-level data?

  • NicolasLIUNicolasLIU Member
    edited March 2018

    I've used several modules to supplement the missing information of the VCF. Here is what the main body of my generated VCF file looks like. The header is lengthy and is not shown here. Obviously, in the FORMAT field, only GT is available.

    1   3169814 rs01160702  C   T   .   .   AC=242;AF=0.307;AN=788;HW=35.2;PG=0,1,7 GT  0/1 0/0
    1   3350136 rs01340704  G   T   .   .   AC=180;AF=0.221;AN=814;HW=33.7;PG=0,2,11    GT  0/1 0/0

    Here is a record of running one module,

    $ java -jar GenomeAnalysisTK.jar -R Ptric.JGI2.0.dna.fa -T VariantAnnotator -
    A AlleleBalance -A AlleleBalanceBySample -A HardyWeinberg -V snp681v2.full.vcf -o snp681v2.full1.vcf
    INFO  10:45:46,616 HelpFormatter - ------------------------------------------------------------------------------------ 
    INFO  10:45:46,619 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50 
    INFO  10:45:46,619 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
    INFO  10:45:46,619 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
    INFO  10:45:46,620 HelpFormatter - [Tue Mar 20 10:45:46 PDT 2018] Executing on Linux 4.13.0-37-generic amd64 
    INFO  10:45:46,620 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 
    INFO  10:45:46,625 HelpFormatter - Program Args: -R Ptric.JGI2.0.dna.fa -T VariantAnnotator -A AlleleBalance -A AlleleBalanceBySample -A HardyWeinberg -V snp681v2.full.vcf -o snp681v2.full1.vcf 
    INFO  10:45:46,629 HelpFormatter - Executing as [email protected] on Linux 4.13.0-37-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12. 
    INFO  10:45:46,630 HelpFormatter - Date/Time: 2018/03/20 10:45:46 
    INFO  10:45:46,630 HelpFormatter - ------------------------------------------------------------------------------------ 
    INFO  10:45:46,630 HelpFormatter - ------------------------------------------------------------------------------------ 
    INFO  10:45:46,822 NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/media/sf_H_DRIVE/Project%20-%20GWAS%20(pop)/Demography/gatk-!/com/intel/gkl/native/libgkl_compression.so 
    INFO  10:45:46,873 GenomeAnalysisEngine - Deflater: IntelDeflater 
    INFO  10:45:46,874 GenomeAnalysisEngine - Inflater: IntelInflater 
    INFO  10:45:46,875 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  10:45:48,389 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 
    INFO  10:45:50,230 GenomeAnalysisEngine - Preparing for traversal 
    INFO  10:45:50,236 GenomeAnalysisEngine - Done preparing for traversal 
    INFO  10:45:50,237 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  10:45:50,237 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
    INFO  10:46:20,433 ProgressMeter -       4:2556001   9.4146626E7    30.0 s       0.0 s       22.7%     2.2 m     102.0 s 
    INFO  10:46:50,782 ProgressMeter -      8:12649401   1.95133407E8    60.0 s       0.0 s       46.9%     2.1 m      67.0 s 
    INFO  10:47:21,167 ProgressMeter -       15:297501   3.03640053E8    90.0 s       0.0 s       72.9%     2.1 m      33.0 s 
    INFO  10:47:51,338 ProgressMeter - scaffold_590:10201   4.05919994E8     2.0 m       0.0 s       97.3%     2.1 m       3.0 s 
    INFO  10:47:54,922 VariantAnnotator - Processed 681 loci.
    INFO  10:47:55,284 ProgressMeter -            done   4.17137944E8     2.1 m       0.0 s      100.0%     2.1 m       0.0 s 
    INFO  10:47:55,284 ProgressMeter - Total runtime 125.05 secs, 2.08 min, 0.03 hours 
    Done. There were no warn messages.
    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭


    The article I linked to, https://software.broadinstitute.org/gatk/documentation/article?id=10836, says:

    If given the optional BAM input, VariantAnnotator will calculate annotations based on the pileup. Otherwise, VariantAnnotator and GenotypeGVCFs calculate summary metrics based on existing VCF fields.

    There is little you can calculate given your format field has one annotation, GT.

  • Thanks for this confirmation. Your reply further corroborates my understanding of how GATK deals with input data and what data format is compulsory for specific calculations. Thank you again!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
Sign In or Register to comment.