Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller, vcf mode, ./.

Hi,

maybe it is an old question but I can not find it in the forum...

I was using HaplotypeCaller (linux server, v4.1.0.0) in the CNN pipeline and I saw that it calls variants with ./.

My vcf has 31127 variants but 46 are like the following...

chr1 154590149 . G C 0 . FS=0.000;MBQ=0,0;MFRL=0,0;MLEAC=0;MLEAF=NaN;MMQ=0,0;MPOS=0;SOR=0.693 GT ./.

The code is:

${GATK4} --java-options "${javaOpt3}" HaplotypeCaller \
-R ${hg38} -I ${bqsr_BAM} -O ${VCF} -L ${INTERVAL} \
-bamout ${bamout_BAM} \
--dont-trim-active-regions -stand-call-conf 0 \
-A Coverage -A ChromosomeCounts -A BaseQuality -A FragmentLength -A MappingQuality -A ReadPosition \
--tmp-dir ${tmp}/

How/Why HaplotypeCaller calls variants with ./. genotype (without genotype?)

At the end all ./. variants are excluded with the FilterVariantTranches (CNN_2D) tool with a value ranging from -10.410 to -3.456.

Many thanks

Tagged:

Answers

  • manolismanolis Member ✭✭

    HaplotypeCaller without the "-ERC GVCF" option also running "similar" to the gVCF mode, calling ./. and 0/0 genotypes? I have also 1223 variants 0/0 like the following"

    chr1 962442 . G A 0 . AC=0;AF=0.00;AN=2;BaseQRankSum=-2.060;DP=19;ExcessHet=3.0103;FS=4.135;MBQ=30,20;MFRL=151,69;MLEAC=0;MLEAF=0.00;MMQ=60,60;MPOS=34;MQ=60.00;MQRankSum=0.000;ReadPosRankSum=2.138;SOR=0.069 GT:AD:DP:GQ:PL 0/0:17,2:19:8:0,8,501

    chr1 1312208 . GC G 0 . AC=0;AF=0.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MBQ=15,0;MFRL=183,0;MLEAC=0;MLEAF=0.00;MMQ=60,0;MPOS=0;MQ=60.00;SOR=0.105 GT:AD:DP:GQ:PL 0/0:2,0:2:5:0,5,24

    Here is the full code from the vcf header:

    ##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --dont-trim-active-regions true --bam-output /home/manolis/GATK4/3.HaplotypeCaller/VCF_gatk4100_20XwgWES_CustmoLiftOver_unknown_SGR/WES_16-1242_HC.bamout.bam --standard-min-confidence-threshold-for-calling 0.0 --output /home/manolis/GATK4/3.HaplotypeCaller/VCF_gatk4100_20XwgWES_CustmoLiftOver_unknown_SGR/WES_16-1242_HC.vcf.gz --intervals /home/manolis/GATK4/DB/hg38_CustomLiftOver_SanGiovRotondo_HC_1-22_XY.intervals --input WES_16-1242_bqsr.bam --reference /shared/resources/hgRef/hg38/Homo_sapiens_assembly38.fasta --tmp-dir /home/manolis/GATK4/tmp/ --annotation Coverage --annotation ChromosomeCounts --annotation BaseQuality --annotation FragmentLength --annotation MappingQuality --annotation ReadPosition --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --indel-size-to-eliminate-in-ref-model 10 --use-alleles-trigger false --disable-optimizations false --just-determine-active-regions false --dont-genotype false --max-mnp-distance 0 --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --consensus false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 1.0 --max-unpruned-variants 100 --debug-graph-transformations false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --debug false --use-filtered-reads-for-annotations false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --capture-assembly-failure-bam false --error-correct-reads false --do-not-run-physical-phasing false --min-base-quality-score 10 --smith-waterman JAVA --correct-overlapping-quality false --emit-ref-confidence NONE --use-new-qual-calculator true --use-old-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version=4.1.0.0,Date="February 16, 2019 10:40:56 AM CET">

    All 0/0 variants have a negative CNN_2D value and some of them are marked as "CNN_2D_INDEL_Tranche_99.50_100.00" (23 variants) and as "CNN_2D_SNP_Tranche_99.90_100.00" (311 variants).

    As always many thanks for any future explanation!

  • manolismanolis Member ✭✭

    Sorry, just one more question about this output.

    @samwell , the ./. and 0/0 can affect the CNNScoreVariants tool? Is better first to exclude this kind of variants/lines from my vcf with the SelectVariants tool (--exclude-non-variants) and then to proceed with the CNNScoreVariants tool?

    All the best

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi there @manolis -

    HaplotypeCaller designates variants with "./." to indicate that there is a lack of evidence to designate genotype at that location for diploid and haploid samples respectively.

    And then, if you look at how GATK4 operates - more information can be found here - it provides some insight into how this might happen.

    4. Assign sample genotypes
    For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles given the read data to calculate the likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.
    

    Basically, a variant with coverage below a certain threshold cannot be definitively determined. It is a 'lack of evidence' and not an incorrect call, so I imagine that he CNNScoreVariants tool ignores the ./.

    A 0/0 call is not a lack of evidence, [it is a genotype:] (https://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it)

    GT : The genotype of this sample at this site.
    For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
    
    0/0 - the sample is homozygous reference
    0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    1/1 - the sample is homozygous alternate
    In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively. 
    For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
    

    More information about the CNNSelectVariants tool can be found in this blog post if you want to read how reference sequences are annotated into tensors. It is a good read.

  • manolismanolis Member ✭✭

    Hi @AdelaideR , I was just totally convinced that HaplotypeCaller excluded ./. and 0/0 and only in the gVCF mode I could have them. Everything clear and as always thanks a lot!

Sign In or Register to comment.