Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

SNV variant not identified with haplotype caller

wcarrewcarre RennesMember

Hello,
I used haplotype caller on 3 bam files, producing gvcf and one groupe vcf afterwards. IGV show 2 HET and one REF.

But GATK gave me 1REF,1HET,1REF. The coverage at this position is around 100 with 50% variants. The only otions used for the calling are -L and -D. The bam have been preprocessed with GATK for duplicates and recalibration.
Freebayes give me the right genotypes !!!
I tried GATK 3.7, 4.0 and 4.1. I also tried on the raw bam and before recalibration. Still the same.
When I do a group calling without individual gvcf output, this variant completely disapear ???

I am stuck on this since I don't really know how to investigate further?

What should I do to figure out why this specific variant does not seems to be identified by haplotype caller.

Thanks

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @wcarre,

    @AdelaideR asked I look into your question. I suspect the difference you observe between HaplotypeCaller and Freebayes results is about differential variant representation arising from differences in graph assembly. It would be helpful to (1) have the exact HaplotypeCaller command you are using and (2) confirmation with IGV visualization of HaplotypeCaller BAMOUT (instructions here) that variants are not differentially represented. Alternatively or in addition, you can compare your VCF callsets using the vcfeval module of RTG-Tools, which can resolve differential representations of variants. If these steps do not shed light, then we would ask you for data to recapitulate your observations. Towards this, you can follow instructions for submitting a bug report on the left panel.

  • wcarrewcarre RennesMember

    Thanks for your answers.

    Here are the command lines I used :
    java -jar gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I 10439G.recal.bam -O 10439G.r41.gvcf -bamout 10439G.recal.out.bam -L region.bed --min-pruning 3 -D dbsnp_138.hg19.vcf -stand-call-conf 10.0 -ERC GVCF --dont-use-soft-clipped-bases

    java -jar gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I 10440G.recal.bam -O 10440G.r41.gvcf -bamout 10440G.recal.out.bam -L region.bed --min-pruning 3 -D dbsnp_138.hg19.vcf -stand-call-conf 10.0 -ERC GVCF --dont-use-soft-clipped-bases

    java -jar gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I 10441G.recal.bam -O 10441G.r41.gvcf -bamout 10441G.recal.out.bam -L region.bed --min-pruning 3 -D dbsnp_138.hg19.vcf -stand-call-conf 10.0 -ERC GVCF --dont-use-soft-clipped-bases

    java -jar gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CombineGVCFs -R ucsc.hg19.fasta -V 10439G.r41.gvcf -V 10440G.r41.gvcf -V 10441G.r41.gvcf -O GATK.recal.4.1.gvcf -L region.bed -D dbsnp_138.hg19.vcf

    java -jar gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R ucsc.hg19.fasta -V GATK.recal.4.1.gvcf -O GATK.recal.4.1.vcf -L region.bed -D dbsnp_138.hg19.vcf -stand-call-conf 10.0

    and the bamout show me a variant when it's detected and nothing when it's not. On the screen shot you have original bam and bamout for each of the 3 samples. First one no variant identified, middle one variant identified and last one no variants.

    What would be the next step to have more clues of what is going on... and why this variant is not called ?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @wcarre,

    Is there a particular reason you are using the --dont-use-soft-clipped-bases option? Can you see how the results differ if you omit this? I ask this because variants maybe contained within soft-clipped bases and including these in the graph-assembly can be essential to revealing them.

    In general, towards debugging, you might want to stick to using default parameters. Also, do you have enough breadth in your -L region.bed?

    Again, it's really hard to tell what's going on without context. See point 5 of the ' Got a problem?' panel to the left.

  • wcarrewcarre RennesMember
    edited February 18

    HI Shlee,

    Thaks for your help.

    After removing the --dont-use-soft-clipped-bases, I got now ./. for the genotype of this individual.
    When I look in the gvcf, I got GT:DP:GQ:MIN_DP:PL 0/0:121:0:121:0,0,0. Which explain the final result.
    In this area, it seems that there is soft clipping going but the variant seems clear and not related to this soft clipping.

    Is there any solution to solve a situation like this ? Any options in gatk that could solve the issue ? I was thinking og removing freebayes from my pipeline and keeping only gatk, but can't afford to miss variants like this one.

  • wcarrewcarre RennesMember

    By looking on the net, I found that kmer size could help.
    By adding --kmer-size at 30;31;40;41;50;51;60;61 to the default values 10 and 25, the variant is identified.
    What is strange is that the kmersize other than these specific value don't help in the identification process.
    What is the risk to add kmer-size values to the output ?

    Thanks

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @wcarre,

    After removing the --dont-use-soft-clipped-bases, I got now ./. for the genotype of this individual.
    When I look in the gvcf, I got GT:DP:GQ:MIN_DP:PL 0/0:121:0:121:0,0,0. Which explain the final result.
    ...
    By adding --kmer-size at 30;31;40;41;50;51;60;61 to the default values 10 and 25, the variant is identified.

    I'm glad your variant site is now called despite the messiness of the region. It's hard to conclude the soft-clips are unrelated to the final call unless again we study the BAMOUT. The BAMOUT results will differ because you changed the parameters.

    Adding additional kmer sizes to cycle through increases compute expenditure. You can study what may be going on with the different kmer sizes using the --graph-output option. This generates a .dot format output that you can visualize with an external tool called Graphviz.

    Again, it's really hard to say what is going on without context. If you need further insight, please submit some test data.

  • wcarrewcarre RennesMember

    Thanks for your answer,
    If I want to provide you with sub bams (500ko) of two samples for further insight, How should I proceed ?
    Also If I add the --kmer-size 30, should I also specify the --kmer-size 10 and --kmer-size 25 to keep the default beviour ?

    Wilfrid

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @wcarre

    Here are the directions on how to upload your data for troubleshooting.

    If you could provide a screenshot of the --graph-output, that might provide some context for your kmer-size question. I am just wondering if your data may be not aligning properly to the genome. Is this human data? What is the coverage of your reads?

  • wcarrewcarre RennesMember

    Thanks for the information.
    I uploaded a Zip (10439G.zip) with all the bams, graphs and commande line used (cmd.txt).
    The variant that is not detected in one sample is in position chr5:179043958 on hg19 (alignment done on hg19 from UCSC). The bed file correspond to a wider region around the variant.
    The coverage at this position is 120; 90 and 90 for each of the 3 samples.

    java -jar gatk-package-4.1.0.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I 10439G.recal.bam -O 10439G.r41.gvcf -L region.bed -stand-call-conf 10.0 -ERC GVCF -graph 10439G.r41.graph
    java -jar gatk-package-4.1.0.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I 10440G.recal.bam -O 10440G.r41.gvcf -L region.bed -stand-call-conf 10.0 -ERC GVCF -graph 10440G.r41.graph
    java -jar gatk-package-4.1.0.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I 10441G.recal.bam -O 10441G.r41.gvcf -L region.bed -stand-call-conf 10.0 -ERC GVCF -graph 10441G.r41.graph
    java -jar gatk-package-4.1.0.0-local.jar CombineGVCFs -R ucsc.hg19.fasta -V 10439G.r41.gvcf -V 10440G.r41.gvcf -V 10441G.r41.gvcf -O GATK.recal.4.1.gvcf -L region.bed
    java -jar gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R ucsc.hg19.fasta -V GATK.recal.4.1.gvcf -O GATK.recal.4.1.vcf -L region.bed -stand-call-conf 10.0

    Thanks for your help

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited February 26

    Hi Wilfrid (@wcarre),

    I've just processed your data and I observe what you observe for the three samples:

    • GVCF mode calling gives ./. no-call, het and hom-ref (for samples 39, 40, 41).
    • The no-call sample's PL annotation gives 0,0,0, indicating ambiguity for the site.

    Here are my thoughts.

    1. If this is a trio and/or a site that is variant in the population, i.e. in dbSNP, you can try genotype refinement to rescue the variant for sample-39. I see that this is a variant call in gnomAD at a low population frequency of 2.795e-5. The GATK Workshop materials provides a hands-on tutorial section illustrating use of CalculateGenotypePosteriors, in the Germline_Variant_Discovery worksheet. I had added the section ~ last summer.
    2. If you have other samples that are a part of your cohort, try padding with GVCFs from these to see if these boost the callability of the variant. GVCF mode calling is designed with large cohort sizes in mind.
    3. If these are the only samples in your cohort, consider joint genotyping directly with HaplotypeCaller's default mode, i.e. by providing HaplotypeCaller with all three BAMs at the same time for a final multi-sample callset. There can be differences between the two modes for low-quality variant sites and other edge cases in particular for small cohorts with low coverage. When I run this mode on the three samples for the region, I see the calls you expect (het, het, hom-ref).
    4. Finally, check out HaplotypeCaller's --genotyping-mode GENOTYPE_GIVEN_ALLELES option.
Sign In or Register to comment.