To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

DP differences between haploid and diploid mode

I would appreciate some help in understanding better the differences between haploid and diploid mode when it comes to calling and joint-genotyping (HaplotypeCaller + GenotypeGVCFs) in gatk. In particular, differences in the reported read depth (DP). Here my example case.

HaplotypeCaller using haploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 1 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
--output ~/data/haploid_calls/sample1_chrY.g.vcf.gz

Y       2650345 .       A       <NON_REF>       .       .       END=2650345     GT:DP:GQ:MIN_DP:PL      0:13:99:13:0,429

HaplotypeCaller using diploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 2 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
--output ~/data/diploid_calls/sample1_chrY.g.vcf.gz

Y       2650345 .       A       <NON_REF>       .       .       END=2650345     GT:DP:GQ:MIN_DP:PL      0/0:13:33:13:0,33,495

The only difference between the last two calls to HaplotypeCaller is the parameter --sample-ploidy. In both cases (ploidy 1 and ploidy 2), the reference call is being supported by 13 reads (DP field). Concordant with this, looking at this position using the bam file in IGV (see image below), it is possible to confirm that there are 14 reads covering the position and only one base in one of the reads is of low quality (QV 2), hence a DP of 13 makes sense.

What it's more, the number of reads spanning this position even increases (up to 24 DP + 2 artificial haplotypes) when looking at the same sample/position but using the already locally re-aligned reads that can be output by gatk in a bam file. See below.

HaplotypeCaller using haploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 1 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
-forceActive -disableOptimizations --bamOutput ~/data/sample1_RE-AL_HAP_chrY.bam

HaplotypeCaller using diploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 2 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
-forceActive -disableOptimizations --bamOutput ~/data/sample1_RE-AL_DIP_chrY.bam

However, when I do multi-sample joint genotyping using GenotypeGVCFs, DP values and the number of supporting reads reported vary significantly between g.vcf files produced in haploid and those produced in diploid mode. DP values get significantly reduced, in particular for reference calls it seems. To simplify, I added only a second extra sample in the example here below.

GenotypeGVCFs
(Using g.vcf files produced with HaplotypeCaller in haploid mode)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs  --intervals Y:2650345-2650345 \
--standard_min_confidence_threshold_for_calling 10 --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta --max_alternate_alleles 3 \
--variant ~/data/haploid_calls/sample1_chrY.g.vcf.gz --variant ~/data/haploid_calls/sample2_chrY.g.vcf.gz \
--out ~/data/raw_vcfs/raw_haploid_calls.vcf.gz

Y       2650345 .       A       G       497.76  .       AC=1;AF=0.500;AN=2;DP=23;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=31.11;SOR=0.941       GT:AD:DP:GQ:PL  0:6,0:6:99:0,109        1:0,16:16:99:529,0

GenotypeGVCFs
(Using g.vcf files produced with HaplotypeCaller in diploid mode)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs  --intervals Y:2650345-2650345 \
--standard_min_confidence_threshold_for_calling 10 --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta --max_alternate_alleles 3 \
--variant ~/data/diploid_calls/sample1_chrY.g.vcf.gz --variant ~/data/diploid_calls/sample2_chrY.g.vcf.gz \
--out ~/data/raw_vcfs/raw_diploid_calls.vcf.gz

Y       2650345 .       A       G       494.42  .       AC=2;AF=0.500;AN=4;DP=30;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=30.90;SOR=0.941      GT:AD:DP:GQ:PL  0/0:13,0:13:33:0,33,495 1/1:0,16:16:48:529,48,0

As can be seen, using g.vcf files produced in haploid mode, the final DP value for sample1 gets down to 6 reads, while previously was 13. The number 13 however, is reported when g.vcf files produced in diploid mode are used.

So, why?

I would be very thankful about some help understanding this. Additional information here below

.- I'm using gatk v3.7-0-gcfedb67, Java 1.8.0_40-b26
.- In the case of sample2, there is no difference in final DP values as reported using haploid vs diploid g.vcf files. In this case is an ALT call, but it does happen in REF calls just as the example for sample1.
.- This is an example with one SNP, but the issue is widespread across the call set, at least when it comes to REF calls.
.- I've checked with the help of IGV -> all the 13 reads/base_positions that I think should be reported in haploid mode (only 6 reported) pass -mmq 20 and -mbq 20
.- there is no significant strand bias
.- data is Illumina, PCR-free, 150 bp paired-reads, reads alligned with bwa-mem, and picard for marking duplicates.

Best,
Rodrigo

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jrfe1
    Hi Rodrigo,

    Sorry for the delay. That is odd. Can you test this in GATK4 beta latest release on a small snippet? If it still occurs, I may need you to submit a bug report.

    -Sheila

  • Hello Sheila,

    Thanks for your reply. A few more comments.

    .- I'll do a test with GATK4 as you suggest as soon as possible. Indeed I quickly tried once already but had some problems on how to specify several samples since --variant seems to work differently in gatk4 (does not accepts multiple sample files or something). But yes.

    .- Using gatk v3.8 the issue still is there

    .- My understanding of the issue so far is: A certain gVCF block in sample 1 will only be capturing one estimated DP value for en entire region that has no variants being detected, right?. So, when doing joint-genotyping, if for that region, a SNP comes out in sample number 2 lets say, then, how to get back to the precise DP value in that position for the sample number 1? (it was not exactly recorded in the gVCF block) .... An estimation occurs here. ... not very accurate when using the haploid mode as far as my case/example goes. Diploid mode however, is accurately capturing the DP variations over the gVCF blocks even though no variants are detected.

    .- What I did so far to get around the issue - and still use haploid mode - is to produce BP_RESOLUTION g.vcf files. In this mode all sites keep their accurate DP values. But indeed I still think there is an issue with the --emitRefConfidence GVCF option in haploid mode.

    .- Coming back as I get more feedback with GATK4,

    Many thanks,
    Rodrigo

  • @Sheila,

    I've tested now this with gatk-4.beta.3-SNAPSHOT

    1.- I did HaplotypeCaller. I generated again g.vcf files in both haploid and diploid mode.
    (An issue here, using the option --contamination_fraction_to_filter, causes haplotyper to fail because a NullPointerException from java. But not using this option gives no issues and the option is not relevant for the case here presented)

    2.- I did GenomicsDBImport in order to merge the g.vcf files

    3.- I did GenotypeGVCFs. For both haploid and diploid g.vcf files in separate.

    The issue of DP differences persists in the same manner. Same details as before.
    Please let me know any further feedback here,

    Thanks
    Rodrigo

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JRodrigo_F
    Hi Rodrigo,

    I think the DP of 13 in the GVCF that becomes a DP of 6 in the final VCF is odd. If you can submit a bug report, we can take a look locally. Instructions are here.

    Thanks,
    Sheila

  • Hi @Sheila,

    I've done ! : )
    The zip file with the bug report as instructed is named "report_jrfe230817_gatk-3.8-0-ge9d806836.zip"
    It has all the commands and input files necessary to replicate the issue, also a little readme file

    Please let me know any news on it,
    Many thanks
    Rodrigo

    Issue · Github
    by Sheila

    Issue Number
    2462
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JRodrigo_F
    Hi Rodrigo,

    Thanks. I will take a look soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JRodrigo_F
    Hi Rodrigo,

    TL;DR:
    I just had a look, and it seems like this is a "quirk" of the GVCF workflow and the ploidy model. The amount of evidence necessary to make a call is different with different ploidies, and that affects the GVCF blocks (which are grouped by GQ). The blocks then contain different DP information which is then output to the final VCF.

    For the haploid run, there has to be a lot more evidence for the variant at a site (you are assuming only 1 allele is real at the site so most if not all reads should have the allele). There really is not much evidence for any variant in the region, so HaplotypeCaller calls the entire region (Y:2650045-2650645) as reference. Because the GQs are 99 for all those sites, the GVCF block consists of all the sites, and the DP reflects the average coverage over all 600 sites. It looks like in the final output VCF, the min_DP is used (6).

    In the diploid run, there has to be less evidence for the variant at a site (you are assuming 2 alleles are real and ~50% of reads should contain the alleles). Because there are small amounts of noise in the region, the GQs are different for the different sites, and the GVCF blocks are broken up and show more resolution. The site you are interested in is part of a 2 base block so the DP is more accurate.

    I hope this makes sense. If you need the absoute DP, you should use BP_RESOLUTION instead of GVCF. That will give you block information for each site.

    -Sheila

  • jrodrigofjrodrigof EstoniaMember

    Hi @Sheila,

    Many thanks for your reply and for checking this issue. Indeed your explanation matches to a good extent and my observations above about DP being simply not recorded for all positions of a gvcf block, and makes me understand more precisely the issue around. As you say then, this is a peculiarity of the gvcf workflow, with DP values being less accurate when working with an haploid model.

    Yes, indeed I managed to get around this issue by using BP_RESOLUTION. I would suggest nevertheless to include somehow this information in the future manuals of info snippets for people interested in working with haploid models. I'm sure it would be useful to read and be warned and informed about.

    Many thanks !
    Best
    Rodrigo

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited October 2017

    @jrodrigof
    Hi Rodrigo,

    I just want to clarify one point.

    As you say then, this is a peculiarity of the gvcf workflow, with DP values being less accurate when working with an haploid model.

    It is true this is a peculiarity of the GVCF workflow, but the DP values will not always be less accurate in haploid model. It will occur at sites which do not have much evidence for a variant (also can happen with any ploidy when sites do not show much evidence for a variant). The reason all the sites were grouped into one large non-variant block is because the GQs were all high for the homozygous reference genotype. If there was more evidence at any of the sites for a variant, the block would have been broken up into "higher-resolution" blocks and have better estimates of the DP.

    I hope this makes sense.

    -Sheila

    P.S. I will make a note to write a doc or include this in the documentation for GATK4 :smiley: Thanks.

Sign In or Register to comment.