Attention:
The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

HaplotypeCaller on whole genome or chromosome by chromosome: different results

Hi,

I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?

Thanks!

Maguelonne

Tagged:

Best Answer

Answers

  • MaguelonneMaguelonne ParisMember

    Well I thought about downsampling too but it does not make sens because if I do the calling on one chromosome twice I obtain the same results.. that's does not seem consistent with random sampling.

    In addition, if I choose a variant randomly in my VCF file and I look at the DP for each patient, the patient DP>250.. I don't understand that, as it's confirmed on the log file that the downsampling is done with dcov=250.. I hope you can help me on this point and find with me an other explanation for my first question.

    Thks,

    Maguelonne

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Maguelonne‌

    Hi Maguelonne,

    Can you post an IGV screenshot of the bamout file for any of the questionable positions?

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Maguelonne‌

    Hi again,

    The downsampling random seed is dependent on the interval list, so it is normal to see no change when running twice on the same interval. However, you can see change when running on different intervals.

    Instead of looking at a random variant in your vcf, please check the variants that are missing in one of the callsets.

    Thanks,
    Sheila

  • MaguelonneMaguelonne ParisMember

    Hi,

    I give you one line corresponding to one variant which is in one vcf file and not in the other (cf onevariant.vcf). Then, I give you the pileups at this position for the four first patients. I hope this will be helpful.

    And, now, I understand why there is no change when running twice on the same interval, but why am I obtaining DP per patient > 250?!

    Thanks,

    Maguelonne

    image

    image

    image

    image

    image

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Maguelonne In HC the downsampling is done over a region, not per-site, so you can still get sites with >250 reads depth.

    Regarding the screenshots you posted, it is generally impossible to say just by looking at one record if there is a problem or not. I advise you to collect annotation metrics (quals, PLs etc) on the subset of variants that are missing in one callset, and analyze them to determine if they are borderline calls that are potentially false positives, or if they look like good variants. If you find it is the latter, then let us know and we'll help you figure out why they are not getting called. But until you have those results, we cannot provide more guidance than that because we are too busy at the moment.

  • MaguelonneMaguelonne ParisMember

    @Geraldine_VdAuwera‌ I understand..

    To avoid any problems, I decided to do the calling without downsampling (by specifying '-dt NONE'). But still, the results at one locus depend on the length of the list of targeted regions specified with -L option (indeed each list I tried contained the locus I'm studying). Without downsampling, what could explain variations in the variant calling?

    Thanks,

    Maguelonne

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Maguelonne‌

    Hi Maguelonne,

    It is not recommended to remove downsampling in Haplotype Caller as it does not work properly. This is a known limitation, but it should not affect calls. If you find that the calls that are missing in one set are indeed of good quality, please let us know.

    -Sheila

  • MaguelonneMaguelonne ParisMember

    Ok and if I put -dcov at the max of coverage I have (~1000 reads), will I avoid the effects of downsampling?

    Maguelonne

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Maguelonne‌

    No, -dcov does not work either.

  • tregouettregouet parisMember

    Dear Maguelonne & Sheila,
    Interesting discussions about extremely high-depth sequencing . In clinical diagnostic settings, we have designed target resequencing project where the average read depth is around 500X. We wish we could use all the produced information to optimize our genotype calling and mosaicism detection. We are not keen to only use the by-default 250 DP parameter and wish to use as much as reads as possible. What are your proposal?
    Sincerely Yours
    David

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi David @tregouet,

    Unfortunately, fixing this is going to require rewriting significant portions of the downsampling machinery, so it might take a long time before we can deliver a solution. For standard use cases such depth is really unnecessary, but I understand that mosaicism detection may impose different conditions. I wonder if using a somatic variant caller like Mutect might be appropriate for mosaicism. But I'm not familiar enough with that use case to be able to give you a definite answer, sorry.

  • ymcymc Member

    Does "-dfrac 1.0" force GATK software to use all reads ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @ymc Not with HaplotypeCaller. HC has its own hardcoded downsampling settings which you cannot override in the current version.

  • MaguelonneMaguelonne ParisMember

    @Geraldine_VdAuwera, @Sheila,

    Is it still impossible to remove downsampling when using Haplotype Caller?

    I would like to call snps and indels on pooled samples and thus I need all reads to be used for the calling.

    If downsampling settings remain hardcoded in HC, should I use UnifiedGenotyper?

    Maguelonne

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Maguelonne
    Hi Maguelonne,

    No, you cannot change the downsampling setting in HaplotypeCaller. However, you may try changing -maxReadsInRegionPerSample and -minReadsPerAlignStart.

    -Sheila

  • MaguelonneMaguelonne ParisMember

    Thank you @Sheila for your answer.

    How should I use these options to get coverage of 1400 and not 500 for the calling?

    Or once again, should I use UnifiedGenotyper instead if I want all my coverage to be used for the calling?

    Maguelonne

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Maguelonne
    Hi Maguelonne,

    You cannot set exactly the number of reads to be used with HaplotypeCaller. However, if you set the -maxReadsInRegionPerSample to a very high number and play around with the -minReadsPerAlignStart by setting it to various values higher than the default of 10, you should get what you are looking for. Note, we have never tried this ourselves, but would love to hear how it works for you.

    -Sheila

  • manolismanolis Member ✭✭
    edited March 2018

    Hi! In the GATK4.0.2.1 change something about downsampling? I have to work with possible mosaic samples, on targeted regions with high coverage and with low AF variants(<5 % of chimeric variants), and I'm the same interesting toset up the coverage in HC.

    Then, relative to the original question, considering the different of the variants call, between targeted intervals and chr, what is better to use? Both ???

    Thanks

    Post edited by manolis on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @manolis
    Hi,

    Are these diploid samples? In general HaplotypeCaller expects AF to follow the ploidy assumption. In your case of very low AF, you may consider using Mutect2 instead, as it does not make any assumptions on ploidy.

    -Sheila

  • manolismanolis Member ✭✭

    Hi Sheila, are diploid, human DNA. Thanks

    Then, relative to the original question, considering the different of the variants call, between targeted intervals and chr, what is better to use? Both ???

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @manolis
    Hi,

    I think it is up to you. It is fine to run by chromosome to save time. Those differences should not be major, only at some edge cases. You may also consider using the WDLs we provide here.

    -Sheila

  • manolismanolis Member ✭✭

    Thank you very much Sheila!

    Have a nice day

  • picard_gatk_mjpicard_gatk_mj Unconfirmed

    so is it right to call by every chromosome and if right, how to do? thanks in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
Sign In or Register to comment.