To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

How is a haplotype called by HaplotypeCaller across the genome with RADseq data?

I had a question about how is a haplotype called by HaplotypeCaller across the genome with reduced representation sequencing data. I have ddRADseq data from a diploid organism and I used HaplotypeCaller to get the raw vcf file. I saw some heterozygous SNP sites were phased, however, I also found some unphased heterozygous sites in the vcf file, I guess it was because there was not much information available to phase the sequence.
I wonder how does the program deal with the reduced representation sequencing data to call a haplotype across the whole genome?
Also, I was wondering if I should exclude unphased heterozygous sites for my downstream analysis, if so, how can I do that?

Hope my questions make sense. Thanks!

Best Answer


  • Hi @Sheila

    Thank you for your reply, I really appreciated it. Now it totally makes sense to me.

    I had a follow up question. The organism I'm working on is diploid, since I don't have enough information (with unpahsed het sites) to call haplotypes across the entire genome for each sample, should I use genotypes instead of haplotypes for my downstream analysis?

    The preliminary analysis I did seems to indicate the organism I'm working on is a hybrid, now I started to wonder if I did the the analysis in a correct way using haplotypes with limited phasing information. Do you have any comments or thought on that?

    Thanks for your time and help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    I am not sure what you mean by "should I use genotypes instead of haplotypes for my downstream analysis?" Can you clarify? What is your end goal?


  • jingjin0322jingjin0322 USAMember

    Hi @Sheila ,

    So I have ddRADseq data for my samples (diploid). From what I've known so far, I don't think programs can call the two haplotypes across the whole genome for each of my sample, in other words, there will always have heterozygous sites that are not phased. I guess my question was, with the partially phased data I have, how can I make most of it for my downstream analysis?

    My goal is to see if there is any population structure shaped by environment in my samples. I'm also trying to do some association work to find SNPs related to certain traits.

    Again, thanks for your time and help! Really appreciated it.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Jing,

    Okay, so one of your goals is to compare the SNPs in your different populations. You will be able to do this without the phasing information. You can use SelectVariants with discordance or concordance to find SNPs that are in all or only a few of your samples.

    As for the phasing, I am not sure if other tools will be able to help much more, but you will have to look into outside tools that do imputation, such as Beagle.


Sign In or Register to comment.