Holiday Notice:
The Frontline Support team will be offline December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks as we get to all of your questions. Happy Holidays!

How is a haplotype called by HaplotypeCaller across the genome with RADseq data?

Hi,
I had a question about how is a haplotype called by HaplotypeCaller across the genome with reduced representation sequencing data. I have ddRADseq data from a diploid organism and I used HaplotypeCaller to get the raw vcf file. I saw some heterozygous SNP sites were phased, however, I also found some unphased heterozygous sites in the vcf file, I guess it was because there was not much information available to phase the sequence.
I wonder how does the program deal with the reduced representation sequencing data to call a haplotype across the whole genome?
Also, I was wondering if I should exclude unphased heterozygous sites for my downstream analysis, if so, how can I do that?

Hope my questions make sense. Thanks!

Best Answer

Answers

  • jingjin0322jingjin0322 USAMember

    Hi @Sheila

    Thank you for your reply, I really appreciated it. Now it totally makes sense to me.

    I had a follow up question. The organism I'm working on is diploid, since I don't have enough information (with unpahsed het sites) to call haplotypes across the entire genome for each sample, should I use genotypes instead of haplotypes for my downstream analysis?

    The preliminary analysis I did seems to indicate the organism I'm working on is a hybrid, now I started to wonder if I did the the analysis in a correct way using haplotypes with limited phasing information. Do you have any comments or thought on that?

    Thanks for your time and help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jingjin0322
    Hi,

    I am not sure what you mean by "should I use genotypes instead of haplotypes for my downstream analysis?" Can you clarify? What is your end goal?

    Thanks,
    Sheila

  • jingjin0322jingjin0322 USAMember

    Hi @Sheila ,

    So I have ddRADseq data for my samples (diploid). From what I've known so far, I don't think programs can call the two haplotypes across the whole genome for each of my sample, in other words, there will always have heterozygous sites that are not phased. I guess my question was, with the partially phased data I have, how can I make most of it for my downstream analysis?

    My goal is to see if there is any population structure shaped by environment in my samples. I'm also trying to do some association work to find SNPs related to certain traits.

    Again, thanks for your time and help! Really appreciated it.
    Jing

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jingjin0322
    Hi Jing,

    Okay, so one of your goals is to compare the SNPs in your different populations. You will be able to do this without the phasing information. You can use SelectVariants with discordance or concordance to find SNPs that are in all or only a few of your samples.

    As for the phasing, I am not sure if other tools will be able to help much more, but you will have to look into outside tools that do imputation, such as Beagle.

    -Sheila

Sign In or Register to comment.