Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

Calling complex pedigrees with HaplotypeCaller

Hi,

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

Thanks,
Alva

Best Answer

Answers

  • astrandastrand New YorkMember

    Thanks Sheila. Just to confirm, I should remove the option --emitRefConfidence, correct?

    Alva

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @astrand‌ Not quite: you should run HaplotypeCaller on each sample individually with the --emitRefConfidence option, then run GenotypeGVCFs on all the resulting GVCFs together. This is a new(ish) workflow for joint analysis that gives best results while being computationally efficient. See the methods documentation for more details.

  • astrandastrand New YorkMember
    edited December 2014

    I see. So from what I understand, the workflow should be the following:

    • Run HaplotypeCaller in regular mode on each sample bam file to create single-sample VCF
    • Run HaplotypeCaller in GVCF mode on each sample bam file to create single-sample GVCFs (using --emitConfidence GVCF, --variant_index_type LINEAR, --variant_index_parameter 128000)
    • Run GenotypeGVCFs on all the bam files together to get raw VCFs
    • Run VQSR on these raw VCFs

    Please correct me if I am wrong at any of these steps.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    All good except the first step -- creating single-sample VCFs in regular mode is redundant if you go with the GVCF workflow. And to be clear, the output of GenotypeGVCFs will be a single VCF containing all the samples that you put through it together.

  • astrandastrand New YorkMember

    Great, thanks!

    Alva

  • tommycarstensentommycarstensen United KingdomMember

    @Sheila said:
    Haplotype Caller actually does not take into account the pedigree file.

    Does this apply only when running HaplotypeCaller in GVCF mode? I am not sure, whether to pass a pedigree file to all GATK walkers or not. Maybe you can clarify? In another thread it reads:

    http://gatkforums.broadinstitute.org/discussion/37/pedigree-analysis
    

    @delangel said:

    Workflow

    To call variants with the GATK using pedigree information, you should base your workflow on the Best Practices recommendations -- the principles detailed there all apply to pedigree analysis.

    But there is one crucial addition: you should make sure to pass a pedigree file (PED file) to all GATK walkers that you use in your workflow. Some will deliver better results if they see the pedigree data.

    At the moment there are two of the standard annotations affected by pedigree:

    • Allele Frequency (computed on founders only)
    • Inbreeding coefficient (computed on founders only)

    @gauthier said:
    The new Genotype Refinement Pipeline is already in the codebase and should be available via the nightly builds. It has the capability (via CalculateGenotypePosteriors) to derive posterior genotype probabilities (in the new PP format field) based on the genotype likelihoods of the other members of the trio. (Genotypes will be modified based on these posteriors if necessary.) You can pass in population allele counts from HapMap or 1000 Genomes to help inform the posteriors as well. There's also a new possibleDeNovo annotation that can be applied with VariantAnnotator after CGP to tag high- and low-confidence de novo mutations in the trio offspring if that's something you're interested in.

    There's some information in the CalculateGenotypePosteriors tool docs, but more comprehensive documentation is forthcoming pending the completion of Geraldine_VdAuwera's trip abroad.

    If I'm not very clear, then it's because I'm not quite sure of the advantages of using pedigree information with the various GATK tools.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @tommycarstensen HC doesn't use the pedigree for calling variants (regardless of mode), but the ped file gets used for calculating annotations like InbreedingCoeff if you request them. The tools that are able to use the ped file are the ones that clearly require it (like PhaseByTransmission) plus the annotator-capable tools (HC, UG, VariantAnnotator and GenotypeGVCFs) if you request an annotation that has to do with population structure.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    FYI, I updated the method article to be a more concise and hopefully clearer FAQ.

  • @astrand said:
    I see. So from what I understand, the workflow should be the following:

    • Run HaplotypeCaller in regular mode on each sample bam file to create single-sample VCF
    • Run HaplotypeCaller in GVCF mode on each sample bam file to create single-sample GVCFs (using --emitConfidence GVCF, --variant_index_type LINEAR, --variant_index_parameter 128000)
    • Run GenotypeGVCFs on all the bam files together to get raw VCFs
    • Run VQSR on these raw VCFs

    Please correct me if I am wrong at any of these steps.

    Hi,

    I am using HaplotypeCaller in GVCF mode to call variants for my pedigree data (with ped file) by providing each sample bam file separately to create single-sample GVCFs.

    Therefore, in order to annotate all my samples together (multi-sample variant calling by using GenotypeGVCFs) do I need to provide all the bam files (using -I ) once again here as an input? along with all the GVCFs (using --variant ), in order to get a raw single VCF?

  • astrandastrand New YorkMember

    @ganapathivarma‌

    I don't believe GenotypeGVCFs takes bam files as input. It also seems to me that you shouldn't have to annotate twice... Seems to me that you could just pass the pedigree file when running GenotypeGVCFs and you would be fine. I have not used ped files with these tools though, so someone more experienced should confirm/correct my assertion.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ganapathivarma‌

    @astrand‌ is correct. GenotypeGVCFs does not take bam files as input. You can either pass the pedigree file into HaplotypeCaller or into GenotypeGVCFs. This article may help as well: http://gatkforums.broadinstitute.org/discussion/37/which-tools-use-pedigree-information

    -Sheila

  • I also have a family (of plants) for which I want to define haplotypes. I used to call Haplotypes using Haplotype Caller and phasing them by PhaseByTransmission. Has using PhaseByTransmission now become redundant when Haplotype Caller and CalculateGenotypePosteriors are used? What about ReadBackedPhasing? It also does physical phasing similiar to HC. What is the difference? In which order are the tools recommended to be used?

    Thank you.
    Nadia

  • jinghe1999jinghe1999 TennesseeMember

    @Geraldine_VdAuwera said:
    All good except the first step -- creating single-sample VCFs in regular mode is redundant if you go with the GVCF workflow. And to be clear, the output of GenotypeGVCFs will be a single VCF containing all the samples that you put through it together.

    Would you recommend the 4 following steps for pedigree sample analysis:

    1 Run HaplotypeCaller in GVCF mode on each sample bam file to create single-sample GVCFs (using --emitConfidence GVCF, --variant_index_type LINEAR, --variant_index_parameter 128000)
    2. run CombineGVCFs per ~200 samples
    3. Run GenotypeGVCFs on all the gvcf files together to get raw VCFs
    4. Run VQSR on these raw VCFs
    5. genotype refinement pipeline (CalculateGenotypePosteriors,VariantFiltration,VariantAnnotator,SnpEff)

    Since I have about 1200 samples, would you recommend process these samples per chromosome in step 2 and 3 ?

    Thanks,
    Jing

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jinghe1999
    Hi Jing,

    This looks good. The only thing to remember is that the output of GenotypeGVCFs will be 1 single vcf that contains all the samples.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Nadia

    Hi Nadia,

    Sorry for the late response.

    If your plant family is in trios or you know the parents and the offspring, you can use PhaseByTransmission to get more accurate phasing.

    If you are interested in exact genotypes, you should use CalculateGenotypePosteriors.

    If you are interested in all the tools, you can run HaplotypeCaller to get phasing. Then, you can run Calculate Genotype Posteriors to get the updated genotypes and posteriors. Then, if you want, you can run PhaseByTransmission again to re-phase after the genotypes and posteriors have been updated.

    ReadBackedPhasing is redundant, but it does come in handy when merging MNPs, because Haplotype Caller does not currently do it.

    -Sheila

  • jinghe1999jinghe1999 TennesseeMember

    @Sheila said:
    jinghe1999
    Hi Jing,

    This looks good. The only thing to remember is that the output of GenotypeGVCFs will be 1 single vcf that contains all the samples.

    -Sheila

    Thanks Sheila, is that ok if I run CombineGVCFs and GenotypeGVCFs per chromosome ?

    Jing

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jinghe1999
    Hi Jing,
    Sure, you can run both by chromosome.
    -Sheila

  • @Sheila

    If you are interested in all the tools, you can run HaplotypeCaller to get phasing. Then, you can run Calculate Genotype Posteriors to get the updated genotypes and posteriors. Then, if you want, you can run PhaseByTransmission again to re-phase after the genotypes and posteriors have been updated.

    I still have a question on this.... I run the HapotypeCaller on each genotype individually, then I use GenotypeGVCFs to merge the variants of within a family of mother, father and 35 children. After that I run CalculateGenotypePosteriors on the merged genomic vcf. In order to run PhasebyTransmission I create 35 individual vcfs each containing the parents and one children using SelectVariants. The resulting haplotypes of the parents after running PhaseByTransmission differ between the phased vcfs depending on the childs haplotypes. Do you have a recommondation on how to deal with this?

    Thanks,
    Nadia

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Nadia
    Hi Nadia,

    I am not sure what you mean. The phasing may change depending on the new genotype posterior calculated. Can you post an example of where the phasing differs?

    Thanks,
    Sheila

  • NadiaNadia Member
    edited September 2015

    ok so this is one example. 13391516 and 13380506 are children of GfGa4742 and Villard blanc. In the second example Villardblanc should also be heterozygous at this position but it's not.

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 13391516 GfGa4742 VillardBlanc
    VIT_201s0011g06410 162 . G A 35281.26 . AC=1;AF=0.167;AN=6;BaseQRankSum=-3.210e-01;ClippingRankSum=-1.110e-01;DP=249;FS=2.708;MQ=60.00;MQRankSum=0.542;QD=19.55;ReadPosRankSum=0.784;SOR=0.902 GT:AD:DP:GQ:PGT:PID:PL:TP 0|1:18,29:47:99:0|1:93_G_T:905,0,1438:79 0|0:90,0:90:99:.:.:0,120,1800:79 1|0:112,0:112:0:.:.:0,0,376:79

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 13380506 GfGa4742 VillardBlanc
    VIT_201s0011g06410 162 . G A 35281.26 . AC=0;AF=0.00;AN=6;BaseQRankSum=-3.210e-01;ClippingRankSum=-1.110e-01;DP=414;FS=2.708;MQ=60.00;MQRankSum=0.542;QD=19.55;ReadPosRankSum=0.784;SOR=0.902 GT:AD:DP:GQ:PL:TP 0|0:212,0:212:99:0,120,1800:3 0|0:90,0:90:99:0,120,1800:3 0|0:112,0:112:0:0,0,376:3

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Nadia
    Hi Nadia,

    It looks like what is happening is that the father (Villard Blanc's) genotype at that position is not really known, so whatever makes sense given the child is what is being output. Notice the PLs for the father. 0,0,376 means we are not sure if he is 0/0 or 0/1. When the child is 0/1 and the mother is 0/0, it makes sense to assume the father is 0/1. Unfortunately, the site is ambiguous because of the PLs, so there may not be anything we can do.

    -Sheila

Sign In or Register to comment.