Calling complex pedigrees with HaplotypeCaller

Hi,
I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?
The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?
And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?
Thanks,
Alva
Best Answer
-
Sheila Broad Institute admin
Hi Alva,
Haplotype Caller actually does not take into account the pedigree file. The best thing to do is run HaplotypeCaller on all 4 families together. After Haplotype Caller, you can use the pedigree file in PhaseByTransmission. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_phasing_PhaseByTransmission.php
-Sheila
Answers
@astrand
Hi Alva,
Haplotype Caller actually does not take into account the pedigree file. The best thing to do is run HaplotypeCaller on all 4 families together. After Haplotype Caller, you can use the pedigree file in PhaseByTransmission. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_phasing_PhaseByTransmission.php
-Sheila
Thanks Sheila. Just to confirm, I should remove the option --emitRefConfidence, correct?
Alva
@astrand Not quite: you should run HaplotypeCaller on each sample individually with the --emitRefConfidence option, then run GenotypeGVCFs on all the resulting GVCFs together. This is a new(ish) workflow for joint analysis that gives best results while being computationally efficient. See the methods documentation for more details.
I see. So from what I understand, the workflow should be the following:
Please correct me if I am wrong at any of these steps.
All good except the first step -- creating single-sample VCFs in regular mode is redundant if you go with the GVCF workflow. And to be clear, the output of GenotypeGVCFs will be a single VCF containing all the samples that you put through it together.
Great, thanks!
Alva
Does this apply only when running HaplotypeCaller in GVCF mode? I am not sure, whether to pass a pedigree file to all GATK walkers or not. Maybe you can clarify? In another thread it reads:
If I'm not very clear, then it's because I'm not quite sure of the advantages of using pedigree information with the various GATK tools.
@tommycarstensen HC doesn't use the pedigree for calling variants (regardless of mode), but the ped file gets used for calculating annotations like InbreedingCoeff if you request them. The tools that are able to use the ped file are the ones that clearly require it (like PhaseByTransmission) plus the annotator-capable tools (HC, UG, VariantAnnotator and GenotypeGVCFs) if you request an annotation that has to do with population structure.
FYI, I updated the method article to be a more concise and hopefully clearer FAQ.
Hi,
I am using HaplotypeCaller in GVCF mode to call variants for my pedigree data (with ped file) by providing each sample bam file separately to create single-sample GVCFs.
Therefore, in order to annotate all my samples together (multi-sample variant calling by using GenotypeGVCFs) do I need to provide all the bam files (using -I ) once again here as an input? along with all the GVCFs (using --variant ), in order to get a raw single VCF?
@ganapathivarma
I don't believe GenotypeGVCFs takes bam files as input. It also seems to me that you shouldn't have to annotate twice... Seems to me that you could just pass the pedigree file when running GenotypeGVCFs and you would be fine. I have not used ped files with these tools though, so someone more experienced should confirm/correct my assertion.
@ganapathivarma
@astrand is correct. GenotypeGVCFs does not take bam files as input. You can either pass the pedigree file into HaplotypeCaller or into GenotypeGVCFs. This article may help as well: http://gatkforums.broadinstitute.org/discussion/37/which-tools-use-pedigree-information
-Sheila
I also have a family (of plants) for which I want to define haplotypes. I used to call Haplotypes using Haplotype Caller and phasing them by PhaseByTransmission. Has using PhaseByTransmission now become redundant when Haplotype Caller and CalculateGenotypePosteriors are used? What about ReadBackedPhasing? It also does physical phasing similiar to HC. What is the difference? In which order are the tools recommended to be used?
Thank you.
Nadia
Would you recommend the 4 following steps for pedigree sample analysis:
1 Run HaplotypeCaller in GVCF mode on each sample bam file to create single-sample GVCFs (using --emitConfidence GVCF, --variant_index_type LINEAR, --variant_index_parameter 128000)
2. run CombineGVCFs per ~200 samples
3. Run GenotypeGVCFs on all the gvcf files together to get raw VCFs
4. Run VQSR on these raw VCFs
5. genotype refinement pipeline (CalculateGenotypePosteriors,VariantFiltration,VariantAnnotator,SnpEff)
Since I have about 1200 samples, would you recommend process these samples per chromosome in step 2 and 3 ?
Thanks,
Jing
@jinghe1999
Hi Jing,
This looks good. The only thing to remember is that the output of GenotypeGVCFs will be 1 single vcf that contains all the samples.
-Sheila
@Nadia
Hi Nadia,
Sorry for the late response.
If your plant family is in trios or you know the parents and the offspring, you can use PhaseByTransmission to get more accurate phasing.
If you are interested in exact genotypes, you should use CalculateGenotypePosteriors.
If you are interested in all the tools, you can run HaplotypeCaller to get phasing. Then, you can run Calculate Genotype Posteriors to get the updated genotypes and posteriors. Then, if you want, you can run PhaseByTransmission again to re-phase after the genotypes and posteriors have been updated.
ReadBackedPhasing is redundant, but it does come in handy when merging MNPs, because Haplotype Caller does not currently do it.
-Sheila
Thanks Sheila, is that ok if I run CombineGVCFs and GenotypeGVCFs per chromosome ?
Jing
@jinghe1999
Hi Jing,
Sure, you can run both by chromosome.
-Sheila
@Sheila
I still have a question on this.... I run the HapotypeCaller on each genotype individually, then I use GenotypeGVCFs to merge the variants of within a family of mother, father and 35 children. After that I run CalculateGenotypePosteriors on the merged genomic vcf. In order to run PhasebyTransmission I create 35 individual vcfs each containing the parents and one children using SelectVariants. The resulting haplotypes of the parents after running PhaseByTransmission differ between the phased vcfs depending on the childs haplotypes. Do you have a recommondation on how to deal with this?
Thanks,
Nadia
@Nadia
Hi Nadia,
I am not sure what you mean. The phasing may change depending on the new genotype posterior calculated. Can you post an example of where the phasing differs?
Thanks,
Sheila
ok so this is one example. 13391516 and 13380506 are children of GfGa4742 and Villard blanc. In the second example Villardblanc should also be heterozygous at this position but it's not.
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 13391516 GfGa4742 VillardBlanc
VIT_201s0011g06410 162 . G A 35281.26 . AC=1;AF=0.167;AN=6;BaseQRankSum=-3.210e-01;ClippingRankSum=-1.110e-01;DP=249;FS=2.708;MQ=60.00;MQRankSum=0.542;QD=19.55;ReadPosRankSum=0.784;SOR=0.902 GT:AD:DP:GQ:PGT:PID:PL:TP 0|1:18,29:47:99:0|1:93_G_T:905,0,1438:79 0|0:90,0:90:99:.:.:0,120,1800:79 1|0:112,0:112:0:.:.:0,0,376:79
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 13380506 GfGa4742 VillardBlanc
VIT_201s0011g06410 162 . G A 35281.26 . AC=0;AF=0.00;AN=6;BaseQRankSum=-3.210e-01;ClippingRankSum=-1.110e-01;DP=414;FS=2.708;MQ=60.00;MQRankSum=0.542;QD=19.55;ReadPosRankSum=0.784;SOR=0.902 GT:AD:DP:GQ:PL:TP 0|0:212,0:212:99:0,120,1800:3 0|0:90,0:90:99:0,120,1800:3 0|0:112,0:112:0:0,0,376:3
@Nadia
Hi Nadia,
It looks like what is happening is that the father (Villard Blanc's) genotype at that position is not really known, so whatever makes sense given the child is what is being output. Notice the PLs for the father. 0,0,376 means we are not sure if he is 0/0 or 0/1. When the child is 0/1 and the mother is 0/0, it makes sense to assume the father is 0/1. Unfortunately, the site is ambiguous because of the PLs, so there may not be anything we can do.
-Sheila