The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Which tools use pedigree information?

Posts: 71Dev mod
edited December 2014 in FAQs

There are two types of GATK tools that are able to use pedigree (family structure) information:

### Tools that require a pedigree to operate

PhaseByTransmission and CalculateGenotypePosterior will not run without a properly formatted pedigree file. These tools are part of the Genotype Refinement workflow, which is documented here.

### Tools that are able to generate standard variant annotations

The two variant callers (HaplotypeCaller and the deprecated UnifiedGenotyper) as well as VariantAnnotator and GenotypeGVCFs are all able to use pedigree information if you request an annotation that involves population structure (e.g. Inbreeding Coefficient). To be clear though, the pedigree information is not used during the variant calling process; it is only used during the annotation step at the end.

If you already have VCF files that were called without pedigree information, and you want to add pedigree-related annotations (e.g to use Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as a feature annotation), don't panic. Just run the latest version of the VariantAnnotator to re-annotate your variants, requesting any missing annotations, and make sure you pass your PED file to the VariantAnnotator as well. If you forget to provide the pedigree file, the tool will run successfully but pedigree-related annotations may not be generated (this behavior is different in some older versions).

### About the PED format

The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.

For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.

Post edited by Geraldine_VdAuwera on
Tagged:

• Bay Area, CAPosts: 28Member

I've been looking all over for how to add a PED file to my VariantAnnotator run. I don't see an explanation on the VariantAnnotator page or here. I've tried to use the -list function to look at possible annotations, but I don't see ped file as an option. How should I pass a ped file to VariantAnnotator to re-annotate variants from an old version GATK run that wasn't originally run with a ped file?

PED files are passed through an engine argument, they're not tool-specific: see

Geraldine Van der Auwera, PhD

• Bay Area, CAPosts: 28Member

Thank you for the very prompt response. The program is now running away happily (hopefully .

• Posts: 61Member
edited April 2013

Hi Geraldine, can you tell me whats wrong with my PED file? Here it is attached!
Ped file (A family with, father(NA00001), mother(NA00002), sonNA00003), daugther(NA00004)) the son has ALS disease for example.

• (number sign)Family ID

• (number sign)Individual ID

• (number sign)Paternal ID
• (number sign)Maternal ID
• (number sign)Sex (1=male; 2=female; other=unknown)
• (number sign)Phenotype (-9=missing; 0=missing; 1=unaffected; 2=affected)

• FAM001 NA00001 0 0 1 1

• FAM001 NA00002 0 0 2 1
• FAM001 NA00003 NA00001 NA00002 1 2
• FAM001 NA00004 NA00001 NA00002 2 1
Post edited by alirezakj on
• Posts: 61Member

Can I put the cousin also as:

• FAM002 NA00005 0 0 2 1
Can I have only sisters and brothers in a ped file without having mother and father?
and last question: is better to put the ped file at the time of variant calling with unifiedgenotyper?

Hi @alirezakj,

Actually I think PED files can only contain trios, so if you want to phase siblings you have to put them in as different families (though obviously with the same parents). So FAM1 would have Mom, Dad and Kid1, FAM2 would have Mom, Dad and Kid2, and so on. Not sure what to do about cousins though.

Some GATK tools use the PED files and some don't. The simplest is to pass your PED file to every tool; those that can use it will do so, and those that can't will just ignore it.

Geraldine Van der Auwera, PhD

• Posts: 61Member
edited April 2013

Thank you so much Geraldine, very helpful. Three more questions:

1. for a disease with complex genetics if a member of a family dose not show the phenotype, what number should be used in the phenotype field of the Ped file (0=missing or 1=unaffected)?
2. should I always set -pedValidationType as "SILENT" when I pass the Ped file to walkers, why most people set it as SILENT?
3. Why in the phenotype field missing can be both set as 0 and -9, why not just 0?

Thank you so much you are being so helpful.

Post edited by Geraldine_VdAuwera on
• Posts: 61Member

From your explanation I understand, for a family of four (father "F", mother "M", son "S", daughter "D" affected) the Ped file should look like the following:

• FAM1 F 0 0 1 0

• FAM1 M 0 0 2 0

• FAM1 S F M 1 0
• FAM2 F 0 0 1 0
• FAM2 M 0 0 2 0
• FAM2 D F M 2 2

Am I right?
Thanks

edited April 2013

To your previous three questions:

1. Set 1 if you know they are unaffected, 0 if you don't know.

2. By default validation is set to STRICT. Some people choose to use SILENT for various reasons, for example if they are using a BAM file containing a large cohort of individuals, but they are only analyzing one family trio. If they used STRICT, the program would complain that all the other samples are lacking pedigree information. If you don't care about those other samples then this is a bother. These are the validation options:

STRICT
Require if a pedigree file is provided at all samples in the VCF or BAM files have a corresponding entry in the pedigree file(s).
SILENT
Do not enforce any overlap between the VCF/BAM samples and the pedigree data

3. This is typically a programmer's decision to allow different values to mean the same thing; in this case I don't know why they chose this. It doesn't really matter, just pick one and be sure to always use the same one in all your work to avoid confusion.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 28Member
edited May 2013

Hi,
I just created a ped file in order to use it for the variant calling. However I'm receiving the following error:

INFO 12:45:21,838 PedReader - Reading PED file /Data/samples.ped with missing fields: []

INFO 12:45:21,949 PedReader - Phenotype is other? false

I already saw some comments about this error which might occur due to a missformed ped file. However I could not find the problem with my ped file. It looks like this, separated by tabs.

FAM001 S00002 0 0 1 1

FAM001 S00003 0 0 2 1

FAM001 S00001 S00002 S00003 1 2

FAM002 ....

Is there any option to get a more detailed error message or can someone tell me whats wrong about my file ? I have to say that I created it by hand, since there are only 8 trios in it.

Post edited by Max on

Hmm, any chance that you have spaces instead of tabs somewhere in there?

Geraldine Van der Auwera, PhD

• Posts: 28Member

I could not find any spaces, only tabs. Just checked it again.

I also created a minimal PED file, containing only the first trio. However, this resulted in the same warning message. I also tried to experiment with the values, changed the separator to spaces, without success.

Hey MAX that doesn't look like an error, it's just an info line, what is the output of the program?

• Posts: 28Member

Hi Carneiro,

well the output looks normal, but I wasnt sure if everything worked fine since I saw this warning message. Therefore I thougt there might be a problem with my ped file.

problem solved then! great!

• Posts: 8Member

Hi,

Its my first time on the GATK pipeline and I'm happy to say I've made it to having a vcf files. My issue is I now need to filter variants based on inheritance. I have trios but they are not what is usually meant my trios. I have two affected siblings and a parent. So my trio is like this; child1, child2, parent. I need to find homozygous or heterozygous variants that both children inherited from the parent. Can I do this by the GATK -T PhasebyTransmission or no? If no, do I use Annovar or any suggestions as to what I'd use?

Thanks very much!

There is no tool designed to do specifically that, but you can use the toolkit (CombineVariants and SelectVariants are a good start). Or, you can write your own rod walker to do so very easily.

• Posts: 56Member

@Geraldine_VdAuwera said:
Hi alirezakj,
Actually I think PED files can only contain trios, so if you want to phase siblings you have to put them in as different families (though obviously with the same parents). So FAM1 would have Mom, Dad and Kid1, FAM2 would have Mom, Dad and Kid2, and so on.

Geraldine, is this a confirmed workaround? I'm getting "##### ERROR MESSAGE: Inconsistent values detected for [Father] for field Family_ID value1 [FAM1] value2 [FAM2]" (ID's in [brackets] replaced)" when trying this approach.
Performing two successive PhaseByTransmission runs gets around this (commenting out the other trio each time (just commented with '#' at start of line, so problem unlikely caused by ped file format error (though I wish as it's be the easiest fix ;-)). Seems to work

However, what happens if the phase for a parent genotype in the second run doesn't agree with the phase determined in the first run? Does the second run simply over-write the first? ('academic' question only, have yet to observe this case)

Thanks

Hi Klaus,

I've never needed to do it myself, but I'm told that's how it's supposed to work. It may be that you need to pass the trios in separately -- that would fit with what you're seeing.

No idea about your hypothetical, but let me know if you run into the case, I'd be interested to find out what happens.

Geraldine Van der Auwera, PhD

• UKPosts: 24Member

Hi Geraldine,

I think my questions fit here other than starting a new topic. I have several questions related to the pedigree files usage with no clear answer can be found within the forum. Hope you could help me on these.

I noticed that with --pedigree, one can pass several pedigree files a time. I guess a few things are not clear here:
1, It's clear that each pedigree file can only have one phenotype, but how about each pedigree file for sample has different phenotype? Will GATK consider the pedigrees are under a same phenotype or it knows that in different file there is a different phenotype?

2, The note of this doc mentioned what to do to adapt ped info for the variants called with UnifiedGenotyper or VariantAnnotator, how about variants called with HaplotypeCaller? Can I do the same?

3, For trios in a bigger family, if I want to do the phasing on each trio later, should I put each trio in a separate ped file while calling variants with HaplotypeCaller?

4, I use Queue and scala script to execute HaplotypeCaller, could you shed a light on how to take ped files as an input parameter for the scala?

Many thanks,

Hi @byb121,

1. I'm sorry but I don't understand your first question, can you please clarify what you mean? Perhaps if you describe an example of use case (with hypothetical details) it will help.

2. The HaplotypeCaller will treat pedigrees the same way as UnifiedGenotyper, so the advice for UG is also applicable for HC.

3. As I recall you can use a single ped file for calling variants; only the phasing tools require that you break up the family into separate trio peds.

4. You can add a pedigree input the same way as you would add any input file argument to your script. What are you currently using as a base to develop your scala script? FYI, next week we are holding a workshop that includes details on how to write scala scripts for Queue. The presentations will be online by Monday 21st; be sure to have a look at them as they will include a lot of helpful documentation on this topic.

Geraldine Van der Auwera, PhD

• UKPosts: 24Member
edited October 2013

Thanks a lot for your reply. Sorry I didn't make myself clear on the first question. Here's an example for it: There are 30 exomes, 10 of them are for studying disease A ( you will certainly have affected and unaffected ones), another 10 are for disease B, 5 are for disease C, the rest 5 are singulars for 5 other different diseases. The situation is actually quite common in our lab now. As far as I know, you can't use 1 pedigree file for two different diseases (Not sure I am correct on it). So when calling variants I will have to supply 3 ped files (the 5 singulars can be without ped info) for disease A, B and C. My question is: Will GATK know that the 3 files are for 3 diseases but not one, so correct annotations (eg: Allele Frequency and Inbreeding coefficient) will be assigned?

For better variant calling and recalibration results, we need at least 30 samples as suggested in best practise. So there's also a situation where we add 20 or more additional exomes when we really want to call variants from 10 other samples. If some of the additional exomes are related, should the pedigree file be used while calling variants? I guess it's 'Yes', but a confirmation from you expert is a better comfort.

Cheers

Post edited by byb121 on

Hi again and sorry for the delayed response. If you mean that you want to calculate frequency/inbreeding per disease group, this is a post-processing step that is performed after variant calling.

Geraldine Van der Auwera, PhD

• koreaPosts: 2Member

Hi, I'm trying to run PhaseByTransmission, but I keep getting NullPointerException in the stack trace, and a Code exception error message. I've copy-pasted the standard error output below.
The vcf file was produced using GATK UnifiedGenotyper from recalibrated bam files (with -ped option), and the variants were recalibrated as well. The "analysis ready" vcf file was then annotated using VariantAnnotator (-G StandardAnnotation and -ped parameters as recommended above). I've pasted the ped file contents below as well.

Can you advise me on anything that I can try? Thank you.

ped file

F1     case      dad   mom  1       2
F1     dad        0       0       1       1
F1     mom      0       0       2       1


stack trace

INFO 02:57:49,282 HelpFormatter - --------------------------------------------------------------------------------
INFO 02:57:49,284 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15
INFO 02:57:49,284 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 02:57:49,284 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 02:57:49,287 HelpFormatter - Program Args: -T PhaseByTransmission -R /home/adminrig/src/GATK.2.0/resource.bundle/2.8/b37/human_g1k_v37.fasta -V trio.annot2.vcf -ped input.ped -o trio.annot2.pbt.vcf
INFO 02:57:49,287 HelpFormatter - Date/Time: 2014/04/04 02:57:49
INFO 02:57:49,288 HelpFormatter - --------------------------------------------------------------------------------
INFO 02:57:49,288 HelpFormatter - --------------------------------------------------------------------------------
INFO 02:57:49,297 ArgumentTypeDescriptor - Dynamically determined type of trio.annot2.vcf to be VCF
INFO 02:57:49,785 GenomeAnalysisEngine - Strictness is SILENT
INFO 02:57:49,939 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 02:57:49,959 RMDTrackBuilder - Loading Tribble index from disk for file trio.annot2.vcf
INFO 02:57:50,036 PedReader - Reading PED file input.ped with missing fields: []
INFO 02:57:50,113 PedReader - Phenotype is other? false
INFO 02:57:50,162 GenomeAnalysisEngine - Preparing for traversal
INFO 02:57:50,176 GenomeAnalysisEngine - Done preparing for traversal
INFO 02:57:50,177 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 02:57:50,178 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 02:58:20,180 ProgressMeter - 7:95001555 2.63e+04 30.0 s 19.0 m 42.8% 70.0 s 40.0 s

##### ERROR stack trace

java.lang.NullPointerException
at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission$TrioPhase.getPhasedGenotype(PhaseByTransmission.java:431) at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission$TrioPhase.getPhasedGenotypes(PhaseByTransmission.java:390)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)

##### ERROR ------------------------------------------------------------------------------------------

Have you checked if there is a blank line in your file? I think I've seen this happen before, and you do have the line "Reading PED file input.ped with missing fields: [] " in the log output.

Geraldine Van der Auwera, PhD

• koreaPosts: 2Member

I don't see any unnecessary spaces or blank lines.
I don't believe that my ped file format is incorrect.
Doesn't "missing fields: []" mean that I don't have any missing fields within the ped file?
Here is the output of "cat -A input.ped":

F1^Icase^Idad^Imom^I1^I2$F1^Idad^I0^I0^I1^I1$
F1^Imom^I0^I0^I2^I1$ Do you think there could be other reasons for the error? Could it be a bug? Thank you again for your help Geraldine. I really appreciate it. • Posts: 10,382Administrator, Dev admin Hmm, that looks fine. Can you please validate your VCF to make sure there is nothing wrong with it? Geraldine Van der Auwera, PhD • Posts: 6Member Is this trio-aware in the sense that mendelian inheritance spelled out in the ped file can help the calling of genotypes where there is ambiguity? btw, PhaseByTransmission links are broken • Posts: 10,382Administrator, Dev admin @JeremyLeipzig‌ Yes, that's right. Note that we are currently working on upgrading trio-based genotyping capabilities, so we should have some improvements out in the next month or so. Geraldine Van der Auwera, PhD • United KingdomPosts: 400Member ✭✭✭ @Geraldine_VdAuwera said: Note that we are currently working on upgrading trio-based genotyping capabilities, so we should have some improvements out in the next month or so. I will be doing calling of variants from high coverage trio data within the next weeks. Will a newer version of GAKT with upgraded trio-based capabilities be released before then? Is it already possible at this stage to reveal, which improvements have been planned? Thanks! • Posts: 6Member, Broadie, Dev The new Genotype Refinement Pipeline is already in the codebase and should be available via the nightly builds. It has the capability (via CalculateGenotypePosteriors) to derive posterior genotype probabilities (in the new PP format field) based on the genotype likelihoods of the other members of the trio. (Genotypes will be modified based on these posteriors if necessary.) You can pass in population allele counts from HapMap or 1000 Genomes to help inform the posteriors as well. There's also a new possibleDeNovo annotation that can be applied with VariantAnnotator after CGP to tag high- and low-confidence de novo mutations in the trio offspring if that's something you're interested in. There's some information in the CalculateGenotypePosteriors tool docs, but more comprehensive documentation is forthcoming pending the completion of @Geraldine_VdAuwera's trip abroad. • Posts: 10,382Administrator, Dev admin Yep, I'm holding up the process -- Laura @gauthier has produced some beautiful docs but I wasn't able to finalize the postings before leaving on my current conference trip. Will try to post them later this week, or next week at the very latest. Sorry for the delay, @tommycarstensen‌ ! Geraldine Van der Auwera, PhD • United KingdomPosts: 400Member ✭✭✭ @Geraldine_VdAuwera said: Actually I think PED files can only contain trios, so if you want to phase siblings you have to put them in as different families (though obviously with the same parents). So FAM1 would have Mom, Dad and Kid1, FAM2 would have Mom, Dad and Kid2, and so on. Not sure what to do about cousins though. Can ped/fam files still only contain trios (version 3.2)? Thanks! • Posts: 10,382Administrator, Dev admin I believe so but will check. Geraldine Van der Auwera, PhD • StockholmPosts: 2Member Hi! I would like to phase my vcf generated with Ion Torrent Variant Caller using PhaseByTransmission, but the phasing is actually not happening. This is my command (running GATK 3.2): java -Xmx8g -jar /home/bianca/bin/GenomeAnalysisTK.jar \ -R$REF_PFX \
-T PhaseByTransmission \
-V WESIonProton.vcf \
-ped /home/bianca/glob/bin/gemini/WES.ped \
-o WESIonProton_phased.vcf \
--pedigreeValidationType SILENT


And this is the output I get:

OUTPUT
INFO  12:15:11,411 ProgressMeter -   chr9:16419124        1.03e+05   30.0 s        4.9 m     50.2%        59.0 s    29.0 s
INFO  12:15:37,061 PhaseByTransmission - Number of complete trio-genotypes: 211628
INFO  12:15:37,061 PhaseByTransmission - Number of trio-genotypes containing no call(s): 1089220
INFO  12:15:37,061 PhaseByTransmission - Number of trio-genotypes phased: 0
INFO  12:15:37,062 PhaseByTransmission - Number of resulting Het/Het/Het trios: 44409
INFO  12:15:37,062 PhaseByTransmission - Number of remaining single mendelian violations in trios: 0
INFO  12:15:37,062 PhaseByTransmission - Number of remaining double mendelian violations in trios: 0
INFO  12:15:37,062 PhaseByTransmission - Number of complete pair-genotypes: 0
INFO  12:15:37,062 PhaseByTransmission - Number of pair-genotypes containing no call(s): 0
INFO  12:15:37,063 PhaseByTransmission - Number of pair-genotypes phased: 0
INFO  12:15:37,063 PhaseByTransmission - Number of resulting Het/Het pairs: 0
INFO  12:15:37,063 PhaseByTransmission - Number of remaining mendelian violations in pairs: 0
INFO  12:15:37,063 PhaseByTransmission - Number of genotypes updated: 0
INFO  12:15:38,924 ProgressMeter -            done        2.28e+05   57.0 s        4.2 m     98.6%        57.0 s     0.0 s
INFO  12:15:38,924 ProgressMeter - Total runtime 57.54 secs, 0.96 min, 0.02 hours


And this is how my vcf looks:

chr1    897325  .   G   C   38.35   PASS    AC=36;AF=0.900;AN=40;DP=520;FR=.;HRUN=1;LEN=1;OALT=C;OID=.;OMAPALT=C;OPOS=897325;OREF=G;SSEN=0;SSEP=0;TYPE=snp;set=variant-variant3-variant5-variant11-variant13-variant15-variant17-variant19-variant21-variant25-variant27-variant29-variant33-variant35-variant37-variant39-variant41-variant45-variant47-variant51-variant53-variant55-variant57-variant63-variant65-variant67-variant69-variant71-variant73-variant77-variant79-variant81-variant85-variant87-variant89-variant91-variant93-variant97-variant99-variant103 GT:AO:DP:FAO:FDP:FRO:FSAF:FSAR:FSRF:FSRR:GQ:RO:SAF:SAR:SRF:SRR  ./. 1/1:6:6:6:6:0:0:6:0:0:4:0:0:6:0:0   1/1:7:7:7:7:0:2:5:0:0:4:0:2:5:0:0   0/1:7:8:7:8:1:1:6:0:1:6:1:1:6:0:1   ./. 1/1:14:14:15:15:0:7:8:0:0:7:0:6:8:0:0   1/1:8:8:9:9:0:5:4:0:0:4:0:5:3:0:0   1/1:11:11:11:11:0:6:5:0:0:5:0:6:5:0:0   1/1:12:12:12:12:0:2:10:0:0:5:0:2:10:0:0 1/1:7:7:7:7:0:3:4:0:0:4:0:3:4:0:0   0/1:9:31:9:20:11:1:8:0:11:44:22:1:8:11:11   1/1:8:8:8:8:0:3:5:0:0:4:0:3:5:0:0   1/1:6:6:7:7:0:1:6:0:0:4:0:1:5:0:0   0/1:5:24:5:14:9:1:4:1:8:23:19:1:4:11:8  1/1:17:17:18:18:0:1:17:0:0:8:0:1:16:0:0 1/1:17:17:19:19:0:7:12:0:0:8:0:5:12:0:0 1/1:5:5:5:5:0:1:4:0:0:4:0:1:4:0:0   ./. ./. ./. 0/1:8:43:8:31:23:5:3:4:19:22:35:5:3:16:19   1/1:8:8:9:9:0:2:7:0:0:4:0:2:6:0:0   1/1:12:12:12:12:0:6:6:0:0:5:0:6:6:0:0   ./. 1/1:7:7:8:8:0:4:4:0:0:4:0:4:3:0:0   1/1:9:9:11:11:0:6:5:0:0:5:0:6:3:0:0


What is wrong? How can I get this to work?
In alternative, which tool do you recommend for phasing vcf produced by TVC?

Thanks,
Bianca

Try running without --pedigreeValidationType SILENT just in case there's something wrong with the pedigree file that is interfering with the phasing process.