The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Register now for the upcoming GATK Best Practices workshop, Nov 7-8 at the Broad in Cambridge, MA. Open to all comers! More info and signup at https://goo.gl/forms/mFeERlIeLuJ95Idm1

# Pedigree Analysis

Posts: 53Member
edited July 2013

Hi,

I am trying to use PED files when I run UnifiedGeotyper, variantannotator and VariantRecalibrator (The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14)

if I don't add the PED file , I have some InbreedingCoeff

annotation=[SnpEff, AlleleBalance, BaseCounts, GCContent, HardyWeinberg, IndelType, AlleleBalanceBySample, MappingQualityZeroBySample]

but when I add the PED file, I have nothing

annotation=[AlleleBalance, BaseCounts, GCContent, HardyWeinberg, IndelType, AlleleBalanceBySample, MappingQualityZeroBySample, InbreedingCoeff]
pedigree=[/scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped] pedigreeString=[] pedigreeValidationType=SILENT

1       14653   .       C       T       10.43   LowQual ABHom=0.792;AC=2;AF=0.091;AN=22;BaseCounts=1,1362,0,261;BaseQRankSum=-3.828;DP=1075;Dels=0.00;FS=2.373;GC=58.42;HW=10.2;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.045;MQ=6.20;MQ0=976;MQRankSum=-2.768;OND=0.190;QD=0.10;

and my PED file seems is bad format because I have this message

INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,770 PedReader - Phenotype is other? false
INFO  17:02:17,770 PedReader - Phenotype is other? false
INFO  17:02:17,770 PedReader - Phenotype is other? false
INFO  17:02:17,775 PedReader - Phenotype is other? false

But I have 6 columns and the name and tabulation between each column

2 MN932002 0 0 2 1
2 PNB32015 0 MN932002 1 1
4 CS934001 0 0 2 2
4 RMAO8004 0 CS934001 1 2
6 KH948045 0 0 2 2
6 AHAO8001 0 0 2 2
6 MHAO8003 0 0 2 2
A LSB32012 0 0 2 2
A SBB32010 0 LSB32012 2 2
B CMSB32011 0 0 2 2
C MKB32014 0 0 2 2
Z M88BBTPL 0 0 other -9

for the phenotype :

-9 missing
1 unaffected
2 affected

Consequently, I cannot run " VariantRecalibrator with -an InbreedingCoeff". Could you help me ?

Thanks,

Tiphaine

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 53Member
edited July 2013

I don't understand if I have an error in PED file or not, I don't think so but I don't have InbreedingCoef when I had PED file

INFO 16:50:55,539 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:50:55,541 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.6-4-g3e5ff60, Compiled 2013/06/24 14:48:56
INFO 16:50:55,541 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:50:55,544 HelpFormatter - Program Args: -nct 14 -R /scratch/cbrc/ref/broad_bundle_b37_2.2/human_g1k_v37.fasta -T HaplotypeCaller -stand_emit_conf 10 -L /scratch/cbrc/analysis/BHF-4/tmp/target-intervals.1026697[0].master01.xcat.cluster.intervals -I /scratch/cbrc/analysis/BHF-4/out/bam/PNB32_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/CMSB32_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/CS001_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/M88BL_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/MN932_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/LSB32_A.bam -I /scratch/cbrc/analysis/BostonHF-4/out/bam/SBB3_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/KH9_B.bam -I /scratch/cbrc/analysis/BostonHF-4/out/bam/MHAO_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/KH945_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/RMA04_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/AH1_A.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/CS001_C.bam -I /scratch/cbrc/analysis/HF-4/out/bam/MN02_B.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/C4001_B.bam -I /scratch/cbrc/analysis/BHF-4/out/bam/M32014_A.bam --dbsnp /scratch/cbrc/ref/broad_bundle_b37_2.2/dbsnp_137.b37.vcf -dcov 200 -A QualByDepth -A Coverage -A VariantType -A ClippingRankSumTest -A DepthPerSampleHC -o /scratch/cbrc/analysis/BostonHF-4/out/vcf/initialvariants-0.vcf -A InbreedingCoeff -ped /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BHF.ped -pedValidationType SILENT -et NO_ET -K /scratch/cbrc/groupapps/GATK/2.6_4/key/tm483_medschl.cam.ac.uk.key
INFO 16:50:55,545 HelpFormatter - Date/Time: 2013/07/12 16:50:55
INFO 16:50:55,545 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:50:55,545 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:50:55,571 ArgumentTypeDescriptor - Dynamically determined type of /scratch/cbrc/ref/broad_bundle_b37_2.2/dbsnp_137.b37.vcf to be VCF
INFO 16:50:56,252 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:50:56,320 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 200
INFO 16:50:56,327 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:50:56,400 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
INFO 16:50:56,415 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 16:50:56,608 IntervalUtils - Processing 34876904 bp from intervals
INFO 16:50:56,618 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BHF/PED/BHF.ped with missing fields: []
INFO 16:50:56,684 PedReader - Phenotype is other? false
INFO 16:50:56,689 MicroScheduler - Running the GATK in parallel mode with 14 total threads, 14 CPU thread(s) for each of 1 data thread(s), of 16 processors available on this machine
INFO 16:50:56,729 GenomeAnalysisEngine - Preparing for traversal over 16 BAM files
INFO 16:50:57,380 GenomeAnalysisEngine - Done preparing for traversal
INFO 16:50:57,380 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 16:50:57,380 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining
INFO 16:50:57,456 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units

this is PED file

2 MN93 0 0 2 2
2 PNB32 0 MN93 1 2
4 CS001 0 0 2 2
4 RMA04 0 CS001 1 2
6 KH945 0 0 2 2
6 AH1 0 0 2 2
6 MHAO8003 0 0 2 2
A LSB32 0 0 2 2
A SBB3 0 LSB32 2 2
B CMSB32 0 0 2 2
C MKB32014 0 0 2 2
Z MBPL 0 0 other 0

could you help me ? I have only 12 samples and only 11 affected samples ... is it a limit to run InbreedingCoeff ?

If I remember correctly, the PED file is used to calculate the InbreedingCoeff, so if your PED file is not loading correctly then it will prevent the annotation from being made properly. So you need to figure out what's wrong with your PED file. Have you checked that there are no extra trailing spaces or blank lines? We've seen that cause issues in the past.

Geraldine Van der Auwera, PhD

• Posts: 53Member
edited July 2013

I have one extra space and I deleted that, but when I rerun with VariantAnnotator with option -A InbreedingCoeff -pedValidationType SILENT and PED file, I have the same error message. (I rerun VariantAnnotator and no HaplotypeCaller because VariantAnnotator run quicker than HaplotypeCaller and I suppose that VariantAnnotator updates this values).
I am not sure I understand your sentence "if your PED file is not loading correctly then it will prevent the annotation from being made properly ". Do you want to say that if my PED file is not loading correctly, your algorithm to define InbreedingCoeff does not take this PED file and define InbreedingCoeff as if we have unrelated samples?

• Posts: 543Member, Dev ✭✭✭✭

I'm a little confused here - what's the error? I don't see any problems in that log

• Posts: 53Member

I don't know if this message is an error and If it is an error message, how do I check that my PED file is ok because this file has 6 columns and I follow the recommandations ...
I don't understand these lines

INFO 16:50:56,618 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BHF/PED/BHF.ped with missing fields: [] INFO 16:50:56,684 PedReader - Phenotype is other? false

Tiphaine

• Posts: 543Member, Dev ✭✭✭✭

The first line tells you that no fields are missing (that is, the fields listed inside the square brackets are missing)

The second line says that Phenotype is not encoded as "other" in the file (Is that 9 in PEDs? Can't remember)

These are both just status lines - which you can tell because they start with "INFO". I can't remember ever seeing GATK encounter an error without throwing an exception (ie, stopping execution and complaining loudly)

• Posts: 53Member

ok thanks

@pdexheimer said:
I can't remember ever seeing GATK encounter an error without throwing an exception (ie, stopping execution and complaining loudly)

Hah that's very true in general. However we do occasionally see some bugs where something fails silently. In this case I'm concerned that the PED file issue may be causing a silent failure because of this:

I don't have InbreedingCoef when I had PED file

If the otherwise same command line fails to yield the InbreedingCoeff annotation when the PED file is provided, but does yield the annotation when the PED file is not provided, then it sounds like we might have a silent failure. If so we may need some test files to reproduce the error locally. @Tiphaine, can you confirm that that's what you're seeing?

Geraldine Van der Auwera, PhD

• Posts: 543Member, Dev ✭✭✭✭

I thought the effect of the PED file was to remove non-founders from the annotation. In this case, that may be enough to drop it below 10 subjects and not emit the annotation

That's a fair point, I hadn't considered that. I'd be happier if we emitted a message to that effect in the log though.

Geraldine Van der Auwera, PhD

• Posts: 53Member

in this case, I have 11 affected samples and one control data. And it doesn't explain why when I give a PED file, I don't have values of InbreedingCoeff but I have when I don't give a PED file . I dropped the -pedValidationType SILENT, I don't have more informations in log.

• Posts: 543Member, Dev ✭✭✭✭

Yes, but affected vs control is not the same as founder vs non-founder. In your 12 samples, I see 3 that are not founders (i.e., that have parents present in the PED file) - PNB32, RMA04, and SBB3. That leaves 9 samples considered for InbreedingCoefficient, which is only calculated when you give it at least 10.

@Tiphaine, I think @pdexheimer is correct; your PED file is actually OK, and the annotation isn't being added because once the non-founders are removed, there are not enough individuals to calculate it.

Geraldine Van der Auwera, PhD

• Posts: 53Member
edited July 2013

Ok, I can understand that but why with HaplotypeCaller (GATK 2.6.4) I obtain some values of InbreedingCoeff. isn't it the same algorithm ?
I think it will be good if we have also the number of non-founders and founders used for InbreedingCoeff in a log file.

• Posts: 53Member
edited July 2013

In doc, maybe it is better to write at least 10 founders and not 10 samples to define inbreedingCoeff.

Fair point, I'll make a note to update the doc.

Geraldine Van der Auwera, PhD

• Posts: 53Member

Ok thanks but Do you have an explanation why HaplotypeCaller of GATK 2.6.4 can do that and not UnifiedGenotyper of GATK 2.3.9 from the same PED file?