We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Why is Inbreeding Coefficient not always displayed in a vcf file?

Hi there,
Using GATK v3.5-0, I've generated vcf files for a group of 223 whole genomes, one per interval of the reference genome. I created GVCF files using Haplotype Caller in 'GVCF' mode, before using CombineGVCFs to generate cohorts, which were entered into GenotypeGVCFs runs. Curiously, not every SNP or indel called in the vcf file has an Inbreeding Coefficient value attributed to it. This happens with both SNPs and Indels. Below are two lines from one of the output vcf files - you can see that the top entry contains no Inbreeding Coefficient value, whilst the lower entry does.
NW_014444451.1 5724 . A C 29058.86 . AC=166;AF=0.382;AN=434;BaseQRankSum=0.322;ClippingRankSum=0.244;DP=1982;ExcessHet=0.0000;FS=0.000;MLEAC=172;MLEAF=0.396;MQ=59.39;MQRankSum=0.278;QD=31.55;ReadPosRankSum=0.00;SOR=0.711 GT:AD:DP:GQ:PGT:PID:PL 0/0:4,0:4:0:.:.:0,0,56 0/1:3,4:7:99:0|1:5714_G_C:159,0,829
NW_014444451.1 5731 . TG T 28376.05 . AC=167;AF=0.383;AN=436;BaseQRankSum=0.387;ClippingRankSum=0.00;DP=1963;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.3346;MLEAC=174;MLEAF=0.399;MQ=59.60;MQRankSum=-5.300e-02;QD=31.49;ReadPosRankSum=0.00;SOR=0.743 GT:AD:DP:GQ:PGT:PID:PL 0/0:5,0:5:0:.:.:0,0,790/1:2,4:6:99:0|1:5714_G_C:200,0,694
I plan to use hard filters to generate a list of HQ variants that I can then feed into BQSR and VQSR in a second run-through of the GATK best practices, and I was going to include Inbreeding Coefficient in this. Would someone be able to explain to me why not all variant sites have Inbreeding Coefficient values and whether this will impact upon my filtering, please?
Many thanks,
Ian
Best Answers
-
Sheila Broad Institute admin
@ianw
Hi Ian,I just looked at your files, and it seems like the sites where InbreedingCoefficient is not annotated are sites where 1 or more samples have a no-call. In the test files, I only have 10 samples, so if 1 of the samples is no-call at a site, there will not be enough samples to determine InbreedingCoefficient. Can you check in your files with all the samples and see if this seems to be the case?
Thanks,
Sheila -
Sheila Broad Institute admin
@ianw
Hi Ian,I just used some of the new samples you uploaded, and again I am seeing what I mentioned earlier. When there are 9 samples with called genotypes, InbreedingCoefficient is not annotated. However, when a 10th sample is added that has a called genotype, InbreedingCoefficient is annotated. Do you have an exact site that is never annotated with InbreedingCoefficient?
Thanks,
Sheila -
Sheila Broad Institute admin
@ianw
Hi Ian,As long as 10 samples have a "proper genotype" (not ./.) InbreedingCoeff will be present at the site. Note, at least one sample has to have a variant genotype (not 0/0). That said, I do not get any sites output in my final VCF where all the samples have 0/0 genotypes.
I ran HaplotypeCaller on ~12-13 of your sample BAMs, then I ran GenotypeGVCFs on the GVCFs.
I did not use multithreading and I used the latest version of GATK.I get InbreedingCoeff for positions 9074, 7699, 8751.
Positions 6913 and 6541 do not show up at all in my final VCF.
I can share my exact commands with you in a private message if you want.
-Sheila
Answers
@ianw
Hi Ian,
I need to check with the team on this. I will get back to you soon.
-Sheila
Thanks @Sheila - I look forward to hearing back from you.
Hi @ianw.
Could you please post the full command line you use when calling GenotypeGVCFs? That should be located in the header of the output VCF as #GATKCommadLine= or something similar. Feel free to change file names in there in order to make it more anonymous. Also can you tell us how many samples are you analyzing and whether they are all diploid?
Thanks, V.
Hi @valentin ,
Apologies for not responding yesterday. I'm analysing 223 samples (with another ~160 to come in the next few months), all of which are diploid. The command line was run from within a script (simply called using 'sh script.sh') so I've included the entire script so it makes more sense:
for i in UMD_1_4 UMD_5_10;
do
bsub -G cichlid -e genotype_e_%J -o genotype_o_%J -R'select[mem>28000] rusage[mem=28000]' -M28000 -n 32 -q normal /nfs/users/nfs_m/mm21/programs/jre1.7.0_71/bin/java -Djava.io.tmpdir=/lustre/scratch113/projects/cichlid/tmp -Xmx28000m -jar /software/vertres/bin-external/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /lustre/scratch113/projects/cichlid/assembly/ref.fa --variant ${i}aa_cohort.g.vcf.gz --variant ${i}ab_cohort.g.vcf.gz --variant ${i}ac_cohort.g.vcf.gz --variant ${i}ad_cohort.g.vcf.gz --variant ${i}ae_cohort.g.vcf.gz --variant ${i}af_cohort.g.vcf.gz --variant ${i}ag_cohort.g.vcf.gz --variant ${i}ah_cohort.g.vcf.gz --variant ${i}ai_cohort.g.vcf.gz --variant ${i}aj_cohort.g.vcf.gz -nt 32 -L /lustre/scratch113/projects/cichlid/ref/${i}.intervals --includeNonVariantSites --disable_auto_index_creation_and_locking_when_reading_rods --out ${i}_allSites.vcf.gz;
done
Many thanks,
Ian
Hi @ianw
Based on the command line and the fact that all samples are diploid I cannot see what would be the reason why the InbreedingCoefficient is not displayed at that site. If it is not much of an inconvenience we would need to debug using a minimal subset of the data. If it possible for you to extract from the inputs variants and non-variant blocks that overlap those two sites and then upload the data? @Sheila could you help with that I don't know what is the protocol for this. Is the reference that you are using public somewhere?
Hi @valentin
Yep, I can do that no problem, if you wouldn't mind. I'll wait to hear from Sheila so I know how to upload the data. The most recently updated version of the reference genome is available at http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA60369
Many thanks,
Ian
@ianw
Hi Ian,
Thanks. Instructions for uploading data are here.
-Sheila
Hi @Sheila and @valentin,
I'll get this to you ASAP. Unfortunately, I deleted the cohort files to clear room on our server so I'll need to recreate them to send to you. I'll hopefully get it to you today or tomorrow.
Ian
Apologies for the delay here. Looking at my input files, I'm not sure what I should upload. The cohort gvcf files (i.e. the GenotypeGVCFs input files) do not contain non-variant blocks - instead they all look like this:
The 223 original gvcf files from the previous step in the pipeline (i.e. the inputs for CombineGVCFs) do, however, contain non-variant blocks but I assume you do not wish to be faced with 223 input files?
Sorry for being dense but would you mind clarifying which files I would be best sending you (alongside the command, program output and reference files, as listed in the link @Sheila kindly provided), please?
Many thanks,
Ian
@ianw
Hi Ian,
Yes, 223 files might be too much!
I think the best thing for me would be if you can send me snippets of your BAM files. So, if you can narrow your 223 samples down to 10 samples which reproduce the issue, that would be great. Ideally, you would send over the snippets of 10 sample BAM files so when I run HaplotypeCaller on them, I get InbreedingCoefficient for some sites, but not for others. I hope this makes sense.
Thanks,
Sheila
@Sheila
Hi Sheila,
Apologies for the delay - I had to go through things a couple of times as well as cracking on with some other work. I've uploaded a tarball of the requested files as per your instructions. The tarball is called ianw_troubleshoot_upload.tar.gz.
I look forward to hearing back from you and, in the meantime, if any other information or files are required, please don't hestitate to get in touch.
Best wishes,
Ian
Issue · Github
by Sheila
@ianw
Hi Ian,
Thanks for the test data. I will have a look soon and get back to you.
-Sheila
I have noted inconsistent output with same input for InbreedingCoeff when using multithreads - see the following thread for an example - same input 5 threads, two different values reported and one missing, for the same positions. http://gatkforums.broadinstitute.org/gatk/discussion/7840/qd-and-inbreedingcoefficient-inconsistent-when-different-thread-number-used-for-genotypegvcf#latest May be related to this issue.
Hi @SteveL
That's very interesting! It certainly seems possible that it's down the multithreading. I'll be interested to see what answers our questions get. Fingers crossed it's not a big issue!
@ianw
Hi Ian,
I just looked at your files, and it seems like the sites where InbreedingCoefficient is not annotated are sites where 1 or more samples have a no-call. In the test files, I only have 10 samples, so if 1 of the samples is no-call at a site, there will not be enough samples to determine InbreedingCoefficient. Can you check in your files with all the samples and see if this seems to be the case?
Thanks,
Sheila
@Sheila
Hi Sheila,
I've had a look at the full vcf file of all SNPs that passed my hard filters and I don't think no-call samples explain it, unfortunately. There are 2,547,075 sites without Inbreeding Coefficient values compared with only 417,503 sites with them. Having briefly glanced through the output of a Perl script, the sites without Inbreeding Coefficient values include calls from different numbers of samples, ranging from at least 47 to 223 (out of 223). The sites with Inbreeding Coefficient values, meanwhile, include calls from numbers of samples ranging from 43 to 214. The ranges may be larger - I just haven't had time to look extensively into them.
The plot - as with the fog in my brain - thickens...
Cheers,
Ian
I'm not using a pedigree file, no - we don't have that information for these individuals. So, presumably, each is considered unrelated to the others, yes?
Ah I see. Nope, afraid not!
@ianw
Hi Ian,
Hmm. Can you upload a few more samples that will definitely cause the issue? I just need a few more samples that will give me 10 called genotypes at a site but not give the InbreedingCoefficient.
Thanks,
Sheila
@Sheila
Hi Sheila,
No problem - I'll get this to you as quickly as I can next week. Thanks for the continued support and help.
Ian
@Sheila @Geraldine_VdAuwera
Hi again,
I've uploaded 20 snippet files which should do the trick. I've included also the reference fasta file and relevant interval file, as well as the gvcf file that I generated using the 20 snippets, just in case it proved useful. The upload is called 'ianw_Broad_test_files.tar.gz'. I look forward to hearing from you about this again
Thanks,
Ian
@ianw
Hi Ian,
I just used some of the new samples you uploaded, and again I am seeing what I mentioned earlier. When there are 9 samples with called genotypes, InbreedingCoefficient is not annotated. However, when a 10th sample is added that has a called genotype, InbreedingCoefficient is annotated. Do you have an exact site that is never annotated with InbreedingCoefficient?
Thanks,
Sheila
@Sheila
It's possible that I've misunderstood what you meant here. Do you mean that Inbreeding Coefficient won't be calculated when fewer than 10 sites have no call at all (./.) or when fewer than 10 call variants (0/1 or 1/1)? If it's the latter then you may be right that it's down to low numbers of variant calls, although this entry, at least, doesn't seem to fit:
If it's the former then the gVCF file I uploaded contains a number of site that don't seem to fit this rule, including:
Finally, and a little off-topic but since I spotted it, HC or GGVCFs seems to be calling variants for sites without any variant calls. Again, I assume I'm missing something here but I'd be grateful for your input or a pointer to a page explaining why this happens:
If I've just misunderstood your meaning regarding Inbreeding Coefficient then I'm sorry for wasting your time, although it's helpful for me to know what's causing the discrepancy and that it isn't cause for concern.
All the best,
Ian
@ianw
Hi Ian,
As long as 10 samples have a "proper genotype" (not ./.) InbreedingCoeff will be present at the site. Note, at least one sample has to have a variant genotype (not 0/0). That said, I do not get any sites output in my final VCF where all the samples have 0/0 genotypes.
I ran HaplotypeCaller on ~12-13 of your sample BAMs, then I ran GenotypeGVCFs on the GVCFs.
I did not use multithreading and I used the latest version of GATK.
I get InbreedingCoeff for positions 9074, 7699, 8751.
Positions 6913 and 6541 do not show up at all in my final VCF.
I can share my exact commands with you in a private message if you want.
-Sheila
Hi @Sheila
If you wouldn't mind sharing your commands, that would be very helpful, thank you. This does rather suggest that I'm doing something wrong somewhere along the line.
Many thanks,
Ian
Hi @Sheila,
Was the issue of
InbreedingCoeff
not always displayed ever solved?I have the same problem at the VariantRecalibrator step. GenotypeGVCFs was run in multithread, but both VariantRecalibrator and ApplyRecalibration were not.
I've run all my analyses with v3.5 (and just re-ran the ApplyRecalibration step with v3.7) and I still get seemingly good looking positions without
InbreedingCoeff
in a consistent manner like the ones below:Some rarer or less well covered variants –maximum
AN=2270
– do reportInbreedingCoeff
:Any idea what's going on?
Looking forward to your response.
Issue · Github
by Sheila
@Fer
Hi,
Yes, this was actually an issue of not enough samples having a variant genotype. Can you confirm that at least ten samples at the site have a variant genotype? Please post an example record that does not have the annotation as well.
Thanks,
Sheila
@Sheila
Thanks for your response. This is a diploid organism, thus I guess
AC>20
already guarantees that at least 10 samples carry the variant (n=1135, maxAN=2270
), at least for the first two examples from my previous post –the defective ones.Nevertheless, I'm afraid the issue may be worse than what I've thought, and I suspect it comes down to multithreading. I've gone back to the GenotypeGVCFs step with and without the
-nt
option and, besides many positions missingInbreedingCoeff
annotation in the output of the former, some that do display it have pretty odd values only for that annotation. Here an example for the exact same position:WITH multithreading:
WITHOUT multithreading:
By the way, I calculated inbreeding coefficient manually, and the latter is closer to reality (only two samples carry Het calls).
Regards,
Fernando
@Fer
Hi Fernando,
Hmm. It looks like this was reported a while back.
Can you test this with GATK4 latest beta and see if the issue still occurs?
Thanks,
Sheila
If I understand correctly, the warning "WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples" is issued per-call-site if less than 10 samples are called at that site? I'm feeding GenotypeGVCFs over 45 samples, and I still get that (misleading) warning filling up most of the log file.
I'm on version docker://broadinstitute/gatk:4.0.1.2 , and using it in a similar manner as the original post, with the exception that I import each sample gvcf into a tiledb with GenomicsDBImport first:
@init_js
Hi,
Yes, that is correct. However, the WARN also happens at sites where there are lots of no-calls for the samples. Can you check if this is the case?
-Sheila