We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Standard practice for VCF filtering for the purpose of fingerprinting via proportion IBS?

Hi all,

I was wondering whether I could get some insight into what the standard procedure for `fingerprinting' by prop. IBS for sequencing data is.

To be more precise: the proportion IBS between the called variants from sequencing data between two samples depends on the relationship between the individuals from which the samples were taken. For example, the prop. IBS between two samples taken from the same person should be higher than that between siblings, which in turn should be higher than that between two unrelated people.

It should therefore be possible to classify two individuals whose relationship is unknown into at least one of the three aforementioned bins (self-self, self-sibling, self-unrelated). My attempt to do this with a ~400 sample size of targeted amplicon sequence data yields fairly good separation by simply filtering by MAF of 0.05, but there is still a lot of overlap (see attached image):

My question is whether there is a battery of filters to be applied to the VCFs that is generally accepted to be good practice for such a use.


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @damagingcamel,

    You are asking about identity by state (IBS). GATK/Picard offers fingerprinting tools towards such analyses. Please see CrosscheckFingerprints and ClusterCrosscheckMetrics. For details on the method, check out the fingerprinting poster. You can find a link to the poster folder on https://software.broadinstitute.org/gatk/documentation/presentations.

  • Hi @shlee , thanks for the response --- I wasn't aware of the existence of those tools. Is there any reason why IBS wouldn't be a good metric for "fingerprinting" a genotype, given the concept of fingerprinting as I understand it to mean simply its unique genotype? I don't quite understand what the difference between 'fingerprinting' and comparing IBS values is. Thanks!

  • Also, is there a tutorial/walkthrough for how to use the tools you pointed out? The documentation is a bit confusing.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @damagingcamel, all that I can offer you is this dictionary entry. We have no plans to write a tutorial for the fingerprinting tools. I hope you can figure out how to use these given the tool documentation and other forum threads. Good luck.

  • damagingcameldamagingcamel Member
    edited March 2018

    @shlee thanks for the response --- it seems fairly straightforward except for how one would actually go about constructing a haplotype map for generic usage --- I saw your post on a related github issue, but it doesn't appear as though there was ever any resolution:

    Do you have any advice as to how one would construct such a file for generic usage?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @damagingcamel, it is my desire to provide researchers with the pertinent information they need to enable science, hence the comment in that issue ticket. I did indeed ask to be able to look more into fingerprinting and to write a more detailed synopsis. However, our team has other priorities to focus on and the scope of such an endeavor goes beyond what the GATK/Picard programs cover. If you can make your case on why being able to generate your own haplotype map resource is important to biomedical research, then we can take it into consideration.

  • damagingcameldamagingcamel Member
    edited March 2018

    Hi @shlee --- the main purpose would be to use it with the fingerprinting tools in Picard in order to track/prevent sample swaps in the context of PGS --- in particular, to make sure that the correct embryo was implanted by cross-referencing the embryo's sequence information with the results from a NIPT. With the ability for the fingerprinting tools in Picard (according to the poster detailing the abilities of the tool) to distinguish between self-self, self-unrelated, and self-sibling sample pairs, it seems to us that they would be ideal for our purposes.

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited March 2018

    @damagingcamel, there is a publication in the works that will enable making your own HaplotypeDB file. In the meanwhile, our developer says you can make a usable file by including many (~1000) common SNP sites that are spread far enough from each other to be in low LD. We have sites-only gnomAD resource files with population allele frequencies AF available for use with Mutect2 in the GATK Resource Bundle. In particular, the resource one would use with GetPileupSummaries contains fewer sites. Specifically, the small_exac_common resource contains ~60K biallelic common SNP sites ranging in population allele frequency from 0.051 to 0.499. Perhaps this may be a suitable resource to start with. Just a thought.

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    in case you can make use of it, another user has made a pipeline of sorts for generating these hapmap files:


    I hope it be useful.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks, @yfarjoun for the resource. Researchers will find this useful.

  • golharamgolharam Member ✭✭✭

    @shlee - CheckFingerprints is pretty much useless unless GATK Team can tell users how to generate the files.

Sign In or Register to comment.