We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Cross comparison between Array and NGS data

santiagorevalesantiagorevale ArgentinaMember

Dear GATK staff,

I have a 11 samples that were sequenced using NGS (Illumina HiSeq) and 2 of these samples were also genotyped using an Illumina Human Global screening array (Illumina Iscan). I was looking at your latest WDL script and I've noticed a few steps that I think are related but I don't know how to prepare the inputs for them. Any help or additional explanation on them would be really appreciated!

# Check identity of fingerprints across readgroups
input: haplotype_database_file

What information should I use to create this file? Array data? I have already read these links but I'm still lost:

# Estimate level of cross-sample contamination
input: contamination_sites_vcf

What information should I use to create this file?

# Check the sample BAM fingerprint against the sample array 
input: haplotype_database_file
input: genotypes

What information should I use to create these files? What does each input stands for?

Thank you very much in advance.

Best regards,

Issue · Github
by shlee

Issue Number
Last Updated
Closed By


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2017

    Hi @santiagorevale,

    You will have to prepare a haplotype map format file for the Picard fingerprinting tools for (i) appropriate content and (ii) file format. One approach would be to use a population common variant resource that contains allele frequencies. To tie together variants that are in high LD, you could use population stratification.

    Take a look at the resources I've made available on the FTP server's beta folder (see GATK Resource Bundle>FTP Server access>beta folder). The ContEst folder contains an older HapMap-based resource that stratifies allele frequencies by population. If formatted correctly, I believe this data could work for the fingerprinting tools. Note that ExAc offers a more recent dataset of stratified population allele frequencies but these are not in the correct format. The Mutect2 folder contains a gnomAD-based population common variant resource that contains allele frequencies but these are not stratified by population, i.e. for variants in high LD. Prepping such a resource is beyond the scope of what our tools can do. If you are interested in going this route, you might look into other external programs, e.g. bcftools.

    You'll also notice a GetPileupSummaries folder within the beta folder. The resource is for use with our new GATK4 contamination workflow that uses GetPileupSummaries and CalculateContamination. GATK4 will become available sometime this week or next. In the meanwhile, you can read up on the tool here.

    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2017

    @santiagorevale I've also asked for some followup from our developers who are more in-the-know with the statistical aspects of these tools.

  • santiagorevalesantiagorevale ArgentinaMember

    Hi @shlee,

    I don't understand much of population genomics. My goal in this case is to be able to identify which of the NGS samples corresponds to each array sample data. So, why do I need a population file as input for any of the tools?

    On the other hand, what would it be the (i) appropriate content for preparing the haplotype map file in my case? Is there any haplotype map file already generated you could share with me for this purpose?

    If I have to generate my own file starting from one of the VCF files from the bundle you described before, how can I get the MAF and how should I populate ANCHOR_SNP and PANELS fields?

    Thank you very much in advance.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2017

    Hi @santiagorevale,

    Links to our bundle are listed on https://software.broadinstitute.org/gatk/download/bundle. Resources are available either via FTP or Google Cloud storage where clicking on a file automatically starts a download.

    The HAPLOTYPE_MAP argument descriptions for the two fingerprinting tools state:

    The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting.


    The file of haplotype data to use to pick SNPs to fingerprint

    From this I think the tool uses this file to narrow down the sites it uses to cross check sites. It makes the tool more efficient than checking every site. The LD block linkage makes it even more efficient but it appears from the wording that the linkage isn't necessary. I'm sorry I cannot be more definitive. I haven't tested the tools myself such that I would have generated a haplotype-map format file and I believe accepting a VCF format is a feature that is still in the works. That's why we've gone to the lengths to describe the haplotype-map format in http://gatkforums.broadinstitute.org/gatk/discussion/9526/picard-haplotype-map-file-format.

    The fingerprinting tools are actually a bit more sophisticated than their names imply. I believe they also detect degrees of relatedness between samples, e.g. between samples from two sisters. Check out the poster in the posters drive listed here.

    If file manipulations are beyond your comfort level, then you may be interested to know that for production we currently use an external tool for sample-swap detection called VerifyBamID.

    Post edited by shlee on
  • santiagorevalesantiagorevale ArgentinaMember

    Hi @shlee,

    Regarding to bundle, I was able to find it properly. However, I was unable to find any haplotype map file there. What I was asking is if maybe you are already using a file for this and would be able to share it as well.

    Thanks for pointing me in the direction of VerifyBamID to use in this case. I was able to solve my issue with it.

    However, I'm still interested in using this commands in the future. Would it be possible, once this commands accept a haplotype map file in VCF format for you to develop a tutorial on how to build it? That would be great, because it's a required input for them.

    Thank you very much for your help.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @santiagorevale,

    I have already asked this on behalf of our users, and unfortunately we don't have permission to share at this time the particular file our group uses. I will ask again in a few months.

    Are you asking for a tutorial on how to process a VCF to the haplotype map format? Sure, I will keep this in mind for a tutorial going forward.

  • @shlee Is there any update with regard to whether you guys can share the particular file your group uses? Would be most useful rather than having to construct our own.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi again @damagingcamel,

    If we are sharing it, then it would be in https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0. I will double-check with the team.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Alright @damagingcamel,

    I have heard that there may be a haplotype map file accessible through a FireCloud workspace. I think your best bet is to ask on the FireCloud forum.

  • Thanks @shlee, I'll look into that.

  • FPBarthelFPBarthel HoustonMember ✭✭

    I'm also very interested in finding a haplotype map file and have been scavenging around to find something but no luck so far. The fingerprint_map tool is not working as expected. At the moment I am using NGSCheckMate, instead. Any pointers to finding this file (eg. where on firecloud) would be helpful.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @FPBarthel,

    Similar discussions asking for this resource are here and here. The latter discussion has a suggestion to make your own HaplotypeMap with the GetPileupSummaries resource (the small_exac_common resource) as well as a link to the fingerprint_map tool you reference. If you are having trouble using the fingerprint_map tool, I encourage you to open an issue ticket in their project repository. You will need to be signed into GitHub and can post the issue using this link. Otherwise, you can describe the problem you are having with the fingerprint_map in this discussion thread and I will try to pass on the word.

  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks @shlee ! I read those issues as well and using the fingerprint_maps seemed like the best approach, but am having trouble running the code. I created an issue on the GitHub here

Sign In or Register to comment.