Picard haplotype map file format

shleeshlee CambridgeMember, Broadie, Moderator
edited July 2017 in Dictionary

The haplotype map that certain Picard tools require is a file that maps SNPs to LD (linkage disequilibrium) blocks. These tools include Picard CrosscheckReadGroupFingerprints and CheckFingerprint. For these tools, the HAPLOTYPE_MAP parameter defines the file.

  • For details on what the tools do and their parameters, see https://broadinstitute.github.io/picard/.
  • For an overview of fingerprinting math and comparative results for different data types, see the related poster. You can find a link to posters on this page.
  • To view the javadoc documentation for tools within the Picard Jar, type

    java -jar picard.jar <tool name> -h
    

As of this writing (5/5/2017, Picard v2.9.0), the HAPLOTYPE_MAP file is a text-based file that tab-separates fields. In a future release of Picard, this field will also accept VCF formats ending in .vcf, .vcf.gz or .bcf. At that time, tools will interpret all other file extensions for this parameter as the original text-based format.

These two formats differ in their requirements as we outline below.


The original haplotype map file format

It has a header and a body as shown.

image

The header is a standard SAM header, with an @HD line to define the file type and @SQ lines to define the reference contigs. You can easily derive such a header from your reference dictionary file.

The body contains a column header line starting with a # hash followed by lines that annotate SNPs and blocks in high LD.

  • NAME is a SNP identifier, e.g. dbSNP rsID
  • MAF is minor allele frequency
  • ANCHOR_SNP refers to the NAME of a SNP that groups SNPs in high LD with each other. The tool counts all of the SNPs with the same ANCHOR_SNP as one group.
  • Although the column header requires the PANELS label, the PANELS column field value is optional.

Again, the SNPs listed with the same ANCHOR_SNP will be in the same haplotype. If there is a discrepancy between the MAFs within a block, the tool considers the MAF of the first SNP, i.e. that with the smallest genomic position, the MAF of the block. Again, MAF stands for minor allele frequency.


The VCF-based haplotype map

Picard v2.10.1+ (released 2017/7/11) accepts this format. Tools will recognize a VCF format if the file extension ends in .vcf, .vcf.gz or .bcf. Tools will interpret all other file extensions fas the original text-based format we describe above.

Click here to download an example file. Here is the body portion of this example file.

image

  • The VCF format haplotype map contains exactly one sample whose genotype calls are all heterozygous, e.g. 0/1 or 0|1.
  • The tool determines haplotype block grouping using phased genotypes (with a pipe |) and the PS (phase set) format field annotation.
  • The INFO field's AF annotation refers to the alternate allele frequency. This is not necessarily the minor allele frequency. This differs from the original haplotype map file format's requirement.

Finally, the VCF specification (v4.2) defines the PS field as follows.

PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)


Post edited by shlee on

Comments

Sign In or Register to comment.