We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
picard CrosscheckReadGroupFingerprints

Hello,
In the picard CrosscheckReadGroupFingerprints command, it asks for a Haplotype_map file. What format does Haplotype_map file need to be in? If you have multiple BAM files with the same RG tag, would this command work? If not, what is the best way to edit 6000+ BAM files to have unique RG tag names?
Thank you.
Devin Porter
Best Answers
-
shlee Cambridge ✭✭✭✭✭
Oh sorry, I spoke on behalf of our new Picard tool, CheckFingerprint. I imagine the same would apply to CrosscheckReadGroupFingerprints.
Also, I double-checked with our developer and it appears the file format is not VCF. So I have to describe it to you. It is a text-based file with a header and body.
The header is much like a VCF
edited: BAM
header, e.g.:@HD VN:1.4 GO:none SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens ...
And the body contains these tab-separated columns:
#CHROMOSOME POSITION NAME MAJOR_ALLELE MINOR_ALLELE MAF ANCHOR_SNP PANELS
For example
@HD VN:1.4 GO:none SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens #CHROMOSOME POSITION NAME MAJOR_ALLELE MINOR_ALLELE MAF ANCHOR_SNP PANELS 1 29478419 rs2503005 C T 0.483791 rs2230678 panel1 1 29497820 rs1994859 C T 0.483791 rs2230678 panel2 1 29500995 rs2486204 T C 0.484167 rs2230678 1 29509603 rs2428556 C T 0.728976 rs2230679 panel1
NOTE: SNPs listed with the same ANCHOR_SNP will be in the same haplotype. In case of discrepancy between the MAFs within a
block, the MAF of the first (smallest genomic position) SNP in the
block is considered the MAF of the block.NOTE: the PANEL field is optional (as a value, not in the header)
Where
NAME
=a snp identifier, e.g. dbSNP rsID,MAF
=minor allele frequency andANCHOR_SNP
refers to the NAME of a SNP that groups SNPs in high LD with each other. The tool counts all of the SNPs with the same ANCHOR_SNP as one group. The last column is likely not necessary.As for your second question, if I understand correctly, the purpose of the tool is to look for sample swaps. If you are feeding per file data into the tool, then identical read groups across the files should not matter. If the identical read groups are within the same file, then I think you could take advantage of these tool options:
CROSSCHECK_SAMPLES=Boolean Instead of producing the normal comparison of read-groups, roll fingerprints up to the sample level and print out a sample x sample matrix with LOD scores. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} CROSSCHECK_LIBRARIES=Boolean Instead of producing the normal comparison of read-groups, roll fingerprints up to the library level and print out a library x library matrix with LOD scores. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} EXPECT_ALL_READ_GROUPS_TO_MATCH=Boolean Expect all read groups' fingerprints to match, irrespective of their sample names. By default (with this value set to false), read groups with different sample names are expected to mismatch, and those with the same sample name are expected to match. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
If you can describe your setup in more detail, perhaps we can help more on this part of your question.
Post edited by shlee on -
yfarjoun Broad Institute ✭✭✭
Hi,
I think that crosscheckReadGroupFingerprints is very appropriate for your use-case. However, it currently cannot compare a BAM to a VCF. In a short while there will be a version that can.
The one step that you need to do (in the meantime?) is to prepare a haplotype_database file as described above. Ideally your panel contains common, and independent variants. If they are not, the results you will get form the tool will be less accurate.
200 sites should be more than enough to identify your files.
Answers
Hi @dporter8,
I'll have to double-check what is absolutely required. What I can tell you now is what the file I have that works with the tool looks like is a sites-only VCF file. It has a header and a body.
If this does not work for you, please let me know. I will look into more detail what is required.
Oh sorry, I spoke on behalf of our new Picard tool, CheckFingerprint. I imagine the same would apply to CrosscheckReadGroupFingerprints.
Also, I double-checked with our developer and it appears the file format is not VCF. So I have to describe it to you. It is a text-based file with a header and body.
The header is much like a VCF
edited: BAM
header, e.g.:And the body contains these tab-separated columns:
For example
NOTE: SNPs listed with the same ANCHOR_SNP will be in the same haplotype. In case of discrepancy between the MAFs within a
block, the MAF of the first (smallest genomic position) SNP in the
block is considered the MAF of the block.
NOTE: the PANEL field is optional (as a value, not in the header)
Where
NAME
=a snp identifier, e.g. dbSNP rsID,MAF
=minor allele frequency andANCHOR_SNP
refers to the NAME of a SNP that groups SNPs in high LD with each other. The tool counts all of the SNPs with the same ANCHOR_SNP as one group. The last column is likely not necessary.As for your second question, if I understand correctly, the purpose of the tool is to look for sample swaps. If you are feeding per file data into the tool, then identical read groups across the files should not matter. If the identical read groups are within the same file, then I think you could take advantage of these tool options:
If you can describe your setup in more detail, perhaps we can help more on this part of your question.
Thanks for gathering this information for me. It is quite helpful. Specifically, I am trying to deconvolute 12 cell lines from one sequencing lane. I performed a single-cell RNA-seq experiment using the 10X genomics chromium system. I have SNP information on hundreds of DO mESCs and from 10X barcode system, I can get reads that come from individual cells. I filtered my BAM file down to the SNP regions of interest, then sorted out the reads into 6000 individual barcode BAM files. I already have a reference vcf file containing eQTLs on each chromosome for 200 mouse ESCs and have it in a VCF format. My plan was to correlate the SNPs from my individual barcoded vcf files with my reference file to find a matching cell line, but when I read the description of the picard fingerprint tool, I thought maybe it could be a better way of doing this.
HI @dporter8, I'll ask a developer to address your question. Notice that I've made additions to my description of the haplotype map file above.
Hi,
I think that crosscheckReadGroupFingerprints is very appropriate for your use-case. However, it currently cannot compare a BAM to a VCF. In a short while there will be a version that can.
The one step that you need to do (in the meantime?) is to prepare a haplotype_database file as described above. Ideally your panel contains common, and independent variants. If they are not, the results you will get form the tool will be less accurate.
200 sites should be more than enough to identify your files.
yfarjoun,
Thanks for your confirmation for utility of this tool. About creating the haplotype database.. We have Gigamuga SNP arrays on about 900 DO mESC lines. Do you know of any tools that would be able to utilize this data to create this haplotype map database? Would DOQTL work for this?
Thanks.
Devin
I am not familiar with any tool that can genenerate a haplotype database. You should be able to LD-prune with plink (https://www.cog-genomics.org/plink2/ld) and convert the resulting file into the Haplotype Database format using hand-written scripts.
actually, I think I misunderstood your question...if the regions you are looking at do not have variants then the fingerprinting will not work. are sites in question variants or just regions with differential coverage between the samples? are we talking about one individual, or many individuals?
I now describe the haplotype map format officially at http://gatkforums.broadinstitute.org/dsde/discussion/9526. Thanks for bringing this need to our attention @dporter8.