Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

vendor pre-built reference package

zwzhangzwzhang Member, Broadie

Hi, firecloud team
I am aligning my WGS with vendor pre-built hg38 reference package, which contains decoy contigs, but not alternate haplotypes and MHC alleles, compared to standard hg38 reference assembly.

I have recently tried running mutect2 and somatic CNV workflow on firecloud for those bams aligned with vendor reference package but assigned firecloud attributes were made from standard hg38 assembly ( I got from Broad data resource bundle, etc). Unfortunately, I have not made it work yet. But not sure whether it is because of different reference used.

Do you suggest using vendor reference for ref_fasta, ref_dict, ref_fai attributes on firecloud? if so, I don't think I can use other reference databases such as PoN, 1000g, gnomad files, because they all made with standard reference build, I think.

Please advise on what I should do.

Thank you very much

Issue · Github
by shlee

Issue Number
3112
State
open
Last Updated
Assignee
Array

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hi @zwzhang, thank you so much for your patience while you've been waiting for me to get back to you.

    With regards to using your vendor hg38 and the standard one, I do know you should use the same reference assembly throughout your process. However, there may be an exception in your case because both references should have the same formatting. I am going to transfer this thread to the GATK forum where one of my colleagues, @Sheila or @shlee, can better answer your question.

    I did want to ask you, though-- when you say you have not made it work yet, what error message are you getting? If it is related to the reference issue you discussed, then Sheila or Soo Hee can help you. If it is a separate FireCloud issue, I'll be back to this thread to help on that.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @zwzhang,

    Kate has asked I look into your question. I'm with the GATK support.

    vendor pre-built hg38 reference package, which contains decoy contigs, but not alternate haplotypes and MHC alleles

    So you are using a reference set that omits alternate haplotype contigs. You tried to use the full-GRCh38 reference set, which includes the alternate haplotypes, that a workflow had preconfigured.

    To figure out why your run is not working, it helps for us to see the error message that GATK produces for the run. Can you post your error message? @KateN can direct you to where to find these on the FireCloud platform.

    The GATK engine performs checks to ensure a provided reference matches that indicated in the provided data. For example, if you have a command that takes in an alignment BAM and also you provide a reference with the -R parameter, then the check happens. One thing to note about GATK4, the latest release that I believe you are using, is that it is more relaxed about reference checks than prior GATK versions. For example, many tools now make the -R reference optional when before the reference was required. Also, it is possible to adjust the stringency of the check so you can use mismatching reference sets (it is up to you to check the scientific validity of doing so). You can check the available options for a tool by running, e.g. for Mutect2 gatk Mutect2. You will then see options such as --sequence-dictionary and --disable-sequence-dictionary-validation explained. These are two options that may be of interest to you, depending on the nature of your error message.

    As Kate states, it is best for you to use the matching reference. However, your particular case is special, in that what is dropped are the alternate contigs whose regions, at least for one haplotype, are represented by the primary assembly.

    In general, we recommend the resources you use match as closely as possible the tool-chain of the data under scrutiny. This allows you to fully leverage the resources. For example, consider the purpose of the PoN in the context of somatic mutation calling (see the last section of Article#11127). What would be the impact of using a mismatched PoN?

    How you perform alignments to a reference set will dictate your choice of resources. Given the vendor reference set, I assume you are NOT performing alt-aware alignment nor post-alt processing. However, given the lack of alternate contigs in the vendor reference, it is as if you are performing alt-aware mapping, where alignments to the primary assembly take precedence. In this case, I think it is okay for you to use resources generated by alt-aware mapping to the full GRCh38 reference. However, you should be careful with resources that have additionally undergone post-alt processing.

    I am unfamiliar with the resources preconfigured for the Mutect2 FireCloud workspace. Factors to consider are (i) how were they aligned, (ii) to which reference they were originally aligned, (iii) were they lifted-over, etc. I hope I've been helpful. Good luck.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    If you don't already know what error message you're receiving, there are a couple places you can check. In the Monitor tab for the workflow that you ran, there should be "Failures" section. If you click to expand that, it will give you an error message. If the error message points you to a specific task, you can scroll down, expand that task, and open the stderr.log file to see more information.

Sign In or Register to comment.