Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Large number of scaffolds for the reference genome for SNP calling pipeline


I am running the SNP calling pipeline from RNA-seq data. I used STAR for alignment. My reference genome is composed of 32012 scaffolds from which more than 30000 are contigs with length of less than 1000 bp. I first decided to delete these contigs and just keep the superscaffolds, scaffolds and contigs longer than 1000 bp but in the STAR manual, it was recommended to keep all the sequences because other types of RNA can be mapped to them, so, I have retained them.

I have run the rest of the pipeline using these 32012 scaffolds as my reference. However, for the last step which is variant calling, I am going to extract only scaffolds that map to chromosome Z, so, I will have sthg around 40 scaffolds.

As I don't know the algorithms behind GATK and picard, I was wondering whether using these large numbers of primary scaffolds may be problematic at some steps? I am sorry for this general question, I have run the pipeline and have not encountered any problem so far but I was wondering if there is anything specific that I should look into the outputs to make sure everything has gone very well.

Thank you in advance.


Best Answer


  • SheilaSheila Broad InstituteMember, Broadie admin



    This can be an issue, because the GATK engine is not designed to process so many scaffolds. The best thing to do is pare your ~30000 scaffolds down to ~40 scaffolds. This should eliminate any performance issues.


  • HomaHoma Member

    Thank you for your answer. Are the large number of scaffolds the problem only for Haplotype caller or for the whole utilities of GATK? For example for Indel Realigner as I have already run this tool. Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Some of the tools will struggle more than others depending on how they process the data. To clarify, if the tools run to completion, the results will be completely accurate. The main issue is that the tools will require a lot more computational power, and some people's infrastructure just cannot fulfill the demand (mostly for RAM), so they may run extremely slowly or even freeze.

  • SheilaSheila Broad InstituteMember, Broadie admin


    We are currently trying to ensure GATK runs efficiently on references that have a large number of contigs. Would it be possible for you to share some test data with us? We would need your reference that has the large number of contigs and a snippet of a BAM file that is aligned to the reference. Instructions for sharing are here.


  • qwedsafqwedsaf Member

    Apologies for resurrecting an old thread, but I have a related question. I'm mapping reads and calling variants against the rhesus macaque reference Mmul_8.0.1 (http://useast.ensembl.org/Macaca_mulatta/Info/Index). The toplevel FASTA file contains 22 primary chromosomes and >250k unplaced scaffolds. I want to include these scaffolds in read mapping to minimize false mapping to chromosomes, but when I try to run HaplotypeCaller (3.6) against the reference genome containing all these sequences it stalls, likely because of the issue mentioned above.

    If I make a reference genome from only the chromosomes, I can get HaplotypeCaller to call variants against a BAM file with reads mapped to both the primary chromosome and the unplaced scaffolds, but only when running in unsafe mode.

    What is the best current advice for this sort of situation? Thanks!


Sign In or Register to comment.