We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Large number of scaffolds for the reference genome for SNP calling pipeline


I am running the SNP calling pipeline from RNA-seq data. I used STAR for alignment. My reference genome is composed of 32012 scaffolds from which more than 30000 are contigs with length of less than 1000 bp. I first decided to delete these contigs and just keep the superscaffolds, scaffolds and contigs longer than 1000 bp but in the STAR manual, it was recommended to keep all the sequences because other types of RNA can be mapped to them, so, I have retained them.

I have run the rest of the pipeline using these 32012 scaffolds as my reference. However, for the last step which is variant calling, I am going to extract only scaffolds that map to chromosome Z, so, I will have sthg around 40 scaffolds.

As I don't know the algorithms behind GATK and picard, I was wondering whether using these large numbers of primary scaffolds may be problematic at some steps? I am sorry for this general question, I have run the pipeline and have not encountered any problem so far but I was wondering if there is anything specific that I should look into the outputs to make sure everything has gone very well.

Thank you in advance.


Best Answer


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭



    This can be an issue, because the GATK engine is not designed to process so many scaffolds. The best thing to do is pare your ~30000 scaffolds down to ~40 scaffolds. This should eliminate any performance issues.


  • HomaHoma Member

    Thank you for your answer. Are the large number of scaffolds the problem only for Haplotype caller or for the whole utilities of GATK? For example for Indel Realigner as I have already run this tool. Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Some of the tools will struggle more than others depending on how they process the data. To clarify, if the tools run to completion, the results will be completely accurate. The main issue is that the tools will require a lot more computational power, and some people's infrastructure just cannot fulfill the demand (mostly for RAM), so they may run extremely slowly or even freeze.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    We are currently trying to ensure GATK runs efficiently on references that have a large number of contigs. Would it be possible for you to share some test data with us? We would need your reference that has the large number of contigs and a snippet of a BAM file that is aligned to the reference. Instructions for sharing are here.


  • qwedsafqwedsaf Member

    Apologies for resurrecting an old thread, but I have a related question. I'm mapping reads and calling variants against the rhesus macaque reference Mmul_8.0.1 (http://useast.ensembl.org/Macaca_mulatta/Info/Index). The toplevel FASTA file contains 22 primary chromosomes and >250k unplaced scaffolds. I want to include these scaffolds in read mapping to minimize false mapping to chromosomes, but when I try to run HaplotypeCaller (3.6) against the reference genome containing all these sequences it stalls, likely because of the issue mentioned above.

    If I make a reference genome from only the chromosomes, I can get HaplotypeCaller to call variants against a BAM file with reads mapped to both the primary chromosome and the unplaced scaffolds, but only when running in unsafe mode.

    What is the best current advice for this sort of situation? Thanks!


Sign In or Register to comment.