If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Using hg19 with genomeSTRiP: chrM sorted first

I've been trying to get genomeSTRiP running with a set of BAMs that are the output from the GATK best practices pipeline. They have all been aligned to the gatk bundle version of hg19.

I'm using a modified version of the preprocessing queue script from the 2013 workshop tutorial.

I generated my own svmask based on my version of hg19

Previously I was fighting some problems with the CN2 file from the bundle ( not matching hg19. I solved this by
1. adding chr to each chromosome name
2. changing chrMT to chrM
3. editing the naming of the random regions:
sed \
-e 's/>chrGL000191.1/>chr1_gl000191_random/' \
-e 's/>chrGL000192.1/>chr1_gl000192_random/' \
-e 's/>chrGL000193.1/>chr4_gl000193_random/' \
-e 's/>chrGL000194.1/>chr4_gl000194_random/' \
... etc
4. adding in entries at the bottom of the CN2 mask corresponding to entries in my svmask that did not appear in the download CN2 containing lists of 0's the length of the given contigs in hg19

1. Does the above method of dealing with the CN2 file sound correct? In the CN2 mask file does 0 stand for exclude, or include

  1. I'm now hitting on a chromosome ordering issue now. Running the hg19 version from the resource bundle I'm getting the error:

INFO 15:18:52,483 MetaData - Loading insert size distributions ...
INFO 15:18:52,524 ComputeReadDepthCoverageWalker - Using genome mask /ifs/labs/cccb/projects/cccb/apps/svtoolkit_metadata/svmasks/ucsc.hg19.mask.75.fasta
INFO 15:19:20,086 ProgressMeter - Starting 0.00e+00 30.0 s 49.6 w 100.0% 30.0 s 0.0 s

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: The 1th sequences in the reference /ifs/labs/cccb/projects/db/gatk/hg19 is named chrM
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(
at org.broadinstitute.sting.commandline.CommandLineProgram.start(
at org.broadinstitute.sting.commandline.CommandLineProgram.start(

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.5-2-g5671483):
ERROR Please check the documentation guide to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions
ERROR MESSAGE: The 1th sequences in the reference /ifs/labs/cccb/projects/db/gatk/hg19 is named chrM
ERROR ------------------------------------------------------------------------------------------

Do I really have to assemble a new FASTA with the hg19 individual chromosomes as described here:

And once I've done that, are my alignments that were done against the canonical hg19 contig order going to fail?

I'm just having a hard time believing that the reference from the GATK bundle is the one causing the issue.


Best Answer


  • Rock!
    That seems to have worked. On to the next problem.

Sign In or Register to comment.