We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Errors parsing the sample map for GenomicsDBImport, different encoding expected?

I am using docker://broadinstitute/gatk:4.0.1.2 GenomicsDBImport on a batch of 2400 samples, over a 1Mbp region and getting a spurious file not found error, presumably due to a re-encoding of the sample name and file.

The filenames passed into GenomicsDBImport are loaded from a sample map, and certain special characters get reencoded further down during processing (in bold):

April 25, 2018 7:13:05 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2076049408
htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to create BasicFeatureReader using feature file , for input source: file:///home/user/genomics/snake4/data/gwas/variants/Hidatsa%231.bam.g.vcf.gz
        at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:113)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:615)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getHeaderFromPath(GenomicsDBImport.java:356)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.initializeHeaderAndSampleMappings(GenomicsDBImport.java:342)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.onStartup(GenomicsDBImport.java:297)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
        at org.broadinstitute.hellbender.Main.main(Main.java:277)
Caused by: java.io.FileNotFoundException: /home/user/genomics/snake4/data/gwas/variants/Hidatsa%231.bam.g.vcf.gz (No such file or directory)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.(RandomAccessFile.java:243)
        at htsjdk.samtools.seekablestream.SeekableFileStream.(SeekableFileStream.java:47)
        at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:99)
        at htsjdk.tribble.readers.TabixReader.(TabixReader.java:129)
        at htsjdk.tribble.TabixFeatureReader.(TabixFeatureReader.java:80)
        at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:106)
        ... 10 more

My command line (inside singularity):

export HOME=data/tmp/tmp.gendbimport.XgXABL
/gatk/gatk GenomicsDBImport \
   --java-options '-Xmx20G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true' \
   --genomicsdb-workspace-path data/tmp/tmp.gendbimport.XgXABL/gendb_8beaa85294bc2d920308a33b61ad16f6a8508288_HanXRQChr01-000000001-001000000.db \
   -L HanXRQChr01:000000001-001000000 \
   --batch-size 100 \
   --validate-sample-name-map true \
   --sample-name-map data/gwas/gendb/sample_list_8beaa85294bc2d920308a33b61ad16f6a8508288.txt

(Note: The HOME= env is a workaround to avoid the tiledb library writing metadata into my home directory and stomp over all the concurrent runs. Hopefully this will be fixed in the near future.)

My sample map has, amongst others, the following proplematic entries which have # chars (hex 0x23) in them:

Hidatsa#1       data/gwas/variants/Hidatsa#1.bam.g.vcf.gz
Mandan#2        data/gwas/variants/Mandan#2.bam.g.vcf.gz

They've been aligned, and then processed through HaplotypeCaller without issues in earlier steps.

The file data/gwas/variants/Hidatsa#1.bam.g.vcf.gz exists, and contains valid information.

I don't necessarily care about the specific name, I could rename the sample, but it might be easier for me to change the encoding for the sample map, if is is a simple matter of passing the info differently.

Tagged:

Answers

Sign In or Register to comment.