SVPreprocess.qError: Cannot determine library identifier for read HS2000-1273_166:4:1116:21037:84252

Hi ,
I have a bam file without library name , although each bam file is one library. The SVPreprocess.q throws a warning and an error in the ComputeInsertSizeHistogramsWalker and ComputeGCProfileWalker steps respectively.

Below is the warning message from ComputeInsertSizeHistogramsWalker

Error: Cannot determine library identifier for read HS2000-1273_166:4:1215:5282:36837

Header from the bam :

samtools view -H test.bam
@SQ SN:GL000216.1 LN:172294
@SQ SN:GL000215.1 LN:172545
@SQ SN:GL000205.1 LN:174588
@SQ SN:GL000219.1 LN:179198
@SQ SN:GL000224.1 LN:179693
@SQ SN:GL000223.1 LN:180455
@SQ SN:GL000195.1 LN:182896
@SQ SN:GL000212.1 LN:186858
@SQ SN:GL000222.1 LN:186861
@SQ SN:GL000200.1 LN:187035
@SQ SN:GL000193.1 LN:189789
@SQ SN:GL000194.1 LN:191469
@SQ SN:GL000225.1 LN:211173
@SQ SN:GL000192.1 LN:547496
@SQ SN:NC_007605 LN:171823
@SQ SN:hs37d5 LN:35477943
@RG ID:LP6005443-DNA_B09 SM:LP6005443-DNA_B09
@PG ID:bwa PN:bwa VN:0.7.10-r1017-dirty CL:bwa-new.kit/bwa mem -p -t4 [email protected]\tID:LP6005443-DNA_B09\tSM:LP6005443-DNA_B09 /home/hl141/bwadb/hs37d5.fa -

Is there an alternate way to specify the library information to match the read group without fixing the headers of the bam since I have to fix hundreds of bams. Also is there a hack around runinng the genotyper. Below is the stack trace I get when I run the genotyper on the same bam.

INFO 14:46:43,661 26-Oct-2015 ReadCountDiskCache - Initializing read count disk cache [/net/eichler/vol24/projects/human_diversity/nobackups/araja/GenomeSTRiP/genotyping/metadata/rccach
e.bin] ...
INFO 14:46:43,662 26-Oct-2015 ReadCountDiskCache - Read count disk cache: no cache files found.
INFO 14:46:44,786 26-Oct-2015 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: Unexpected library path: LP6005443-DNA_B09/null/null
at org.broadinstitute.sv.genotyping.ReadDepthAlgorithm.initLibraryPathMap(ReadDepthAlgorithm.java:236)
at org.broadinstitute.sv.genotyping.ReadDepthAlgorithm.initialize(ReadDepthAlgorithm.java:68)
at org.broadinstitute.sv.genotyping.GenotypingDepthModule.createReadDepthAlgorithm(GenotypingDepthModule.java:1967)
at org.broadinstitute.sv.genotyping.GenotypingDepthModule.init(GenotypingDepthModule.java:1896)
at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.initModules(GenotypingAlgorithm.java:522)
at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.initialize(GenotypingAlgorithm.java:87)
at org.broadinstitute.sv.genotyping.SVGenotyperWalker.initialize(SVGenotyperWalker.java:217)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:124)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:78)
at org.broadinstitute.sv.main.SVGenotyper.main(SVGenotyper.java:21)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version ):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unexpected library path: LP6005443-DNA_B09/null/null
ERROR ------------------------------------------------------------------------------------------

Thanks in advance for the help.

Archana

Answers

  • bhandsakerbhandsaker Member, Broadie, Moderator
    edited October 2015

    This answer was incorrect (see below).
    Currently, Genome STRiP requires the LB tag.
    If you do not have one in your bam headers, the only solution that currently works is to reheader your bam files.

    There is one undocumented way, although it has been rarely used:

    In your metadata directory, you can create a file prior to preprocessing named "read_groups.dat".
    It should be tab delimited, with three columns and the following header line: READGROUP SAMPLE LIBRARY.
    This is keyed by read group ID (i.e. the ID field in the @RG record) and can be used to override the library and sample identifier.
    If you override, you need to override both.
    Currently, this can only be enabled by naming the file in this way and putting it in the metadata directory (i.e. prior to preprocessing).

    Post edited by bhandsaker on
  • Hi Bob,
    Thank you very much for the prompt reply. The fix worked and the SVPreprocess is still running but throwing an error in SVPreprocess-16.out. Below is the error message.

    INFO 12:35:16,109 27-Oct-2015 MetaData - Opening metadata ...
    INFO 12:35:16,109 27-Oct-2015 MetaData - Adding metadata directory /GenomeSTRiP/genotyping/metadata ...
    INFO 12:35:16,110 27-Oct-2015 MetaData - Loading read group information from /GenomeSTRiP/genotyping/metadata/read_groups.dat ...
    INFO 12:35:16,111 27-Oct-2015 MetaData - Opened metadata.
    INFO 12:35:16,134 27-Oct-2015 MetaData - Loading insert size distributions ...
    INFO 12:35:16,999 27-Oct-2015 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.NullPointerException
    at java.io.DataOutputStream.writeUTF(Unknown Source)
    at java.io.DataOutputStream.writeUTF(Unknown Source)
    at org.broadinstitute.sv.util.io.BinaryCodec$CodecForReadGroupPath.encodeBody(BinaryCodec.java:337)
    at org.broadinstitute.sv.util.io.BinaryCodec$Codec.encodeObject(BinaryCodec.java:282)
    at org.broadinstitute.sv.util.io.BinaryCodec.encodeStatic(BinaryCodec.java:143)
    at org.broadinstitute.sv.util.io.BinaryCodec.encodeStatic(BinaryCodec.java:124)
    at org.broadinstitute.sv.util.io.BinaryCodec.encode(BinaryCodec.java:63)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileWriter.writeHeaderInternal(ReadCountFileWriter.java:210)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileWriter.writeHeader(ReadCountFileWriter.java:98)
    at org.broadinstitute.sv.metadata.depth.ComputeReadCountsWalker.initialize(ComputeReadCountsWalker.java:101)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:124)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:78)
    at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:59)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version ):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Code exception (see stack trace for error itself)

    "/GenomeSTRiP/genotyping/logs/SVPreprocess-16.out" 53L, 5085C

    How to fix this error?

    Thanks. 1

  • bhandsakerbhandsaker Member, Broadie, Moderator

    I apologize - I was wrong.

    We require the LB tag. There is no way in the current version to fix this except to reheader your bam files.

  • Thanks for the prompt reply Bob.

  • @bhandsaker ,

    cc @Araja @bnelsj

    We are working on genotyping SVs for the Simons Genome Diversity Project: https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/

    Unfortunately, we have 30TBs of local bam files that do not have the LB tag (one library per genome). We do have the SM and RG tags. Re-headering the bams would put us back several weeks/month.

    Is there no work-around?

    We would really love to incorporate genome strip calls and genotypes into our diversity panel.

    Thank you,

    Zev

  • lindenblindenb FranceMember

    I've created a patch for htsjdk . It hacks the htsjdk library : the library returns getSample() when getLibrary() is null.

    https://gist.github.com/lindenb/580c675e52d47fc043fb4c265337701d

    A few questions: The old library was named htsjdk-2.1.0.gs. What is that "gs" ? is there any difference with the official htsjdk 2.1.0 release ?

    And for now, it seems to work but the process seems very slow:

    INFO 22:10:59,230 18-Jul-2016 ProgressMeter - Starting 0.0 20.0 m 1984.4 w 100.0% 20.0 m 0.0 s

  • nickpro9nickpro9 LisboaMember

    Good afternoon,
    I am facing a similar problem with Genome STRiP and read groups.

    This is how my @RD field looks like in one of my files:
    @RG ID:10 SM:176448550 PU:H0E3UALXX:3:TCCGGAGA LB:176448550 PL:ILLUMINA
    @RG ID:11 SM:176448550 PU:H0E3UALXX:4:TCCGGAGA LB:176448550 PL:ILLUMINA
    @RG ID:12 SM:176448550 PU:H0E3UALXX:5:TCCGGAGA LB:176448550 PL:ILLUMINA
    @RG ID:13 SM:176448550 PU:H0E3UALXX:6:TCCGGAGA LB:176448550 PL:ILLUMINA
    @RG ID:14 SM:176448550 PU:H0E3UALXX:7:TCCGGAGA LB:176448550 PL:ILLUMINA
    @RG ID:15 SM:176448550 PU:H0E3UALXX:8:TCCGGAGA LB:176448550 PL:ILLUMINA
    @RG ID:8 SM:176448550 PU:H0E3UALXX:1:TCCGGAGA LB:176448550 PL:ILLUMINA
    @RG ID:9 SM:176448550 PU:H0E3UALXX:2:TCCGGAGA LB:176448550 PL:ILLUMINA

    Since the LB field was absent in the original header, I introduced a mock one knowing that each sample comes from a single library.

    Genome strip fails giving the error :

    ERROR MESSAGE: Cannot determine library identifier for read H0E3UALXX:6:2212:5217826:0

    I was wondering whether this read identifiers are ok.
    To me it looks like these read identifiers are missing some fields (like the tile number within the flow cell lane).
    Do you have any suggestions?

    Thanks

  • bhandsakerbhandsaker Member, Broadie, Moderator

    I would extract the SAM record for that read named in the error message and make sure the RG tag on the SAM record looks correct. And of course make sure the reheadering was done correctly (e.g. tabs not spaces, etc.).

  • nickpro9nickpro9 LisboaMember

    Actually after having manipulated the header of the BAM file, the RG tags on the SAM records were not coherent anymore with the header @RG field.
    Thanks for the helpful suggestion

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Glad you were able to figure it out.

Sign In or Register to comment.