Is there a way to remove the "|" from the SQ line in a .bam file?

grtaveragrtavera Case Western Reserve UniversityMember

We have several, unique H. pylori genomes. We have aligned each of them to a reference genome, which had "|"in the fasta (or .fa) file name. Now, when running through GATK, we cannot create our final .vcf file. Do we have to remove the "|" from the .fa file and rerun everything from the beginning, or is there another approach?



  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    Could you post the exception you are getting and extract of the header that is causing the issue? It sounds like a bug... or perhaps an obscure violation of the SAM format but I guess is the former.

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    It seems that a similar issue has been reported here. There you will find a workaround that you be able to adapt to your situation.

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    Also could you confirm what version of GATK you are using?

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    It seems that the that I posted above actually does not work, here is the workaround:

    It seems that something similar has been reported here.

    That might be a bug in GATK, so thanks for reporting. I guess the work around with samtools would be:

    samtools view -h input.bam | sed 's/SN:gi\|[0-9]*\|gb\|\(.*\)\|/SN:\1/' | samtools view -b - > output.bam

    You may need to add more 'sed' commands if there is SNs that follow a different regular expression. You can check on whether the
    'sed' is doing the right think like so:

    samtools view -H input.bam | sed 's/SN:gi\|[0-9]*\|gb\|\(.*\)\|/SN:\1/'

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Please do have a look at the threads Valentin pointed to above. Also, why do you say you cannot create the final VCF? Are you getting an error message? If so, please post it.


