We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How to change @SQ file in a .bam file

SundanceSundance Case Western ReserveMember

I am trying to run GATK to get a g.vcf file using an interval list.
I get the following message:
ERROR MESSAGE: File associated with name interval.list is malformed: Interval file could not be parsed in any supported format. caused by Failed to parse Genome Location string: gi|409893163|gb|CP003904.1| : 74250-75830
In my .bam file I have the following:
@SQ SN:gi|409893163|gb|CP003904.1| LN:1667892

How can I change this to read:
@SQ SN:CP003904.1 LN:1667892

My genome is a bacteria

I don't think GATK like the "|" in my files. Or is there some way to change my interval.list

Thanks

Answers

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    It seems that something similar has been reported here.

    That might be a bug in GATK, so thanks for reporting. I guess the work around with samtools would be:

    samtools view -h input.bam | sed 's/SN:gi\|[0-9]*\|gb\|\(.*\)\|/SN:\1/' | samtools view -b - > output.bam

    You may need to add more 'sed' commands if there is SNs that follow a different regular expression. You can check on whether the 'sed' is doing the right think like so:

    samtools view -H input.bam | sed 's/SN:gi\|[0-9]*\|gb\|\(.*\)\|/SN:\1/'

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @Sundance
    Hi,

    I think you simply need to add quotes around the contig names. For example if you want to run on positions 1-1000 on contig gi|409893163|gb|CP003904.1|, you would specify "gi|409893163|gb|CP003904.1|":1-1000 in the interval list.

    -Sheila

Sign In or Register to comment.