Error in GenomeStrip when using a bam file with added RG groups using AddOrReplaceReadGroups.jar

I have a question regarding the PicardTools AddOrReplaceReadGroups tool. I am using a single-sample WGS dataset and it arrived without any RG information, which I have now added with this command:

$HOME/bin/java -Xmx20g -jar /home/exome/bin/AddOrReplaceReadGroups.jar  \
I=DVH_GEONOME0002_novoalign.bam  \
O=DVH_GEONOME0002_novoalign_RG.bam  \
SORT_ORDER=coordinate \
RGID=HiSeq1 \
RGPL=illumina \

This seems to have worked, but when I use the output file for GenomeStrip (PreProcess), I get an error:

ERROR MESSAGE: Error parsing SAM header. Problem parsing @RG key:value pair. Line:
ERROR @RG ID:HiSeq1 PL:illumina PU: LB:WGS SM:DVH_genome0002; File /data_n2/vmistry/illumina_genome_SS6004426_HUSS1728_VM/DVH_genome0002_novoalign_RG.bam; Line number 86

I can't see any issues with line 86:

HS2000-1010_95:5:2202:3599:13204 99 chr1 10004 0 100M = 10023 118 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT <<<>>>===>??===>>>===>>?===>??===???===>??===???===???===??>===???===>?>===>??==<><>===???='9><899;= MD:Z:100 PG:Z:novoalign RG:Z:HiSeq1 AM:i:70 NM:i:0 SM:i:70 PQ:i:92 UQ:i:1 AS:i:

And the header of the RG_bam file looks like this:

@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621 AS:GRCh37
@SQ SN:chr2 LN:243199373 AS:GRCh37
@SQ SN:chr3 LN:198022430 AS:GRCh37
@SQ SN:chr4 LN:191154276 AS:GRCh37
@SQ SN:chr5 LN:180915260 AS:GRCh37
@SQ SN:chr6 LN:171115067 AS:GRCh37
@SQ SN:chr7 LN:159138663 AS:GRCh37
@SQ SN:chr8 LN:146364022 AS:GRCh37
@SQ SN:chr9 LN:141213431 AS:GRCh37
@SQ SN:chr10 LN:135534747 AS:GRCh37
@SQ SN:chr11 LN:135006516 AS:GRCh37
@SQ SN:chr12 LN:133851895 AS:GRCh37
@SQ SN:chr13 LN:115169878 AS:GRCh37
@SQ SN:chr14 LN:107349540 AS:GRCh37
@SQ SN:chr15 LN:102531392 AS:GRCh37
@SQ SN:chr16 LN:90354753 AS:GRCh37
@SQ SN:chr17 LN:81195210 AS:GRCh37
@SQ SN:chr18 LN:78077248 AS:GRCh37
@SQ SN:chr19 LN:59128983 AS:GRCh37
@SQ SN:chr20 LN:63025520 AS:GRCh37
@SQ SN:chr21 LN:48129895 AS:GRCh37
@SQ SN:chr22 LN:51304566 AS:GRCh37
@SQ SN:chrX LN:155270560 AS:GRCh37
@SQ SN:chrY LN:59373566 AS:GRCh37
@SQ SN:chrMT LN:16569 AS:GRCh37
@SQ SN:chrGL000207.1 LN:4262 AS:GRCh37
@SQ SN:chrGL000226.1 LN:15008 AS:GRCh37
@SQ SN:chrGL000229.1 LN:19913 AS:GRCh37
@SQ SN:chrGL000231.1 LN:27386 AS:GRCh37
@SQ SN:chrGL000210.1 LN:27682 AS:GRCh37
@SQ SN:chrGL000239.1 LN:33824 AS:GRCh37
@SQ SN:chrGL000235.1 LN:34474 AS:GRCh37
@SQ SN:chrGL000201.1 LN:36148 AS:GRCh37
@SQ SN:chrGL000247.1 LN:36422 AS:GRCh37
@SQ SN:chrGL000245.1 LN:36651 AS:GRCh37
@SQ SN:chrGL000197.1 LN:37175 AS:GRCh37
@SQ SN:chrGL000203.1 LN:37498 AS:GRCh37
@SQ SN:chrGL000246.1 LN:38154 AS:GRCh37
@SQ SN:chrGL000249.1 LN:38502 AS:GRCh37
@SQ SN:chrGL000196.1 LN:38914 AS:GRCh37
@SQ SN:chrGL000248.1 LN:39786 AS:GRCh37
@SQ SN:chrGL000244.1 LN:39929 AS:GRCh37
@SQ SN:chrGL000238.1 LN:39939 AS:GRCh37
@SQ SN:chrGL000202.1 LN:40103 AS:GRCh37
@SQ SN:chrGL000234.1 LN:40531 AS:GRCh37
@SQ SN:chrGL000232.1 LN:40652 AS:GRCh37
@SQ SN:chrGL000206.1 LN:41001 AS:GRCh37
@SQ SN:chrGL000240.1 LN:41933 AS:GRCh37
@SQ SN:chrGL000236.1 LN:41934 AS:GRCh37
@SQ SN:chrGL000241.1 LN:42152 AS:GRCh37
@SQ SN:chrGL000243.1 LN:43341 AS:GRCh37
@SQ SN:chrGL000242.1 LN:43523 AS:GRCh37
@SQ SN:chrGL000230.1 LN:43691 AS:GRCh37
@SQ SN:chrGL000237.1 LN:45867 AS:GRCh37
@SQ SN:chrGL000233.1 LN:45941 AS:GRCh37
@SQ SN:chrGL000204.1 LN:81310 AS:GRCh37
@SQ SN:chrGL000198.1 LN:90085 AS:GRCh37
@SQ SN:chrGL000208.1 LN:92689 AS:GRCh37
@SQ SN:chrGL000191.1 LN:106433 AS:GRCh37
@SQ SN:chrGL000227.1 LN:128374 AS:GRCh37
@SQ SN:chrGL000228.1 LN:129120 AS:GRCh37
@SQ SN:chrGL000214.1 LN:137718 AS:GRCh37
@SQ SN:chrGL000221.1 LN:155397 AS:GRCh37
@SQ SN:chrGL000209.1 LN:159169 AS:GRCh37
@SQ SN:chrGL000218.1 LN:161147 AS:GRCh37
@SQ SN:chrGL000220.1 LN:161802 AS:GRCh37
@SQ SN:chrGL000213.1 LN:164239 AS:GRCh37
@SQ SN:chrGL000211.1 LN:166566 AS:GRCh37
@SQ SN:chrGL000199.1 LN:169874 AS:GRCh37
@SQ SN:chrGL000217.1 LN:172149 AS:GRCh37
@SQ SN:chrGL000216.1 LN:172294 AS:GRCh37
@SQ SN:chrGL000215.1 LN:172545 AS:GRCh37
@SQ SN:chrGL000205.1 LN:174588 AS:GRCh37
@SQ SN:chrGL000219.1 LN:179198 AS:GRCh37
@SQ SN:chrGL000224.1 LN:179693 AS:GRCh37
@SQ SN:chrGL000223.1 LN:180455 AS:GRCh37
@SQ SN:chrGL000195.1 LN:182896 AS:GRCh37
@SQ SN:chrGL000212.1 LN:186858 AS:GRCh37
@SQ SN:chrGL000222.1 LN:186861 AS:GRCh37
@SQ SN:chrGL000200.1 LN:187035 AS:GRCh37
@SQ SN:chrGL000193.1 LN:189789 AS:GRCh37
@SQ SN:chrGL000194.1 LN:191469 AS:GRCh37
@SQ SN:chrGL000225.1 LN:211173 AS:GRCh37
@SQ SN:chrGL000192.1 LN:547496 AS:GRCh37
@RG ID:HiSeq1 PL:illumina PU: LB:WGS SM:DVH_genome0002
@PG ID:novoalign VN:V2.07.17 CL:novoalign -d /home/exome/repository/ref_genomes/GRCh37 -f /home/exome/samples/DVH_genome0002/fastq/DVH_genome0002_1_all.fastq.gz /home/exome/samples/DVH_genome0002/fastq/DVH_genome0002_2_all.fastq.gz --Q2Off -F STDFQ -i 200 30 -o SAM -o SoftClip -k -a -g 65 -x 7

Any idea what the issue could be?

Many thanks


  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    I suspect the error message is referring to line 86 in the header, the @RG line.
    And I further suspect the problem is the missing PU: field (i.e. $1 in your original script probably had no value).

  • VMistry13VMistry13 Member

    Ah OK, thanks
    I will try again

  • VMistry13VMistry13 Member
    edited July 2013

    I sorted the read groups in the .bam file, but have another issue when running PreProcess

    Here is my command:
    java -Xmx2g -cp /data_n2/vmistry/Software/svtoolkit/lib/gatk/Queue.jar:/data_n2/vmistry/Software/svtoolkit/lib/SVToolkit.jar:/data_n2/vmistry/Software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar org.broadinstitute.sting.queue.QCommandLine -S /data_n2/vmistry/Software/svtoolkit/qscript/SVPreprocess.q -S /data_n2/vmistry/Software/svtoolkit/qscript/SVQScript.q -gatk /data_n2/vmistry/Software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar -cp /data_n2/vmistry/Software/svtoolkit/lib/SVToolkit.jar:/data_n2/vmistry/Software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar -configFile /data_n2/vmistry/Software/svtoolkit/conf/genstrip_parameters.txt -tempDir /data_n2/vmistry/Software/svtoolkit/tmpdir -md /data_n2/vmistry/Software/svtoolkit/metadata -R /data_n2/vmistry/Software/svtoolkit/Homo_sapiens_assembly19.fasta -genomeMaskFile /data_n2/vmistry/Software/svtoolkit/Homo_sapiens_assembly19.mask.101.fasta -copyNumberMaskFile /data_n2/vmistry/Software/svtoolkit/cn2_mask_g1k_v37.fasta -ploidyMapFile /data_n2/vmistry/Software/svtoolkit/ -genderMapFile /data_n2/vmistry/Software/svtoolkit/ -reduceInsertSizeDistributions -computeGCProfiles -bamFilesAreDisjoint -I DVH_genome0002_novoalign_RG_v02.bam -run

    And I get an error when the ref profile is computing:

    DBG: Wed Jul 24 15:10:40 BST 2013 computing reference profile for the interval NC_007605:1-171823

    Caused by: java.lang.IllegalArgumentException: Invalid sequence position: NC_007605:201

    As far as I can see, there is no NC_007605:201 in the header of my bam file, it's the same as above

    Any idea where this is coming from?


  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Can you include a complete stack trace?
    The essence of the problem is that you are mixing incompatible reference sequences and masks (Homo_sapiens_assembly19 and g1k_v37) that have slight differences.
    The first thing you need to do is to find out (and be 100% sure) which exact fasta file was used to align your bams.
    Everything else needs to match that fasta file exactly.

  • rwhiterwhite London, UKMember

    NC_007605 is the Epstein-Barr virus genome - the virus is in all lymphoblastoid cell lines (as the transforming agent).

Sign In or Register to comment.