The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Error in GenomeStrip when using a bam file with added RG groups using AddOrReplaceReadGroups.jar

VMistry13VMistry13 Member Posts: 5

I have a question regarding the PicardTools AddOrReplaceReadGroups tool. I am using a single-sample WGS dataset and it arrived without any RG information, which I have now added with this command:

$HOME/bin/java -Xmx20g -jar /home/exome/bin/AddOrReplaceReadGroups.jar  \
I=DVH_GEONOME0002_novoalign.bam  \
O=DVH_GEONOME0002_novoalign_RG.bam  \
SORT_ORDER=coordinate \
RGID=HiSeq1 \
RGPL=illumina \

This seems to have worked, but when I use the output file for GenomeStrip (PreProcess), I get an error:

ERROR MESSAGE: Error parsing SAM header. Problem parsing @RG key:value pair. Line:
ERROR @RG ID:HiSeq1 PL:illumina PU: LB:WGS SM:DVH_genome0002; File /data_n2/vmistry/illumina_genome_SS6004426_HUSS1728_VM/DVH_genome0002_novoalign_RG.bam; Line number 86

I can't see any issues with line 86:

HS2000-1010_95:5:2202:3599:13204 99 chr1 10004 0 100M = 10023 118 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT <<<>>>===>??===>>>===>>?===>??===???===>??===???===???===??>===???===>?>===>??==<><>===???='9><899;= MD:Z:100 PG:Z:novoalign RG:Z:HiSeq1 AM:i:70 NM:i:0 SM:i:70 PQ:i:92 UQ:i:1 AS:i:

And the header of the RG_bam file looks like this:

@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621 AS:GRCh37
@SQ SN:chr2 LN:243199373 AS:GRCh37
@SQ SN:chr3 LN:198022430 AS:GRCh37
@SQ SN:chr4 LN:191154276 AS:GRCh37
@SQ SN:chr5 LN:180915260 AS:GRCh37
@SQ SN:chr6 LN:171115067 AS:GRCh37
@SQ SN:chr7 LN:159138663 AS:GRCh37
@SQ SN:chr8 LN:146364022 AS:GRCh37
@SQ SN:chr9 LN:141213431 AS:GRCh37
@SQ SN:chr10 LN:135534747 AS:GRCh37
@SQ SN:chr11 LN:135006516 AS:GRCh37
@SQ SN:chr12 LN:133851895 AS:GRCh37
@SQ SN:chr13 LN:115169878 AS:GRCh37
@SQ SN:chr14 LN:107349540 AS:GRCh37
@SQ SN:chr15 LN:102531392 AS:GRCh37
@SQ SN:chr16 LN:90354753 AS:GRCh37
@SQ SN:chr17 LN:81195210 AS:GRCh37
@SQ SN:chr18 LN:78077248 AS:GRCh37
@SQ SN:chr19 LN:59128983 AS:GRCh37
@SQ SN:chr20 LN:63025520 AS:GRCh37
@SQ SN:chr21 LN:48129895 AS:GRCh37
@SQ SN:chr22 LN:51304566 AS:GRCh37
@SQ SN:chrX LN:155270560 AS:GRCh37
@SQ SN:chrY LN:59373566 AS:GRCh37
@SQ SN:chrMT LN:16569 AS:GRCh37
@SQ SN:chrGL000207.1 LN:4262 AS:GRCh37
@SQ SN:chrGL000226.1 LN:15008 AS:GRCh37
@SQ SN:chrGL000229.1 LN:19913 AS:GRCh37
@SQ SN:chrGL000231.1 LN:27386 AS:GRCh37
@SQ SN:chrGL000210.1 LN:27682 AS:GRCh37
@SQ SN:chrGL000239.1 LN:33824 AS:GRCh37
@SQ SN:chrGL000235.1 LN:34474 AS:GRCh37
@SQ SN:chrGL000201.1 LN:36148 AS:GRCh37
@SQ SN:chrGL000247.1 LN:36422 AS:GRCh37
@SQ SN:chrGL000245.1 LN:36651 AS:GRCh37
@SQ SN:chrGL000197.1 LN:37175 AS:GRCh37
@SQ SN:chrGL000203.1 LN:37498 AS:GRCh37
@SQ SN:chrGL000246.1 LN:38154 AS:GRCh37
@SQ SN:chrGL000249.1 LN:38502 AS:GRCh37
@SQ SN:chrGL000196.1 LN:38914 AS:GRCh37
@SQ SN:chrGL000248.1 LN:39786 AS:GRCh37
@SQ SN:chrGL000244.1 LN:39929 AS:GRCh37
@SQ SN:chrGL000238.1 LN:39939 AS:GRCh37
@SQ SN:chrGL000202.1 LN:40103 AS:GRCh37
@SQ SN:chrGL000234.1 LN:40531 AS:GRCh37
@SQ SN:chrGL000232.1 LN:40652 AS:GRCh37
@SQ SN:chrGL000206.1 LN:41001 AS:GRCh37
@SQ SN:chrGL000240.1 LN:41933 AS:GRCh37
@SQ SN:chrGL000236.1 LN:41934 AS:GRCh37
@SQ SN:chrGL000241.1 LN:42152 AS:GRCh37
@SQ SN:chrGL000243.1 LN:43341 AS:GRCh37
@SQ SN:chrGL000242.1 LN:43523 AS:GRCh37
@SQ SN:chrGL000230.1 LN:43691 AS:GRCh37
@SQ SN:chrGL000237.1 LN:45867 AS:GRCh37
@SQ SN:chrGL000233.1 LN:45941 AS:GRCh37
@SQ SN:chrGL000204.1 LN:81310 AS:GRCh37
@SQ SN:chrGL000198.1 LN:90085 AS:GRCh37
@SQ SN:chrGL000208.1 LN:92689 AS:GRCh37
@SQ SN:chrGL000191.1 LN:106433 AS:GRCh37
@SQ SN:chrGL000227.1 LN:128374 AS:GRCh37
@SQ SN:chrGL000228.1 LN:129120 AS:GRCh37
@SQ SN:chrGL000214.1 LN:137718 AS:GRCh37
@SQ SN:chrGL000221.1 LN:155397 AS:GRCh37
@SQ SN:chrGL000209.1 LN:159169 AS:GRCh37
@SQ SN:chrGL000218.1 LN:161147 AS:GRCh37
@SQ SN:chrGL000220.1 LN:161802 AS:GRCh37
@SQ SN:chrGL000213.1 LN:164239 AS:GRCh37
@SQ SN:chrGL000211.1 LN:166566 AS:GRCh37
@SQ SN:chrGL000199.1 LN:169874 AS:GRCh37
@SQ SN:chrGL000217.1 LN:172149 AS:GRCh37
@SQ SN:chrGL000216.1 LN:172294 AS:GRCh37
@SQ SN:chrGL000215.1 LN:172545 AS:GRCh37
@SQ SN:chrGL000205.1 LN:174588 AS:GRCh37
@SQ SN:chrGL000219.1 LN:179198 AS:GRCh37
@SQ SN:chrGL000224.1 LN:179693 AS:GRCh37
@SQ SN:chrGL000223.1 LN:180455 AS:GRCh37
@SQ SN:chrGL000195.1 LN:182896 AS:GRCh37
@SQ SN:chrGL000212.1 LN:186858 AS:GRCh37
@SQ SN:chrGL000222.1 LN:186861 AS:GRCh37
@SQ SN:chrGL000200.1 LN:187035 AS:GRCh37
@SQ SN:chrGL000193.1 LN:189789 AS:GRCh37
@SQ SN:chrGL000194.1 LN:191469 AS:GRCh37
@SQ SN:chrGL000225.1 LN:211173 AS:GRCh37
@SQ SN:chrGL000192.1 LN:547496 AS:GRCh37
@RG ID:HiSeq1 PL:illumina PU: LB:WGS SM:DVH_genome0002
@PG ID:novoalign VN:V2.07.17 CL:novoalign -d /home/exome/repository/ref_genomes/GRCh37 -f /home/exome/samples/DVH_genome0002/fastq/DVH_genome0002_1_all.fastq.gz /home/exome/samples/DVH_genome0002/fastq/DVH_genome0002_2_all.fastq.gz --Q2Off -F STDFQ -i 200 30 -o SAM -o SoftClip -k -a -g 65 -x 7

Any idea what the issue could be?

Many thanks


  • bhandsakerbhandsaker Member, Collaborator Posts: 380 ✭✭✭

    I suspect the error message is referring to line 86 in the header, the @RG line.
    And I further suspect the problem is the missing PU: field (i.e. $1 in your original script probably had no value).

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • VMistry13VMistry13 Member Posts: 5

    Ah OK, thanks
    I will try again

  • VMistry13VMistry13 Member Posts: 5
    edited July 2013

    I sorted the read groups in the .bam file, but have another issue when running PreProcess

    Here is my command:
    java -Xmx2g -cp /data_n2/vmistry/Software/svtoolkit/lib/gatk/Queue.jar:/data_n2/vmistry/Software/svtoolkit/lib/SVToolkit.jar:/data_n2/vmistry/Software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar org.broadinstitute.sting.queue.QCommandLine -S /data_n2/vmistry/Software/svtoolkit/qscript/SVPreprocess.q -S /data_n2/vmistry/Software/svtoolkit/qscript/SVQScript.q -gatk /data_n2/vmistry/Software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar -cp /data_n2/vmistry/Software/svtoolkit/lib/SVToolkit.jar:/data_n2/vmistry/Software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar -configFile /data_n2/vmistry/Software/svtoolkit/conf/genstrip_parameters.txt -tempDir /data_n2/vmistry/Software/svtoolkit/tmpdir -md /data_n2/vmistry/Software/svtoolkit/metadata -R /data_n2/vmistry/Software/svtoolkit/Homo_sapiens_assembly19.fasta -genomeMaskFile /data_n2/vmistry/Software/svtoolkit/Homo_sapiens_assembly19.mask.101.fasta -copyNumberMaskFile /data_n2/vmistry/Software/svtoolkit/cn2_mask_g1k_v37.fasta -ploidyMapFile /data_n2/vmistry/Software/svtoolkit/ -genderMapFile /data_n2/vmistry/Software/svtoolkit/ -reduceInsertSizeDistributions -computeGCProfiles -bamFilesAreDisjoint -I DVH_genome0002_novoalign_RG_v02.bam -run

    And I get an error when the ref profile is computing:

    DBG: Wed Jul 24 15:10:40 BST 2013 computing reference profile for the interval NC_007605:1-171823

    Caused by: java.lang.IllegalArgumentException: Invalid sequence position: NC_007605:201

    As far as I can see, there is no NC_007605:201 in the header of my bam file, it's the same as above

    Any idea where this is coming from?


  • bhandsakerbhandsaker Member, Collaborator Posts: 380 ✭✭✭

    Can you include a complete stack trace?
    The essence of the problem is that you are mixing incompatible reference sequences and masks (Homo_sapiens_assembly19 and g1k_v37) that have slight differences.
    The first thing you need to do is to find out (and be 100% sure) which exact fasta file was used to align your bams.
    Everything else needs to match that fasta file exactly.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • rwhiterwhite London, UKMember Posts: 2

    NC_007605 is the Epstein-Barr virus genome - the virus is in all lymphoblastoid cell lines (as the transforming agent).

Sign In or Register to comment.