Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Error with UnifiedGenotyper after using BaseRecalibrator twice

bhall7bhall7 Posts: 6Member

Aloha,

I am calling SNPs on an organism without a reference genome or database of known polymorphisms, so I'm trying to follow the advice posted here (and in the BaseRecalibrator documentation).

I've successfully called SNPs on the un-recalibrated .bam file, then used those SNPs to recalibrate, then called SNPs on the recalibrated .bam file. As expected, I got significantly fewer (and presumably more accurate) results.

I then used the new, reduced set of SNPs to recalibrate again. When I attempted to call SNPs on this "Round Two" recalibrated .bam file, I got the following error:

SAM/BAM file recalibrated.2.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader!

I attempted to use PicardTools ValidateSamFile and CleanSam but received the same message (as an IllegalArgumentException). I would definitely consider myself a novice in the field. Any advice you can give will be greatly appreciated.

Best Answers

Answers

  • bhall7bhall7 Posts: 6Member

    Hi Dr. Van der Auwera,

    Oops, I was running v2.4-7. (I'm not sure how that happened; I've only been doing this for a couple of weeks.) I will upgrade and retry, then post results. Thanks!

  • bhall7bhall7 Posts: 6Member

    Update Received same error message using v2.4-9 ... I'll try using --no_pg_tag. Thanks for your help!

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    When you say you attempted to "When I attempted to call SNPs on this "Round Two" recalibrated .bam", do you mean that you saw this error when running BaseRecalibrator, PrintReads or UnifiedGenotyper?

  • bhall7bhall7 Posts: 6Member

    Hi Dr. Carneiro,

    I saw the error running UnifiedGenotyper.

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    can you take a look at your bam file header to see if it has 2 @PG entries for PrintReads?

    you can do so with the following command (provided you have samtools)

    samtools view -H recalibrated2.bam | grep @PG

  • bhall7bhall7 Posts: 6Member
    edited March 2013

    It does have two @PG PrintReads entries. They're identical and each reads as follows:

    @PG ID:GATK PrintReads VN:2.4-7-g5e89f01 CL:readGroup=null platform=null number=-1 downsample_coverage=1.0 sample_file=[] sample_name=[] simplify=false no_pg_tag=false

    I'm using samtools to remove these lines from the header and I'll try to run UnifiedGenotyper again, then report back.

    Post edited by bhall7 on
  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    The fix should be up in the nightly builds, if you want to try it.

  • caddymobcaddymob Posts: 9Member

    I just tried the latest nightly, nightly-2013-04-12-g3fc5478, same error with UnifiedGenotyper. Its a merged bam of a family - they were individually recalibrated, then merged and recalibrated. Am I really going to have to spend the time re-headering this? Or is it because recalibration was done with GATK 2.4-7-g5e89f01 ?

    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version nightly-2013-04-12-g3fc5478):
    ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ##### ERROR Please do not post this error to the GATK forum
    ##### ERROR
    ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: SAM/BAM file XXX.MERGED.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader!
    ##### ERROR ------------------------------------------------------------------------------------------
    
    
    samtools view -H XXX.MERGED.bam | grep PG | grep PrintReads
    @PG ID:GATK PrintReads  VN:2.4-7-g5e89f01   CL:readGroup=null platform=null number=-1 downsample_coverage=1.0 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
    @PG ID:GATK PrintReads  VN:2.4-7-g5e89f01   CL:readGroup=null platform=null number=-1 downsample_coverage=1.0 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
    
  • caddymobcaddymob Posts: 9Member

    Also happens if I just try BaseRecalibrator again. BaseRecalibrator won't take the no_pg_tag as an option, so it dies there.

  • bhall7bhall7 Posts: 6Member

    FWIW no_pg_tag didn't help me either; changing the header was the only thing that worked. A bit of a pain, but I only had to do it a few times before I got convergence on my call-snps-recalibrate-call-snps-recalibrate loop.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hmm, we'll take another look at this. Stay tuned, folks.

    Geraldine Van der Auwera, PhD

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    The problem here is that you guys are trying to run the tools with the bam that has the multiple @PG tags (which was generated with the bug version we fixed).

    For this to work you'll have to regenerate the bam, or manually remove the erroneous duplicated PG tags. In the new version (as far as I can test) running print reads multiple times will not add multiple PG tags anymore -- which fixes the problem, and then you can run any tool UG, BQSR, ... on that bam.

  • ymwymw Posts: 9Member

    I am encountering the same problem.
    After checking some relevant discussions on "call-snps-recalibrate-call-snps-recalibrate loop", I am wodnering if this problem roots from which bam file should be used for each round of recalibration. That is should we (a) use uncalibarted (original) bam file for the 2nd and further rounds of recalibration or (b) use new calibrated bam for the next round of recalibration? If we chose the later, one @PG will be added after each around. If we chose the former, there should be only one @PG every time when we run UnifiedGenotyper.

    Am I right? and which option, (a) or (b), is correct logically?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @ymw,

    We haven't compared the two possible options so we can't say definitively, but our default recommendation is indeed to do the successive rounds of recalibration on the original, unrecalibrated file each time. In principle the process should work on the recalibrated file too, but I think you could make the case that the recalibration works best to correct systematic (as opposed to random) errors, and the systematic error patterns are "cleanest" in the original file, while in the successively recalibrated files the patterns may get obscured by the recalibration attempts. So in terms of logic you are correct to say that (a) is the better option. Apologies to anyone who may have misunderstood our recommendations if they were not clear on this point.

    That being said, we have now changed the behavior of the @PG tagging so that if there is already a @PG tag for that program in your header, it will be taken out and replaced by the new one, to remain in compliance with the BAM spec.

    Geraldine Van der Auwera, PhD

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    Hi ymw.

    Always use the original BAM file on your iterations of recalibration. You always want the priors to be the original quality scores and the adjustments to be calculated on that, not on a biased observation.

    In terms of the @PG tag, either way should only add one @PG tag in the latest version. We fixed this when it got reported.

  • vivekdas_1987vivekdas_1987 MilanPosts: 20Member

    Error with UnifiedGenotyper with option -glm BOTH with GATK during Variant calling

    Hi,

    I am using the below command for calling the raw variants using GATK(GenomeAnalysisTK-2.3-4-g57ea19f) on the realigned recalibrated bam file after BQSR and PrintReads steps but am getting an error. Command am using is

    java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /scratch/GT/vdas/test_exome/exome/hg19.fa -I /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.realigned.recal.bam -L /scratch/GT/vdas/referenceBed/hg19/ss_v4/SureSelect_XT_Human_All_Exon_V4.bed -D /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf –glm BOTH -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 200 -l INFO -A AlleleBalance -A DepthOfCoverage -A FisherStrand -log /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.GATKvariants.log -o /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.GATKvariants.raw.vcf

    Error:

    ERROR MESSAGE: Invalid argument value '???glm' at position 10. ERROR Invalid argument value 'BOTH' at position 11. I have used this above command earlier while testing my pipeline with a single sample from 1000G project with this version of GATK but did not face any error at that time, but am encountering them with my tumor samples. Any suggestions? I have tried checking the posts and the suggestion I see is version compatibility but I have used this version 5 days back with other sample and the same command worked. Any idea how to get rid of this error? It would be of great help

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @vivekdas_1987,

    This looks like some funky characters were introduced in your command line when you copied it over. Might be an issue with the encoding of whatever file format you store your command lines in. Or if it's some kind of word-processor document (e.g. MS Word) the program may have transformed the basic dash character into a special long-dash character that's not recognized by the shell. Just copy your command to a pure text document, fix the dash, and then you can copy-paste it and it should work.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 20Member
    edited December 2013

    Yes this was solved long ago. Thanks for the input. I have had this problem when I first used it

    Post edited by vivekdas_1987 on
Sign In or Register to comment.