Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

RealignerTargetCreator

SystemSystem Posts: 226Administrator admin
edited July 2012 in Tool Bulletin

A new tool has been released!

Check out the documentation at RealignerTargetCreator.

Comments

  • mmoissemmoisse Posts: 3Member
    edited March 2013

    I there a reason why e.g. IndelRealigner, BaseRecalibrator, ... in Queue 2.4.7 uses other target intervals than the once created by RealignerTargetCreator?

    e.g.

    The first interval created by RealignerTargetCreator is this one:

    chr1 1 123827759 + interval_1

    While the rest of the pipeline uses the following interval as first interval:

    chr1 1 249250621 + interval_1

    Post edited by mmoisse on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,852Administrator, GATK Developer admin

    I'm not sure I understand what you're describing... Are you saying the intervals are all different, or that the first interval is skipped?

    Geraldine Van der Auwera, PhD

  • mmoissemmoisse Posts: 3Member

    When I use the example Scale script DataProcessingPipeline, I noticed that in a first step it calls the RealignerTargetCreator, this created 25 interval files of wich the first is the following one

    'chr1 1 123827759 + interval_1'

    In a next step in invokes IndelRealigner, this step also uses 25 interval but they do not match with the once created by RealignerTargetCreator.

    In summary RealignerTargetCreator creates interval files containing parts of one or several chromosomes while the consecutive steps (IndelRealigner, BaseRecalibrator, ...) don't use these interval files but use interval files with one whole chromosome per interval file. Is there any particular reason for this or do I need to adapt the DataProcessingPipeline in such a way to use the intervals created by RealignerTargetCreator?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,852Administrator, GATK Developer admin

    Oh, I see. I believe what's happening here is that the 25 intervals files you are seeing are the ones used to scatter-gather the jobs. This explains the differences between the steps because different tools need the data to be scattered differently. Then when each job is run, the actual realignment target intervals files are produced and stored elsewhere. This is the expected behavior; but if you would prefer to set it up differently of course feel free to adapt the script as you like. It is only provided as an example of what you can do, and we discourage users from using it "out of the box" because it is specifically tailored to our workflow needs.

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member

    Hi I am running GATK command as specified in the documentation

    java -Xmx8g -jar /home/skhan/bio/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R Soybean_ref_genome.fasta -I AddOrRep_HN002.bam -o HN002_realigner.intervals -nt 3 --fix_misencoded_quality_scores -fixMisencodedQuals

    and I am getting following error

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR

    ' at position 12.GE: Invalid argument value '

    ERROR ------------------------------------------------------------------------------------------

    Funny thing is that I ran the exact same command a few days back but I did not get any error. I am wondering if I am missing something ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,852Administrator, GATK Developer admin

    That's a strange error, can you post the stack trace too? (the part before the ERROR block, that gives the details of where in the code the error occurred).

    By the way, --fix_misencoded_quality_scores is not recommended to use by default. You should only use it if you tried without and the program complained. Also, I don't know why you are specifying that flag twice (--fix_misencoded_quality_scores -fixMisencodedQuals). Did you see that in the documentation or in another user's post?

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member
    edited June 2013

    Yes exactly, I did see it in other user's post. Also I was running many files in batches so I had put this flag in the script so that it is checked on every file.

    Followuing is the complete error :-

    ERROR ------------------------------------------------------------------------------------------

    ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR

    ' at position 12.GE: Invalid argument value '

    Post edited by smk_84 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,852Administrator, GATK Developer admin

    You definitely should NOT apply this flag by default. Files are automatically checked for quality encoding, and the program will tell you to use the flag if necessary. To be clear, the flag should only be used in those cases where it is necessary. If you apply it when it is not necessary, it will mess up your data.

    I can't see the stack trace in your post -- I really need to see the bit before the ERROR part.

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member

    Is there a way I can do it in a perl script. I mean to only use the flags when it might throw an error not otherwise.

    Here is the stack trace when I don't use the flag

    INFO 13:59:00,325 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:59:00,327 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 13:59:00,328 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:59:00,328 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:59:00,333 HelpFormatter - Program Args: -T RealignerTargetCreator -R Soybean_ref_genome.fasta -I AddOrRep_HN002.bam -o HN002_realigner.intervals -nt 3 INFO 13:59:00,333 HelpFormatter - Date/Time: 2013/06/24 13:59:00 INFO 13:59:00,333 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:59:00,333 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:59:01,868 GenomeAnalysisEngine - Strictness is SILENT INFO 13:59:03,343 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 13:59:03,350 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:03,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.10 INFO 13:59:03,507 MicroScheduler - Running the GATK in parallel mode with 3 total threads, 1 CPU thread(s) for each of 3 data thread(s), of 16 processors available on this machine INFO 13:59:03,649 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 13:59:04,345 GenomeAnalysisEngine - Done creating shard strategy INFO 13:59:04,346 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:59:04,347 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 13:59:04,358 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:04,396 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 13:59:04,397 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:04,431 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:59:04,785 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:06,701 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1.92 INFO 13:59:07,377 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: SAM/BAM file SAMFileReader{/share/data/skhan/genome1_25_soyb/AddOrRep_HN002.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 62; please see the GATK --help documentation for options related to this error
    ERROR ------------------------------------------------------------------------------------------

    It seems to be running fine now if I only use one flag : --fix_misencoded_quality_scores

    thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,852Administrator, GATK Developer admin

    OK, that makes sense.

    You could set up your runs to try normally (no flag), and if the job quits with an error, do a grep of the error message, and automatically retry with the flag if a key phrase (like "extremely high quality score" or "wrong encoding", which are unique to this error message) is found. You can do that with a shell script, or perl, or whatever you are most comfortable scripting with. That would be much safer than appending the flag automatically.

    Alternatively, if it's likely that most of your files need the flag and you don't want to waste the runtime of all the jobs that fail and restart, you could do a QC step before entering the GATK pipeline, to verify that the encoding of each file is what you expect. We currently don't have any tools to do that but it may be worthwhile to script something like that.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.