The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# RealignerTargetCreator

edited July 2012

A new tool has been released!

Check out the documentation at RealignerTargetCreator.

Tagged:

• Posts: 3Member
edited March 2013

I there a reason why e.g. IndelRealigner, BaseRecalibrator, ... in Queue 2.4.7 uses other target intervals than the once created by RealignerTargetCreator?

e.g.

The first interval created by RealignerTargetCreator is this one:

chr1 1 123827759 + interval_1

While the rest of the pipeline uses the following interval as first interval:

chr1 1 249250621 + interval_1

Post edited by mmoisse on

I'm not sure I understand what you're describing... Are you saying the intervals are all different, or that the first interval is skipped?

Geraldine Van der Auwera, PhD

• Posts: 3Member

When I use the example Scale script DataProcessingPipeline, I noticed that in a first step it calls the RealignerTargetCreator, this created 25 interval files of wich the first is the following one

'chr1 1 123827759 + interval_1'

In a next step in invokes IndelRealigner, this step also uses 25 interval but they do not match with the once created by RealignerTargetCreator.

In summary RealignerTargetCreator creates interval files containing parts of one or several chromosomes while the consecutive steps (IndelRealigner, BaseRecalibrator, ...) don't use these interval files but use interval files with one whole chromosome per interval file. Is there any particular reason for this or do I need to adapt the DataProcessingPipeline in such a way to use the intervals created by RealignerTargetCreator?

Oh, I see. I believe what's happening here is that the 25 intervals files you are seeing are the ones used to scatter-gather the jobs. This explains the differences between the steps because different tools need the data to be scattered differently. Then when each job is run, the actual realignment target intervals files are produced and stored elsewhere. This is the expected behavior; but if you would prefer to set it up differently of course feel free to adapt the script as you like. It is only provided as an example of what you can do, and we discourage users from using it "out of the box" because it is specifically tailored to our workflow needs.

Geraldine Van der Auwera, PhD

• Posts: 59Member

Hi I am running GATK command as specified in the documentation

java -Xmx8g -jar /home/skhan/bio/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R Soybean_ref_genome.fasta -I AddOrRep_HN002.bam -o HN002_realigner.intervals -nt 3 --fix_misencoded_quality_scores -fixMisencodedQuals

and I am getting following error

##### ERROR

' at position 12.GE: Invalid argument value '

##### ERROR ------------------------------------------------------------------------------------------

Funny thing is that I ran the exact same command a few days back but I did not get any error. I am wondering if I am missing something ?

That's a strange error, can you post the stack trace too? (the part before the ERROR block, that gives the details of where in the code the error occurred).

By the way, --fix_misencoded_quality_scores is not recommended to use by default. You should only use it if you tried without and the program complained. Also, I don't know why you are specifying that flag twice (--fix_misencoded_quality_scores -fixMisencodedQuals). Did you see that in the documentation or in another user's post?

Geraldine Van der Auwera, PhD

• Posts: 59Member
edited June 2013

Yes exactly, I did see it in other user's post. Also I was running many files in batches so I had put this flag in the script so that it is checked on every file.

Followuing is the complete error :-

### ERROR ------------------------------------------------------------------------------------------

##### ERROR

' at position 12.GE: Invalid argument value '

Post edited by smk_84 on

You definitely should NOT apply this flag by default. Files are automatically checked for quality encoding, and the program will tell you to use the flag if necessary. To be clear, the flag should only be used in those cases where it is necessary. If you apply it when it is not necessary, it will mess up your data.

I can't see the stack trace in your post -- I really need to see the bit before the ERROR part.

Geraldine Van der Auwera, PhD

• Posts: 59Member

Is there a way I can do it in a perl script. I mean to only use the flags when it might throw an error not otherwise.

Here is the stack trace when I don't use the flag

INFO 13:59:00,325 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:59:00,327 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 13:59:00,328 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:59:00,328 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:59:00,333 HelpFormatter - Program Args: -T RealignerTargetCreator -R Soybean_ref_genome.fasta -I AddOrRep_HN002.bam -o HN002_realigner.intervals -nt 3 INFO 13:59:00,333 HelpFormatter - Date/Time: 2013/06/24 13:59:00 INFO 13:59:00,333 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:59:00,333 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:59:01,868 GenomeAnalysisEngine - Strictness is SILENT INFO 13:59:03,343 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 13:59:03,350 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:03,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.10 INFO 13:59:03,507 MicroScheduler - Running the GATK in parallel mode with 3 total threads, 1 CPU thread(s) for each of 3 data thread(s), of 16 processors available on this machine INFO 13:59:03,649 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 13:59:04,345 GenomeAnalysisEngine - Done creating shard strategy INFO 13:59:04,346 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:59:04,347 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 13:59:04,358 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:04,396 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 13:59:04,397 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:04,431 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:59:04,785 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:59:06,701 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1.92 INFO 13:59:07,377 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

It seems to be running fine now if I only use one flag : --fix_misencoded_quality_scores

thanks