RealignerTargetCreator gets a empty file, why?

livefallflylivefallfly Posts: 23Member
edited November 2012 in Ask the GATK team
java -jar GenomeAnalysisTK.jar -R exampleFASTA.fasta -I exampleBAM.bam -T RealignerTargetCreator -o exampleTarget.intervals

I used the example files, and I got no error report, but in the output, there is no data, that is to say. The exampleTarget.intervals is empty.

Why is this happening?

And when I add --known dbsnp_135.hg19.vcf( that is in the bundle), it will get an error

ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chr1 / 249250621
ERROR contig reference = chr1 / 100000.
Post edited by Geraldine_VdAuwera on

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
    Answer ✓

    OK, I have good news -- you're not doing anything wrong. I checked the example file; the reason you're not getting any SNPs is that there aren't any in there. This example file was intended only to test the very simple walkers in the "first time GATK" tutorials; it's not appropriate for use with more advanced tools like the UnifiedGenotyper. I'm sorry that wasn't clear in the documentation, I'll try to correct that. To test the UG, you can use the file called NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam in the bundle. That contains a dataset for chromosome 20. You will get plenty of SNPs there.

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Regarding your first problem, it looks like exampleBAM.bam does not have any problem regions that need to be realigned...

    For your second problem, that's because the example files are aligned to the b37 reference, but you're using the hg19 version of the dbsnp file. See the FAQs for why the b37/hg19 stuff is important.

    Geraldine Van der Auwera, PhD

  • livefallflylivefallfly Posts: 23Member
    edited November 2012

    Thanks very much.

    Post edited by livefallfly on
  • livefallflylivefallfly Posts: 23Member
    edited November 2012

    I have got another problem for calling SNPs. I can't get SNPs in the example.vcf.

    java -jar ~/chenlong/GenomeAnalysisTK-2.1-5/GenomeAnalysisTK.jar -T UnifiedGenotyper -I exampleFASTA.bam -R exampleFASTA.fasta --dbsnp exampleSNP.vcf -o example.vcf
    

    And it is all the same either with -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 or not.

    And I can't get the result either with my own data.

    Thanks.

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Can you tell me a little more about your problem? Specifically, I need to know what happens when you run the command. Is there no output file created, or is the file empty? Do you get any error messages?

    By the way, we recently released version 2.2, so I recommend you upgrade -- the tools are significantly faster in the new version.

    Geraldine Van der Auwera, PhD

  • livefallflylivefallfly Posts: 23Member

    Thanks for your answer. As for your questions, it creates the example.vcf file. And in the file it contains the annotation line. The final line in the example.vcf file is: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT exampleBAM But there is no error messages report to me. And I will upgrade the GATK, and have a try again.

  • livefallflylivefallfly Posts: 23Member
    edited November 2012

    java -jar ~/GenomeAnalysisTK-2.1-5/GenomeAnalysisTK.jar -T UnifiedGenotyper -I exampleBAM.bam -R exampleFASTA.fasta --dbsnp exampleSNP.vcf -o example.vcf

    INFO 21:58:41,035 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:58:41,036 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-5-gf3daab0, Compiled 2012/08/27 17:19:33 INFO 21:58:41,036 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 21:58:41,036 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 21:58:41,036 HelpFormatter - Program Args: -T UnifiedGenotyper -I exampleBAM.bam -R exampleFASTA.fasta --dbsnp exampleSNP.vcf -o example.vcf INFO 21:58:41,037 HelpFormatter - Date/Time: 2012/11/04 21:58:41 INFO 21:58:41,037 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:58:41,037 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:58:41,054 ArgumentTypeDescriptor - Dynamically determined type of exampleSNP.vcf to be VCF INFO 21:58:41,061 GenomeAnalysisEngine - Strictness is SILENT INFO 21:58:41,130 SAMDataSource$SAMReaders - Initializing SAMRecords in serial WARNING: BAM index file /home/livefallfly/chenlong/GenomeAnalysisTK-2.1-5/resources/exampleBAM.bam.bai is older than BAM /home/livefallfly/chenlong/GenomeAnalysisTK-2.1-5/resources/exampleBAM.bam INFO 21:58:41,140 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 21:58:41,148 RMDTrackBuilder - Loading Tribble index from disk for file exampleSNP.vcf WARN 21:58:41,234 VCFStandardHeaderLines$Standards - Repairing standard header line for field GQ because -- type disagree; header has Float but standard is Integer INFO 21:58:41,581 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 21:58:41,581 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 21:58:42,666 TraversalEngine - Total runtime 1.17 secs, 0.02 min, 0.00 hours INFO 21:58:42,667 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) INFO 21:58:44,838 GATKRunReport - Uploaded run statistics report to AWS S3

    This is the screen output with no error.

    Post edited by livefallfly on
  • livefallflylivefallfly Posts: 23Member
    edited November 2012

    I have tried the new version. But it is the same question. It created the example.vcf file with the annotation infomation. Below is the screen output

    INFO  14:25:36,452 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:25:36,465 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-3-gde33222, Compiled 2012/11/03 01:04:17 
    INFO  14:25:36,466 HelpFormatter - Copyright (c) 2010 The Broad Institute 
    INFO  14:25:36,466 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  14:25:36,468 HelpFormatter - Program Args: -T UnifiedGenotyper -I exampleBAM.bam --dbsnp ../../GenomeAnalysisTK-2.1-5/resources/exampleSNP.vcf -R exampleFASTA.fasta -o example.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 
    INFO  14:25:36,468 HelpFormatter - Date/Time: 2012/11/05 14:25:36 
    INFO  14:25:36,468 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:25:36,468 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:25:36,515 ArgumentTypeDescriptor - Dynamically determined type of ../../GenomeAnalysisTK-2.1-5/resources/exampleSNP.vcf to be VCF 
    INFO  14:25:36,520 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  14:25:36,599 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000  
    INFO  14:25:36,603 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    WARNING: BAM index file /home/livefallfly/chenlong/GenomeAnalysisTK-2.2-3/resources/exampleBAM.bam.bai is older than BAM /home/livefallfly/chenlong/GenomeAnalysisTK-2.2-3/resources/exampleBAM.bam
    INFO  14:25:36,616 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
    INFO  14:25:36,624 RMDTrackBuilder - Loading Tribble index from disk for file ../../GenomeAnalysisTK-2.1-5/resources/exampleSNP.vcf 
    WARN  14:25:36,692 VCFStandardHeaderLines$Standards - Repairing standard header line for field GQ because -- type disagree; header has Float but standard is Integer 
    INFO  14:25:36,705 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  14:25:36,706 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  14:25:37,992 ProgressMeter -            done        1.00e+05    1.3 s       12.9 s    100.0%         1.3 s     0.0 s 
    INFO  14:25:37,992 ProgressMeter - Total runtime 1.29 secs, 0.02 min, 0.00 hours 
    INFO  14:25:37,993 MicroScheduler - 0 reads were filtered out during traversal out of 33 total (0.00%) 
    INFO  14:25:37,993 NSRuntimeProfile - Input   time:    0.3 s (20.96%) 
    INFO  14:25:37,993 NSRuntimeProfile - Map     time:    0.7 s (57.91%) 
    INFO  14:25:37,993 NSRuntimeProfile - Reduce  time:    0.0 s ( 1.99%) 
    INFO  14:25:37,993 NSRuntimeProfile - Outside time:    0.2 s (19.13%) 
    INFO  14:25:41,040 GATKRunReport - Uploaded run statistics report to AWS S3 
    

    And next is the example.vcf

    ##..
    ...
    ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
    ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
    ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
    ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
    ##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
    ##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
    ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
    ##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
    ##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[exampleBAM.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=exampleFASTA.fasta nonDeterministicRandomSeed=false disableRandomization=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 enable_experimental_downsampling=false baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 defaultBaseQualities=-1 validation_strictness=SILENT remove_program_records=false keep_program_records=false unsafe=null num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=SNP pcr_error_rate=1.0E-4 computeSLOD=false annotateNDA=false pair_hmm_implementation=ORIGINAL min_base_quality_score=17 max_deletion_fraction=0.05 min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false heterozygosity=0.001 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=6 p_nonref_model=EXACT_INDEPENDENT contamination_fraction_to_filter=0.05 logRemovedReadsFromContaminationFiltering=null exactcallslog=null dbsnp=(RodBinding name=dbsnp source=../../GenomeAnalysisTK-2.1-5/resources/exampleSNP.vcf) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
    ##contig=<ID=chr1,length=100000>
    ##reference=file:///home/livefallfly/chenlong/GenomeAnalysisTK-2.2-3/resources/exampleFASTA.fasta
    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  exampleBAM.bam
    

    And no SNP, I don't why.

    Post edited by Geraldine_VdAuwera on
  • livefallflylivefallfly Posts: 23Member

    Thanks for your help. It works well now. And I have another question, IndelGenotyperV2 is not used again, so which replace the function? And makeIndelMask.py is in the GATK? where can I get it. if not ,which software can I get it?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    It sounds like you're following a really old pipeline example. Most of the tools have changed -- for example, now the variant calling is done by either the UnifiedGenotyper or the HaplotypeCaller. Please read the Best Practices document, it will tell you what tools to use and in what order:

    http://www.broadinstitute.org/gatk/guide/article?id=1186

    Geraldine Van der Auwera, PhD

  • livefallflylivefallfly Posts: 23Member

    It's very helpful. Thanks very much!

  • nanacnanac Posts: 9Member

    Hi, I'm trying to use the RealignerTargetCreator and I got empty Target.intervals files. I read that maybe it means that no regions need to be realigned. But then when I used the IndelRealigner tool with my empty Target.intervals files I got empty BAMs. I thought I should have gotten the same BAM (the one used as INPUT file) as there is nothing to realign? Why is it not the case?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Sounds like something is wrong with your data files. Have you validated your bam file? And if you try using PrintReads on a small interval, does it generate a small bam properly?

    Geraldine Van der Auwera, PhD

  • nanacnanac Posts: 9Member

    Actually, if I do not use the option -L my_selected_scaff when I'm using PrintReads I got this error message : Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The contig index 23924 is greater than the stored sequence count (1)

    Actually I created a new bam only with the reads mapping on my_selected_scaff with this command: samtools view -h my_bam "gi|15616630|ref|NC_002528.1|" | samtools view -bS -> my_new_bam But I still have in the header a line for each scaff I had before. I runned then RealignerTargetCreator and PrintReads reads on this BAM.

    If I use the option -L "gi|15616630|ref|NC_002528.1|" I do not have any error but I got at the end an empty bam.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Something is definitely wrong with your new bam file. You should re-create it from the original file using PrintReads with the -L option. Keep in mind that the -L option does not use quotes for contig names.

    Geraldine Van der Auwera, PhD

  • nanacnanac Posts: 9Member
    edited February 2013

    Then I tried without quotes but I got an error because the tool is looking for a contig only named gi. So I tried -L gi\|15616630\|ref\|NC_002528.1\| and I got the same as before : no error message but no result in my final BAM. I really don't understand what's happening here. Anyway, as I got an empty interval file for realignment, can I suppose that I do not need to realign and then I can keep working on my initial BAM?

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    You would be better off simplifying the contig names, eg here change it to just NC_002528.1.

    Is that a very small contig? If so it's possible that the RTC is not finding anything that needs to be realigned. I would recommend doing one more try with PrintReads on that contig and make sure that the data in the new file is the same as the old one. Then you can move on to the next processing step.

    Geraldine Van der Auwera, PhD

  • nanacnanac Posts: 9Member

    I tried to change the scaffold name but when I use PrintReads to select the scaff I want to work on, it doesn't print anything in my BAM file but the header. But with this command : samtool view -h mybam.bam buch_scaff | samtools view -bS -> mybam_buchnera.bam with buch_scaff the name of the scaff for which I want to retrieve the reads mapping on it. The new name doesn't change anything about the realignment. My interval file is still empty. The length of the scaffold is 600 kb. Maybe it's not short enough to think that there is no need of a realignment step...?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @nanac, if PrintReads isn't doing the right thing, there's a problem you need to fix before you can move on. You should check if the contigs are in the exact same order in your reference as in your bam file. If you changed the reference ordering after alignment that would explain the problem. If the contigs are not in the same order the GATK may fail to find the right contig and think there are no reads for that contig. We are working on improving the way the GATK handles this case, but in the meantime you need to check your files.

    Geraldine Van der Auwera, PhD

  • nanacnanac Posts: 9Member

    Thanks! I think I solve my problem. You were right, I had a problem in my reference file! Everything is working fine now!

Sign In or Register to comment.