Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
v3.7-0-gcfedb67: -T RealignerTargetCreator does not support --known blah.vcf.gz

Hi,
it seems GATK does not relaize that it has opened a vcf.gz (actually vcf.bgz file).
java -Djavaio.tmpdir=. -jar /scratch/mmokrejs/GATK/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /scratch/mmokrejs/db/ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.GRC-style.minimal.fasta -I 5584.removed_duplicates.bam -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/1000G_phase3_v4_20130502.sites.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/hapmap_3.3.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/GATK/common_all_20160601.sorted.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12877/NA12877.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12878/NA12878.vcf.gz -o 5584.forIndelRealigner.intervals INFO 16:51:39,199 HelpFormatter - ------------------------------------------------------------------------------------------ INFO 16:51:39,202 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 16:51:39,202 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 16:51:39,202 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 16:51:39,202 HelpFormatter - [Sat Jan 21 16:51:39 CET 2017] Executing on Linux 2.6.32-642.6.2.el6.Bull.104.x86_64 amd64 INFO 16:51:39,202 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15 INFO 16:51:39,206 HelpFormatter - Program Args: -T RealignerTargetCreator -R /scratch/mmokrejs/db/ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.GRC-style.minimal.fasta -I 5584.removed_duplicates.bam -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/1000G_phase3_v4_20130502.sites.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/hapmap_3.3.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/GATK/common_all_20160601.sorted.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12877/NA12877.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12878/NA12878.vcf.gz -o 5584.forIndelRealigner.intervals INFO 16:51:39,211 HelpFormatter - Executing as [email protected] on Linux 2.6.32-642.6.2.el6.Bull.104.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15. INFO 16:51:39,211 HelpFormatter - Date/Time: 2017/01/21 16:51:39 INFO 16:51:39,211 HelpFormatter - ------------------------------------------------------------------------------------------ INFO 16:51:39,211 HelpFormatter - ------------------------------------------------------------------------------------------ INFO 16:51:39,257 GenomeAnalysisEngine - Strictness is SILENT INFO 16:51:39,366 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 16:51:39,374 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:51:39,406 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 WARN 16:51:39,803 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation WARN 16:51:39,804 IndexDictionaryUtils - Track known2 doesn't have a sequence dictionary built in, skipping dictionary validation WARN 16:51:39,804 IndexDictionaryUtils - Track known3 doesn't have a sequence dictionary built in, skipping dictionary validation WARN 16:51:39,804 IndexDictionaryUtils - Track known4 doesn't have a sequence dictionary built in, skipping dictionary validation WARN 16:51:39,804 IndexDictionaryUtils - Track known5 doesn't have a sequence dictionary built in, skipping dictionary validation WARN 16:51:39,804 IndexDictionaryUtils - Track known6 doesn't have a sequence dictionary built in, skipping dictionary validation INFO 16:51:39,950 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 16:51:40,233 GenomeAnalysisEngine - Done preparing for traversal INFO 16:51:40,234 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 16:51:40,234 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 16:51:40,234 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Line 128: there aren't enough columns for line ?BCp9?}ksc???g?W ???B??11? (we expected 9 tokens, and saw 1 ), for input source: /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
Answers
The format decoding is based on the extension, so if you have the wrong extension it's going to do the wrong thing.
Hi Geraldine, what do you mean? GATK and picard can read, actually require bgzipped files. I do not know why most people use .vcf.gz instead of .vcf.bgz, it is confusing. However, 'picard.jar index blah.vcf.gz' works while 'picard.jar index blah.vcf.bgz' does not. That is why I use .vcf.gz with GATK, and assuming that while the code base is shared it will work as well. Anyway, BGZF files can be decompresswed with gzip. What am I missing? That just the file extension is not recognized? Well, that is why I reported it. ;-)
Good point, I did not have Mills_and_1000G_gold_standard.indels.b37.vcf.gz.tbi.
I had only indexes from picard's index:
Mills_and_1000G_gold_standard.indels.b37.vcf.gz.idx
Mills_and_1000G_gold_standard.indels.b37.vcf.idx
and the flatfile:
Mills_and_1000G_gold_standard.indels.b37.vcf
Hi @Geraldine_VdAuwera ,
I have same exact WARNING message that @mmokrejs got for tool RealignerTargetCreator in GATK 3.7.
I made sure that I have the proper tabix version 1.4 index file(vcf.gz.tbi) at the same location as where the original file(bgzipped.vcf.gz) is located, and I do not understand why I got this WARN message ?
What other index files or dictionaries do I need, inorder to make GATK happy ?
I am using OpenJDK1.8, and does Oracle Java 1.8 would make any difference here ?
Following are the -known files that I have for Mills Indels :
[[email protected] b37]$ ls -lhA Mills_and_1000G_gold_standard.indels.b37*
-rw-r--r-- 1 sgajja sbsuser 20M Apr 11 22:35 Mills_and_1000G_gold_standard.indels.b37_bgzipped.vcf.gz
-rw-r--r-- 1 sgajja sbsuser 1.5M Apr 11 22:36 Mills_and_1000G_gold_standard.indels.b37_bgzipped.vcf.gz.tbi
-rwxr--r-- 1 sgajja sbsuser 19M Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.gz
-rw-r--r-- 1 sgajja sbsuser 120 Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.gz.md5
-rwxr--r-- 1 sgajja sbsuser 536K Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
-rw-r--r-- 1 sgajja sbsuser 124 Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz.md5
[[email protected] b37]$
My GATK run and log is as follows :
INFO 22:56:54,778 HelpFormatter - --------------------------------------------------------------------------------
INFO 22:56:54,780 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO 22:56:54,780 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 22:56:54,780 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 22:56:54,780 HelpFormatter - [Tue Apr 11 22:56:54 EDT 2017] Executing on Linux 3.10.0-229.el7.x86_64 amd64
INFO 22:56:54,781 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_111-b15
INFO 22:56:54,783 HelpFormatter - Program Args: -nt 12 -T RealignerTargetCreator -R human_g1k_v37.fasta -I bwaAlign_MDup.bam -L exome_targetedregions_v1.2.bed -ip 100 -o targetIntervals_forRealign.list -known Mills_and_1000G_gold_standard.indels.b37_bgzipped.vcf.gz -known 1000G_phase1.indels.b37_bgzipped.vcf.gz
INFO 22:56:54,786 HelpFormatter - Executing as [email protected] on Linux 3.10.0-229.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15.
INFO 22:56:54,786 HelpFormatter - Date/Time: 2017/04/11 22:56:54
INFO 22:56:54,786 HelpFormatter - --------------------------------------------------------------------------------
INFO 22:56:54,786 HelpFormatter - --------------------------------------------------------------------------------
INFO 22:56:54,797 GenomeAnalysisEngine - Strictness is SILENT
INFO 22:56:54,876 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 22:56:54,881 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:54,902 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO 22:56:55,539 IntervalUtils - Processing 85392035 bp from intervals
WARN 22:56:55,551 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation
WARN 22:56:55,551 IndexDictionaryUtils - Track known2 doesn't have a sequence dictionary built in, skipping dictionary validation
INFO 22:56:55,557 MicroScheduler - Running the GATK in parallel mode with 12 total threads, 1 CPU thread(s) for each of 12 data thread(s), of 40 processors available on this machine
INFO 22:56:55,605 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 22:56:55,747 GenomeAnalysisEngine - Done preparing for traversal
INFO 22:56:55,748 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 22:56:55,748 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 22:56:55,748 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 22:56:55,752 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,757 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,758 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,762 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,762 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,768 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 22:56:55,768 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,772 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,772 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,775 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,776 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,779 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,780 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,783 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,784 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,789 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 22:56:55,790 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,793 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,793 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,797 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:56:55,797 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:56:55,800 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 22:57:25,764 ProgressMeter - 13:70681933 1.78714595E8 30.0 s 0.0 s 65.6% 45.0 s 15.0 s
INFO 22:57:36,832 ProgressMeter - done 2.80942824E8 41.0 s 0.0 s 100.0% 41.0 s 0.0 s
INFO 22:57:36,833 ProgressMeter - Total runtime 41.09 secs, 0.68 min, 0.01 hours
INFO 22:57:36,834 MicroScheduler - 1164392 reads were filtered out during the traversal out of approximately 11696894 total reads (9.95%)
INFO 22:57:36,834 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 22:57:36,834 MicroScheduler - -> 18377 reads (0.16% of total) failing BadMateFilter
INFO 22:57:36,834 MicroScheduler - -> 267967 reads (2.29% of total) failing DuplicateReadFilter
INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 22:57:36,835 MicroScheduler - -> 875795 reads (7.49% of total) failing MappingQualityZeroFilter
INFO 22:57:36,835 MicroScheduler - -> 2253 reads (0.02% of total) failing NotPrimaryAlignmentFilter
INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter
INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
Done. ------------------------------------------------------------------------------------------
Thanks,
mglclinical
@mglclinical
Hi,
Those WARN statements are nothing to worry about. Your run finished without an error. The WARNING statements are telling you the VCF does not have sequence information in the header. In that case, the tool won't check the VCF header sequence lines against the FASTA .dict file. As long as you are sure you are using the correct reference and VCF, you are good to go.
-Sheila
Thank you @Sheila for letting me know that I don't need to worry about those warnings
Hi @Sheila ,
I have got other WARN statements for HaplotypeCaller. Can I just ignore the following as well ?
WARN 07:19:17,107 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN 07:19:25,890 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller
WARN 07:19:28,157 HaplotypeCallerGenotypingEngine - location 4:67783-67787: too many alternative alleles found (7) larger than the maximum requested with -maxAltAlleles (6),
the following will be dropped: ATTTT.
WARN 07:22:22,642 AnnotationUtils - Annotation will not be calculated, genotype is not called
Thanks,
mglclinical
@mglclinical
Hi mglclinical,
Yes, you can ignore those as well. They are either telling you annotations cannot be calculated (for reasons why, check out the annotation documentation) or that there are more than the default number of alternate alleles allowed at a site. You can change the default value with
--max_alternate_alleles
.-Sheila
Thank you so much @Sheila
jDear all
I have generated Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
with tabix
when running RealignerTargetCreator should indicate the tbi file among the inputs?
is it correct to us use -known Mills_and_1000G_gold_standard.indels.hg38.vcf or should I use
-known Mills_and_1000G_gold_standard.indels.hg38.vcf,gz
java-new -jar ../GenomeAnalysisTK.jar -T RealignerTargetCreator -R Homo_sapiens_assembly38.fasta -I unodedupsorted474_1_riccio-161221_GAATCTGA.bam -known Mills_and_1000G_gold_standard.indels.hg38.vcf -o unotarget_intervals.list &
thank you
vittoria
@vittoria
Hi Vittoria,
I am not sure I understand your question. Are you asking if you can use .gz files as input VCF files to GATK? You can indeed use .gz files in GATK, but you must make to sure to specify the exact name of the VCF file in your command. If you unzipped the .gz file, then use .vcf. If you are leaving the file zipped, use the .vcf.gz.
Note, you can also find the hg38 resource files on the cloud here.
-Sheila