Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

v3.7-0-gcfedb67: -T RealignerTargetCreator does not support --known blah.vcf.gz

mmokrejsmmokrejs Czech RepublicMember

Hi,
it seems GATK does not relaize that it has opened a vcf.gz (actually vcf.bgz file).

java -Djavaio.tmpdir=. -jar /scratch/mmokrejs/GATK/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /scratch/mmokrejs/db/ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.GRC-style.minimal.fasta -I 5584.removed_duplicates.bam -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/1000G_phase3_v4_20130502.sites.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/hapmap_3.3.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/GATK/common_all_20160601.sorted.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12877/NA12877.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12878/NA12878.vcf.gz -o 5584.forIndelRealigner.intervals
INFO  16:51:39,199 HelpFormatter - ------------------------------------------------------------------------------------------ 
INFO  16:51:39,202 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
INFO  16:51:39,202 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  16:51:39,202 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  16:51:39,202 HelpFormatter - [Sat Jan 21 16:51:39 CET 2017] Executing on Linux 2.6.32-642.6.2.el6.Bull.104.x86_64 amd64 
INFO  16:51:39,202 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15 
INFO  16:51:39,206 HelpFormatter - Program Args: -T RealignerTargetCreator -R /scratch/mmokrejs/db/ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.GRC-style.minimal.fasta -I 5584.removed_duplicates.bam -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/1000G_phase3_v4_20130502.sites.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/hapmap_3.3.b37.vcf.gz -known /scratch/mmokrejs/db/ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/GATK/common_all_20160601.sorted.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12877/NA12877.vcf.gz -known /scratch/mmokrejs/db/ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12878/NA12878.vcf.gz -o 5584.forIndelRealigner.intervals 
INFO  16:51:39,211 HelpFormatter - Executing as [email protected] on Linux 2.6.32-642.6.2.el6.Bull.104.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15. 
INFO  16:51:39,211 HelpFormatter - Date/Time: 2017/01/21 16:51:39 
INFO  16:51:39,211 HelpFormatter - ------------------------------------------------------------------------------------------ 
INFO  16:51:39,211 HelpFormatter - ------------------------------------------------------------------------------------------ 
INFO  16:51:39,257 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:51:39,366 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  16:51:39,374 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:51:39,406 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 
WARN  16:51:39,803 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  16:51:39,804 IndexDictionaryUtils - Track known2 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  16:51:39,804 IndexDictionaryUtils - Track known3 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  16:51:39,804 IndexDictionaryUtils - Track known4 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  16:51:39,804 IndexDictionaryUtils - Track known5 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  16:51:39,804 IndexDictionaryUtils - Track known6 doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  16:51:39,950 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  16:51:40,233 GenomeAnalysisEngine - Done preparing for traversal 
INFO  16:51:40,234 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  16:51:40,234 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  16:51:40,234 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Line 128: there aren't enough columns for line ?BCp9?}ksc???g?W ???B??11? (we expected 9 tokens, and saw 1 ), for input source: /scratch/mmokrejs/db/ftp.broadinstitute.org/bundle/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The format decoding is based on the extension, so if you have the wrong extension it's going to do the wrong thing.

  • mmokrejsmmokrejs Czech RepublicMember

    Hi Geraldine, what do you mean? GATK and picard can read, actually require bgzipped files. I do not know why most people use .vcf.gz instead of .vcf.bgz, it is confusing. However, 'picard.jar index blah.vcf.gz' works while 'picard.jar index blah.vcf.bgz' does not. That is why I use .vcf.gz with GATK, and assuming that while the code base is shared it will work as well. Anyway, BGZF files can be decompresswed with gzip. What am I missing? That just the file extension is not recognized? Well, that is why I reported it. ;-)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Ah, I must have misread your post, sorry. Does your file have a proper tabix index?
  • mmokrejsmmokrejs Czech RepublicMember

    Good point, I did not have Mills_and_1000G_gold_standard.indels.b37.vcf.gz.tbi.

    I had only indexes from picard's index:
    Mills_and_1000G_gold_standard.indels.b37.vcf.gz.idx
    Mills_and_1000G_gold_standard.indels.b37.vcf.idx

    and the flatfile:
    Mills_and_1000G_gold_standard.indels.b37.vcf

  • mglclinicalmglclinical USAMember
    edited April 2017

    Hi @Geraldine_VdAuwera ,

    I have same exact WARNING message that @mmokrejs got for tool RealignerTargetCreator in GATK 3.7.

    I made sure that I have the proper tabix version 1.4 index file(vcf.gz.tbi) at the same location as where the original file(bgzipped.vcf.gz) is located, and I do not understand why I got this WARN message ?

    What other index files or dictionaries do I need, inorder to make GATK happy ?

    I am using OpenJDK1.8, and does Oracle Java 1.8 would make any difference here ?

    Following are the -known files that I have for Mills Indels :

    [[email protected] b37]$ ls -lhA Mills_and_1000G_gold_standard.indels.b37*
    -rw-r--r-- 1 sgajja sbsuser 20M Apr 11 22:35 Mills_and_1000G_gold_standard.indels.b37_bgzipped.vcf.gz
    -rw-r--r-- 1 sgajja sbsuser 1.5M Apr 11 22:36 Mills_and_1000G_gold_standard.indels.b37_bgzipped.vcf.gz.tbi
    -rwxr--r-- 1 sgajja sbsuser 19M Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.gz
    -rw-r--r-- 1 sgajja sbsuser 120 Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.gz.md5
    -rwxr--r-- 1 sgajja sbsuser 536K Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
    -rw-r--r-- 1 sgajja sbsuser 124 Nov 4 2015 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz.md5
    [[email protected] b37]$

    My GATK run and log is as follows :

    INFO 22:56:54,778 HelpFormatter - --------------------------------------------------------------------------------
    INFO 22:56:54,780 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
    INFO 22:56:54,780 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 22:56:54,780 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 22:56:54,780 HelpFormatter - [Tue Apr 11 22:56:54 EDT 2017] Executing on Linux 3.10.0-229.el7.x86_64 amd64
    INFO 22:56:54,781 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_111-b15
    INFO 22:56:54,783 HelpFormatter - Program Args: -nt 12 -T RealignerTargetCreator -R human_g1k_v37.fasta -I bwaAlign_MDup.bam -L exome_targetedregions_v1.2.bed -ip 100 -o targetIntervals_forRealign.list -known Mills_and_1000G_gold_standard.indels.b37_bgzipped.vcf.gz -known 1000G_phase1.indels.b37_bgzipped.vcf.gz
    INFO 22:56:54,786 HelpFormatter - Executing as [email protected] on Linux 3.10.0-229.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15.
    INFO 22:56:54,786 HelpFormatter - Date/Time: 2017/04/11 22:56:54
    INFO 22:56:54,786 HelpFormatter - --------------------------------------------------------------------------------
    INFO 22:56:54,786 HelpFormatter - --------------------------------------------------------------------------------
    INFO 22:56:54,797 GenomeAnalysisEngine - Strictness is SILENT
    INFO 22:56:54,876 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 22:56:54,881 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:54,902 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
    INFO 22:56:55,539 IntervalUtils - Processing 85392035 bp from intervals
    WARN 22:56:55,551 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation
    WARN 22:56:55,551 IndexDictionaryUtils - Track known2 doesn't have a sequence dictionary built in, skipping dictionary validation
    INFO 22:56:55,557 MicroScheduler - Running the GATK in parallel mode with 12 total threads, 1 CPU thread(s) for each of 12 data thread(s), of 40 processors available on this machine
    INFO 22:56:55,605 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
    INFO 22:56:55,747 GenomeAnalysisEngine - Done preparing for traversal
    INFO 22:56:55,748 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 22:56:55,748 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 22:56:55,748 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 22:56:55,752 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,757 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,758 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,762 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,762 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,768 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
    INFO 22:56:55,768 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,772 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,772 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,775 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,776 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,779 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,780 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,783 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,784 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,789 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
    INFO 22:56:55,790 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,793 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,793 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,797 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:56:55,797 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 22:56:55,800 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
    INFO 22:57:25,764 ProgressMeter - 13:70681933 1.78714595E8 30.0 s 0.0 s 65.6% 45.0 s 15.0 s
    INFO 22:57:36,832 ProgressMeter - done 2.80942824E8 41.0 s 0.0 s 100.0% 41.0 s 0.0 s
    INFO 22:57:36,833 ProgressMeter - Total runtime 41.09 secs, 0.68 min, 0.01 hours
    INFO 22:57:36,834 MicroScheduler - 1164392 reads were filtered out during the traversal out of approximately 11696894 total reads (9.95%)
    INFO 22:57:36,834 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 22:57:36,834 MicroScheduler - -> 18377 reads (0.16% of total) failing BadMateFilter
    INFO 22:57:36,834 MicroScheduler - -> 267967 reads (2.29% of total) failing DuplicateReadFilter
    INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 22:57:36,835 MicroScheduler - -> 875795 reads (7.49% of total) failing MappingQualityZeroFilter
    INFO 22:57:36,835 MicroScheduler - -> 2253 reads (0.02% of total) failing NotPrimaryAlignmentFilter
    INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter

    INFO 22:57:36,835 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    Done. ------------------------------------------------------------------------------------------

    Thanks,
    mglclinical

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mglclinical
    Hi,

    Those WARN statements are nothing to worry about. Your run finished without an error. The WARNING statements are telling you the VCF does not have sequence information in the header. In that case, the tool won't check the VCF header sequence lines against the FASTA .dict file. As long as you are sure you are using the correct reference and VCF, you are good to go.

    -Sheila

  • mglclinicalmglclinical USAMember

    Thank you @Sheila for letting me know that I don't need to worry about those warnings

  • mglclinicalmglclinical USAMember

    Hi @Sheila ,

    I have got other WARN statements for HaplotypeCaller. Can I just ignore the following as well ?

    WARN 07:19:17,107 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
    WARN 07:19:25,890 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller
    WARN 07:19:28,157 HaplotypeCallerGenotypingEngine - location 4:67783-67787: too many alternative alleles found (7) larger than the maximum requested with -maxAltAlleles (6),
    the following will be dropped: ATTTT.
    WARN 07:22:22,642 AnnotationUtils - Annotation will not be calculated, genotype is not called

    Thanks,
    mglclinical

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mglclinical
    Hi mglclinical,

    Yes, you can ignore those as well. They are either telling you annotations cannot be calculated (for reasons why, check out the annotation documentation) or that there are more than the default number of alternate alleles allowed at a site. You can change the default value with --max_alternate_alleles.

    -Sheila

  • mglclinicalmglclinical USAMember

    Thank you so much @Sheila

  • jDear all

    I have generated Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
    with tabix
    when running RealignerTargetCreator should indicate the tbi file among the inputs?
    is it correct to us use -known Mills_and_1000G_gold_standard.indels.hg38.vcf or should I use
    -known Mills_and_1000G_gold_standard.indels.hg38.vcf,gz

    java-new -jar ../GenomeAnalysisTK.jar -T RealignerTargetCreator -R Homo_sapiens_assembly38.fasta -I unodedupsorted474_1_riccio-161221_GAATCTGA.bam -known Mills_and_1000G_gold_standard.indels.hg38.vcf -o unotarget_intervals.list &
    thank you

    vittoria

  • SheilaSheila Broad InstituteMember, Broadie admin

    @vittoria
    Hi Vittoria,

    I am not sure I understand your question. Are you asking if you can use .gz files as input VCF files to GATK? You can indeed use .gz files in GATK, but you must make to sure to specify the exact name of the VCF file in your command. If you unzipped the .gz file, then use .vcf. If you are leaving the file zipped, use the .vcf.gz.

    Note, you can also find the hg38 resource files on the cloud here.

    -Sheila

Sign In or Register to comment.