If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Why does SelectVariants discard variants when the input vcf is unsorted?

I am using SelectVariants to subset based on samples, while trying to keep all variants. However, SelectVariants seems to discard variants when the input file is not sorted and an index file is present. When the index file is not present, SelectVariants throws an error like this:

ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature ....

Which may indicate that a sorted vcf is required for SelectVariants. However, with an index file present, there are no errors thrown, but some of the variants are silently discarded from the result file. If I sort the input vcf, all the variants will be in the output from SelectVariants.

So I have found a solution to my problem, but it took me a while and this looks like a bug to me. It could alternatively be a problem with CatVariants and the index file created.

To reproduce:

java -cp GenomeAnalysisTK.jar \
-R hg19.fasta \
-V dummy1.vcf \
-V dummy2.vcf \
-out concatGATK.vcf

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fasta \
-V concatGATK.vcf \
-o subsetGATK.vcf \
-sn sample1

The outputfile, subsetGATK.vcf, may have a lot less variants than the input file.

I was using GenomeAnalysisTK-3.8,java version "1.8.0_31",
I was using GATK best practices "Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence", followed by hard filtering of SNPs and indels separately, then I concatenated the two files, and the I subset to case and control with SelectVariants. I was doing amplicon DNA seq from Illumina fastq files, but I did also recreate the problem with vcfs from IonTorrents software.



  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi vegard,

    Can you post an example record that is present in the original VCF that is not present after SelectVariants? Can you also check if this occurs in GATK4 latest beta?


  • vegardvegard Member


    Here is one example row that is not included in the output from SelectVariants

    chr2 47702451 . G T . . AC=1;AF=0.500;AN=2;DP=126 GT:AF:AO:DP:FAO:FDP:FRO:FSAF:FSAR:FSRF:FSRR:FWDB:FXX:GQ:HRUN:LEN:MLLD:OALT:OMAPALT:OPOS:OREF:QD:QUAL:RBI:REFB:REVB:RO:SAF:SAR:SRF:SRR:SSEN:SSEP:SSSB:STB:STBP:VARB 0/1:0.0692308:7:126:9:130:119:7:2:98:21:0.00714747:0:27:1:1:52.28:T:T:47702451:G:1.80268:58.5871:0.0237369:-0.00677467:-0.0226352:107:0:7:98:9:0:0:-0.806182:0.565904:0.816:0.00948168

    That line is not from a vcf produced by a GATK (except CatVariants). I hope that thats ok,

    When I tested GATK4-beta the problem was not there. All lines were included. I tested both with a vcf from the GATK workflow and from elsewhere.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi vegard,

    Interesting. I am not sure why GATK3 did not include the records, but if it is not a problem in GATK4, please use that. Development has pretty much halted on GATK3.


Sign In or Register to comment.