Why does SelectVariants discard variants when the input vcf is unsorted?
I am using SelectVariants to subset based on samples, while trying to keep all variants. However, SelectVariants seems to discard variants when the input file is not sorted and an index file is present. When the index file is not present, SelectVariants throws an error like this:
ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature ....
Which may indicate that a sorted vcf is required for SelectVariants. However, with an index file present, there are no errors thrown, but some of the variants are silently discarded from the result file. If I sort the input vcf, all the variants will be in the output from SelectVariants.
So I have found a solution to my problem, but it took me a while and this looks like a bug to me. It could alternatively be a problem with CatVariants and the index file created.
java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants \
-R hg19.fasta \
-V dummy1.vcf \
-V dummy2.vcf \
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fasta \
-V concatGATK.vcf \
-o subsetGATK.vcf \
The outputfile, subsetGATK.vcf, may have a lot less variants than the input file.
I was using GenomeAnalysisTK-3.8,java version "1.8.0_31",
I was using GATK best practices "Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence", followed by hard filtering of SNPs and indels separately, then I concatenated the two files, and the I subset to case and control with SelectVariants. I was doing amplicon DNA seq from Illumina fastq files, but I did also recreate the problem with vcfs from IonTorrents software.