If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Why does SelectVariants discard variants when the input vcf is unsorted?
I am using SelectVariants to subset based on samples, while trying to keep all variants. However, SelectVariants seems to discard variants when the input file is not sorted and an index file is present. When the index file is not present, SelectVariants throws an error like this:
ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature ....
Which may indicate that a sorted vcf is required for SelectVariants. However, with an index file present, there are no errors thrown, but some of the variants are silently discarded from the result file. If I sort the input vcf, all the variants will be in the output from SelectVariants.
So I have found a solution to my problem, but it took me a while and this looks like a bug to me. It could alternatively be a problem with CatVariants and the index file created.
java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants \
-R hg19.fasta \
-V dummy1.vcf \
-V dummy2.vcf \
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fasta \
-V concatGATK.vcf \
-o subsetGATK.vcf \
The outputfile, subsetGATK.vcf, may have a lot less variants than the input file.
I was using GenomeAnalysisTK-3.8,java version "1.8.0_31",
I was using GATK best practices "Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence", followed by hard filtering of SNPs and indels separately, then I concatenated the two files, and the I subset to case and control with SelectVariants. I was doing amplicon DNA seq from Illumina fastq files, but I did also recreate the problem with vcfs from IonTorrents software.