Why does SelectVariants discard variants when the input vcf is unsorted?

I am using SelectVariants to subset based on samples, while trying to keep all variants. However, SelectVariants seems to discard variants when the input file is not sorted and an index file is present. When the index file is not present, SelectVariants throws an error like this:

ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature ....

Which may indicate that a sorted vcf is required for SelectVariants. However, with an index file present, there are no errors thrown, but some of the variants are silently discarded from the result file. If I sort the input vcf, all the variants will be in the output from SelectVariants.

So I have found a solution to my problem, but it took me a while and this looks like a bug to me. It could alternatively be a problem with CatVariants and the index file created.

To reproduce:

java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants \
-R hg19.fasta \
-V dummy1.vcf \
-V dummy2.vcf \
-out concatGATK.vcf

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fasta \
-V concatGATK.vcf \
-o subsetGATK.vcf \
-sn sample1

The outputfile, subsetGATK.vcf, may have a lot less variants than the input file.

I was using GenomeAnalysisTK-3.8,java version "1.8.0_31",
I was using GATK best practices "Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence", followed by hard filtering of SNPs and indels separately, then I concatenated the two files, and the I subset to case and control with SelectVariants. I was doing amplicon DNA seq from Illumina fastq files, but I did also recreate the problem with vcfs from IonTorrents software.

vegard

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vegard
    Hi vegard,

    Can you post an example record that is present in the original VCF that is not present after SelectVariants? Can you also check if this occurs in GATK4 latest beta?

    Thanks,
    Sheila

  • vegardvegard Member

    Hi,

    Here is one example row that is not included in the output from SelectVariants

    chr2 47702451 . G T . . AC=1;AF=0.500;AN=2;DP=126 GT:AF:AO:DP:FAO:FDP:FRO:FSAF:FSAR:FSRF:FSRR:FWDB:FXX:GQ:HRUN:LEN:MLLD:OALT:OMAPALT:OPOS:OREF:QD:QUAL:RBI:REFB:REVB:RO:SAF:SAR:SRF:SRR:SSEN:SSEP:SSSB:STB:STBP:VARB 0/1:0.0692308:7:126:9:130:119:7:2:98:21:0.00714747:0:27:1:1:52.28:T:T:47702451:G:1.80268:58.5871:0.0237369:-0.00677467:-0.0226352:107:0:7:98:9:0:0:-0.806182:0.565904:0.816:0.00948168

    That line is not from a vcf produced by a GATK (except CatVariants). I hope that thats ok,

    When I tested GATK4-beta the problem was not there. All lines were included. I tested both with a vcf from the GATK workflow and from elsewhere.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vegard
    Hi vegard,

    Interesting. I am not sure why GATK3 did not include the records, but if it is not a problem in GATK4, please use that. Development has pretty much halted on GATK3.

    -Sheila

Sign In or Register to comment.