The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
Incomplete output in ApplyRecalibration without generating an ERROR message
Dear GATK team,
I've been facing a problem involving incomplete output vcf files while running VariantRecalibrator followed by ApplyRecalibration. However, this misbehavior didn't produce an ERROR message from any of those commands. Instead, I caught it only once I've tried SelectVariants downstream in my pipeline, but the ERROR messages where of two kinds: "track variant is out of coordinate order..." or "The provided VCF file is malformed at approximately line number...". (More details below).
My pipeline consists of three commands:
1) VariantRecalibrator. Doesn't report any ERROR and the recal file seems OK.
java -Djava.io.tmpdir=$mytmp -Xmx232g -jar GenomeAnalysisTK.jar -R $ref -T VariantRecalibrator -input 1245g.vcf -recalFile 1245g.$SNPs.recal -tranchesFile 1245g.$SNPs.tranches -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an QD -an InbreedingCoeff -tranche 100.0 -tranche 99.5 -tranche 99.0 -tranche 90.0 -mode SNP -resource:$SNPs,known=true,training=true,truth=true,prior=15 $SNPs.vcf -rscriptFile 1245g.$SNPs.recalibrate_SNP.R -nt 40
2) ApplyRecalibration. Doesn't report any ERROR, but the vcf file is smaller than expected.
java -Djava.io.tmpdir=$mytmp -Xmx232g -jar GenomeAnalysisTK.jar -R $ref -T ApplyRecalibration -input 1245g.vcf --ts_filter_level 99 -tranchesFile 1245g.$SNPs.tranches -recalFile 1245g.$SNPs.recal -mode SNP -o 1245g.$SNPs.vcf -nt 40
3) SelectVariants. Aborts after producing an ERROR.
java -Djava.io.tmpdir=$mytmp -Xmx232g -jar GenomeAnalysisTK.jar -R $ref -T SelectVariants -V 1245g.$SNPs.vcf -selectType SNP -restrictAllelesTo BIALLELIC -o 1245g.$SNPs.BIALLELIC.vcf -nt 40
Examples of ERROR messages from SelectVariants:
##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 122: 0/0:7,0:7:15:0,15,225 is not a valid start position in the VCF format, for input source: 1245g.$SNPs.vcf
##### ERROR MESSAGE: LocationAwareSeekableRODIterator: track variant is out of coordinate order on contig Chr2:14660901 compared to Chr2:14661030
Few important NOTES:
- GATK version 3.4-45.
- $SNPs refers to the slightly different sets of training SNPs that I'm testing. For some sets, sometimes it runs all the way through without visible errors, but sometimes it doesn't .
- The two main variants of ERROR messages can occur even for the same $SNPs training set, in independent runs.
- I'm dealing with ~12M variants from 1245 individuals. However, worth to mention is that this same pipeline didn't give me any troubles when dealing with ~6M variants from a subset of 242 individuals.
- Attached is a file with the last two lines of file 1245g.$SNPs.vcf in a failed run: you'll notice that it insists in writing the beginning of a new line in an unfinished one.
- Before posting this message I even tried a run without multi-threading and the problem persisted.