ValidateVariants Error on low coverage WGS variant calling
Dear GATK Team,
first of all.. Thanks for the great work I love your software package.
I am currently experiencing some problems on a low-coverage sequencing project. In a nutshell: We deep-sequenced 24 founder pigs (~20x coverage) and we did low-coverage sequencing (~0.8-1.2x) on 91 F1-generation pigs (WGS). I processed the founder pigs 100% according to you best practices and it turned out fine. I did the variant calling on the F1 samples as follows (settings were taken from GATK forum threads):
java -Xmx64G -jar -Djava.io.tmpdir=/home/falker/temp/ /usr/local/bin/GenomeAnalysisTK-3.8-0.jar -R /home/falker/genomes/Sscrofa11.1/GCA_000003025.6_Sscrofa11.1_genomic.fna -T HaplotypeCaller -nct 14 -minPruning 1 -minDanglingBranchLength 1 --dbsnp /home/falker/genomes/Sscrofa11.1/dbSNP_Ss10.2_to_Ss11.1_SortVcf.vcf -o F1_Pigs.raw.snps.indels.vcf -I Sample_10_bwa_mem_markduplicates_recal_reads_merged.bam -I ... -I Sample_9_bwa_mem_markduplicates_recal_reads_merged.bam
... denotes the other 89 bam files I used as Input.
Haplotypecaller finished without major error messages. It gave me thousands of warnings like this:
WARN 09:10:15,857 HaplotypeCallerGenotypingEngine - location CM000812.5:12661-12667: too many alternative alleles found (10) larger than the maximum requested with -maxAltAlleles (6), the following will be dropped: AACACACACACACTCACACACACAC, A, AACACACACACACTCACACACAC, AAC.
So, the next downstream application after filtering will be imputation of the low-coverage pigs using the founder vcf file. For that purpose both datasets have to be merged to one vcf file (LB-Impute conventions). LB-Impute fails with this error message (I konw this is not you jurisdiction and I don't expect support on this one):
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2 at imputation.RetrieveParentals.retrieveparentals(RetrieveParentals.java:23) at imputation.ImputeBySample.imputebysample(ImputeBySample.java:43) at imputation.Impute.impute(Impute.java:54) at imputation.ImputeMain.start(ImputeMain.java:97) at imputation.ImputeMain.main(ImputeMain.java:34)
The developers of LB-Impute say that this indicates a malformatted VCF file. So I ran ValidateVariants on both datasets before merging. Founders are fine but the low-coverage vcf file produces this error:
##### ERROR MESSAGE: File /media/4TB/F1_Schweine/F1_Pigs.raw.snps.indels.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position CM000812.5:469, 108 vs. 106
Eliminating this line from the file produces the same error 3 positions further down in the file.
My first guess was, that running Haplotypecaller on 14 threads might have caused this (I ran the founders on 4 threads). This already took 6 days on a 3.2 Ghz, 16-Thread, Intel Broadwell Processor. Running it on one core would take weeks. Even 4 threads gives me a predicted runtime of 16 days.
Can you please help me figure out the source of the Allele Number error? If you also suspect the multi threading, is there a way to parallelize this kind of joint variant calling in another way? Maybe split by chromosome?
Thanks in advance
using: GATK-3.8.0, Java: OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12