This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
ValidateVariants Error on low coverage WGS variant calling
Dear GATK Team,
first of all.. Thanks for the great work I love your software package.
I am currently experiencing some problems on a low-coverage sequencing project. In a nutshell: We deep-sequenced 24 founder pigs (~20x coverage) and we did low-coverage sequencing (~0.8-1.2x) on 91 F1-generation pigs (WGS). I processed the founder pigs 100% according to you best practices and it turned out fine. I did the variant calling on the F1 samples as follows (settings were taken from GATK forum threads):
java -Xmx64G -jar -Djava.io.tmpdir=/home/falker/temp/ /usr/local/bin/GenomeAnalysisTK-3.8-0.jar -R /home/falker/genomes/Sscrofa11.1/GCA_000003025.6_Sscrofa11.1_genomic.fna -T HaplotypeCaller -nct 14 -minPruning 1 -minDanglingBranchLength 1 --dbsnp /home/falker/genomes/Sscrofa11.1/dbSNP_Ss10.2_to_Ss11.1_SortVcf.vcf -o F1_Pigs.raw.snps.indels.vcf -I Sample_10_bwa_mem_markduplicates_recal_reads_merged.bam -I ... -I Sample_9_bwa_mem_markduplicates_recal_reads_merged.bam
... denotes the other 89 bam files I used as Input.
Haplotypecaller finished without major error messages. It gave me thousands of warnings like this:
WARN 09:10:15,857 HaplotypeCallerGenotypingEngine - location CM000812.5:12661-12667: too many alternative alleles found (10) larger than the maximum requested with -maxAltAlleles (6), the following will be dropped: AACACACACACACTCACACACACAC, A, AACACACACACACTCACACACAC, AAC.
So, the next downstream application after filtering will be imputation of the low-coverage pigs using the founder vcf file. For that purpose both datasets have to be merged to one vcf file (LB-Impute conventions). LB-Impute fails with this error message (I konw this is not you jurisdiction and I don't expect support on this one):
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2 at imputation.RetrieveParentals.retrieveparentals(RetrieveParentals.java:23) at imputation.ImputeBySample.imputebysample(ImputeBySample.java:43) at imputation.Impute.impute(Impute.java:54) at imputation.ImputeMain.start(ImputeMain.java:97) at imputation.ImputeMain.main(ImputeMain.java:34)
The developers of LB-Impute say that this indicates a malformatted VCF file. So I ran ValidateVariants on both datasets before merging. Founders are fine but the low-coverage vcf file produces this error:
##### ERROR MESSAGE: File /media/4TB/F1_Schweine/F1_Pigs.raw.snps.indels.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position CM000812.5:469, 108 vs. 106
Eliminating this line from the file produces the same error 3 positions further down in the file.
My first guess was, that running Haplotypecaller on 14 threads might have caused this (I ran the founders on 4 threads). This already took 6 days on a 3.2 Ghz, 16-Thread, Intel Broadwell Processor. Running it on one core would take weeks. Even 4 threads gives me a predicted runtime of 16 days.
Can you please help me figure out the source of the Allele Number error? If you also suspect the multi threading, is there a way to parallelize this kind of joint variant calling in another way? Maybe split by chromosome?
Thanks in advance
using: GATK-3.8.0, Java: OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12