ValidateVariants Error on low coverage WGS variant calling

falkerfalker GermanyMember

Dear GATK Team,

first of all.. Thanks for the great work I love your software package.

I am currently experiencing some problems on a low-coverage sequencing project. In a nutshell: We deep-sequenced 24 founder pigs (~20x coverage) and we did low-coverage sequencing (~0.8-1.2x) on 91 F1-generation pigs (WGS). I processed the founder pigs 100% according to you best practices and it turned out fine. I did the variant calling on the F1 samples as follows (settings were taken from GATK forum threads):

java -Xmx64G -jar -Djava.io.tmpdir=/home/falker/temp/ /usr/local/bin/GenomeAnalysisTK-3.8-0.jar -R /home/falker/genomes/Sscrofa11.1/GCA_000003025.6_Sscrofa11.1_genomic.fna -T HaplotypeCaller -nct 14 -minPruning 1 -minDanglingBranchLength 1  --dbsnp /home/falker/genomes/Sscrofa11.1/dbSNP_Ss10.2_to_Ss11.1_SortVcf.vcf -o F1_Pigs.raw.snps.indels.vcf -I Sample_10_bwa_mem_markduplicates_recal_reads_merged.bam -I ... -I Sample_9_bwa_mem_markduplicates_recal_reads_merged.bam

... denotes the other 89 bam files I used as Input.

Haplotypecaller finished without major error messages. It gave me thousands of warnings like this:

WARN  09:10:15,857 HaplotypeCallerGenotypingEngine - location CM000812.5:12661-12667: too many alternative alleles found (10) larger than the maximum requested with -maxAltAlleles (6), the following will be dropped: AACACACACACACTCACACACACAC, A, AACACACACACACTCACACACAC, AAC. 

So, the next downstream application after filtering will be imputation of the low-coverage pigs using the founder vcf file. For that purpose both datasets have to be merged to one vcf file (LB-Impute conventions). LB-Impute fails with this error message (I konw this is not you jurisdiction and I don't expect support on this one):

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
        at imputation.RetrieveParentals.retrieveparentals(RetrieveParentals.java:23)
        at imputation.ImputeBySample.imputebysample(ImputeBySample.java:43)
        at imputation.Impute.impute(Impute.java:54)
        at imputation.ImputeMain.start(ImputeMain.java:97)
        at imputation.ImputeMain.main(ImputeMain.java:34)

The developers of LB-Impute say that this indicates a malformatted VCF file. So I ran ValidateVariants on both datasets before merging. Founders are fine but the low-coverage vcf file produces this error:

##### ERROR MESSAGE: File /media/4TB/F1_Schweine/F1_Pigs.raw.snps.indels.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position CM000812.5:469, 108 vs. 106

Eliminating this line from the file produces the same error 3 positions further down in the file.

My first guess was, that running Haplotypecaller on 14 threads might have caused this (I ran the founders on 4 threads). This already took 6 days on a 3.2 Ghz, 16-Thread, Intel Broadwell Processor. Running it on one core would take weeks. Even 4 threads gives me a predicted runtime of 16 days.

Can you please help me figure out the source of the Allele Number error? If you also suspect the multi threading, is there a way to parallelize this kind of joint variant calling in another way? Maybe split by chromosome?

Thanks in advance
Clemens

using: GATK-3.8.0, Java: OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12

Answers

  • falkerfalker GermanyMember

    UPDATE

    I tried Haloptypecaller using -nct 1 on each chromosome separately with the same settings a above. I get similar errors with ValidateVariants:

    ##### ERROR MESSAGE: File /media/4TB/F1_Schweine/mapping/F1_Pigs_haplotypecaller_nct1_CM000813.5.raw.snps.indels.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position CM000813.5:3135, 138 vs. 134
    

    Any suggestions?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @falker
    Hi Clemens,

    Can you try running with GATK4 latest beta? You can test on just the one chromosome that is throwing the error.

    Thanks,
    Sheila

  • falkerfalker GermanyMember

    Hi Sheila,

    I tried GATK4b5 with the following command:

    java -Xmx6G -jar -Djava.io.tmpdir=/home/falker/temp/ /usr/local/bin/gatk-package-4.beta.5-local.jar  HaplotypeCaller -R /home/falker/genomes/Sscrofa11.1/GCA_000003025.6_Sscrofa11.1_genomic.fna -minPruning 1   -minDanglingBranchLength 1   -L CM000829.5 --dbsnp /home/falker/genomes/Sscrofa11.1/dbSNP_Ss10.2_to_Ss11.1_SortVcf.vcf   -O F1_Pigs_haplotypecaller_GATK4b5_nct1_CM000829.5.raw.snps.indels.vcf -I
    

    I get countless of these warnings:

    10:25:19.953 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    , which makes total sense because read depth is 1-2 since these are very low coverage samples.

    I cannot find an argument to ignore read depth. Can you help me out please?

    Thank you
    Clemens

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @falker
    Hi Clemens,

    Yikes, a 1-2X coverage is not good for calling variants. Is that the depth over your entire BAM file?

    -Sheila

  • falkerfalker GermanyMember

    Well, the mean coverage of each sample is even 0.9. I am looking at a sample in IGV and there are positions in the BAM file that are covered by three reads but there are also stretches with no read at all. This is totally intended, since recent findings in livestock population genetics propose that imputation from low coverage sequencing data is cheaper and more accurate than using a SNP chip.

    Based on this publication ( https://www.biorxiv.org/content/early/2017/07/28/169789 ) I resorted to using mpileup to call the low coverage variants and optimize the VQSR doing exploratory runs with different settings (QUAL, maxGaussians and minNumBadVariants).

    With the low coverage SNPs from the Haplotypecalle run I could find an adequate combination of settings leading to a Ti/Tv ratio around 2.1. With mpileup I am not that successful, but I will use the data if I have no other choice.

    Ideally I would like to resolve the initial issue I stated with the 3.8 GATK version. Do you have any clue where this AN tag error might come from?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @falker
    Hi,

    Alright. So, to recap, you have two VCFs (one from HaplotypeCaller and one from mpileup). They both validate, but when you merge them, you get validation errors? Is this correct? Or, do you get the errors after running a third-party tool on the merged VCF?

    Thanks,
    Sheila

  • falkerfalker GermanyMember

    Hi Sheila,

    variants called with mpileup do not cause any issues, but I am not satisfied with the results of VQSR.

    Variants called with Haplotypecaller produce an error in ValidateVariants, which you find in my first post.

    I might have found a tool, which solves my problem. conform-gt removes all variants from the query vcf, that do not match the reference vcf prior to imputaion. So I was able to use the dataset generated with HaplotypeCaller in Beagle 4.1.

    But to solve the issue with HaplotypeCaller 3.8 would be nice nevertheless.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @falker
    Hi,

    If the output of HaplotypeCaller is not validating, we would like to look into it. Can you share some test data with us?
    Instructions are here.

    Thanks,
    Sheila

  • falkerfalker GermanyMember

    Hi Sheila,

    sure, I would be glad to. I get back to you next year. Have some nice holidays!

    Clemens

  • falkerfalker GermanyMember

    Hi Sheila,

    since I am able to continue my analysis despite the error and GATK4 is now out of the beta phase you can close the thread. I realized you deleted all the 3.8 tutorials from your website so this may be a waste of time for you.

    Thank you for your help
    Clemens

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @falker
    Hi Clemens,

    Okay, thanks for the update. For the GATK3 docs, you can still view them. Simply choose which version in the orange dropdown box here :smiley:

    -Sheila

Sign In or Register to comment.