To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

GenotypeGVCFs on pooled data sets fail for ploidy > 10

Hi,

I'm currently analysing a data set of six pools, 25 individuals in each (ploidy 50), of a non-model organism. I initially ran HaplotypeCaller with parameter -ERC GVCF and then attempted to do joint genotyping on these files. The resulting vcf was empty however, so in order to troubleshoot, I set up tests on a subset of data (700k sequences, 200k bp contig), which runs mapping, followed by "regular" genotyping with HaplotypeCaller, and compared to running GenotypeGVCFs on output from HaplotypeCaller run with option -ERC GVCF. In addition, I varied the ploidy for each call in HaplotypeCaller, ranging from 10 to 50. I ran bcftools stats on the resulting vcfs and got the following results in terms of number of SNPs:

testout.sort.rg.10.hc.vcf.stats:SN 0 number of SNPs: 485
testout.sort.rg.10.hcnorm.vcf.stats:SN 0 number of SNPs: 494
testout.sort.rg.20.hc.vcf.stats:SN 0 number of SNPs: 0
testout.sort.rg.20.hcnorm.vcf.stats:SN 0 number of SNPs: 502
testout.sort.rg.30.hc.vcf.stats:SN 0 number of SNPs: 0
testout.sort.rg.30.hcnorm.vcf.stats:SN 0 number of SNPs: 508
testout.sort.rg.40.hc.vcf.stats:SN 0 number of SNPs: 0
testout.sort.rg.40.hcnorm.vcf.stats:SN 0 number of SNPs: 504
testout.sort.rg.50.hc.vcf.stats:SN 0 number of SNPs: 0
testout.sort.rg.50.hcnorm.vcf.stats:SN 0 number of SNPs: 503

Here, 'hcnorm' refers to HaplotypCaller in normal mode (no -ERC GVCF). Evidently, for ploidy>10, no SNPs are output. Are there any known problems running GenotypeGVCFs with high ploidy, or am I missing something evident here?

Cheers,

Per

Best Answer

Answers

  • valentinvalentin Cambridge, MAMember, Dev
    edited March 2017

    Strange... higher ploidy is normally more sensitive... that said there are some components (code) that may differ between diploid and non-diploid runs.... so is possible.

    However for us to look into it we would need you report the version that you are using (the VCF output #GATKCommand line should tell you that) and focus on a concrete single variant that is present in the regular VCF mode but is missing from the GVCF model.

    Please post the command lines you are using for both, the VCF and GVCF output at that overlap that position (if it falls in a GVCF non-var block then you need to look at the POS column and the END= info annotation) and ideally also an IGV plot corroborating that the variant seem to be real.

  • punnebergpunneberg Member
    edited March 2017

    Hi Valentin,

    thanks for the quick reply. I'm running GATK Version=3.7-0-gcfedb67 (sorry, I completely forgot to post that info even though I know it's essential).

    Please find attached a screenshot of position 20625, which has AF=1.00, DP=93, so this clearly is a variant.

    The command to generate hcnorm was:

    GenomeAnalysisTK -Xmx6g -Djava.io.tmpdir=/tmp -T HaplotypeCaller -ploidy 50 -I testout.sort.rg.bam -R scaffold430.fasta -o testout.sort.rg.50.hcnorm.vcf -L region.interval_list

    for which position 20625 in vcf is:

    scaffold430 20625 . A G 3731.30 . AC=50;AF=1.00;AN=50;DP=92;FS=0.000;MLEAC=50;MLEAF=1.00;MQ=60.00;QD=33.74;SOR=0.829 GT:AD:DP:GQ:PL 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,92:92:8:3754,1553,1280,1119,1005,917,844,783,730,683,641,603,569,537,507,480,454,430,407,386,365,346,327,310,293,276,261,246,231,217,204,191,178,166,154,142,131,120,109,99,89,79,70,60,51,42,33,25,16,8,0

    Similarly, the commands to generate gvcf and subsequent run of GenotypeGVCFs:

    GenomeAnalysisTK -Xmx6g -Djava.io.tmpdir=/tmp -T HaplotypeCaller -ERC GVCF -ploidy 50 -I testout.sort.rg.bam -R scaffold430.fasta -o testout.sort.rg.50.hcnorm.vcf -L region.interval_list GenomeAnalysisTK -Xmx6g -Djava.io.tmpdir=/tmp -T GenotypeGVCFs -l DEBUG -nt 1 -R scaffold430.fasta -o testout.sort.rg.50.hc.vcf -V testout.sort.rg.50.hc.g.vcf

    and for position 20625 in gvcf we have:

    scaffold430 20625 . A G,<NON_REF> 3732.30 . DP=92;MLEAC=50,0;MLEAF=1.00,0.00;RAW_MQ=331200.00 GT:AD:DP:GQ:SB 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,92,0:92:8:0,0,49,43

    Yet, the output file testout.sort.rg.50.hc.vcf is empty.

    The interval file is used to exclude a region of excessive coverage at the end of scaffold430. To make sure this isn't what's causing the problem, I removed that region from the scaffold, shortening it to ~50kb, realigned, and ran the above commands again. The snp counts this time:

    testout.sort.rg.10.hc.vcf.stats:SN 0 number of SNPs: 259 testout.sort.rg.10.hcnorm.vcf.stats:SN 0 number of SNPs: 270 testout.sort.rg.20.hc.vcf.stats:SN 0 number of SNPs: 0 testout.sort.rg.20.hcnorm.vcf.stats:SN 0 number of SNPs: 267 testout.sort.rg.30.hc.vcf.stats:SN 0 number of SNPs: 0 testout.sort.rg.30.hcnorm.vcf.stats:SN 0 number of SNPs: 273 testout.sort.rg.40.hc.vcf.stats:SN 0 number of SNPs: 0 testout.sort.rg.40.hcnorm.vcf.stats:SN 0 number of SNPs: 281 testout.sort.rg.50.hc.vcf.stats:SN 0 number of SNPs: 0 testout.sort.rg.50.hcnorm.vcf.stats:SN 0 number of SNPs: 280

    I hope that is all the information you need, if not, let me know and I'll provide you with more data.

    /P

  • valentinvalentin Cambridge, MAMember, Dev

    Yup, it looks like the variant should be in the .vcf and it is not.

    You are asking for the debug information (-l DEBUG) so I was wondering whether is there any output DEBUG, WARN or INFO logging message that is giving up a clue what is going on.

    Another think that you can try is to use the advance '-newQual' argument which is an alternative way to calculate the QUAL score that might solve the issue for now.

    Even if -newQual fixes it, we would like you to submit a full bug report so that we can look into it.
    For that, could you please follow the step in here to submit the required files to reproduce the error. That way I'll be able to debug it for you.

    You only need to provide a modified subset of testout.sort.rg.50.hc.g.vcf that includes the header and that position, and reference fasta file that contains scaffold430 up to position 20625; just saying this in case the full reference would be a bit too big to pack and ftp it over. There is no need for the input alignment as it seems that the problem lies with GenotypeGVCFs.

    Before you submit, those files try to reproduce the bug once more using those to make sure that I will be able to see the bug.

    Cheers, V.

  • Hi

    I've uploaded a test data set as requested called gatk_forum_discussion_9190.tar.gz. In it you'll find a README that describes the contents. I looked at the debugging output but nothing really caught my attention. I narrowed the region to 3kb, in which there are 30 SNPs, called correctly by GenotypeGVCFs only for ploidy=10 but not at all for 20, 30, 40, or 50, as previously.

    Please let me know if you need anything else.

    Cheers,

    Per

    Issue · Github
    by Sheila

    Issue Number
    1839
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • Hi

    I am also encountering this same problem. I was wondering if there was any update.

    Thanks,
    Matt

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mconte
    Hi Matt,

    Sorry for the delay. I should get to processing the bug report this week.

    I will update when I have some news.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @punneberg
    Hi Per,

    Sorry for the delay, but I just tried testing your files. It seems the reference and BAM file have inconsistent sequence lengths. Can you please re-submit the in.bam or the reference file? It looks like you submitted a shorter reference than you mapped the BAM to.

    Thanks,
    Sheila

  • Hi Sheila,

    I completely missed your answer, sorry. The file in.bam is only used to extract fastq sequences used for subsequent mapping to the provided reference scaffold430.fasta. I could have been clearer on this point in the README, I see now.

    As a consequence, all variant calls are based on the generated file out.bam. Running make -B all -n will list all commands, with the top three being:

    bamToFastq -i in.bam -fq /dev/stdout | gzip -vc > read1.fastq.gz
    bwa index scaffold430.fasta
    bwa mem scaffold430.fasta read1.fastq.gz 2> bwa.log | samtools view -Sb - > out.bam

    and so on.

    I hope this helps.

    /Per

  • Hi Sheila,

    thanks for the update. Well, that was simple enough. According to the documentation the default value of --max_num_PL_values is set to 100, but apparently this is not directly related to the number of allele states then since it in that case should be sufficient for the ploidies in question here?

    Cheers,

    Per

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @punneberg
    Hi Per,

    Have a look here for some helpful information.

    -Sheila

Sign In or Register to comment.