Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

AD field formatting errors

Hi. I have variants where the AD field is incorrectly formatted, such as

0/2:21,19,0:40:99:527,590,1204,0,614,557

for a MULTI-allelic SNP.

The AD field should be 21,0,19 in this case.

I've been using GenotypeGVCFs in version:
nightly-2014-06-07-g1006061, Compiled 2014/06/07 00:01:21

using the HC incremental discovery pipeline.

I know that problems with the AD field in this pipeline are known; Is there any ETA for correcting it? It's making downstream analysis with our scripts very difficult. Should I try the latest nightly build?

I've previously seen problems with BIALLELIC SNPs having AD output such as 32,0,15 instead of 32,15 (which also threw off our scripts), but I am so far not seeing that with the 2014/06/07 nightly build. So at least the AD problem is partially corrected as far as I know.

Thanks.

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jhomsy‌

    Hi,

    Yes, this should be fixed in the latest nightly build.

    -Sheila

  • jhomsyjhomsy Member

    Hi Sheila,
    Thanks for your response.

    I downloaded two more nightly builds, including the latest one:

    GenomeAnalysisTK-nightly-2014-06-23-g672adf3
    and
    GenomeAnalysisTK-nightly-2014-06-20-g2c1530d

    and still no luck:
    0/2:21,19,0:40:99:527,590,1204,0,614,557

    Here are the Program Args:

    -T GenotypeGVCFs -R Homo_sapiens_assembly19.fasta -L 3:196388145 --variant database.bp-res.all.list -o test3.vcf

    Thanks,
    Jason

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jhomsy‌

    Hi Jason,

    Please upload a snippet of your file where the problem occurs. Instructions on how to do this are found here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • jhomsyjhomsy Member

    Thanks Sheila. I just uploaded jhomsy.ADfield.vcf.gz. It is a vcf file with the multiallelic site in question.
    It was created using GenotypeGVCFs reading a database list of >5000 exomes (actually, multiple lists of about 50-75 exomes each, created using CombineGVCFs).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited June 2014

    @jhomsy‌

    Hi,

    I see you have only uploaded the vcf file.

    Please upload these:

    1) The exact command line that you used when you had the problem (in a text file)

    2) The full stack trace (program output in the console) from the start of the run to the end or error message (in a text file)

    3) A snippet of the BAM file if applicable and the index (.bai) file associated with it

    4) If a non-standard reference (i.e. not available in our resource bundle) was used, we need the .fasta, .fai, and .dict files for the reference

    5) Any other relevant files such as recalibration plots

    Thanks.

    -Sheila

  • jhomsyjhomsy Member

    Hi Sheila,

    Sorry, but it seems that most of these are not exactly applicable. For example:

    1. I can provide a full command for how I created the VCF.

    2. There is no error, so there won't be a stack trace. The problem I am describing is how the VCF is formatted.

    3. Snippet of BAM file: Don't know exactly how to do this: the VCF was made using the HaplotypeCaller incremental discovery pipeline involving over 5,000 GVCFs. Do you want me to create a combined GVCF for a token position from all the samples?

    4. N/A: human hg19

    5. No recalibration plots.

    I don't mean to be difficult, but the problem is within the VCF. The VCF I provided shows a multiallelic site. The alternate reads in the AD format fields for the non-ref genotyped individuals occur in the same position for both alternate alleles.
    ie:
    For Alt allele 1 (GT:AD):
    0/1:Ref,Alt-1,0
    and for Alt allele2:
    0/2:Ref,Alt-2,0

    where it should be:
    For Alt allele 1 (GT:AD):
    0/1:Ref,Alt-1,0
    and for Alt allele2:
    0/2:Ref,0,Alt-2

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jhomsy‌

    Hi,

    For 2, please provide the log output. I will change that in the FAQ!

    For 3, we need a data snippet on which we can reproduce the HC variant calling + joint genotyping. The VCF alone is not helpful. We can see it is not formatted properly, but we cannot track down why without the original data.

    -Sheila

  • jhomsyjhomsy Member

    OK, I just uploaded everything you need to reproduce the problem.
    I was able to reproduce the problem with these snippet files, as explained in the README.
    The file is uploaded as jhomsy.zip.
    Let me know if you have any questions.
    Thanks!
    Jason

  • jhomsyjhomsy Member

    Were you able to reproduce the problem?
    Thanks.

  • jhomsyjhomsy Member

    Hi.
    Was a bit of a drag to see this wasn't fixed in the 3.2 release. Any idea how long this may take?
    Thanks.

  • jhomsyjhomsy Member

    Oh Thanks! Nightly builds are fine!
    Appreciate your reply.
    Jason

  • jhomsyjhomsy Member

    Any progress here as well? Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jhomsy‌

    Hi Jason,

    I made a note that you are excited to have this done, so I hope the developers will fix it soon. I will let you know as soon as I know it is fixed.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jhomsy‌

    Hi Jason,

    We have a possible fix going through code review right now, so this issue should be fixed very soon!

    -Sheila

  • jhomsyjhomsy Member

    Thanks so much for the update!
    Jason

  • jhomsyjhomsy Member

    The latest nightly build fixes the problem! Thank you so much.
    Jason

Sign In or Register to comment.