Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

AbstractVCFCodec error when running MuTect?

nsticklenstickle TorontoMember

I just ran a whole group of paired BAM files through mutect and keeping getting warnings from "AbstractVCFCodec". Specifically, as it's processing reads near the beginning of chr2 it reads:

WARN AbstractVCFCodec - Allele detected with length 1133370 exceeding max size 1048576 at approximately line 35003, likely resulting in degraded VCF processing performance

With the "approximately line" in the range of 34792-40487 and the reported length always equal. Then in Chr5 it warns (again, the same length for about 20 different line numbers):

WARN bstractVCFCodec - Allele detected with length 1857070 exceeding max size 1048576 at approximately line 100968, likely resulting in degraded VCF processing performance.

The MuTect command I used was:

java -Xmx4g -jar $mutect_dir/muTect-1.1.4.jar -T MuTect \
--reference_sequence hg19.genome.fa --cosmic Cosmic.hg19.vcf --dbsnp dbsnp_138.hg19.vcf \
--intervals Regions.bed -dt NONE -rf BadCigar
--input_file:normal normal.cocleaned.bam --input_file:tumor tumor.cocleaned.bam
--out pair..mutect.out --coverage_file pair.coverage.wig.txt --vcf pair.mutect.vcf

The line numbers and warning messages are the same for every pair that I ran, though oddly, the warning messages do not appear exactly in the same place in the output files (sometimes the first set of warnings were during processing of chr1, sometimes during processing of chr2). My input files are BAMs, which would lead me to suspect that this error comes from the dbsnp and cosmic files, but looking at those VCF files there doesn't seem to be anything odd about those line ranges.

The final VCFs produced by Mutect look reasonable based on a quick skim-though, but the strange warnings make me a nervous. Do you know what might cause these warnings?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This error means that the VCF parser (the part of the program that reads the information from the VCF into a format that it can work with) encountered an allele that is longer than the accepted maximum length defined by the VCF specification. At a basic level, the main problem this could cause is that the program may operate more slowly when it reads the file. I'm not familiar enough with those files to know if it is expected that you would find such long alleles. I would check the file to see if the alleles are real or if the file looks somehow malformed in that region. But otherwise I don't think this should be a cause for concern.

  • nsticklenstickle TorontoMember

    Hi Geraldine. Thanks for your quick reply. The problem is that I have used the same dbsnp and cosmic vcfs with mutect previously with no warnings, the only difference I can see are the BAM files. Can you think of any reason these could be causing this warning. Thanks again for your time!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I'm not sure... It could be that it's the first time you have a sample hitting those sites, or something unusual in those bams that is causing MuTect to call unusually long alleles. Can you post the lines in the approximate region of the error in your output VCF? Do you actually see these huge alleles in the output VCF?

  • teamcoopteamcoop Ontario, CanadaMember

    Hi Geraldine, I'm a coworker of @nstickle working on the same project. Thanks for all your help.

    I don't see any huge alleles when I scroll through the output VCF, the Cosmic VCF or the dbSNP VCF. I opened up the BAMs in IGV but don't see anything strange about the areas where the errors occur. We looked back at another project where we ran Mutect and saw the same warning messages.

    It's interesting to me that we see the same allele length reported in our warnings (AbstractVCFCodec - Allele detected with length 1133370 exceeding max size 1048576) as did the person who reported a similar warning from VariantAnnotator elsewhere on the forum. Is there anything special about that length?

    For the same samples, with slightly different BAM processing procedures, I see the same warnings at slightly different points in the file (e.g. during processing of chr1 instead of chr2). I am currently classifying the samples with Xenome and will report back whether, once aligned, those FASTQ lead to the same warning. If it's relevant, these data are from targeted sequencing with very high coverage.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, could be a cap to the value a given variable can take. Or you're using the same file, but then I'd expect the error to occur reproducibly in the same place. If you can manage to nail down the error to a specific interval that reproduces consistently in test files I'd be happy to take a look in the debugger.

  • teamcoopteamcoop Ontario, CanadaMember

    It looks like the issue was with our COSMIC VCF but there doesn't appear to have been any effect on the variants called.

    I downloaded the COSMIC and dbSNP files from the Mutect page as suggested in this forum discussion and chained them over to hg19 using the liftOverVariants.pl script. The warnings appeared only when I used our local COSMIC VCF, rather than the one from the MuTect page.
    I ran a quick a vcf-compare on the variants called with the different combinations of input VCFs and there was no difference, so I feel comfortable proceeding with the variants that were called using our newer version of COSMIC.

    After a closer look at our VCF, it seems like there are a few large variants near the end of chr1 and beginning of chr 2, though not at the line numbers reported in the warning messages. I can copy down the COSMIC IDs of those variants and can post them if that's of interest. Our COSMIC VCF is built from the v64 VCF files that were provided by COSMIC via this page.

  • teamcoopteamcoop Ontario, CanadaMember

    Geraldine,

    I think that's the issue, and it sounds like our large SNVs just won't be considered for weighting the model, which is fine. Thanks for your reply!

Sign In or Register to comment.