We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Possible to using CNNScoreVariants with PacBio reads?

pjedgepjedge La Jolla, CAMember

We have developed a SNV calling method for long reads (https://github.com/pjedge/longshot). The false positive variants that result from our method tend to occur in certain sequence contexts and often have various signals that could be used in conjunction to filter them (including some based on assembled haplotype consistency, etc). It would be nice to be able to combine these signals (reference sequence context as well as annotations in our VCF) to filter variants using a supervised learning approach. I am interested in using CNNVariantWriteTensors, CNNVariantTrain, and CNNScoreVariants for this task, but I'm not sure that it's even possible. Are there design considerations that fundamentally make these tools incompatible with non-illumina sequencing technologies? Further, our output VCF lacks most of the annotations specified in GATK best practices and a lot of those best practice annotations are geared toward Illumina reads. I think a lot of those annotations would not be good features for PacBio reads, if I were to just plug my data into VariantAnnotator to fill in annotations. We would be especially interested in leveraging custom annotations that are long-read specific. Would it be possible for us to define our own annotation set to use with these tools?

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @pjedge

    I have asked the CNN developer to look into this. We will get back to you shortly.

  • samwellsamwell Cambridge, MAMember, Broadie ✭✭

    @pjedge Thanks for your question. It is definitely possible to train a model that takes custom annotations as input, so long as you have a truth VCF and a confidence region for the sample VCF with the custom annotations. Unfortunately, this wont work right out of the box. There are a few code changes necessary. Specifically, you would need to define a new annotation_set in the file defines.py. Then when you call the tools CNNVariantWriteTensors and CNNVariantTrain you would pass the argument --annotation-set my_long_read_annotation_set. The first question is if you have a sample with the annotations you would like to use and an accompanying truth VCF and confident region. If you have a VCF you can share I would be happy to help with the code changes.

Sign In or Register to comment.