Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

ROD files out of FASTA? + other questions

Hey all, newbie here.
tl;dr:
I have a fasta file containing two sequences of my region of interest (~5.5 kbp), that differ in ~100 SNPs. What is the fastest way to generate a ROD file out of these sequences, as an input to BQSR?

So, hey.
I'm trying to determine the frequency of a genetic fragment I introduced into a bacterial strain, at several different samples. As I wrote, my current challenge is to create the aforementioned ROD file; however, my project is a bit different than 'usual' variant calling projects, and any advice regarding processing and analysis would be appreciated.

  1. I have a WT bacteria strain. I introduced a 5.5kbp genetic fragment to it, by electroporation and homologous recombination. It is safe to assume different parts of the fragment have invaded the host's genome with different efficiencies (so I may have 'hybrid' variants, that are half WT and half mutated). The introduced fragment had ~100 SNPs compared to the WT fragment.
  2. I took that sample and grew it on different conditions, in order to determine whether the fragment I introduced is beneficial to the bacteria.
  3. The fragments were PCR-amplified, sheared to smaller DNA fragments (~300-500 bp), and sequenced (150bp per read, paired-end). I have a coverage of 10^6 reads per base for each sample.
  4. I'd like to determine the frequency of each SNP at each sample, and ideally, the identity and frequency of each variant.

I have:
The sequencing samples (1 sample of the initial pool, 6 samples of biological replicates for one condition, and 3 samples of biological replicates for the second condition), the sequence of the WT's genome, and the sequence of the of the fragment I introduced.

My questions:
1. How do I turn the fasta file containing my WT and modified fragments to a ROD file (type doesn't matter) for the BQSR procedure? I do not need to relay on the sequenced samples to determine the differences between the sequences, I already know them.
2. Since all my reads originate from a PCR-amplified fragment, can de-duplication introduce biases \ underestimation to my data?
3. I have a huge coverage. Does it require any different processing methods?
4. Any other advice?

Thanks,
Omer

Answers

Sign In or Register to comment.