Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

I got a question about the kmer length you parsed during the second step of HaplotypeCaller


As I screenshot, I found HC respectively parse the sequence corresponding to the ActiveRegion on reference genome and reads to kmers in length of 10 and 25.

Furthermore, (https://software.broadinstitute.org/gatk/documentation/article.php?id=4146) here you claimed that in the reads threading process, HC starts with the first read and compare its first kmer to the hash table to find if it has a match.

Under this circumstance, I have confusions:
Shouldn't the kmer length be an odd number?
If the kmer length is not consistent between ref-kmer and read-kmer, how are the read-kmers considered to be a match with the ref-kmer in the hash table?

Another little inquiry, by the time of my post, I found I cannot load the web page of your Bundle via FTP. Everytime I tried to log into that page, a little window pops out requiring input of username and code. I input the username and leave the code blank as instructed. But it does not work, the little window keeps popping out every time I hit Enter.

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited November 15

    Hi @Yangyxt,

    As far as I know, in graph assembly, both 10-mer and 25-mer graphs are each built. The reference is always represented by a path in the graph (onto which read paths are also grafted) and cycles are resolved by increasing the kmer size incrementally. I'm not sure if my description helps.

    As for the FTP site, I had our IT department reset the server on November 8th because it was buggy exactly as you describe:

    Everytime I tried to log into that page, a little window pops out requiring input of username and code. I input the username and leave the code blank as instructed. But it does not work, the little window keeps popping out every time I hit Enter.

    Sorry, you are experiencing this issue too. After the reset a week ago, the FTP site seemed to work as expected. Can you please try again and let us know if it is still buggy? Thanks.

    P.S. The FTP site is working for me right now.

  • YangyxtYangyxt Member

    Dear,> @shlee said:

    Hi @Yangyxt,

    As far as I know, in graph assembly, both 10-mer and 25-mer graphs are each built. The reference is always represented by a path in the graph (onto which read paths are also grafted) and cycles are resolved by increasing the kmer size incrementally. I'm not sure if my description helps.

    As for the FTP site, I had our IT department reset the server on November 8th because it was buggy exactly as you describe:

    Everytime I tried to log into that page, a little window pops out requiring input of username and code. I input the username and leave the code blank as instructed. But it does not work, the little window keeps popping out every time I hit Enter.

    Sorry, you are experiencing this issue too. After the reset a week ago, the FTP site seemed to work as expected. Can you please try again and let us know if it is still buggy? Thanks.

    P.S. The FTP site is working for me right now.

    Dear @shlee ,

    Thank you for your reply! Your description has mostly answered my question. However, I have further confusions.

    I was taught the kmer size used in De Bruijin Graph assembly should be an odd number due to the existence of sequences like:

    CGCGCGCG

    For instance, if you use a kmer size of 4, you'll get CGCG as the reverse complementary sequence of itself. This will cause confusion on the source of the kmer: whether it's from the sequence itself or its reverse-complementary sequence? Which further brings obstacle for us to assembly kmers from the same read sequence together.

    Therefore, why you allow using a kmer size of 10???

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @Yangyxt,

    I've consulted with one of our methods developers and here is their reply:

    It's true that for naive assembly some even-k k-mers can be confused with their reverse-compliment. However HaplotypeCaller takes k-mers from reads in a bam file aligned to the reference. Thus we know the orientation and k-merize reads as forward-strand: i.e. for reverse-strand reads, we k-merize the reverse-compliment.

    More generally, starting with k=10 is an optimistic attempt to use short k-mers that will be tolerant to sequencing error but potentially too short to be unique. If that works, we have a good de Bruijn graph (maybe after pruning a few bad nodes). If not (e.g. there are loops) then k is increased successively. By default the subsequent larger k are all odd. k=10 is somewhat arbitrary (probably 11 would be just as good) but was chosen in the past because it worked well and in this specific context, the even-k problem for de Bruijn graphs is circumvented.

    I hope this answers your concern about even k-mers.

Sign In or Register to comment.