Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Creating a reference genome for DepthOfCoverage

mc482mc482 sussexMember

I am trying to use DepthOfCoverage to analyse some sequencing I have done in Sacccaromyces cerevisiae.

I have had some problems creating the reference genome, dict and index files. First, the reference contained a non-IUPAC character which turned out to be /n (line break). I removed all instances of the character and remade the index and dict, but now the index file is incorrect. Instead of listing all the contigs, it contains only the single line

tpg|BK006935.2| 0 12274494 -1 -1

which I think refers only to the first chromosome.

I'd be very grateful for any assistance.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mc482‌

    Hi,

    How are you making the index? Are you sure the reference and dictionary file are correct? Have a look at this article which may help you: http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Keep in mind that you have to have line breaks between contig (chromosome) records to identify them as separate in the FASTA file. Otherwise everything gets parsed as a single record, which I think is what you're seeing.

  • mc482mc482 sussexMember

    @Geraldine_VdAuwera said:
    Keep in mind that you have to have line breaks between contig (chromosome) records to identify them as separate in the FASTA file. Otherwise everything gets parsed as a single record, which I think is what you're seeing.

    Thanks for the reply, I think this could definitely be the answer, but I tried adding the line breaks back in between each chromosome and got the same result (only one line in the index file).

    I made the new line breaks by simply pressing return, but afterwards I searched the document and could not find the character /n. Do you know if there are different ways of denoting a line break and which one I should use?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It can depend on what kind of text editor you are using, and what is the file encoding, but generally speaking if you type /n it should be interpreted as a line break.

  • mc482mc482 sussexMember

    I'm still having trouble with this, I have tried making line breaks between each chromosome using /n, /r and /n/r. They all still create indexes of a single line. There's also an additional error when making the dict file:
    Exception in thread "main" htsjdk.samtools.SAMException: Found invalid line in index file:tpg|BK006935.2| 0 12274569 -1 -1

  • mc482mc482 sussexMember

    Never mind, I've fixed it now. I ditched the file where I had removed all the '/n's, went back to the original file, copied it into gedit (was using Notepad ++ before) and saved the file as UTF-8 with Unix/Linux line ending. Thanks for the replies.

Sign In or Register to comment.