If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BwaAndMarkDuplicatesPipelineSpark Input Format

Hi, looking forward to GATK_4 release, so I have started investigating 4 beta1. I like the tools that are available, calling method and Spark! I had therefore set up a Spark pipe of BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark: very handy, would be cool if there was a pipeline for BWA + MD + BQSR. My issue is with BwaAndMarkDuplicatesPipelineSpark which specifies input as BAM/SAM/CRAM, this seems odd. Is fastq not an option? I ran with 2 fastq (R1, R2) as input and got the error:

Sorry, we only support a single reads input for spark tools for now

I know this is a beta, but can someone explain why input to alignment is in aligned format? I tried merging PE reads into a single fastq, and using just R1.fastq.

Appreciate any input on this,


Best Answer


  • bruce01bruce01 Member ✭✭

    OK, I see some mild discontent from comments on that page. FWIW, and in such a (necessarily) pedantic field, using a file format that is already well known, and the acronym of which actually stands for the opposite of what you are storing in it, seems strange to me. But I appreciate GATK with all the quirks. Thanks for your help as ever, looking forward to the full GATK 4 release!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @bruce01 That's fair, and I agree it's unfortunate -- but we felt it was better to repurpose an existing format in a not-so-intuitive way rather than come up with yet another format. If anything the format could be renamed to be more generic, since the presence or absence of mapping information is not a deal breaker... not that that's really going to happen ;)
  • bruce01bruce01 Member ✭✭

    After testing, I get this error:

    A USER ERROR has occurred: Input files reference and reads have incompatible contigs: No overlapping     contigs found.
          reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
          reads contigs = []

    This is reproducible with the example data from the 6484 uBAM tutorial.

    I will test further to check if it is based on error from the specific Spark pipeline.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bruce01, that looks like a garden variety dictionary mismatch to me. Have you tried running a different tool on the same inputs?

  • bruce01bruce01 Member ✭✭

    Hi, skipped testing other tools because as you note, it is looking for a sequence dictionary, so I used samtools reheader with the genome.dict that I am aligning against and it works. But this is not in the tutorial 6484 files, nor is it specified. A nice little test for us, I assume=)

    Apropos file format, agree that no more are required. I reckon you should rename the BAM acronym to represent Unaligned in this instance though =D

  • bruce01bruce01 Member ✭✭

    Or just use --disableSequenceDictionaryValidation true ...

Sign In or Register to comment.