BwaAndMarkDuplicatesPipelineSpark Input Format

Hi, looking forward to GATK_4 release, so I have started investigating 4 beta1. I like the tools that are available, calling method and Spark! I had therefore set up a Spark pipe of BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark: very handy, would be cool if there was a pipeline for BWA + MD + BQSR. My issue is with BwaAndMarkDuplicatesPipelineSpark which specifies input as BAM/SAM/CRAM, this seems odd. Is fastq not an option? I ran with 2 fastq (R1, R2) as input and got the error:

Sorry, we only support a single reads input for spark tools for now

I know this is a beta, but can someone explain why input to alignment is in aligned format? I tried merging PE reads into a single fastq, and using just R1.fastq.

Appreciate any input on this,

Bruce.

Best Answer

Answers

  • OK, I see some mild discontent from comments on that page. FWIW, and in such a (necessarily) pedantic field, using a file format that is already well known, and the acronym of which actually stands for the opposite of what you are storing in it, seems strange to me. But I appreciate GATK with all the quirks. Thanks for your help as ever, looking forward to the full GATK 4 release!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @bruce01 That's fair, and I agree it's unfortunate -- but we felt it was better to repurpose an existing format in a not-so-intuitive way rather than come up with yet another format. If anything the format could be renamed to be more generic, since the presence or absence of mapping information is not a deal breaker... not that that's really going to happen ;)
  • After testing, I get this error:

    A USER ERROR has occurred: Input files reference and reads have incompatible contigs: No overlapping     contigs found.
          reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
          reads contigs = []
    

    This is reproducible with the example data from the 6484 uBAM tutorial.

    I will test further to check if it is based on error from the specific Spark pipeline.

    Thanks,
    Bruce.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @bruce01, that looks like a garden variety dictionary mismatch to me. Have you tried running a different tool on the same inputs?

  • Hi, skipped testing other tools because as you note, it is looking for a sequence dictionary, so I used samtools reheader with the genome.dict that I am aligning against and it works. But this is not in the tutorial 6484 files, nor is it specified. A nice little test for us, I assume=)

    Apropos file format, agree that no more are required. I reckon you should rename the BAM acronym to represent Unaligned in this instance though =D

  • Or just use --disableSequenceDictionaryValidation true ...

Sign In or Register to comment.