[GATK4.0]Which steps should be done before 'ReadsPipelineSpark '?

In the latest version GATK4.0, the function of pipeline 'ReadsPipelineSpark ' is that
Takes unaligned or aligned reads and runs BWA (if specified), MarkDuplicates, BQSR, and HaplotypeCaller to generate a VCF file of variants.
In my opinion, it means contains bwa,MarkDuplicates, BQSR, and HaplotypeCaller in one command. So which steps should be done before 'ReadsPipelineSpark '?
I try to use FastqToSam->AddOrReplaceReadGroups->ReadsPipelineSpark, but it is not correct. So what is the correct pipeline?

Answers

  • SkyWarriorSkyWarrior TurkeyMember
    edited January 16

    According to my tests just mapping and sorting should suffice. However it is painstakingly slow on a single machine. My old GATK3.x based pipeline could finish the whole job in 1 hour whereas this thing is still running more than 1.5 hours.

    My suggestion: If you are on a single machine just don't bother playing around with spark tools.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @GraceZou
    Hi,

    I try to use FastqToSam->AddOrReplaceReadGroups->ReadsPipelineSpark, but it is not correct.

    What do you mean by that? Do you get an error message?

    Thanks,
    Sheila

  • GraceZouGraceZou ChinaMember

    @Sheila Yes,I've tried several flows these days, the follows are the error message I've got.
    1. FastqToSam->BwaMemIndexImageCreator -> ReorderSam->ReadsPipelineSpark
    get error "Caused by: org.broadinstitute.hellbender.exceptions.UserException$MalformedRead: Read SRR015438.8113429 chr7:55107383-55107414 is malformed: The input .bam file contains reads with no platform information. First observed at read with name = SRR015438.8113429
    " in the pipeline ReadsPipelineSpark.
    2.FastqToSam->BwaMemIndexImageCreator -> ReorderSam->SortReadFileSpark->ReadsPipelineSpark
    get error "Caused by: org.broadinstitute.hellbender.exceptions.GATKException: We're supposed to be aligning paired reads, but there are an odd number of them." in the pipeline ReadsPipelineSpark.
    3.FastqToSam->BwaMemIndexImageCreator -> ReorderSam->SortReadFileSpark->AddOrReplaceReadGroups->ReadsPipelineSpark
    get the same error "Caused by: org.broadinstitute.hellbender.exceptions.GATKException: We're supposed to be aligning paired reads, but there are an odd number of them." in the pipeline ReadsPipelineSpark.
    4. FastqToSam->BwaMemIndexImageCreator ->ReadsPipelineSpark
    get error "Caused by: org.broadinstitute.hellbender.exceptions.UserException$MalformedRead: Read SRR015438.13933175 chr21:10086706-10086724 is malformed: The input .bam file contains reads with no platform information. First observed at read with name = SRR015438.13933175"
    5.FastqToSam->BwaMemIndexImageCreator->BwaSpark can get the right result.

    So what should be performed before ReadsPipelineSpark?

  • GraceZouGraceZou ChinaMember

    @SkyWarrior said:
    According to my tests just mapping and sorting should suffice. However it is painstakingly slow on a single machine. My old GATK3.x based pipeline could finish the whole job in 1 hour whereas this thing is still running more than 1.5 hours.

    My suggestion: If you are on a single machine just don't bother playing around with spark tools.

    Hi, I have three servers in a spark cluster. could you tell me the whole steps from fastq to vcf by using ReadsPipelineSpark. Thanks a lot.

  • SkyWarriorSkyWarrior TurkeyMember

    The only thing that I did was to prepare a clean bam file.

    Map with BWA MEM

    Create Unmapped bam with Picard with read group and other metainfo

    Merge Unmapped bam with the sam file from bwa mem.

    Optional but recommended - Sort and fix NmMd Uq tags with picard

    Give this final bam file to ReadsPipelineSpark with 2bit compressed reference file and wait for the vcf file to form. This method won't give you any recalibrated bam file. All the data goes through the pipe and as a result you get your vcf.

    Goodluck.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited January 29

    @GraceZou
    Hi,

    The error messages you are getting tell you what the problems are.

    The input .bam file contains reads with no platform information.

    This can be fixed with Picard's AddOrReplaceReadGroups.

    We're supposed to be aligning paired reads, but there are an odd number of them.

    Does your FASTQ contain paired end reads? Or single end reads? The tool expects an even number of reads for paired end reads, but is seeing an odd number.

    -Sheila

  • GraceZouGraceZou ChinaMember

    @Sheila said:
    @GraceZou
    Hi,

    The error messages you are getting tell you what the problems are.

    The input .bam file contains reads with no platform information.

    This can be fixed with Picard's AddOrReplaceReadGroups.

    We're supposed to be aligning paired reads, but there are an odd number of them.

    Does your FASTQ contain paired end reads? Or single end reads? The tool expects an even number of reads for paired end reads, but is seeing an odd number.

    -Sheila

    It is pairend fastq, it is worked for the Bwaspark pipeline.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @GraceZou
    Hi,

    Were you able to fix the platform Read group issue with Picard?

    -Sheila

Sign In or Register to comment.