Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Difference in output file from Picard's MarkDuplicates

Hi, I'm working on a pipeline which takes a RNA-seq bam file and looks for overlaps using R. When I run the R script on my original BAM file it works fine, however when I run it on the bam file produced by MarkDuplicates it throws this error:

Error in $<-.data.frame(*tmp*, "queryHits", value = integer(0)) :
replacement has 0 rows, data has 2
Calls: $<- -> $<-.data.frame
Execution halted

From what I can find online this means that the R script has been asked to find a variable in the file which it cannot find. The R script uses these packages: library(GenomicRanges) and library(Biostrings)
This is the section of R script which falls over:

overlaps <- findOverlaps(GRbam,GR)
printWithTimeStamp("Collating data:\n")
overs <- data.frame(NA,rownames=c(1:length(overlaps)))
printWithTimeStamp(" queryHits\n")
overs$queryHits<-queryHits(overlaps)

I included both the following options when calling MarkDuplicates to try and reduce the formatting changes in the new file:
REMOVE_DUPLICATES=TRUE
PROGRAM_RECORD_ID=null

Has anyone come across a similar issue and know what might be different in the new bam file? I've compared both the headers and the only difference is that the original file has @HD:VN 1.0, while the MarkDuplicates output has @HD:VN 1.5. Could this be the issue? I can't find much online about the differences.
Please feel free to ask for more information if I haven't been clear.

Tagged:

Best Answer

  • Accepted Answer

    Problem solved! I had the input and output as the same name (to try to reduce the effect on the rest of the pipeline), but I changed the output name and just had to deal with altering the pipeline as such, and it seems to be working fine now!

Answers

Sign In or Register to comment.