We've moved!
You can find our new documentation site and support forum for posting questions here.

Sample consolidated uBAM from public data

In public data I am looking for single consolidated uBAM for below files (full list),

gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.4.NA12878.downsampled.query.sorted.unmapped.bam
gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.5.NA12878.downsampled.query.sorted.unmapped.bam

Best Answers

Answers

  • I see below files, but not sure. Please confirm whether below ones correspond to above list, if not where I can find.

    .....//storage.cloud.google.com/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/NA12878.full.1.unaligned.bam?_ga=2.82104849.-683196340.1545165918
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @write2sethu Please let me confirm and get back to you.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @write2sethu We do not have the combined uBAM file available. If you want to combine the content of the above listed 2 uBAM files, you can convert to fastq files and then combine fastq files together followed by conversion into a single uBAM.

  • Thanks for the response.

    So are the below 2 uBAM files (from public data)

    ..//storage.cloud.google.com/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/NA12878.full.1.unaligned.bam?_ga=2.55292076.-683196340.1545165918

    ..//storage.cloud.google.com/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/NA12878.full.2.unaligned.bam?_ga=2.55292076.-683196340.1545165918


    correspond to below set of full list of files. In other words, if I consolidate above files - will it be equivalent to one file consolidated of below ones. Just looking for one uBAM file for all below files.

    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.4.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.5.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.6.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.7.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.8.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYN2.1.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.1.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.2.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.3.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.4.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.5.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.6.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.7.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.8.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35N.1.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35N.2.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.1.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.2.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.3.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.4.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.5.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.6.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.7.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.8.NA12878.interval.filtered.query.sorted.unmapped.bam
  • Thanks for the response.

    So are the below 2 uBAM files (from public data)

    ..//storage.cloud.google.com/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/NA12878.full.1.unaligned.bam?_ga=2.55292076.-683196340.1545165918

    ..//storage.cloud.google.com/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/NA12878.full.2.unaligned.bam?_ga=2.55292076.-683196340.1545165918


    correspond to below set of full list of files. In other words, if I consolidate above files - will it be equivalent to one file consolidated of below ones. Just looking for one uBAM file for all below files.

    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.4.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.5.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.6.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.7.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.8.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYN2.1.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.1.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.2.NA12878.downsampled.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.3.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.4.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.5.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.6.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.7.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35M.8.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35N.1.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK35N.2.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.1.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.2.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.3.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.4.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.5.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.6.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.7.NA12878.interval.filtered.query.sorted.unmapped.bam
    gs://gatk-test-data/wgs_ubam/NA12878_24RG/small/HK3T5.8.NA12878.interval.filtered.query.sorted.unmapped.bam
  • AdelaideRAdelaideR Member admin

    @write2sethu

    It is my understanding that it is NOT a good idea to merge the files directly using MergeSamFiles. Check the discussion here

    One of the items that could occur is that "Merging unmapped bam and mapped bam restores hardclipped bases back into alignment in the form of softclips which may significantly affect your SNP and INDEL discovery"

    More information about the process can be found at this link.

    Whether you can use MergeSamFiles depends on whether the RG (read group) values are consistent across all ubams in a group.

    This is not the case with the gatk-test-data, which has Read Groups based in the ubam file name.

    For example,

    HK3T5.8.NA12878.interval.filtered.query.sorted.unmapped.bam
    
    

    has a read group that looks like

    RG:Z:HJYFJ.8
    

    So it may not regroup this ubam with the other ubams correctly without converting the ubam back to a fastq first.

    Alternatively, the read groups can be removed by using by using a picard tool, AddOrReplaceReadGroups

    If you are using GATK on the Firecloud, the ubams do not need to be merged to run the protocol with the test data set. Check out this discussion to learn how to run your data model on the Firecloud.

    What problem are you trying to solve by merging the gatk-test-data ubams?

  • I am looking to use same BAM input for pipelines in local vs cloud.
  • I am looking to use same BAM input for pipelines in local vs cloud. Any suggestions please ?
  • Hi AdelaideR,
    Thank you so much for whole lot of information.
    I would like to proceed with first option.
    This is the pipeline I am trying to run in cloud.
    ...//portal.firecloud.org/#workspaces/help-gatk/five-dollar-genome-analysis-pipeline

    It has below list of uBAMs as inputs.

    ...//gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.4.NA12878.downsampled.query.sorted.unmapped.bam
    ...//gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.5.NA12878.downsampled.query.sorted.unmapped.bam
    .......

    Can I change input configuration to have just first uBAM as input and run the pipeline ?
    So I can download the same one uBAM file to run locally.

    Also how can I download below uBAM file ?
    //gatk-test-data/wgs_ubam/NA12878_24RG/small/HJYFJ.4.NA12878.downsampled.query.sorted.unmapped.bam
Sign In or Register to comment.