The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Combining BAM files from two separate tests on same person, from same company

Member Posts: 4

Hello. I poked around a bit but didn't find the answer. It was recommended to me that picard was appropriate for the task I have, but I was hoping for confirmation, and perhaps comments.

My question stems from an odd situation, so let me quickly explain it. I purchased a 30x full genome test, and on getting the results found it was 20x. When I asked the testing company, they said they would make good on the offer by sending a new test kit and running it all over again, from scratch. Any week now I expect to get the new 30x coverage BAM (or rather, set of BAM files, as they deliver it as one BAM per chromosome).

It seems that it should be possible to combine all these data from the same company on the same person, and yield a 50x coverage data set (20x + 30x).

First off, does that seem like a reasonable goal? And if so, can I use picard to do it? Or is there a more appropriate tool? When I initially asked in igv-help forum, I got this suggestion:

Yes you should be able to combine the 2 bams. The best way to do this is to use an offline tool, such as the Picard tool MergeSamFiles

Use the "coordinate" SORT_ORDER option. See the bottom of this page for more help on Picard:
https://broadinstitute.github.io/picard/index.html#QuickStart. After merging you will need to create a new index (.bai file).

To simply view the 2 files you can load them both, they will appear in 2 separate panels but that might be an advantage, you can quickly see if the 2 runs were consistent. I assume you were supplied index (".bai") files with the bams.

So the merge process, itself, would not create a new index even if I used the standard option CREATE_INDEX=true ...?

If not, after the merge, I gather I would use:
java -jar picard.jar BuildBamIndex \
I=input.bam

I suppose it would be safest to take it in separate steps if there is no confirmation on using the Create_Index option during the merge.

The respondent hadn't used this tool, and wisely sent me here. Does the initial advice sound right? And what about the CREATE_INDEX question?

I am relatively new to this type of software, but am a pretty fast learner. I am hoping it won't get deep into complexities, but if that's how it has to be then so be it.

Thank you... !!!!!

:-D

Tagged:

Hi @D_Nicolson,

Yes, that all sounds reasonable; run MergeSamFiles with coordinate sorting, and set CREATE_INDEX=true to get an index produced for you while you're at it (it's available to all Picard tools -- but when in doubt, you can always test it out on a small file; you can use some of our tutorial data for that purpose).

One additional point -- make sure the two bam files are tagged with appropriate read group (RG) information. The key here is that the two bams should have different RG:ID (read group identifier) but the same RG:SM (sample identifier). They probably also need different RG:LB (library identifier), since it sounds like the company is processing new samples for you, not just re-sequencing the original one.

I can't promise it won't get a bit complex -- or a lot -- but we're here to help, so don't hesitate to ask! After duly poking around of course

Geraldine Van der Auwera, PhD

Hi @D_Nicolson,

Yes, that all sounds reasonable; run MergeSamFiles with coordinate sorting, and set CREATE_INDEX=true to get an index produced for you while you're at it (it's available to all Picard tools -- but when in doubt, you can always test it out on a small file; you can use some of our tutorial data for that purpose).

One additional point -- make sure the two bam files are tagged with appropriate read group (RG) information. The key here is that the two bams should have different RG:ID (read group identifier) but the same RG:SM (sample identifier). They probably also need different RG:LB (library identifier), since it sounds like the company is processing new samples for you, not just re-sequencing the original one.

I can't promise it won't get a bit complex -- or a lot -- but we're here to help, so don't hesitate to ask! After duly poking around of course

Geraldine Van der Auwera, PhD

• Member Posts: 4

OK, I will try this. Yes, it was a new sample from same person, so it sounds like I will need to modify the RG:SM of one to match the other, and check to be sure that they have different RG:ID & RG:LB. Of course I'll poke around to familiarize myself with what those all are and how to get into them. Thank you!

• Member Posts: 4

It took several more weeks than expected but I finally have my additional 30x BAM files. I have downloaded and installed samtools and it is working. I have downloaded and have both picard.jar and GenomeAnalysisTK.jar, and have managed to invoke both for a basic review of their functions. This is on a Mac with OSX Yosemite 10.10.5, if it matters. The BAM files are one file per chromosome, and with a .bai file for each one. To try to make it simple, I put the Y chromosome files from both the 20x and the 30x files together with the two .jar files in ~/wk/ directory.

Now my problem is I don't seem to find any indication that there are any RG bits in the BAM files... I was expecting to see something after running the following, but it just went back to command prompt after thinking for a while:
samtools view -H Sample_PU6HKD6-EXT.chrY.bam | grep '@RG'

Same goes for the other Y BAM file:
samtools view -H WGC074701D_combined.chrY.bam | grep '@RG'

I can open these in Genome Browse (GoldenHelix) and pull up the headers, but there is nothing in them about RG, RG:ID, RG:SM, or RG:LB or anything that is at all similar... I tried running various stats-oriented commands in these packages in the hopes of seeing these at some point, but so far nothing.

I've poked all around in the FAQs here, and done somegoogling on getting started with samtools, but I'm not getting anywhere. Can someone please give me a push or a pointer to where I can find the RG info for these BAM files? As noted above, I'm looking to make sure they have the same RG:SM (both are from the same person), but different RG:ID & RG:LB (these are two different tests, each run from different swabs taken at different times from the same person). But I'm banging into the same wall over and over here.

Thank you for any mercy ;-)

If there is no RG information in the files you can add it yourself using Picard AddOrReplaceReadgroup. See the dictionary entry on read groups for guidance on what the content of the RG should be.

Geraldine Van der Auwera, PhD

• Member Posts: 4

Just for the record, after beginning to wade into understanding the RG aspects, I decided to give a try at merging the two files just as they were, without any RG info. I used this command:
java -jar picard.jar MergeSamFiles SO=coordinate CREATE_INDEX=true I=[file1].bam I=[file2].bam O=MergedTest.chrY.bam

I was able to open all three chrY.bam files using GenomeBrowse and zoom in to a number of areas with known mutations (from a prior Big Y test), and everything looked as expected. It appears that it works well enough for my present purposes (a Y analysis by YFull.com) without any RG information. However, I realize I will have to come to grips with RG info before I get further into the data (e.g., to generate my own combined vcf file across all chromosomes and other purposes).

Thank you to Geraldine for the help & tips!