tedtoal
edited November 2017

The SAM standard and the GATK documentation both describe the SM field of the RG tag similarly, as containing the sample name, but when sequencing pools of samples, use the pool name instead of an individual sample name. And, the LB field is described as containing the library name.

Say I have samples S1 through S10, I make barcoded libraries from each one, so I have libraries S1 through S10, and I then pool S1 through S5 together into Pool1 and sequence that on one lane, and pool S6 through S10 together into Pool2 and sequence that on a second lane. Then, my understanding is that I would set the SM field to either Pool1 or Pool2, and LB to S1, S2, ... S10.

This is what I in fact did. Now, I discover that UnifiedGenotyper is putting the SM value instead of the LB value into the VCF file sample column header.

How did I misinterpret the seemingly clear documents?

I suggest that the description of SM be changed to NOT say that it should be the pool name, since that can be interpreted in more than one way.

  Sheila


    When we refer to pool, we mean samples that are not individually barcoded. In your case, you know which reads come from each sample, and you have simply run the samples together in one lane. For that multiplexing case, you can keep the SM tag as the sample name and not the "pooled name". Your library should also reflect the library prep. Have a look at this dictionary entry for a hopefully helpful example.


  tedtoal

    Ok, that's a clear explanation. BUT, I still suggest that the description of SM be improved. As I said, the word "pool" can be interpreted in multiple ways. I have never heard a wet lab person say they are "preparing a multiplex" of samples, or that they are "multiplexing the samples together". Rather, they say they are "pooling the samples", even though the samples are barcoded. SM should say that it is the sample name unless multiple samples are pooled without being barcoded, in which case SM is the pool name.

