We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

UnifiedGenotyper doesn't generate 1 vcf per sample when bams from multiple subjects are input

we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post:
but still haven't solved the issue.

details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:

sample 1:

@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1

@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1

@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1

sample 2:

@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1

@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1

@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1

Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?

Does that make sense? Any advice?

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    I think we have a small misunderstanding here. The way you're doing it is correct in the sense that your RG groups are fine, and the UG is treating your samples separately. However, the UG will always output the results of samples that were called together into a single VCF file. This doesn't mean that the sample data got lumped all together. If you look at your VCF, you should see that there are per-sample metrics reported, with separate values for each sample.

    If you want the results reported separately because the samples are unrelated, then you need to call the samples separately in distinct runs of the UG. If the samples are related (in the sense that they're part of a study cohort), then I would recommend calling them together, and if you really want to you can separate out the calls using SelectVariants to produce per-sample VCFs.

  • Many thanks for the response and apologies for my misunderstanding -- I am obviously a newbie. To briefly follow-up with another newbie question: here is the first line of (non-meta-data) output from the vcf (the line after moves to a different locus):

    chr1 19193918 rs12745794 T C 313.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.127;DB;DP=57;Dels=0.00;FS=1.596;HaplotypeScore=6.7844;MLEAC=1;MLEAF=0.500;MQ=39.93;MQ0=0;MQRankSum=-2.666;QD=5.50;ReadPosRankSum=0.381 GT:AD:DP:GQ:PL 0/1:38,19:55:99:342,0,1001

    I do not see anything in this output indicating calls (or genotype assignments) have been made for multiple samples. The structure looks the same as the vcf output in our test of a single sample. Could you please clarify?

    Many thanks for any help.

  • Thanks Geraldine. Very helpful!

Sign In or Register to comment.